Jake, take a look at Vowpal Wabbit 5.0. I saw an incremental LDA
implementation there. Might be scalable


On Tue, Jan 4, 2011 at 6:21 AM, Jake Mannix <[email protected]> wrote:

> Hey all,
>
>  tl;dr Learning about how to work with, improve, and understand our LDA
> impl.  If you don't care about dimensional reduction with LDA, stop here.
>
>  Long time, not much see (although the excuse of "new job" starts to get
> old after 6 months, so I'll try to stop using that one).  I've been
> starting
> to do some large-scale dimensional reduction tests lately, and as things go
> with me, I write a bunch of map-reduce code to do something fancy in one
> place (SVD with both rows and columns scaling out to 10^8 or so), and end
> up
> wanting to test against some other state of the art (e.g. LDA), and now I'm
> all sidetracked: I want to understand better both a) how LDA works in
> general, and how our impl scales.
>
>  I've been making some minor changes (in particular, implementing
> MAHOUT-458 <https://issues.apache.org/jira/browse/MAHOUT-458> among other
> things, which seems to have been closed even though it was never committed,
> nor was its functionality ever put into trunk in another patch, from what I
> can tell), and noticed that one of the major slowdowns is probably due to
> the fact that the p(word|topic) matrix is loaded via side-channel means on
> every pass, on every node, and as has been mentioned in previous threads
> (for example, Running LDA over
> Wikipedia<http://markmail.org/message/ua5hckybpkj3stdl>),
> this puts an absolute cap on the size of the possible vocabulary (numTerms
> *
> numTopics * 8bytes  < memory per mapper), in addition to being a huge
> slowdown: every mapper must load this data via HDFS after every iteration.
>
>  To improve this, the scalable thing is to load only the columns of
> p(word|topic) for the words in a document.  Doing a reduce-side join
> (running effectively a transpose operation on the documents together with a
> "project onto the word_id" mapper on the word/topic matrix), and doing
> inference and parameter update in the reducer would accomplish this, but it
> would have to be followed up by another map-reduce pass of an identity
> mapper and a log-summing reducer (the old LDAReducer).  It's an extra
> shuffle per MR pass, but the data transfer seems like it would be
> phenomenally less, and should scale just perfectly now - you would never
> hold anything in memory the size of numTerms (or numDocs), and instead be
> limited by only docLength * numTopics in memory at the inference step.
>
>  Thoughts on this, for anyone who's been playing around with the LDA code
> (*crickets*...) ?
>
>  Another matter, more on the theoretical side of LDA:  the algorithm
> "prefers sparse results", from what I understand of the lore (you're
> optimizing under an L_1 constraint instead of L_2, so...), but how can you
> make this explicit?  The current implementation has some "noise" factor of
> a
> minimal probability that most words have in any topic, and so when you run
> the Reuters example (50k terms, 23k documents, 20 topics) and look at the
> actual probabilities (not possible to do on trunk, but I'll put up a patch
> shortly), is that that while the probability that topic_0 will generate one
> of the top 175 terms is about 50%, the long tail is rather long: the
> probability that topic_0 will generate one of the top 1000 terms is up to
> 83%, and the probability that topic_0 will generate one of the top 5000 is
> finally 99.6%.
>
>  This is probably the wrong way to look at what the "true" top terms for a
> latent factor are, because they're probabilities, and if the total
> dictionary size is 50,000, you'd expect that random noise would say
> p(term|topic) >= 1/50,000, right?  In which case the right cutoff would be
> "ignore all terms with probabilities less than (say) 10x this value".
>  Given
> this constraint, it looks like on the Reuters data set, each topic ends up
> with about 800 terms above this cutoff.  Does this sound right?
>
>  The reason I ask, is that to help with scaling, we really shouldn't be
> hanging onto probabilities like 2e-11 - they just don't help with anything,
> and our representation at the end of the day, and more importantly: during
> data transfer among mappers and reducers while actually doing the
> calculation, would be much reduced.  More importantly, I've been lead to
> believe by my more-in-the-know-about-LDA colleagues, that since LDA is so
> L_1-centric, you really can try to throw tons of topics at it, and it
> doesn't really overfit that much, because it just splits up the topics into
> sub-topics, basically (the topics get more sparse).  You don't necessarily
> get much better generalization errors, but not necessarily any worse
> either.
>  Has anyone else seen this kind of effect?
>
>  -jake
>

Reply via email to