Jake, take a look at Vowpal Wabbit 5.0. I saw an incremental LDA implementation there. Might be scalable
On Tue, Jan 4, 2011 at 6:21 AM, Jake Mannix <[email protected]> wrote: > Hey all, > > tl;dr Learning about how to work with, improve, and understand our LDA > impl. If you don't care about dimensional reduction with LDA, stop here. > > Long time, not much see (although the excuse of "new job" starts to get > old after 6 months, so I'll try to stop using that one). I've been > starting > to do some large-scale dimensional reduction tests lately, and as things go > with me, I write a bunch of map-reduce code to do something fancy in one > place (SVD with both rows and columns scaling out to 10^8 or so), and end > up > wanting to test against some other state of the art (e.g. LDA), and now I'm > all sidetracked: I want to understand better both a) how LDA works in > general, and how our impl scales. > > I've been making some minor changes (in particular, implementing > MAHOUT-458 <https://issues.apache.org/jira/browse/MAHOUT-458> among other > things, which seems to have been closed even though it was never committed, > nor was its functionality ever put into trunk in another patch, from what I > can tell), and noticed that one of the major slowdowns is probably due to > the fact that the p(word|topic) matrix is loaded via side-channel means on > every pass, on every node, and as has been mentioned in previous threads > (for example, Running LDA over > Wikipedia<http://markmail.org/message/ua5hckybpkj3stdl>), > this puts an absolute cap on the size of the possible vocabulary (numTerms > * > numTopics * 8bytes < memory per mapper), in addition to being a huge > slowdown: every mapper must load this data via HDFS after every iteration. > > To improve this, the scalable thing is to load only the columns of > p(word|topic) for the words in a document. Doing a reduce-side join > (running effectively a transpose operation on the documents together with a > "project onto the word_id" mapper on the word/topic matrix), and doing > inference and parameter update in the reducer would accomplish this, but it > would have to be followed up by another map-reduce pass of an identity > mapper and a log-summing reducer (the old LDAReducer). It's an extra > shuffle per MR pass, but the data transfer seems like it would be > phenomenally less, and should scale just perfectly now - you would never > hold anything in memory the size of numTerms (or numDocs), and instead be > limited by only docLength * numTopics in memory at the inference step. > > Thoughts on this, for anyone who's been playing around with the LDA code > (*crickets*...) ? > > Another matter, more on the theoretical side of LDA: the algorithm > "prefers sparse results", from what I understand of the lore (you're > optimizing under an L_1 constraint instead of L_2, so...), but how can you > make this explicit? The current implementation has some "noise" factor of > a > minimal probability that most words have in any topic, and so when you run > the Reuters example (50k terms, 23k documents, 20 topics) and look at the > actual probabilities (not possible to do on trunk, but I'll put up a patch > shortly), is that that while the probability that topic_0 will generate one > of the top 175 terms is about 50%, the long tail is rather long: the > probability that topic_0 will generate one of the top 1000 terms is up to > 83%, and the probability that topic_0 will generate one of the top 5000 is > finally 99.6%. > > This is probably the wrong way to look at what the "true" top terms for a > latent factor are, because they're probabilities, and if the total > dictionary size is 50,000, you'd expect that random noise would say > p(term|topic) >= 1/50,000, right? In which case the right cutoff would be > "ignore all terms with probabilities less than (say) 10x this value". > Given > this constraint, it looks like on the Reuters data set, each topic ends up > with about 800 terms above this cutoff. Does this sound right? > > The reason I ask, is that to help with scaling, we really shouldn't be > hanging onto probabilities like 2e-11 - they just don't help with anything, > and our representation at the end of the day, and more importantly: during > data transfer among mappers and reducers while actually doing the > calculation, would be much reduced. More importantly, I've been lead to > believe by my more-in-the-know-about-LDA colleagues, that since LDA is so > L_1-centric, you really can try to throw tons of topics at it, and it > doesn't really overfit that much, because it just splits up the topics into > sub-topics, basically (the topics get more sparse). You don't necessarily > get much better generalization errors, but not necessarily any worse > either. > Has anyone else seen this kind of effect? > > -jake >
