Hey all,

  tl;dr Learning about how to work with, improve, and understand our LDA
impl.  If you don't care about dimensional reduction with LDA, stop here.

  Long time, not much see (although the excuse of "new job" starts to get
old after 6 months, so I'll try to stop using that one).  I've been starting
to do some large-scale dimensional reduction tests lately, and as things go
with me, I write a bunch of map-reduce code to do something fancy in one
place (SVD with both rows and columns scaling out to 10^8 or so), and end up
wanting to test against some other state of the art (e.g. LDA), and now I'm
all sidetracked: I want to understand better both a) how LDA works in
general, and how our impl scales.

  I've been making some minor changes (in particular, implementing
MAHOUT-458 <https://issues.apache.org/jira/browse/MAHOUT-458> among other
things, which seems to have been closed even though it was never committed,
nor was its functionality ever put into trunk in another patch, from what I
can tell), and noticed that one of the major slowdowns is probably due to
the fact that the p(word|topic) matrix is loaded via side-channel means on
every pass, on every node, and as has been mentioned in previous threads
(for example, Running LDA over
Wikipedia<http://markmail.org/message/ua5hckybpkj3stdl>),
this puts an absolute cap on the size of the possible vocabulary (numTerms *
numTopics * 8bytes  < memory per mapper), in addition to being a huge
slowdown: every mapper must load this data via HDFS after every iteration.

  To improve this, the scalable thing is to load only the columns of
p(word|topic) for the words in a document.  Doing a reduce-side join
(running effectively a transpose operation on the documents together with a
"project onto the word_id" mapper on the word/topic matrix), and doing
inference and parameter update in the reducer would accomplish this, but it
would have to be followed up by another map-reduce pass of an identity
mapper and a log-summing reducer (the old LDAReducer).  It's an extra
shuffle per MR pass, but the data transfer seems like it would be
phenomenally less, and should scale just perfectly now - you would never
hold anything in memory the size of numTerms (or numDocs), and instead be
limited by only docLength * numTopics in memory at the inference step.

  Thoughts on this, for anyone who's been playing around with the LDA code
(*crickets*...) ?

  Another matter, more on the theoretical side of LDA:  the algorithm
"prefers sparse results", from what I understand of the lore (you're
optimizing under an L_1 constraint instead of L_2, so...), but how can you
make this explicit?  The current implementation has some "noise" factor of a
minimal probability that most words have in any topic, and so when you run
the Reuters example (50k terms, 23k documents, 20 topics) and look at the
actual probabilities (not possible to do on trunk, but I'll put up a patch
shortly), is that that while the probability that topic_0 will generate one
of the top 175 terms is about 50%, the long tail is rather long: the
probability that topic_0 will generate one of the top 1000 terms is up to
83%, and the probability that topic_0 will generate one of the top 5000 is
finally 99.6%.

  This is probably the wrong way to look at what the "true" top terms for a
latent factor are, because they're probabilities, and if the total
dictionary size is 50,000, you'd expect that random noise would say
p(term|topic) >= 1/50,000, right?  In which case the right cutoff would be
"ignore all terms with probabilities less than (say) 10x this value".  Given
this constraint, it looks like on the Reuters data set, each topic ends up
with about 800 terms above this cutoff.  Does this sound right?

  The reason I ask, is that to help with scaling, we really shouldn't be
hanging onto probabilities like 2e-11 - they just don't help with anything,
and our representation at the end of the day, and more importantly: during
data transfer among mappers and reducers while actually doing the
calculation, would be much reduced.  More importantly, I've been lead to
believe by my more-in-the-know-about-LDA colleagues, that since LDA is so
L_1-centric, you really can try to throw tons of topics at it, and it
doesn't really overfit that much, because it just splits up the topics into
sub-topics, basically (the topics get more sparse).  You don't necessarily
get much better generalization errors, but not necessarily any worse either.
 Has anyone else seen this kind of effect?

  -jake

Reply via email to