I am going to be unable to complain today. So silent assent applies to me.
On Fri, Dec 2, 2011 at 12:53 PM, Jake Mannix (Commented) (JIRA) < j...@apache.org> wrote: > > [ > https://issues.apache.org/jira/browse/MAHOUT-897?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13161849#comment-13161849] > > Jake Mannix commented on MAHOUT-897: > ------------------------------------ > > I'm going to commit the latest patch later today if I don't hear any > complaints. > > > New implementation for LDA: Collapsed Variational Bayes (0th derivative > approximation), with map-side model caching > > > ------------------------------------------------------------------------------------------------------------------- > > > > Key: MAHOUT-897 > > URL: https://issues.apache.org/jira/browse/MAHOUT-897 > > Project: Mahout > > Issue Type: New Feature > > Components: Clustering > > Affects Versions: 0.6 > > Reporter: Jake Mannix > > Assignee: Jake Mannix > > Labels: clustering, lda > > Fix For: 0.6 > > > > Attachments: MAHOUT-897.diff, MAHOUT-897.diff > > > > > > Current LDA implementation in Mahout suffers from a few issues: > > 1) it's based on the original Variational Bayes E/M training methods > of Blei et al ( > http://www.cs.princeton.edu/~blei/papers/BleiNgJordan2003.pdf), which are > a) significantly more complex to implement/maintain, and b) significantly > slower than subsequently discovered techniques > > 2) the entire "current working model" is held in memory in each > Mapper, which limits the scalability of the implementation by numTerms in > vocabulary * numTopics * 8bytes per double being less than the mapper heap > size. > > 3) the sufficient statistics which need to be emitted by the mappers > scale as numTopics * numNonZeroEntries in the corpus. Even with judicious > use of Combiners (currently implemented), this can get prohibitively > expensive in terms of network + disk usage. > > In particular, point 3 looks like: a 1B nonzero entry corpus in Mahout > would take up about 12GB of RAM in total, but if you wanted 200 topics, > you'd be using 2.5TB if disk+network traffic *per E/M iteration*. Running > a moderate 40 iterations we're talking about 100TB. Having tried this > implementation on a 6B nonzero entry input corpus with 100 topics (500k > term vocabulary, so memory wasn't an issue), I've seen this in practice: > even with our production Hadoop cluster with many thousands of map slots > available, even one iteration was taking more than 3.5hours to get to 50% > completion of the mapper tasks. > > Point 1) was simple to improve: switch from VB to an algorithm labeled > CVB0 ("Collapsed Variational Bayes, 0th derivative approximation") in > Ascuncion, et al ( http://www.datalab.uci.edu/papers/uai_2009.pdf ). I > tried many approaches to get the overall distributed side of the algorithm > to scale better, originally aiming at removing point 2), but it turned out > that point 3) was what kept rearing its ugly head. The way that YahooLDA ( > https://github.com/shravanmn/Yahoo_LDA ) and many others have achieved > high scalability is by doing distributed Gibbs sampling, but that requires > that you hold onto the model in distributed memory and query it continually > via RPC. This could be done in something like Giraph or Spark, but not in > vanilla Hadoop M/R. > > The end result was to actually make point 2) even *worse*, and instead > of relying on Hadoop combiners to aggregate sufficient statistics for the > model, you instead do a full map-side cache of (this mapper's slice of) the > next iteration's model, and emit nothing in each map() call, emitting the > entire model at cleanup(), and then the reducer simply sums the sub-models. > This effectively becomes a form of ensemble learning: each mapper learns > its own sequential model, emits it, the reducers (one for each topic) sum > up these models into one, which is fed out to all the models in the next > iteration. > > In its current form, this LDA implementation can churn through about two > M/R iterations per hour on the same cluster/data set mentioned above (which > makes it at least 15x faster on larger data sets). > > It probably requires a fair amount of documentation / cleanup, but it > comes with a nice end-to-end unit test (same as the one added to > MAHOUT-399), and also comes with an "in-memory" version of the same > algorithm, for smaller datasets (i.e. those which can fit in memory). > > -- > This message is automatically generated by JIRA. > If you think it was sent incorrectly, please contact your JIRA > administrators: > https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa > For more information on JIRA, see: http://www.atlassian.com/software/jira > > >