[
https://issues.apache.org/jira/browse/MAHOUT-897?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Jake Mannix updated MAHOUT-897:
-------------------------------
Resolution: Fixed
Status: Resolved (was: Patch Available)
Closed indeed!
> New implementation for LDA: Collapsed Variational Bayes (0th derivative
> approximation), with map-side model caching
> -------------------------------------------------------------------------------------------------------------------
>
> Key: MAHOUT-897
> URL: https://issues.apache.org/jira/browse/MAHOUT-897
> Project: Mahout
> Issue Type: New Feature
> Components: Clustering
> Affects Versions: 0.6
> Reporter: Jake Mannix
> Assignee: Jake Mannix
> Labels: clustering, lda
> Fix For: 0.6
>
> Attachments: MAHOUT-897.diff, MAHOUT-897.diff
>
>
> Current LDA implementation in Mahout suffers from a few issues:
> 1) it's based on the original Variational Bayes E/M training methods of
> Blei et al (http://www.cs.princeton.edu/~blei/papers/BleiNgJordan2003.pdf),
> which are a) significantly more complex to implement/maintain, and b)
> significantly slower than subsequently discovered techniques
> 2) the entire "current working model" is held in memory in each Mapper,
> which limits the scalability of the implementation by numTerms in vocabulary
> * numTopics * 8bytes per double being less than the mapper heap size.
> 3) the sufficient statistics which need to be emitted by the mappers scale
> as numTopics * numNonZeroEntries in the corpus. Even with judicious use of
> Combiners (currently implemented), this can get prohibitively expensive in
> terms of network + disk usage.
> In particular, point 3 looks like: a 1B nonzero entry corpus in Mahout would
> take up about 12GB of RAM in total, but if you wanted 200 topics, you'd be
> using 2.5TB if disk+network traffic *per E/M iteration*. Running a moderate
> 40 iterations we're talking about 100TB. Having tried this implementation on
> a 6B nonzero entry input corpus with 100 topics (500k term vocabulary, so
> memory wasn't an issue), I've seen this in practice: even with our production
> Hadoop cluster with many thousands of map slots available, even one iteration
> was taking more than 3.5hours to get to 50% completion of the mapper tasks.
> Point 1) was simple to improve: switch from VB to an algorithm labeled CVB0
> ("Collapsed Variational Bayes, 0th derivative approximation") in Ascuncion,
> et al ( http://www.datalab.uci.edu/papers/uai_2009.pdf ). I tried many
> approaches to get the overall distributed side of the algorithm to scale
> better, originally aiming at removing point 2), but it turned out that point
> 3) was what kept rearing its ugly head. The way that YahooLDA (
> https://github.com/shravanmn/Yahoo_LDA ) and many others have achieved high
> scalability is by doing distributed Gibbs sampling, but that requires that
> you hold onto the model in distributed memory and query it continually via
> RPC. This could be done in something like Giraph or Spark, but not in
> vanilla Hadoop M/R.
> The end result was to actually make point 2) even *worse*, and instead of
> relying on Hadoop combiners to aggregate sufficient statistics for the model,
> you instead do a full map-side cache of (this mapper's slice of) the next
> iteration's model, and emit nothing in each map() call, emitting the entire
> model at cleanup(), and then the reducer simply sums the sub-models. This
> effectively becomes a form of ensemble learning: each mapper learns its own
> sequential model, emits it, the reducers (one for each topic) sum up these
> models into one, which is fed out to all the models in the next iteration.
> In its current form, this LDA implementation can churn through about two M/R
> iterations per hour on the same cluster/data set mentioned above (which makes
> it at least 15x faster on larger data sets).
> It probably requires a fair amount of documentation / cleanup, but it comes
> with a nice end-to-end unit test (same as the one added to MAHOUT-399), and
> also comes with an "in-memory" version of the same algorithm, for smaller
> datasets (i.e. those which can fit in memory).
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira