[
https://issues.apache.org/jira/browse/MAHOUT-897?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13160226#comment-13160226
]
[email protected] commented on MAHOUT-897:
------------------------------------------------------
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/2944/
-----------------------------------------------------------
(Updated 2011-11-30 18:42:21.747592)
Review request for mahout and Ted Dunning.
Changes
-------
More license headers, and some javadocs.
Also: factor out an "LDASampler" which builds a sampler based on a Matrix
representation of a topic model.
Summary
-------
See MAHOUT-897
This addresses bug MAHOUT-897.
https://issues.apache.org/jira/browse/MAHOUT-897
Diffs (updated)
-----
trunk/core/src/main/java/org/apache/mahout/clustering/lda/LDADriver.java
1208294
trunk/core/src/main/java/org/apache/mahout/clustering/lda/LDASampler.java
PRE-CREATION
trunk/core/src/main/java/org/apache/mahout/clustering/lda/cvb/CVB0DocInferenceMapper.java
PRE-CREATION
trunk/core/src/main/java/org/apache/mahout/clustering/lda/cvb/CVB0Driver.java
PRE-CREATION
trunk/core/src/main/java/org/apache/mahout/clustering/lda/cvb/CVB0TopicTermVectorNormalizerMapper.java
PRE-CREATION
trunk/core/src/main/java/org/apache/mahout/clustering/lda/cvb/CachingCVB0Mapper.java
PRE-CREATION
trunk/core/src/main/java/org/apache/mahout/clustering/lda/cvb/CachingCVB0PerplexityMapper.java
PRE-CREATION
trunk/core/src/main/java/org/apache/mahout/clustering/lda/cvb/InMemoryCollapsedVariationalBayes0.java
PRE-CREATION
trunk/core/src/main/java/org/apache/mahout/clustering/lda/cvb/ModelTrainer.java
PRE-CREATION
trunk/core/src/main/java/org/apache/mahout/clustering/lda/cvb/TopicModel.java
PRE-CREATION
trunk/core/src/main/java/org/apache/mahout/common/MemoryUtil.java
PRE-CREATION
trunk/core/src/main/java/org/apache/mahout/common/Pair.java 1208294
trunk/core/src/main/java/org/apache/mahout/math/DistributedRowMatrixWriter.java
PRE-CREATION
trunk/core/src/main/java/org/apache/mahout/math/MatrixUtils.java PRE-CREATION
trunk/core/src/main/java/org/apache/mahout/math/stats/Sampler.java
PRE-CREATION
trunk/core/src/test/java/org/apache/mahout/clustering/ClusteringTestUtils.java
1208294
trunk/core/src/test/java/org/apache/mahout/clustering/lda/TestMapReduce.java
1208294
trunk/core/src/test/java/org/apache/mahout/clustering/lda/cvb/TestCVBModelTrainer.java
PRE-CREATION
trunk/core/src/test/java/org/apache/mahout/math/stats/SamplerTest.java
PRE-CREATION
trunk/src/conf/driver.classes.props 1208294
Diff: https://reviews.apache.org/r/2944/diff
Testing
-------
mvn clean test
Thanks,
Jake
> New implementation for LDA: Collapsed Variational Bayes (0th derivative
> approximation), with map-side model caching
> -------------------------------------------------------------------------------------------------------------------
>
> Key: MAHOUT-897
> URL: https://issues.apache.org/jira/browse/MAHOUT-897
> Project: Mahout
> Issue Type: New Feature
> Components: Clustering
> Affects Versions: 0.6
> Reporter: Jake Mannix
> Assignee: Jake Mannix
> Labels: clustering, lda
> Fix For: 0.6
>
> Attachments: MAHOUT-897.diff
>
>
> Current LDA implementation in Mahout suffers from a few issues:
> 1) it's based on the original Variational Bayes E/M training methods of
> Blei et al (http://www.cs.princeton.edu/~blei/papers/BleiNgJordan2003.pdf),
> which are a) significantly more complex to implement/maintain, and b)
> significantly slower than subsequently discovered techniques
> 2) the entire "current working model" is held in memory in each Mapper,
> which limits the scalability of the implementation by numTerms in vocabulary
> * numTopics * 8bytes per double being less than the mapper heap size.
> 3) the sufficient statistics which need to be emitted by the mappers scale
> as numTopics * numNonZeroEntries in the corpus. Even with judicious use of
> Combiners (currently implemented), this can get prohibitively expensive in
> terms of network + disk usage.
> In particular, point 3 looks like: a 1B nonzero entry corpus in Mahout would
> take up about 12GB of RAM in total, but if you wanted 200 topics, you'd be
> using 2.5TB if disk+network traffic *per E/M iteration*. Running a moderate
> 40 iterations we're talking about 100TB. Having tried this implementation on
> a 6B nonzero entry input corpus with 100 topics (500k term vocabulary, so
> memory wasn't an issue), I've seen this in practice: even with our production
> Hadoop cluster with many thousands of map slots available, even one iteration
> was taking more than 3.5hours to get to 50% completion of the mapper tasks.
> Point 1) was simple to improve: switch from VB to an algorithm labeled CVB0
> ("Collapsed Variational Bayes, 0th derivative approximation") in Ascuncion,
> et al ( http://www.datalab.uci.edu/papers/uai_2009.pdf ). I tried many
> approaches to get the overall distributed side of the algorithm to scale
> better, originally aiming at removing point 2), but it turned out that point
> 3) was what kept rearing its ugly head. The way that YahooLDA (
> https://github.com/shravanmn/Yahoo_LDA ) and many others have achieved high
> scalability is by doing distributed Gibbs sampling, but that requires that
> you hold onto the model in distributed memory and query it continually via
> RPC. This could be done in something like Giraph or Spark, but not in
> vanilla Hadoop M/R.
> The end result was to actually make point 2) even *worse*, and instead of
> relying on Hadoop combiners to aggregate sufficient statistics for the model,
> you instead do a full map-side cache of (this mapper's slice of) the next
> iteration's model, and emit nothing in each map() call, emitting the entire
> model at cleanup(), and then the reducer simply sums the sub-models. This
> effectively becomes a form of ensemble learning: each mapper learns its own
> sequential model, emits it, the reducers (one for each topic) sum up these
> models into one, which is fed out to all the models in the next iteration.
> In its current form, this LDA implementation can churn through about two M/R
> iterations per hour on the same cluster/data set mentioned above (which makes
> it at least 15x faster on larger data sets).
> It probably requires a fair amount of documentation / cleanup, but it comes
> with a nice end-to-end unit test (same as the one added to MAHOUT-399), and
> also comes with an "in-memory" version of the same algorithm, for smaller
> datasets (i.e. those which can fit in memory).
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira