[ https://issues.apache.org/jira/browse/MAHOUT-897?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13162055#comment-13162055 ]
Hudson commented on MAHOUT-897: ------------------------------- Integrated in Mahout-Quality #1219 (See [https://builds.apache.org/job/Mahout-Quality/1219/]) fixes MAHOUT-897 New Latent Dirichlet Allocation implementation, etc. jmannix : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1209794 Files : * /mahout/trunk/core/src/main/java/org/apache/mahout/clustering/lda/LDADriver.java * /mahout/trunk/core/src/main/java/org/apache/mahout/clustering/lda/LDASampler.java * /mahout/trunk/core/src/main/java/org/apache/mahout/clustering/lda/cvb * /mahout/trunk/core/src/main/java/org/apache/mahout/clustering/lda/cvb/CVB0DocInferenceMapper.java * /mahout/trunk/core/src/main/java/org/apache/mahout/clustering/lda/cvb/CVB0Driver.java * /mahout/trunk/core/src/main/java/org/apache/mahout/clustering/lda/cvb/CVB0TopicTermVectorNormalizerMapper.java * /mahout/trunk/core/src/main/java/org/apache/mahout/clustering/lda/cvb/CachingCVB0Mapper.java * /mahout/trunk/core/src/main/java/org/apache/mahout/clustering/lda/cvb/CachingCVB0PerplexityMapper.java * /mahout/trunk/core/src/main/java/org/apache/mahout/clustering/lda/cvb/InMemoryCollapsedVariationalBayes0.java * /mahout/trunk/core/src/main/java/org/apache/mahout/clustering/lda/cvb/ModelTrainer.java * /mahout/trunk/core/src/main/java/org/apache/mahout/clustering/lda/cvb/TopicModel.java * /mahout/trunk/core/src/main/java/org/apache/mahout/common/MemoryUtil.java * /mahout/trunk/core/src/main/java/org/apache/mahout/common/Pair.java * /mahout/trunk/core/src/main/java/org/apache/mahout/math/DistributedRowMatrixWriter.java * /mahout/trunk/core/src/main/java/org/apache/mahout/math/MatrixUtils.java * /mahout/trunk/core/src/main/java/org/apache/mahout/math/stats/Sampler.java * /mahout/trunk/core/src/test/java/org/apache/mahout/clustering/ClusteringTestUtils.java * /mahout/trunk/core/src/test/java/org/apache/mahout/clustering/lda/TestMapReduce.java * /mahout/trunk/core/src/test/java/org/apache/mahout/clustering/lda/cvb * /mahout/trunk/core/src/test/java/org/apache/mahout/clustering/lda/cvb/TestCVBModelTrainer.java * /mahout/trunk/core/src/test/java/org/apache/mahout/math/stats/SamplerTest.java * /mahout/trunk/integration/src/main/java/org/apache/mahout/utils/vectors/VectorDumper.java * /mahout/trunk/integration/src/main/java/org/apache/mahout/utils/vectors/VectorHelper.java * /mahout/trunk/integration/src/test/java/org/apache/mahout/utils/vectors/VectorHelperTest.java * /mahout/trunk/math/src/main/java/org/apache/mahout/math/AbstractVector.java * /mahout/trunk/math/src/main/java/org/apache/mahout/math/NamedVector.java * /mahout/trunk/src/conf/driver.classes.props > New implementation for LDA: Collapsed Variational Bayes (0th derivative > approximation), with map-side model caching > ------------------------------------------------------------------------------------------------------------------- > > Key: MAHOUT-897 > URL: https://issues.apache.org/jira/browse/MAHOUT-897 > Project: Mahout > Issue Type: New Feature > Components: Clustering > Affects Versions: 0.6 > Reporter: Jake Mannix > Assignee: Jake Mannix > Labels: clustering, lda > Fix For: 0.6 > > Attachments: MAHOUT-897.diff, MAHOUT-897.diff > > > Current LDA implementation in Mahout suffers from a few issues: > 1) it's based on the original Variational Bayes E/M training methods of > Blei et al (http://www.cs.princeton.edu/~blei/papers/BleiNgJordan2003.pdf), > which are a) significantly more complex to implement/maintain, and b) > significantly slower than subsequently discovered techniques > 2) the entire "current working model" is held in memory in each Mapper, > which limits the scalability of the implementation by numTerms in vocabulary > * numTopics * 8bytes per double being less than the mapper heap size. > 3) the sufficient statistics which need to be emitted by the mappers scale > as numTopics * numNonZeroEntries in the corpus. Even with judicious use of > Combiners (currently implemented), this can get prohibitively expensive in > terms of network + disk usage. > In particular, point 3 looks like: a 1B nonzero entry corpus in Mahout would > take up about 12GB of RAM in total, but if you wanted 200 topics, you'd be > using 2.5TB if disk+network traffic *per E/M iteration*. Running a moderate > 40 iterations we're talking about 100TB. Having tried this implementation on > a 6B nonzero entry input corpus with 100 topics (500k term vocabulary, so > memory wasn't an issue), I've seen this in practice: even with our production > Hadoop cluster with many thousands of map slots available, even one iteration > was taking more than 3.5hours to get to 50% completion of the mapper tasks. > Point 1) was simple to improve: switch from VB to an algorithm labeled CVB0 > ("Collapsed Variational Bayes, 0th derivative approximation") in Ascuncion, > et al ( http://www.datalab.uci.edu/papers/uai_2009.pdf ). I tried many > approaches to get the overall distributed side of the algorithm to scale > better, originally aiming at removing point 2), but it turned out that point > 3) was what kept rearing its ugly head. The way that YahooLDA ( > https://github.com/shravanmn/Yahoo_LDA ) and many others have achieved high > scalability is by doing distributed Gibbs sampling, but that requires that > you hold onto the model in distributed memory and query it continually via > RPC. This could be done in something like Giraph or Spark, but not in > vanilla Hadoop M/R. > The end result was to actually make point 2) even *worse*, and instead of > relying on Hadoop combiners to aggregate sufficient statistics for the model, > you instead do a full map-side cache of (this mapper's slice of) the next > iteration's model, and emit nothing in each map() call, emitting the entire > model at cleanup(), and then the reducer simply sums the sub-models. This > effectively becomes a form of ensemble learning: each mapper learns its own > sequential model, emits it, the reducers (one for each topic) sum up these > models into one, which is fed out to all the models in the next iteration. > In its current form, this LDA implementation can churn through about two M/R > iterations per hour on the same cluster/data set mentioned above (which makes > it at least 15x faster on larger data sets). > It probably requires a fair amount of documentation / cleanup, but it comes > with a nice end-to-end unit test (same as the one added to MAHOUT-399), and > also comes with an "in-memory" version of the same algorithm, for smaller > datasets (i.e. those which can fit in memory). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira