[jira] [Commented] (MAHOUT-897) New implementation for LDA: Collapsed Variational Bayes (0th derivative approximation), with map-side model caching

[email protected] (Commented) (JIRA) Wed, 30 Nov 2011 10:44:04 -0800

    [ 
https://issues.apache.org/jira/browse/MAHOUT-897?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13160226#comment-13160226
 ]


[email protected] commented on MAHOUT-897:
------------------------------------------------------


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/2944/
-----------------------------------------------------------

(Updated 2011-11-30 18:42:21.747592)


Review request for mahout and Ted Dunning.


Changes
-------

More license headers, and some javadocs.

Also: factor out an "LDASampler" which builds a sampler based on a Matrix 
representation of a topic model.  


Summary
-------

See MAHOUT-897


This addresses bug MAHOUT-897.
    https://issues.apache.org/jira/browse/MAHOUT-897


Diffs (updated)
-----

  trunk/core/src/main/java/org/apache/mahout/clustering/lda/LDADriver.java 
1208294 
  trunk/core/src/main/java/org/apache/mahout/clustering/lda/LDASampler.java 
PRE-CREATION 
  
trunk/core/src/main/java/org/apache/mahout/clustering/lda/cvb/CVB0DocInferenceMapper.java
 PRE-CREATION 
  trunk/core/src/main/java/org/apache/mahout/clustering/lda/cvb/CVB0Driver.java 
PRE-CREATION 
  
trunk/core/src/main/java/org/apache/mahout/clustering/lda/cvb/CVB0TopicTermVectorNormalizerMapper.java
 PRE-CREATION 
  
trunk/core/src/main/java/org/apache/mahout/clustering/lda/cvb/CachingCVB0Mapper.java
 PRE-CREATION 
  
trunk/core/src/main/java/org/apache/mahout/clustering/lda/cvb/CachingCVB0PerplexityMapper.java
 PRE-CREATION 
  
trunk/core/src/main/java/org/apache/mahout/clustering/lda/cvb/InMemoryCollapsedVariationalBayes0.java
 PRE-CREATION 
  
trunk/core/src/main/java/org/apache/mahout/clustering/lda/cvb/ModelTrainer.java 
PRE-CREATION 
  trunk/core/src/main/java/org/apache/mahout/clustering/lda/cvb/TopicModel.java 
PRE-CREATION 
  trunk/core/src/main/java/org/apache/mahout/common/MemoryUtil.java 
PRE-CREATION 
  trunk/core/src/main/java/org/apache/mahout/common/Pair.java 1208294 
  
trunk/core/src/main/java/org/apache/mahout/math/DistributedRowMatrixWriter.java 
PRE-CREATION 
  trunk/core/src/main/java/org/apache/mahout/math/MatrixUtils.java PRE-CREATION 
  trunk/core/src/main/java/org/apache/mahout/math/stats/Sampler.java 
PRE-CREATION 
  
trunk/core/src/test/java/org/apache/mahout/clustering/ClusteringTestUtils.java 
1208294 
  trunk/core/src/test/java/org/apache/mahout/clustering/lda/TestMapReduce.java 
1208294 
  
trunk/core/src/test/java/org/apache/mahout/clustering/lda/cvb/TestCVBModelTrainer.java
 PRE-CREATION 
  trunk/core/src/test/java/org/apache/mahout/math/stats/SamplerTest.java 
PRE-CREATION 
  trunk/src/conf/driver.classes.props 1208294 

Diff: https://reviews.apache.org/r/2944/diff


Testing
-------

mvn clean test


Thanks,

Jake


                
> New implementation for LDA: Collapsed Variational Bayes (0th derivative 
> approximation), with map-side model caching
> -------------------------------------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-897
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-897
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Clustering
>    Affects Versions: 0.6
>            Reporter: Jake Mannix
>            Assignee: Jake Mannix
>              Labels: clustering, lda
>             Fix For: 0.6
>
>         Attachments: MAHOUT-897.diff
>
>
> Current LDA implementation in Mahout suffers from a few issues:
>   1) it's based on the original Variational Bayes E/M training methods of 
> Blei et al (http://www.cs.princeton.edu/~blei/papers/BleiNgJordan2003.pdf), 
> which are a) significantly more complex to implement/maintain, and b) 
> significantly slower than subsequently discovered techniques
>   2) the entire "current working model" is held in memory in each Mapper, 
> which limits the scalability of the implementation by numTerms in vocabulary 
> * numTopics * 8bytes per double being less than the mapper heap size.
>   3) the sufficient statistics which need to be emitted by the mappers scale 
> as numTopics * numNonZeroEntries in the corpus.  Even with judicious use of 
> Combiners (currently implemented), this can get prohibitively expensive in 
> terms of network + disk usage.
> In particular, point 3 looks like: a 1B nonzero entry corpus in Mahout would 
> take up about 12GB of RAM in total, but if you wanted 200 topics, you'd be 
> using 2.5TB if disk+network traffic *per E/M iteration*.  Running a moderate 
> 40 iterations we're talking about 100TB.  Having tried this implementation on 
> a 6B nonzero entry input corpus with 100 topics (500k term vocabulary, so 
> memory wasn't an issue), I've seen this in practice: even with our production 
> Hadoop cluster with many thousands of map slots available, even one iteration 
> was taking more than 3.5hours to get to 50% completion of the mapper tasks.
> Point 1) was simple to improve: switch from VB to an algorithm labeled CVB0 
> ("Collapsed Variational Bayes, 0th derivative approximation") in Ascuncion, 
> et al ( http://www.datalab.uci.edu/papers/uai_2009.pdf ).  I tried many 
> approaches to get the overall distributed side of the algorithm to scale 
> better, originally aiming at removing point 2), but it turned out that point 
> 3) was what kept rearing its ugly head.  The way that YahooLDA ( 
> https://github.com/shravanmn/Yahoo_LDA ) and many others have achieved high 
> scalability is by doing distributed Gibbs sampling, but that requires that 
> you hold onto the model in distributed memory and query it continually via 
> RPC.  This could be done in something like Giraph or Spark, but not in 
> vanilla Hadoop M/R.
> The end result was to actually make point 2) even *worse*, and instead of 
> relying on Hadoop combiners to aggregate sufficient statistics for the model, 
> you instead do a full map-side cache of (this mapper's slice of) the next 
> iteration's model, and emit nothing in each map() call, emitting the entire 
> model at cleanup(), and then the reducer simply sums the sub-models.  This 
> effectively becomes a form of ensemble learning: each mapper learns its own 
> sequential model, emits it, the reducers (one for each topic) sum up these 
> models into one, which is fed out to all the models in the next iteration.
> In its current form, this LDA implementation can churn through about two M/R 
> iterations per hour on the same cluster/data set mentioned above (which makes 
> it at least 15x faster on larger data sets).
> It probably requires a fair amount of documentation / cleanup, but it comes 
> with a nice end-to-end unit test (same as the one added to MAHOUT-399), and 
> also comes with an "in-memory" version of the same algorithm, for smaller 
> datasets (i.e. those which can fit in memory).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAHOUT-897) New implementation for LDA: Collapsed Variational Bayes (0th derivative approximation), with map-side model caching

Reply via email to