[ 
https://issues.apache.org/jira/browse/MAHOUT-897?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13158602#comment-13158602
 ] 

[email protected] commented on MAHOUT-897:
------------------------------------------------------



bq.  On 2011-11-28 07:54:29, Ted Dunning wrote:
bq.  > Generally this looks like pretty clean code.  Some more comments about 
intent would be nice.
bq.  > 
bq.  > My review so far is very superficial.

I'm pretty blind to places which need more docs, as it all does, and I know the 
code.  If you could point out places most in need of docs, I'll know where to 
start. :)


bq.  On 2011-11-28 07:54:29, Ted Dunning wrote:
bq.  > 
trunk/core/src/main/java/org/apache/mahout/clustering/lda/LDADriver.java, line 
217
bq.  > <https://reviews.apache.org/r/2944/diff/1/?file=60151#file60151line217>
bq.  >
bq.  >     Why return double?  Main ignores this.

Because any other program running this (currently: just the unit test I added) 
may want to know what the final converged perplexity was, so now it's available 
from the run() call.


bq.  On 2011-11-28 07:54:29, Ted Dunning wrote:
bq.  > 
trunk/core/src/main/java/org/apache/mahout/clustering/lda/cvb/CVB0Driver.java, 
line 103
bq.  > <https://reviews.apache.org/r/2944/diff/1/?file=60153#file60153line103>
bq.  >
bq.  >     I think convention is @override on a seaparate line.

Ah yes, I'll fix that.


bq.  On 2011-11-28 07:54:29, Ted Dunning wrote:
bq.  > trunk/core/src/main/java/org/apache/mahout/math/stats/Sampler.java, line 
9
bq.  > <https://reviews.apache.org/r/2944/diff/1/?file=60164#file60164line9>
bq.  >
bq.  >     Javadoc here would be nice.  Why is this sampler different from 
samplers we already have?
bq.  >     
bq.  >     Also, I don't see test code for this.

I'll add documentation like follows:

/**
   * Samples from a given discrete distribution: you provide a source of 
randomness and a Vector (cardinality N) which describes a distribution over 
[0,N), and calls to sample() sample from 0 to N 
   * using this distribution
   */

Do we already have a sampler which does this?

I can add tests, good point.


bq.  On 2011-11-28 07:54:29, Ted Dunning wrote:
bq.  > trunk/core/src/main/java/org/apache/mahout/math/stats/Sampler.java, line 
58
bq.  > <https://reviews.apache.org/r/2944/diff/1/?file=60164#file60164line58>
bq.  >
bq.  >     So this looks like a multinomial sampler.  Why not fit it into what 
already exists?

Point me to the class!


- Jake


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/2944/#review3531
-----------------------------------------------------------


On 2011-11-27 20:37:25, Jake Mannix wrote:
bq.  
bq.  -----------------------------------------------------------
bq.  This is an automatically generated e-mail. To reply, visit:
bq.  https://reviews.apache.org/r/2944/
bq.  -----------------------------------------------------------
bq.  
bq.  (Updated 2011-11-27 20:37:25)
bq.  
bq.  
bq.  Review request for mahout.
bq.  
bq.  
bq.  Summary
bq.  -------
bq.  
bq.  See MAHOUT-897
bq.  
bq.  
bq.  This addresses bug MAHOUT-897.
bq.      https://issues.apache.org/jira/browse/MAHOUT-897
bq.  
bq.  
bq.  Diffs
bq.  -----
bq.  
bq.    trunk/core/src/main/java/org/apache/mahout/clustering/lda/LDADriver.java 
1206835 
bq.    
trunk/core/src/main/java/org/apache/mahout/clustering/lda/cvb/CVB0DocInferenceMapper.java
 PRE-CREATION 
bq.    
trunk/core/src/main/java/org/apache/mahout/clustering/lda/cvb/CVB0Driver.java 
PRE-CREATION 
bq.    
trunk/core/src/main/java/org/apache/mahout/clustering/lda/cvb/CVB0TopicTermVectorNormalizerMapper.java
 PRE-CREATION 
bq.    
trunk/core/src/main/java/org/apache/mahout/clustering/lda/cvb/CachingCVB0Mapper.java
 PRE-CREATION 
bq.    
trunk/core/src/main/java/org/apache/mahout/clustering/lda/cvb/CachingCVB0PerplexityMapper.java
 PRE-CREATION 
bq.    
trunk/core/src/main/java/org/apache/mahout/clustering/lda/cvb/InMemoryCollapsedVariationalBayes0.java
 PRE-CREATION 
bq.    
trunk/core/src/main/java/org/apache/mahout/clustering/lda/cvb/MemoryUtil.java 
PRE-CREATION 
bq.    
trunk/core/src/main/java/org/apache/mahout/clustering/lda/cvb/ModelTrainer.java 
PRE-CREATION 
bq.    
trunk/core/src/main/java/org/apache/mahout/clustering/lda/cvb/TopicModel.java 
PRE-CREATION 
bq.    trunk/core/src/main/java/org/apache/mahout/common/Pair.java 1206835 
bq.    
trunk/core/src/main/java/org/apache/mahout/math/DistributedRowMatrixWriter.java 
PRE-CREATION 
bq.    trunk/core/src/main/java/org/apache/mahout/math/MatrixUtils.java 
PRE-CREATION 
bq.    trunk/core/src/main/java/org/apache/mahout/math/stats/Sampler.java 
PRE-CREATION 
bq.    
trunk/core/src/test/java/org/apache/mahout/clustering/ClusteringTestUtils.java 
1206835 
bq.    
trunk/core/src/test/java/org/apache/mahout/clustering/lda/TestMapReduce.java 
1206835 
bq.    
trunk/core/src/test/java/org/apache/mahout/clustering/lda/cvb/TestCVBModelTrainer.java
 PRE-CREATION 
bq.    trunk/src/conf/driver.classes.props 1206835 
bq.  
bq.  Diff: https://reviews.apache.org/r/2944/diff
bq.  
bq.  
bq.  Testing
bq.  -------
bq.  
bq.  mvn clean test
bq.  
bq.  
bq.  Thanks,
bq.  
bq.  Jake
bq.  
bq.


                
> New implementation for LDA: Collapsed Variational Bayes (0th derivative 
> approximation), with map-side model caching
> -------------------------------------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-897
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-897
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Clustering
>    Affects Versions: 0.6
>            Reporter: Jake Mannix
>            Assignee: Jake Mannix
>              Labels: clustering, lda
>             Fix For: 0.6
>
>         Attachments: MAHOUT-897.diff
>
>
> Current LDA implementation in Mahout suffers from a few issues:
>   1) it's based on the original Variational Bayes E/M training methods of 
> Blei et al (http://www.cs.princeton.edu/~blei/papers/BleiNgJordan2003.pdf), 
> which are a) significantly more complex to implement/maintain, and b) 
> significantly slower than subsequently discovered techniques
>   2) the entire "current working model" is held in memory in each Mapper, 
> which limits the scalability of the implementation by numTerms in vocabulary 
> * numTopics * 8bytes per double being less than the mapper heap size.
>   3) the sufficient statistics which need to be emitted by the mappers scale 
> as numTopics * numNonZeroEntries in the corpus.  Even with judicious use of 
> Combiners (currently implemented), this can get prohibitively expensive in 
> terms of network + disk usage.
> In particular, point 3 looks like: a 1B nonzero entry corpus in Mahout would 
> take up about 12GB of RAM in total, but if you wanted 200 topics, you'd be 
> using 2.5TB if disk+network traffic *per E/M iteration*.  Running a moderate 
> 40 iterations we're talking about 100TB.  Having tried this implementation on 
> a 6B nonzero entry input corpus with 100 topics (500k term vocabulary, so 
> memory wasn't an issue), I've seen this in practice: even with our production 
> Hadoop cluster with many thousands of map slots available, even one iteration 
> was taking more than 3.5hours to get to 50% completion of the mapper tasks.
> Point 1) was simple to improve: switch from VB to an algorithm labeled CVB0 
> ("Collapsed Variational Bayes, 0th derivative approximation") in Ascuncion, 
> et al ( http://www.datalab.uci.edu/papers/uai_2009.pdf ).  I tried many 
> approaches to get the overall distributed side of the algorithm to scale 
> better, originally aiming at removing point 2), but it turned out that point 
> 3) was what kept rearing its ugly head.  The way that YahooLDA ( 
> https://github.com/shravanmn/Yahoo_LDA ) and many others have achieved high 
> scalability is by doing distributed Gibbs sampling, but that requires that 
> you hold onto the model in distributed memory and query it continually via 
> RPC.  This could be done in something like Giraph or Spark, but not in 
> vanilla Hadoop M/R.
> The end result was to actually make point 2) even *worse*, and instead of 
> relying on Hadoop combiners to aggregate sufficient statistics for the model, 
> you instead do a full map-side cache of (this mapper's slice of) the next 
> iteration's model, and emit nothing in each map() call, emitting the entire 
> model at cleanup(), and then the reducer simply sums the sub-models.  This 
> effectively becomes a form of ensemble learning: each mapper learns its own 
> sequential model, emits it, the reducers (one for each topic) sum up these 
> models into one, which is fed out to all the models in the next iteration.
> In its current form, this LDA implementation can churn through about two M/R 
> iterations per hour on the same cluster/data set mentioned above (which makes 
> it at least 15x faster on larger data sets).
> It probably requires a fair amount of documentation / cleanup, but it comes 
> with a nice end-to-end unit test (same as the one added to MAHOUT-399), and 
> also comes with an "in-memory" version of the same algorithm, for smaller 
> datasets (i.e. those which can fit in memory).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to