[ 
https://issues.apache.org/jira/browse/MAHOUT-897?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13161849#comment-13161849
 ] 

Jake Mannix commented on MAHOUT-897:
------------------------------------

I'm going to commit the latest patch later today if I don't hear any complaints.
                
> New implementation for LDA: Collapsed Variational Bayes (0th derivative 
> approximation), with map-side model caching
> -------------------------------------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-897
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-897
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Clustering
>    Affects Versions: 0.6
>            Reporter: Jake Mannix
>            Assignee: Jake Mannix
>              Labels: clustering, lda
>             Fix For: 0.6
>
>         Attachments: MAHOUT-897.diff, MAHOUT-897.diff
>
>
> Current LDA implementation in Mahout suffers from a few issues:
>   1) it's based on the original Variational Bayes E/M training methods of 
> Blei et al (http://www.cs.princeton.edu/~blei/papers/BleiNgJordan2003.pdf), 
> which are a) significantly more complex to implement/maintain, and b) 
> significantly slower than subsequently discovered techniques
>   2) the entire "current working model" is held in memory in each Mapper, 
> which limits the scalability of the implementation by numTerms in vocabulary 
> * numTopics * 8bytes per double being less than the mapper heap size.
>   3) the sufficient statistics which need to be emitted by the mappers scale 
> as numTopics * numNonZeroEntries in the corpus.  Even with judicious use of 
> Combiners (currently implemented), this can get prohibitively expensive in 
> terms of network + disk usage.
> In particular, point 3 looks like: a 1B nonzero entry corpus in Mahout would 
> take up about 12GB of RAM in total, but if you wanted 200 topics, you'd be 
> using 2.5TB if disk+network traffic *per E/M iteration*.  Running a moderate 
> 40 iterations we're talking about 100TB.  Having tried this implementation on 
> a 6B nonzero entry input corpus with 100 topics (500k term vocabulary, so 
> memory wasn't an issue), I've seen this in practice: even with our production 
> Hadoop cluster with many thousands of map slots available, even one iteration 
> was taking more than 3.5hours to get to 50% completion of the mapper tasks.
> Point 1) was simple to improve: switch from VB to an algorithm labeled CVB0 
> ("Collapsed Variational Bayes, 0th derivative approximation") in Ascuncion, 
> et al ( http://www.datalab.uci.edu/papers/uai_2009.pdf ).  I tried many 
> approaches to get the overall distributed side of the algorithm to scale 
> better, originally aiming at removing point 2), but it turned out that point 
> 3) was what kept rearing its ugly head.  The way that YahooLDA ( 
> https://github.com/shravanmn/Yahoo_LDA ) and many others have achieved high 
> scalability is by doing distributed Gibbs sampling, but that requires that 
> you hold onto the model in distributed memory and query it continually via 
> RPC.  This could be done in something like Giraph or Spark, but not in 
> vanilla Hadoop M/R.
> The end result was to actually make point 2) even *worse*, and instead of 
> relying on Hadoop combiners to aggregate sufficient statistics for the model, 
> you instead do a full map-side cache of (this mapper's slice of) the next 
> iteration's model, and emit nothing in each map() call, emitting the entire 
> model at cleanup(), and then the reducer simply sums the sub-models.  This 
> effectively becomes a form of ensemble learning: each mapper learns its own 
> sequential model, emits it, the reducers (one for each topic) sum up these 
> models into one, which is fed out to all the models in the next iteration.
> In its current form, this LDA implementation can churn through about two M/R 
> iterations per hour on the same cluster/data set mentioned above (which makes 
> it at least 15x faster on larger data sets).
> It probably requires a fair amount of documentation / cleanup, but it comes 
> with a nice end-to-end unit test (same as the one added to MAHOUT-399), and 
> also comes with an "in-memory" version of the same algorithm, for smaller 
> datasets (i.e. those which can fit in memory).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to