I am going to be unable to complain today.  So silent assent applies to me.

On Fri, Dec 2, 2011 at 12:53 PM, Jake Mannix (Commented) (JIRA) <
j...@apache.org> wrote:

>
>    [
> https://issues.apache.org/jira/browse/MAHOUT-897?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13161849#comment-13161849]
>
> Jake Mannix commented on MAHOUT-897:
> ------------------------------------
>
> I'm going to commit the latest patch later today if I don't hear any
> complaints.
>
> > New implementation for LDA: Collapsed Variational Bayes (0th derivative
> approximation), with map-side model caching
> >
> -------------------------------------------------------------------------------------------------------------------
> >
> >                 Key: MAHOUT-897
> >                 URL: https://issues.apache.org/jira/browse/MAHOUT-897
> >             Project: Mahout
> >          Issue Type: New Feature
> >          Components: Clustering
> >    Affects Versions: 0.6
> >            Reporter: Jake Mannix
> >            Assignee: Jake Mannix
> >              Labels: clustering, lda
> >             Fix For: 0.6
> >
> >         Attachments: MAHOUT-897.diff, MAHOUT-897.diff
> >
> >
> > Current LDA implementation in Mahout suffers from a few issues:
> >   1) it's based on the original Variational Bayes E/M training methods
> of Blei et al (
> http://www.cs.princeton.edu/~blei/papers/BleiNgJordan2003.pdf), which are
> a) significantly more complex to implement/maintain, and b) significantly
> slower than subsequently discovered techniques
> >   2) the entire "current working model" is held in memory in each
> Mapper, which limits the scalability of the implementation by numTerms in
> vocabulary * numTopics * 8bytes per double being less than the mapper heap
> size.
> >   3) the sufficient statistics which need to be emitted by the mappers
> scale as numTopics * numNonZeroEntries in the corpus.  Even with judicious
> use of Combiners (currently implemented), this can get prohibitively
> expensive in terms of network + disk usage.
> > In particular, point 3 looks like: a 1B nonzero entry corpus in Mahout
> would take up about 12GB of RAM in total, but if you wanted 200 topics,
> you'd be using 2.5TB if disk+network traffic *per E/M iteration*.  Running
> a moderate 40 iterations we're talking about 100TB.  Having tried this
> implementation on a 6B nonzero entry input corpus with 100 topics (500k
> term vocabulary, so memory wasn't an issue), I've seen this in practice:
> even with our production Hadoop cluster with many thousands of map slots
> available, even one iteration was taking more than 3.5hours to get to 50%
> completion of the mapper tasks.
> > Point 1) was simple to improve: switch from VB to an algorithm labeled
> CVB0 ("Collapsed Variational Bayes, 0th derivative approximation") in
> Ascuncion, et al ( http://www.datalab.uci.edu/papers/uai_2009.pdf ).  I
> tried many approaches to get the overall distributed side of the algorithm
> to scale better, originally aiming at removing point 2), but it turned out
> that point 3) was what kept rearing its ugly head.  The way that YahooLDA (
> https://github.com/shravanmn/Yahoo_LDA ) and many others have achieved
> high scalability is by doing distributed Gibbs sampling, but that requires
> that you hold onto the model in distributed memory and query it continually
> via RPC.  This could be done in something like Giraph or Spark, but not in
> vanilla Hadoop M/R.
> > The end result was to actually make point 2) even *worse*, and instead
> of relying on Hadoop combiners to aggregate sufficient statistics for the
> model, you instead do a full map-side cache of (this mapper's slice of) the
> next iteration's model, and emit nothing in each map() call, emitting the
> entire model at cleanup(), and then the reducer simply sums the sub-models.
>  This effectively becomes a form of ensemble learning: each mapper learns
> its own sequential model, emits it, the reducers (one for each topic) sum
> up these models into one, which is fed out to all the models in the next
> iteration.
> > In its current form, this LDA implementation can churn through about two
> M/R iterations per hour on the same cluster/data set mentioned above (which
> makes it at least 15x faster on larger data sets).
> > It probably requires a fair amount of documentation / cleanup, but it
> comes with a nice end-to-end unit test (same as the one added to
> MAHOUT-399), and also comes with an "in-memory" version of the same
> algorithm, for smaller datasets (i.e. those which can fit in memory).
>
> --
> This message is automatically generated by JIRA.
> If you think it was sent incorrectly, please contact your JIRA
> administrators:
> https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
> For more information on JIRA, see: http://www.atlassian.com/software/jira
>
>
>

Reply via email to