Re: Yahoo's LDA code

Jake Mannix Wed, 29 Jun 2011 09:05:39 -0700

On Mon, Jun 27, 2011 at 5:27 PM, Hector Yee <[email protected]> wrote:
>
> Mahout LDA using 20 nodes and a dictionary of 30k takes more than a day for
> an iteration and didn't complete (something about output error during the
> reduce step - this may be a CDHbeta3 issue not sure, since reuters clusters
> fine).
>


So this sounds just like a bug, and we should look into it.  I would be
very
surprised if a 30k dictionary even on 100's of millions of documents should
take that long on a 20 node cluster with Mahout's LDA.

A single iteration of LDA with Mahout is just "for each document, do
inference
using the current model, calculate some derivatives, emit some deltas".

This could be done a lot faster than it currently is done, but taking a day
for an iteration is an infinite loop somewhere.


> Hopefully the ideas from the Yahoo version can be incorporated into the
> Mahout LDA.
>

This I definitely agree with.

  -jake


>
> On Fri, Jun 10, 2011 at 6:49 AM, Federico Castanedo <
> [email protected]
> > wrote:
>
> > Hi all,
> >
> > i got through the referenced paper and seems that besides all the
> > distributed tasks the way the inference for \alpha and \beta
> > is performed was the key element on improved the LDA trained performance.
> > They use SGD for the hyperparameter adjustment of \alpha.
> >
> > bests,
> > Federico
> >
> > 2011/6/10 Jake Mannix <[email protected]>
> >
> > > It's all c++, custom distributed processing, custom distributed
> > > coordination
> > > and storage.
> > >
> > > We can certainly try to port over the algorithmic ideas, but the
> > > distributed
> > > systems stuff would be a significant departure from our current setup -
> > > it's
> > > not a web service and it's not hadoop, and it's not a command line
> > utility
> > > -
> > > it's a cluster of long-running processes all intercommunicating.
>  Sounds
> > > awesome, but that's a way's off from where we are now.
> > >
> > >  -jake
> > >
> > > On Thu, Jun 9, 2011 at 7:52 PM, Stanley Xu <[email protected]>
> wrote:
> > >
> > > > Awesome! Guess it would be much faster than then current version in
> > > Mahout.
> > > > Is that possible to just use this version in mahout?
> > > >
> > > > On Fri, Jun 10, 2011 at 8:12 AM, <[email protected]> wrote:
> > > >
> > > > > Yahoo released its hadoop code for LDA
> > > > >
> > > > >
> > > >
> > >
> >
> http://blog.smola.org/post/6359713161/speeding-up-latent-dirichlet-allocation
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > >
> > >
> >
>
>
>
> --
> Yee Yang Li Hector
> http://hectorgon.blogspot.com/ (tech + travel)
> http://hectorgon.com (book reviews)
>

Re: Yahoo's LDA code

Reply via email to