Re: Yahoo's LDA code

Hector Yee Wed, 29 Jun 2011 08:31:34 -0700

A lot less. It can handle 200k terms and 10k topics and 50m docs on one
machine and it took 1.2 GB or so of RAM.



On Wed, Jun 29, 2011 at 8:20 AM, Ian Upright <[email protected]> wrote:

> I also wonder what memory limitations it may have as compared to the Mahout
> implementation.  (with regards to number of terms/topics/documents)
>
> Ian
>
> >So I tried Yahoo LDA  on 52 M documents with 1000 topics.
> >
> >Yahoo LDA with a dictionary of 100k terms does 1 iteration every 30
> minutes
> >on a single machine using 4 cores.
> >
> >Mahout LDA using 20 nodes and a dictionary of 30k takes more than a day
> for
> >an iteration and didn't complete (something about output error during the
> >reduce step - this may be a CDHbeta3 issue not sure, since reuters
> clusters
> >fine).
> >
> >Hopefully the ideas from the Yahoo version can be incorporated into the
> >Mahout LDA.
> >
> >On Fri, Jun 10, 2011 at 6:49 AM, Federico Castanedo <
> [email protected]
> >> wrote:
> >
> >> Hi all,
> >>
> >> i got through the referenced paper and seems that besides all the
> >> distributed tasks the way the inference for \alpha and \beta
> >> is performed was the key element on improved the LDA trained
> performance.
> >> They use SGD for the hyperparameter adjustment of \alpha.
> >>
> >> bests,
> >> Federico
> >>
> >> 2011/6/10 Jake Mannix <[email protected]>
> >>
> >> > It's all c++, custom distributed processing, custom distributed
> >> > coordination
> >> > and storage.
> >> >
> >> > We can certainly try to port over the algorithmic ideas, but the
> >> > distributed
> >> > systems stuff would be a significant departure from our current setup
> -
> >> > it's
> >> > not a web service and it's not hadoop, and it's not a command line
> >> utility
> >> > -
> >> > it's a cluster of long-running processes all intercommunicating.
>  Sounds
> >> > awesome, but that's a way's off from where we are now.
> >> >
> >> >  -jake
> >> >
> >> > On Thu, Jun 9, 2011 at 7:52 PM, Stanley Xu <[email protected]>
> wrote:
> >> >
> >> > > Awesome! Guess it would be much faster than then current version in
> >> > Mahout.
> >> > > Is that possible to just use this version in mahout?
> >> > >
> >> > > On Fri, Jun 10, 2011 at 8:12 AM, <[email protected]> wrote:
> >> > >
> >> > > > Yahoo released its hadoop code for LDA
> >> > > >
> >> > > >
> >> > >
> >> >
> >>
> http://blog.smola.org/post/6359713161/speeding-up-latent-dirichlet-allocation
> >> > > >
> >> > > >
> >> > > >
> >> > > >
> >> > > >
> >> > >
> >> >
> >>
>



-- 
Yee Yang Li Hector
http://hectorgon.blogspot.com/ (tech + travel)
http://hectorgon.com (book reviews)

Re: Yahoo's LDA code

Reply via email to