On Mon, Jun 27, 2011 at 5:27 PM, Hector Yee <[email protected]> wrote: > > Mahout LDA using 20 nodes and a dictionary of 30k takes more than a day for > an iteration and didn't complete (something about output error during the > reduce step - this may be a CDHbeta3 issue not sure, since reuters clusters > fine). >
So this sounds just like a bug, and we should look into it. I would be very surprised if a 30k dictionary even on 100's of millions of documents should take that long on a 20 node cluster with Mahout's LDA. A single iteration of LDA with Mahout is just "for each document, do inference using the current model, calculate some derivatives, emit some deltas". This could be done a lot faster than it currently is done, but taking a day for an iteration is an infinite loop somewhere. > Hopefully the ideas from the Yahoo version can be incorporated into the > Mahout LDA. > This I definitely agree with. -jake > > On Fri, Jun 10, 2011 at 6:49 AM, Federico Castanedo < > [email protected] > > wrote: > > > Hi all, > > > > i got through the referenced paper and seems that besides all the > > distributed tasks the way the inference for \alpha and \beta > > is performed was the key element on improved the LDA trained performance. > > They use SGD for the hyperparameter adjustment of \alpha. > > > > bests, > > Federico > > > > 2011/6/10 Jake Mannix <[email protected]> > > > > > It's all c++, custom distributed processing, custom distributed > > > coordination > > > and storage. > > > > > > We can certainly try to port over the algorithmic ideas, but the > > > distributed > > > systems stuff would be a significant departure from our current setup - > > > it's > > > not a web service and it's not hadoop, and it's not a command line > > utility > > > - > > > it's a cluster of long-running processes all intercommunicating. > Sounds > > > awesome, but that's a way's off from where we are now. > > > > > > -jake > > > > > > On Thu, Jun 9, 2011 at 7:52 PM, Stanley Xu <[email protected]> > wrote: > > > > > > > Awesome! Guess it would be much faster than then current version in > > > Mahout. > > > > Is that possible to just use this version in mahout? > > > > > > > > On Fri, Jun 10, 2011 at 8:12 AM, <[email protected]> wrote: > > > > > > > > > Yahoo released its hadoop code for LDA > > > > > > > > > > > > > > > > > > > > http://blog.smola.org/post/6359713161/speeding-up-latent-dirichlet-allocation > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > Yee Yang Li Hector > http://hectorgon.blogspot.com/ (tech + travel) > http://hectorgon.com (book reviews) >
