Ok thanks, maybe one question. Is it possible to train an already trained model on just a few new documents with the provided algorithms or do you have to train through the whole corpus again. What I mean is whether you can train an lda model incrementally or not? What I would actually like to do is training a topical classifier on top of a lda model. Do you have any experience with that? I mean by changing the lda model, the inputs for the classifier would also change. Do I have to train a classifier from scratch again or can I reuse the classifier trained on top of the older lda model and just ajust that one slightly?
2012/3/27 Dirk Weissenborn <[email protected]> > no problem. I ll post it > > > 2012/3/27 Jake Mannix <[email protected]> > >> Hey Dirk, >> >> Do you mind continuing this discussion on the mailing list? Lots of >> our users may ask this kind of question in the future... >> >> On Mon, Mar 26, 2012 at 3:36 PM, Dirk Weissenborn < >> [email protected]> wrote: >> >>> Ok thanks, >>> >>> maybe one question. Is it possible to train an already trained model on >>> just a few new documents with the provided algorithms or do you have to >>> train through the whole corpus again. What I mean is whether you can train >>> an lda model incrementally or not? >>> What I would actually like to do is training a topical classifier on top >>> of a lda model. Do you have any experience with that? I mean by changing >>> the lda model, the inputs for the classifier would also change. Do I have >>> to train a classifier from scratch again or can I reuse the classifier >>> trained on top of the older lda model and just ajust that one? >>> >>> >>> 2012/3/26 Jake Mannix <[email protected]> >>> >>>> On Mon, Mar 26, 2012 at 12:58 PM, Dirk Weissenborn < >>>> [email protected]> wrote: >>>> >>>> > Thank you for the quick response! It is possible that I need it in >>>> not too >>>> > far future maybe I ll implement on top what already exists, which >>>> should >>>> > not be that hard as you mentioned. I ll provide a patch when the time >>>> > comes. >>>> > >>>> >>>> Feel free to email any questions about using the >>>> InMemoryCollapsedVariationalBayes0 >>>> class - it's mainly been used for testing, so far, but if you want to >>>> take >>>> that class >>>> and clean it up and look into fixing the online learning aspect of it, >>>> that'd be >>>> excellent. Let me know if you make any progress, because I'll probably >>>> be >>>> looking >>>> to work on this at some point as well, but I won't if you're already >>>> working on it. :) >>>> >>>> >>>> > >>>> > 2012/3/26 Jake Mannix <[email protected]> >>>> > >>>> > > Hi Dirk, >>>> > > >>>> > > This has not been implemented in Mahout, but the version of >>>> map-reduce >>>> > > (batch)-learned >>>> > > LDA which is done via (approximate+collapsed-) variational bayes >>>> [1] is >>>> > > reasonably easily >>>> > > modifiable to the methods in this paper, as the LDA learner we >>>> currently >>>> > do >>>> > > via iterative >>>> > > MR passes is essentially an ensemble learner: each subset of the >>>> data >>>> > > partially trains a >>>> > > full LDA model starting from the aggregate (summed) counts of all >>>> of the >>>> > > data from >>>> > > previous iterations (see essentially the method named "approximately >>>> > > distributed LDA" / >>>> > > AD-LDA in Ref-[2]). >>>> > > >>>> > > The method in the paper you refer to turns traditional VB (the >>>> slower, >>>> > > uncollapsed kind, >>>> > > with the nasty digamma functions all over the place) into a >>>> streaming >>>> > > learner, by accreting >>>> > > the word-counts of each document onto the model you're using for >>>> > inference >>>> > > on the next >>>> > > documents. The same exact idea can be done on the CVB0 inference >>>> > > technique, almost >>>> > > without change - as VB differs from CVB0 only in the E-step, not the >>>> > > M-step. >>>> > > >>>> > > The problem which comes up when I've considered doing this kind of >>>> thing >>>> > > in the past >>>> > > is that if you do this in a distributed fashion, each member of the >>>> > > ensemble starts learning >>>> > > different topics simultaneously, and then the merge gets trickier. >>>> You >>>> > can >>>> > > avoid this by >>>> > > doing some of the techniques mentioned in [2] for HDP, where you >>>> swap >>>> > > topic-ids on >>>> > > merge to make sure they match up, but I haven't investigated that >>>> very >>>> > > thoroughly. The >>>> > > other way to avoid this problem is to use the parameter denoted >>>> \rho_t in >>>> > > Hoffman et al - >>>> > > this parameter is telling us how much to weight the model as it was >>>> up >>>> > > until now, against >>>> > > the updates from the latest document (alternatively, how much to >>>> "decay" >>>> > > previous >>>> > > documents). If you don't let the topics drift *too much* during >>>> parallel >>>> > > learning, you could >>>> > > probably make sure that they match up just fine on each merge, while >>>> > still >>>> > > speeding up >>>> > > the process faster than fully batch learning. >>>> > > >>>> > > So yeah, this is a great idea, but getting it to work in a >>>> distributed >>>> > > fashion is tricky. >>>> > > In a non-distributed form, this idea is almost completely >>>> implemented in >>>> > > the class >>>> > > InMemoryCollapsedVariationalBayes0. I say "almost" because it's >>>> > > technically in >>>> > > there already, as a parameter choice (initialModelCorpusFraction != >>>> 0), >>>> > but >>>> > > I don't >>>> > > think it's working properly yet. If you're interested in the >>>> problem, >>>> > > playing with this >>>> > > class would be a great place to start! >>>> > > >>>> > > Refrences: >>>> > > 1) >>>> > > >>>> http://eprints.pascal-network.org/archive/00006729/01/AsuWelSmy2009a.pdf >>>> > > 2) http://www.csee.ogi.edu/~zak/cs506-pslc/dist_lda.pdf >>>> > > >>>> > > On Mon, Mar 26, 2012 at 11:54 AM, Dirk Weissenborn < >>>> > > [email protected]> wrote: >>>> > > >>>> > > > Hello, >>>> > > > >>>> > > > I wanted to ask whether there is already an online learning >>>> algorithm >>>> > > > implementation for lda or not? >>>> > > > >>>> > > > http://www.cs.princeton.edu/~blei/papers/HoffmanBleiBach2010b.pdf >>>> > > > >>>> > > > cheers, >>>> > > > Dirk >>>> > > > >>>> > > >>>> > > >>>> > > >>>> > > -- >>>> > > >>>> > > -jake >>>> > > >>>> > >>>> >>>> >>>> >>>> -- >>>> >>>> -jake >>>> >>> >>> >> >> >> -- >> >> -jake >> >> >
