PS the danger of using an overly specific corpus is that training may not be able to learn polisemy very well unless it sees other documents with examples of use of the industry jargon words that may also mean something else. But you definitely want to include documents that do have words pertinent to the domain, as a first requirement. Actually, i think you may want to have a balanced corpus of all ways of use of the words that are specific to your domain. It's kind of hard to find the right balance, but you can help with the skew towards more relevant things.
On Thu, Nov 17, 2011 at 4:49 PM, Dmitriy Lyubimov <[email protected]> wrote: > The only way to build model incrementally is to do a 'fold in' of new > observations, that i know. > > However, folding in (which is just a multiplication of a new vector > over the matrices as Ted explained somewhere else) is just a > projection into already trained space of factors, but not a repetition > of a training, which is what the actual SVD is. > > This may work in more ways than just for the sake of incremental > compilation. E.g. what i kind of doing: I am working on a particular > industry domain, so i look for a large corpus of documents pertinent > to that domain. That's what i run LSA on. > > Then i actually throw away the entire document matrix and build a 100% > fold-in document index based on the term and singular value matrix. > > The reason for this is that often the actual corpus of documents that > you are working with and need proximity comparisons to is not that big > at all to provide true picture of the specific domain you want to fit > the stuff into. So generic workflow that i figured in my case is to > take a really big corpus relevant to your business, fit term > dictionary to it using LSA, and then use that dictionary to fold-in > new documents, with assumption that that big corpus is more relevant > and much bigger to what you want to focus on, than those documents you > actually want to retrieve and compare. > > Then you may have one fairly infrequent training and a fairly simple > fold-in procedures. > > On Thu, Nov 17, 2011 at 1:47 PM, Grant Ingersoll <[email protected]> wrote: >> I've never implemented LSI. Is there a way to incrementally build the model >> (by simply indexing documents) or is it something that one only runs after >> the fact once one has built up the much bigger matrix? If it's the former, >> I bet it wouldn't be that hard to just implement the appropriate new codecs >> and similarity, assuming Lucene trunk. If it's the latter, then Ted's >> comment about pushing back into Lucene gets a bit hairier. Still, I wonder >> if the Codecs/Similarity could help here, too. >> >> What's a typical workflow look like for building all of this? >> >> On Nov 13, 2011, at 3:58 PM, Ted Dunning wrote: >> >>> Essentially not. >>> >>> And I would worry about how to push the LSI vectors back into lucene in a >>> coherent and usable way. >>> >>> On Sun, Nov 13, 2011 at 10:47 AM, Sebastian Schelter <[email protected]> >>> wrote: >>> >>>> Is there some documentation/tutorial available on how to build a LSI >>>> pipeline with mahout and lucene? >>>> >>>> --sebastian >>>> >> >> >> >
