The pieces are laying around. I had a framework like this for recs and text analysis at Veoh, Jake has something in LinkedIn.
But the amount of code is relatively small and probably could be rewritten before Jake can get clearance to release anything. The first step is to just count n-grams. I think that the input should be relatively flexible and if you assume parametrized use of Lucene analyzers, then all that is necessary is a small step up from word counting. This should count all n-grams from 0 up to a limit. It should also allow suppression of output of any counts less than a threshold. Total number of n-grams of each size observed should be accumulated. There should also be some provision for counting cooccurrence pairs within windows or between two fields. The second step is to detect interesting n-grams. This is done using the counts of words and (n-1)-grams and the relevant totals as input for the LLR code. The final (optional) step is creation of a Bloom filter table. Options should control size of the table and number of probes. Building up all these pieces and connecting them is a truly worthy task. On Thu, Jan 7, 2010 at 3:44 PM, zaki rahaman <[email protected]> wrote: > @Ted, where is the partial framework you're referring to. And yes this is > definitely something I would like to work on if pointed in the right > direction. I wasn't quite sure though just b/c I remember a long-winded > discussion/debate a while back on the listserv about what Mahout's purpose > should be. N-gram LLR for collocations seems like a very NLP type of thing > to have (obviously it could also be used in other applications as well but > by itself its NLP to me) and from my understanding the "consensus" is that > Mahout should focus on scalable machine learning. > -- Ted Dunning, CTO DeepDyve
