On Wed, Jan 6, 2010 at 3:35 PM, Grant Ingersoll <[email protected]> wrote: > > On Jan 5, 2010, at 3:18 PM, Ted Dunning wrote: > >> No. We really don't. > > FWIW, I checked in math/o.a.m.math.stats.LogLikelihood w/ some based LLR > stuff that we use in utils.lucene.ClusterLabels. Would be great to see this > stuff expanded. >
So, doing something like this would involve some number of M/R passes to do the ngram generation, counting and calculate LLR using o.a.m.math.stats.LogLikelihood, but what to do about tokenization? I've seen the approach of using a list of filenames as input to the first mapper, which slurps in and tokenizes / generating ngrams for the text of each file, but is there something that works better? Would Lucene's StandardAnalyzer be sufficient for generating tokens?
