On Jan 6, 2010, at 3:52 PM, Drew Farris wrote:

> On Wed, Jan 6, 2010 at 3:35 PM, Grant Ingersoll <[email protected]> wrote:
>> 
>> On Jan 5, 2010, at 3:18 PM, Ted Dunning wrote:
>> 
>>> No.  We really don't.
>> 
>> FWIW, I checked in math/o.a.m.math.stats.LogLikelihood w/ some based LLR 
>> stuff that we use in utils.lucene.ClusterLabels.  Would be great to see this 
>> stuff expanded.
>> 
> 
> So, doing something like this would involve some number of M/R passes
> to do the ngram generation, counting and calculate LLR using
> o.a.m.math.stats.LogLikelihood, but what to do about tokenization?
> 
> I've seen the approach of using a list of filenames as input to the
> first mapper, which slurps in and tokenizes / generating ngrams for
> the text of each file, but is there something that works better?
> 
> Would Lucene's StandardAnalyzer be sufficient for generating tokens?

Why not be able to pass in the Analyzer?  I think the classifier stuff does, 
assuming it takes a no-arg constructor, which many do.  It's the one place, 
however, where I think we could benefit from something like Spring or Guice.

Reply via email to