Zaki, NLP does fall under the Mahout umbrella, I'd say. Future subproject perhaps?
Otis -- Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch ----- Original Message ---- > From: zaki rahaman <[email protected]> > To: [email protected] > Sent: Thu, January 7, 2010 6:44:22 PM > Subject: Re: Collocations in Mahout? > > Ideally yea, I think it would be nice to be able to pass in a custom > analyzer or at least be able to provide some options... I saw the > LogLikelihood class Grant was referring to in math.stats but I don't seem to > see any M/R LLR piece, at least not something that's nicely abstracted and > extracted out. > > @Ted, where is the partial framework you're referring to. And yes this is > definitely something I would like to work on if pointed in the right > direction. I wasn't quite sure though just b/c I remember a long-winded > discussion/debate a while back on the listserv about what Mahout's purpose > should be. N-gram LLR for collocations seems like a very NLP type of thing > to have (obviously it could also be used in other applications as well but > by itself its NLP to me) and from my understanding the "consensus" is that > Mahout should focus on scalable machine learning. > > On Wed, Jan 6, 2010 at 4:04 PM, Grant Ingersoll wrote: > > > > > On Jan 6, 2010, at 3:52 PM, Drew Farris wrote: > > > > > On Wed, Jan 6, 2010 at 3:35 PM, Grant Ingersoll > > wrote: > > >> > > >> On Jan 5, 2010, at 3:18 PM, Ted Dunning wrote: > > >> > > >>> No. We really don't. > > >> > > >> FWIW, I checked in math/o.a.m.math.stats.LogLikelihood w/ some based LLR > > stuff that we use in utils.lucene.ClusterLabels. Would be great to see this > > stuff expanded. > > >> > > > > > > So, doing something like this would involve some number of M/R passes > > > to do the ngram generation, counting and calculate LLR using > > > o.a.m.math.stats.LogLikelihood, but what to do about tokenization? > > > > > > I've seen the approach of using a list of filenames as input to the > > > first mapper, which slurps in and tokenizes / generating ngrams for > > > the text of each file, but is there something that works better? > > > > > > Would Lucene's StandardAnalyzer be sufficient for generating tokens? > > > > Why not be able to pass in the Analyzer? I think the classifier stuff > > does, assuming it takes a no-arg constructor, which many do. It's the one > > place, however, where I think we could benefit from something like Spring or > > Guice. > > > > > -- > Zaki Rahaman
