Re: Collocations in Mahout?

Otis Gospodnetic Thu, 07 Jan 2010 16:04:01 -0800

Zaki,

NLP does fall under the Mahout umbrella, I'd say.  Future subproject perhaps?


 Otis
--
Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch



----- Original Message ----
> From: zaki rahaman <[email protected]>
> To: [email protected]
> Sent: Thu, January 7, 2010 6:44:22 PM
> Subject: Re: Collocations in Mahout?
> 
> Ideally yea, I think it would be nice to be able to pass in a custom
> analyzer or at least be able to provide some options... I saw the
> LogLikelihood class Grant was referring to in math.stats but I don't seem to
> see any M/R LLR piece, at least not something that's nicely abstracted and
> extracted out.
> 
> @Ted, where is the partial framework you're referring to. And yes this is
> definitely something I would like to work on if pointed in the right
> direction. I wasn't quite sure though just b/c I remember a long-winded
> discussion/debate a while back on the listserv about what Mahout's purpose
> should be. N-gram LLR for collocations seems like a very NLP type of thing
> to have (obviously it could also be used in other applications as well but
> by itself its NLP to me) and from my understanding the "consensus" is that
> Mahout should focus on scalable machine learning.
> 
> On Wed, Jan 6, 2010 at 4:04 PM, Grant Ingersoll wrote:
> 
> >
> > On Jan 6, 2010, at 3:52 PM, Drew Farris wrote:
> >
> > > On Wed, Jan 6, 2010 at 3:35 PM, Grant Ingersoll 
> > wrote:
> > >>
> > >> On Jan 5, 2010, at 3:18 PM, Ted Dunning wrote:
> > >>
> > >>> No.  We really don't.
> > >>
> > >> FWIW, I checked in math/o.a.m.math.stats.LogLikelihood w/ some based LLR
> > stuff that we use in utils.lucene.ClusterLabels.  Would be great to see this
> > stuff expanded.
> > >>
> > >
> > > So, doing something like this would involve some number of M/R passes
> > > to do the ngram generation, counting and calculate LLR using
> > > o.a.m.math.stats.LogLikelihood, but what to do about tokenization?
> > >
> > > I've seen the approach of using a list of filenames as input to the
> > > first mapper, which slurps in and tokenizes / generating ngrams for
> > > the text of each file, but is there something that works better?
> > >
> > > Would Lucene's StandardAnalyzer be sufficient for generating tokens?
> >
> > Why not be able to pass in the Analyzer?  I think the classifier stuff
> > does, assuming it takes a no-arg constructor, which many do.  It's the one
> > place, however, where I think we could benefit from something like Spring or
> > Guice.
> 
> 
> 
> 
> -- 
> Zaki Rahaman

Re: Collocations in Mahout?

Reply via email to