Re: Collocations in Mahout?

Grant Ingersoll Thu, 07 Jan 2010 17:34:00 -0800

On Jan 7, 2010, at 7:57 PM, Ted Dunning wrote:

> The pieces are laying around.
> 
> I had a framework like this for recs and text analysis at Veoh, Jake has
> something in LinkedIn.
> 
> But the amount of code is relatively small and probably could be rewritten
> before Jake can get clearance to release anything.
> 
> The first step is to just count n-grams.  I think that the input should be
> relatively flexible and if you assume parametrized use of Lucene analyzers,
> then all that is necessary is a small step up from word counting.


The classification stuff has this already, in MR form, independent of Lucene.

> This
> should count all n-grams from 0 up to a limit.  It should also allow
> suppression of output of any counts less than a threshold.  Total number of
> n-grams of each size observed should be accumulated.  

I believe it does this, too.  Robin?

> There should also be
> some provision for counting cooccurrence pairs within windows or between two
> fields.
> 
> The second step is to detect interesting n-grams.  This is done using the
> counts of words and (n-1)-grams and the relevant totals as input for the LLR
> code.
> 
> The final (optional) step is creation of a Bloom filter table.  Options
> should control size of the table and number of probes.
> 
> Building up all these pieces and connecting them is a truly worthy task.
> 
> On Thu, Jan 7, 2010 at 3:44 PM, zaki rahaman <[email protected]> wrote:
> 
>> @Ted, where is the partial framework you're referring to. And yes this is
>> definitely something I would like to work on if pointed in the right
>> direction. I wasn't quite sure though just b/c I remember a long-winded
>> discussion/debate a while back on the listserv about what Mahout's purpose
>> should be. N-gram LLR for collocations seems like a very NLP type of thing
>> to have (obviously it could also be used in other applications as well but
>> by itself its NLP to me) and from my understanding the "consensus" is that
>> Mahout should focus on scalable machine learning.
>> 
> 
> 
> 
> -- 
> Ted Dunning, CTO
> DeepDyve

Re: Collocations in Mahout?

Reply via email to