The pieces are laying around.

I had a framework like this for recs and text analysis at Veoh, Jake has
something in LinkedIn.

But the amount of code is relatively small and probably could be rewritten
before Jake can get clearance to release anything.

The first step is to just count n-grams.  I think that the input should be
relatively flexible and if you assume parametrized use of Lucene analyzers,
then all that is necessary is a small step up from word counting.  This
should count all n-grams from 0 up to a limit.  It should also allow
suppression of output of any counts less than a threshold.  Total number of
n-grams of each size observed should be accumulated.  There should also be
some provision for counting cooccurrence pairs within windows or between two
fields.

The second step is to detect interesting n-grams.  This is done using the
counts of words and (n-1)-grams and the relevant totals as input for the LLR
code.

The final (optional) step is creation of a Bloom filter table.  Options
should control size of the table and number of probes.

Building up all these pieces and connecting them is a truly worthy task.

On Thu, Jan 7, 2010 at 3:44 PM, zaki rahaman <[email protected]> wrote:

> @Ted, where is the partial framework you're referring to. And yes this is
> definitely something I would like to work on if pointed in the right
> direction. I wasn't quite sure though just b/c I remember a long-winded
> discussion/debate a while back on the listserv about what Mahout's purpose
> should be. N-gram LLR for collocations seems like a very NLP type of thing
> to have (obviously it could also be used in other applications as well but
> by itself its NLP to me) and from my understanding the "consensus" is that
> Mahout should focus on scalable machine learning.
>



-- 
Ted Dunning, CTO
DeepDyve

Reply via email to