On Fri, Jan 8, 2010 at 7:03 AM, Grant Ingersoll <[email protected]> wrote:

>
> On Jan 7, 2010, at 7:57 PM, Ted Dunning wrote:
>
> > The pieces are laying around.
> >
> > I had a framework like this for recs and text analysis at Veoh, Jake has
> > something in LinkedIn.
> >
> > But the amount of code is relatively small and probably could be
> rewritten
> > before Jake can get clearance to release anything.
> >
> > The first step is to just count n-grams.  I think that the input should
> be
> > relatively flexible and if you assume parametrized use of Lucene
> analyzers,
> > then all that is necessary is a small step up from word counting.
>
> The classification stuff has this already, in MR form, independent of
> Lucene.
>
> > This
> > should count all n-grams from 0 up to a limit.  It should also allow
> > suppression of output of any counts less than a threshold.  Total number
> of
> > n-grams of each size observed should be accumulated.
>
> I believe it does this, too.  Robin?
>
Yeah, Brute force ngram generation is done by Bayes Classifier. Beware its
practically combinatorial explosion of data. But enough machines can tame it
well.

Take a look at the DictionaryVectorizer . If LLR job could be added in a
chain, I could use that information while creating vectors.
https://issues.apache.org/jira/browse/MAHOUT-237

I like the Formulation that Drew made, using n-1 grams to generate n-grams.
It was the same I used to generate n-grams here,
http://thinking.me/(Himanshu and I built it when I was still in college).
But that was just a php script which iterates over a sample of twitter data
:).
One interesting thing I found was that any ngram with LLR <1 is practically
junk, anything over LLR>50 is pretty awesome. between 1-50, its always
debatable. This holds approximately true for large and small datasets.

I will be really happy if Drew can work on the LLR based bigram generation
code and help me attach it with the rest of the dictionaryVectorizer

Also I would prefer if the the entire mahout code agrees upon a single
 format for document input.  I would suggest we stick to SequenceFiles with
key as docid and value as document content. That way, the creation of
sequence files, we leave it to the user.


Robin

Reply via email to