This conversation has been pretty inspiring, thanks everyone.

I spent some time thinking about the steps involved in the M/R LLR
job. I'm a bit of a greenhorn when it comes to this, so it would be
great to see what you all think. Here's what I've been able to piece
together so far:

To perform the LLR calculation, we need 4 values for the combinations
of (n-1)grams in ngrams in the input data: A+B, A+!B, B+!A, !A+!B. For
the ngram 'the best',with A=the, B=best, this would mean:

A+B = the number of times the ngram 'the best' appears
A+!B = the number of times 'the' appears in an ngram without 'best'
!A+B = the number of times 'best' appears in an ngram without 'the'
!A+!B = the number of ngrams that contain neither A or B (are not 'the best')

It is also necessary to have N,
N = the total number of ngrams

(Ted's blog post referenced in the LLR class really helped me
understand this concretely, thanks!)

Input into the job is done so that the first mapper gets a single
entire text document for each map call. This gets runs it through the
Analyzer+Lucene ShingleFilter combo.

The output of that map task is something like:

k:(n-1)gram v:ngram

For the input: 'the best of times'
The mapper output is:

k:the v:the best
k:best v:the best
k:best v:best of
k:of    v:best of
k:of    v:of times
k:times v:of times

As an aside, the shingles were 'the best', 'best of', 'of times') --
we preserve the total number of ngrams/shingles (N). In this case, it
would be 3

In the reducer, we count the number of ngrams each (n-1)gram appears
in and the number of times each ngram appears. (I wind up counting
each ngram 'n' times, however - There's probably a way around that).
Once we have this info, we can output the following from the reducer:

k:ngram,ngram-frequency v:(n-1)gram,(n-1) gram freq

e.g:
k:the best:1, v:best,2
k:best of,1, v:best,2
k:best of,1, v:of,2
k:of times,1 v:of,2
k:the best,1, v:the,1
k:of times,1 v:1 v:times,1

The next mapper could just be a no-op, because we have the data in the
right shape to do the LLR in the next reduction pass:

k: the best:1, v:best,2; v:the,1
k: best of:1, v:best,2; v:of,2
k: of times:1, v:of,2; v: times:1

(n-1)grams sorted

nf = The ngram frequency
ln1f = left (n-1)gram frequency
rn1f = right (n-1)gram frequency
N = Total number of ngrams

A+B = nf
A+!B = ln1f - nf
!A+B = rn1f - nf
!A+!B = N - (ln1f + rn1f - nf)

With these we calculate LLR using the class on o.a.m.stats and the
reducer output can be

k:LLR v:ngram

Does this work, or did I miss something critical?  I've only thought
this through for n=2, but I suspect it extends to other cases. I'm
curious as to whether it has a problem when both 'best of' and 'of
best' occur in the text, but haven't though that through yet.

I have the first map/reduce pass implemented using the hadoop 0.19
api, but the code is brutally inefficient in a couple spots,
especially because it keeps the *grams as strings. There should likely
be a pass to convert them to integers or something to be more compact,
right?

I'm also wondering about the best way to handle input. Line by line
processing would miss ngrams spanning lines, but full document
processing with the StandardAnalyzer+ShingleFilter wil form ngrams
across sentence boundaries.

I followed the WholeFileInputFormat example from the Hadoop in Action
book to slurp in data, but I don't like the idea of reading the entire
file into a buffer and then wrapping that up into something that can
be accessed via the Reader consumed by Lucene's Analyzer. I would much
rather pass a Reader/Stream into the mapper if possible.

I'm interested in whether there's a more efficient way to structure
the M/R passes. It feels a little funny to no-op a whole map cycle. It
would almost be better if one could chain two reduces together.

On Thu, Jan 7, 2010 at 7:57 PM, Ted Dunning <[email protected]> wrote:
> The pieces are laying around.
>
> I had a framework like this for recs and text analysis at Veoh, Jake has
> something in LinkedIn.
>
> But the amount of code is relatively small and probably could be rewritten
> before Jake can get clearance to release anything.
>
> The first step is to just count n-grams.  I think that the input should be
> relatively flexible and if you assume parametrized use of Lucene analyzers,
> then all that is necessary is a small step up from word counting.  This
> should count all n-grams from 0 up to a limit.  It should also allow
> suppression of output of any counts less than a threshold.  Total number of
> n-grams of each size observed should be accumulated.  There should also be
> some provision for counting cooccurrence pairs within windows or between two
> fields.
>
> The second step is to detect interesting n-grams.  This is done using the
> counts of words and (n-1)-grams and the relevant totals as input for the LLR
> code.
>
> The final (optional) step is creation of a Bloom filter table.  Options
> should control size of the table and number of probes.
>
> Building up all these pieces and connecting them is a truly worthy task.
>
> On Thu, Jan 7, 2010 at 3:44 PM, zaki rahaman <[email protected]> wrote:
>
>> @Ted, where is the partial framework you're referring to. And yes this is
>> definitely something I would like to work on if pointed in the right
>> direction. I wasn't quite sure though just b/c I remember a long-winded
>> discussion/debate a while back on the listserv about what Mahout's purpose
>> should be. N-gram LLR for collocations seems like a very NLP type of thing
>> to have (obviously it could also be used in other applications as well but
>> by itself its NLP to me) and from my understanding the "consensus" is that
>> Mahout should focus on scalable machine learning.
>>
>
>
>
> --
> Ted Dunning, CTO
> DeepDyve
>

Reply via email to