No.  We really don't.

The most straightforward implementation does a separate pass for computing
the overall total, for counting the unigrams and then counting the bigrams.
It is cooler, of course, to count all sizes of ngrams in one pass and output
them to separate files.  Then a second pass can do a map-side join if the
unigram table is small enough (it usually is) and compute the results.  All
of this is very straightforward programming and is a great introduction to
map-reduce programming.

On Tue, Jan 5, 2010 at 12:09 PM, Jake Mannix <[email protected]> wrote:

> Ted, we don't have a MR job to scan through a corpus and ouptut [ngram :
> LLR]
> key-value pairs, do we?  I've got one we use at LinkedIn that I could try
> and pull
> out if we don't have one.
>
> (I actually used to give this MR job as an interview question, because it's
> a cute
> problem you can work out the basics of in not too long).
>



-- 
Ted Dunning, CTO
DeepDyve

Reply via email to