On Jan 5, 2010, at 3:18 PM, Ted Dunning wrote: > No. We really don't.
FWIW, I checked in math/o.a.m.math.stats.LogLikelihood w/ some based LLR stuff that we use in utils.lucene.ClusterLabels. Would be great to see this stuff expanded. > > The most straightforward implementation does a separate pass for computing > the overall total, for counting the unigrams and then counting the bigrams. > It is cooler, of course, to count all sizes of ngrams in one pass and output > them to separate files. Then a second pass can do a map-side join if the > unigram table is small enough (it usually is) and compute the results. All > of this is very straightforward programming and is a great introduction to > map-reduce programming. > > On Tue, Jan 5, 2010 at 12:09 PM, Jake Mannix <[email protected]> wrote: > >> Ted, we don't have a MR job to scan through a corpus and ouptut [ngram : >> LLR] >> key-value pairs, do we? I've got one we use at LinkedIn that I could try >> and pull >> out if we don't have one. >> >> (I actually used to give this MR job as an interview question, because it's >> a cute >> problem you can work out the basics of in not too long). >> > > > > -- > Ted Dunning, CTO > DeepDyve
