On Tue, Jan 5, 2010 at 12:18 PM, Ted Dunning <[email protected]> wrote:
> No. We really don't. > > The most straightforward implementation does a separate pass for computing > the overall total, for counting the unigrams and then counting the bigrams. > It is cooler, of course, to count all sizes of ngrams in one pass and > output > them to separate files. Then a second pass can do a map-side join if the > unigram table is small enough (it usually is) and compute the results. All > of this is very straightforward programming and is a great introduction to > map-reduce programming. > Oh you and your map-side join! That's no fun - to pass my interview you need to do it in the case where the unigrams *don't* fit in memory (because lets say you wanted to compute a variation of LLR for trigrams which used the sub-bigram counts in the computation). It's much more fun to do an n-way self-join with composite keys and secondary sorts! :) -jake
