Ok, I lied - I think what you described here is way *faster* than what I
was doing, because I wasn't starting with the original corpus, I had
something like google's ngram terabyte data (a massive HDFS file with
just "ngram ngram-frequency" on each line), which mean I had to do
a multi-way join (which is where I needed to do a secondary sort by
value).

Starting with the corpus itself (the case we're talking about) you have
some nice tricks in here:

On Thu, Jan 7, 2010 at 6:46 PM, Drew Farris <[email protected]> wrote:
>
>
> The output of that map task is something like:
>
> k:(n-1)gram v:ngram
>

This is great right here - it helps you kill two birds with one stone: the
join
and the wordcount phases.


> k:ngram,ngram-frequency v:(n-1)gram,(n-1) gram freq
>
> e.g:
> k:the best:1, v:best,2
> k:best of,1, v:best,2
> k:best of,1, v:of,2
> k:of times,1 v:of,2
> k:the best,1, v:the,1
> k:of times,1 v:1 v:times,1
>

Yeah, once you're here, you're home free.  This should be really a rather
quick set of jobs, even on really big data, and even dealing with it as
text.


> I'm also wondering about the best way to handle input. Line by line
> processing would miss ngrams spanning lines, but full document
> processing with the StandardAnalyzer+ShingleFilter wil form ngrams
> across sentence boundaries.
>

These effects are just minor issues: you lose a little bit of signal on
line endings, and you pick up some noise catching ngrams across
sentence boundaries, but it's fractional compared to your whole set.
Don't try and to be too fancy and cram tons of lines together.  If your
data comes in different chunks than just one huge HDFS text file, you
could certainly chunk it into bigger chunks (10, 100, 1000 lines, maybe)
to reduce the newline error if necessary, but it's probably not needed.
The sentence boundary part gets washed out in the LLR step anyways
(because they'll almost always turn out to have a low LLR score).

What I've found I've had to do sometimes, is something with stop words.
If you don't use stop words at all, you end up getting a lot of relatively
high LLR scoring ngrams like "up into", "he would", and in general pairings
of a relatively rare unigram with a pronoun or preposition.  Maybe there are
other ways of avoiding that, but I've found that you do need to take some
care with the stop words (but removing them altogether leads to some
weird looking ngrams if you want to display them somewhere).


> I'm interested in whether there's a more efficient way to structure
> the M/R passes. It feels a little funny to no-op a whole map cycle. It
> would almost be better if one could chain two reduces together.
>

Beware premature optimization - try this on a nice big monster set on
a real cluster, and see how long it takes.  I have a feeling you'll be
pleasantly surprised.  But even before that - show us a patch, maybe
someone will have easy low-hanging fruit optimization tricks.

  -jake

Reply via email to