Jake, thanks for the review, running narrative and comments. The
Analyzer in use should be up to the user, so there will be flexibility
to mess around with lots of alternative there, but it will be nice to
provide reasonable defaults and include this sort of discussion in the
wiki page for the algo. I'll finish up the rest of the code for it and
post a patch to JIRA.

Robin, I'll take a look at the dictionaryVectorizer, and see how they
can work together. I think something like SequenceFiles<documentId,
Text or BytesWritable> make sense as input for this job and it's
probably easier to work with than what I had to whip up to slurp in
files whole.

Does anyone know if there is a stream based alternative to Text or
BytesWritable?

On Thu, Jan 7, 2010 at 11:46 PM, Jake Mannix <[email protected]> wrote:
> Ok, I lied - I think what you described here is way *faster* than what I
> was doing, because I wasn't starting with the original corpus, I had
> something like google's ngram terabyte data (a massive HDFS file with
> just "ngram ngram-frequency" on each line), which mean I had to do
> a multi-way join (which is where I needed to do a secondary sort by
> value).
>
> Starting with the corpus itself (the case we're talking about) you have
> some nice tricks in here:
>
> On Thu, Jan 7, 2010 at 6:46 PM, Drew Farris <[email protected]> wrote:
>>
>>
>> The output of that map task is something like:
>>
>> k:(n-1)gram v:ngram
>>
>
> This is great right here - it helps you kill two birds with one stone: the
> join
> and the wordcount phases.
>
>
>> k:ngram,ngram-frequency v:(n-1)gram,(n-1) gram freq
>>
>> e.g:
>> k:the best:1, v:best,2
>> k:best of,1, v:best,2
>> k:best of,1, v:of,2
>> k:of times,1 v:of,2
>> k:the best,1, v:the,1
>> k:of times,1 v:1 v:times,1
>>
>
> Yeah, once you're here, you're home free.  This should be really a rather
> quick set of jobs, even on really big data, and even dealing with it as
> text.
>
>
>> I'm also wondering about the best way to handle input. Line by line
>> processing would miss ngrams spanning lines, but full document
>> processing with the StandardAnalyzer+ShingleFilter wil form ngrams
>> across sentence boundaries.
>>
>
> These effects are just minor issues: you lose a little bit of signal on
> line endings, and you pick up some noise catching ngrams across
> sentence boundaries, but it's fractional compared to your whole set.
> Don't try and to be too fancy and cram tons of lines together.  If your
> data comes in different chunks than just one huge HDFS text file, you
> could certainly chunk it into bigger chunks (10, 100, 1000 lines, maybe)
> to reduce the newline error if necessary, but it's probably not needed.
> The sentence boundary part gets washed out in the LLR step anyways
> (because they'll almost always turn out to have a low LLR score).
>
> What I've found I've had to do sometimes, is something with stop words.
> If you don't use stop words at all, you end up getting a lot of relatively
> high LLR scoring ngrams like "up into", "he would", and in general pairings
> of a relatively rare unigram with a pronoun or preposition.  Maybe there are
> other ways of avoiding that, but I've found that you do need to take some
> care with the stop words (but removing them altogether leads to some
> weird looking ngrams if you want to display them somewhere).
>
>
>> I'm interested in whether there's a more efficient way to structure
>> the M/R passes. It feels a little funny to no-op a whole map cycle. It
>> would almost be better if one could chain two reduces together.
>>
>
> Beware premature optimization - try this on a nice big monster set on
> a real cluster, and see how long it takes.  I have a feeling you'll be
> pleasantly surprised.  But even before that - show us a patch, maybe
> someone will have easy low-hanging fruit optimization tricks.
>
>  -jake
>

Reply via email to