Interesting... you have more tokens per line than total lines?

LineRecordReader conveys the line number as the key in the mapper. If I
understand correctly, though, that line number is relative to the input
split, so you could probably use a combination of line number and task ID.

However, based on what you said, if you don't have enough memory to keep a
set of tokens, you might be running into other problems (like having lines
that cannot fit in memory).

If memory is an issue, you could split up Todd's original algorithm in 2 M/R
phases, one to filter out duplicates per line and another that does the word
count.


On Sun, Nov 29, 2009 at 9:07 PM, James R. Leek <le...@llnl.gov> wrote:

>
>  Oh, that's a good idea.  Put the hashset in the mapper rather than the
>> reducer.  Thanks.
>>
> Actually, thinking a little deeper about this, while this will work for my
> smaller examples, the lines get really long for the larger ones.  I'm not
> sure I'll have enough memory to keep all the tokens in a line in one hash
> table.  Although this will do for now, I'd appreciate any additional ideas.
>
> My original idea was to have an array in the reducer that kept line number.
>  So:
>
> Mapper:
>  for each(token):
>   emit(token, linenumber)
>
> Reducer:
>  for linenumber in values:
>   lines[linenumber] = 1;
>
>  count = 0;
>  for existsInThisLine in lines:
>    count += existsInThisLine
>
> if(count = totalNumberOfLines):
>  result(token, count)
>
> When I found I couldn't get the line number, I tried set of line positions,
> but the idea is the same.  Of course, it turned out I was getting input
> split positions rather than line positions, that didn't help.  Anyway, this
> way the size of the array or set is limited by the number of lines rather
> than the number of tokens.
> Jim
>
>

Reply via email to