Oh, that's a good idea. Put the hashset in the mapper rather than the reducer. Thanks.
Actually, thinking a little deeper about this, while this will work for my smaller examples, the lines get really long for the larger ones. I'm not sure I'll have enough memory to keep all the tokens in a line in one hash table. Although this will do for now, I'd appreciate any additional ideas.

My original idea was to have an array in the reducer that kept line number. So:

Mapper:
 for each(token):
   emit(token, linenumber)

Reducer:
 for linenumber in values:
   lines[linenumber] = 1;

 count = 0;
 for existsInThisLine in lines:
    count += existsInThisLine

if(count = totalNumberOfLines):
 result(token, count)

When I found I couldn't get the line number, I tried set of line positions, but the idea is the same. Of course, it turned out I was getting input split positions rather than line positions, that didn't help. Anyway, this way the size of the array or set is limited by the number of lines rather than the number of tokens.
Jim

Reply via email to