Oh, that's a good idea. Put the hashset in the mapper rather than the
reducer. Thanks.
Actually, thinking a little deeper about this, while this will work for
my smaller examples, the lines get really long for the larger ones. I'm
not sure I'll have enough memory to keep all the tokens in a line in one
hash table. Although this will do for now, I'd appreciate any
additional ideas.
My original idea was to have an array in the reducer that kept line
number. So:
Mapper:
for each(token):
emit(token, linenumber)
Reducer:
for linenumber in values:
lines[linenumber] = 1;
count = 0;
for existsInThisLine in lines:
count += existsInThisLine
if(count = totalNumberOfLines):
result(token, count)
When I found I couldn't get the line number, I tried set of line
positions, but the idea is the same. Of course, it turned out I was
getting input split positions rather than line positions, that didn't
help. Anyway, this way the size of the array or set is limited by the
number of lines rather than the number of tokens.
Jim