Re: Identifying lines in map()

James R. Leek Sun, 29 Nov 2009 18:07:55 -0800

Oh, that's a good idea. Put the hashset in the mapper rather than thereducer. Thanks.

Actually, thinking a little deeper about this, while this will work formy smaller examples, the lines get really long for the larger ones. I'mnot sure I'll have enough memory to keep all the tokens in a line in onehash table. Although this will do for now, I'd appreciate anyadditional ideas.

My original idea was to have an array in the reducer that kept linenumber. So:


Mapper:
 for each(token):
   emit(token, linenumber)

Reducer:
 for linenumber in values:
   lines[linenumber] = 1;

 count = 0;
 for existsInThisLine in lines:
    count += existsInThisLine

if(count = totalNumberOfLines):
 result(token, count)

When I found I couldn't get the line number, I tried set of linepositions, but the idea is the same. Of course, it turned out I wasgetting input split positions rather than line positions, that didn'thelp. Anyway, this way the size of the array or set is limited by thenumber of lines rather than the number of tokens.

Jim

Re: Identifying lines in map()

Reply via email to