Identifying lines in map()

James R. Leek Sun, 29 Nov 2009 17:01:06 -0800

I want to use hadoop to discover if there is any token that appears inevery line of a file. I thought that this should be prettystraightforward, but I'm having a heck of a time with it. (I'm prettynew to hadoop. I've been using it for about two weeks.)

My original idea was to have the mapper produce every token as the key,with the line number as the value. But I couldn't find any InputFormatthat would give me line numbers.

However, it seemed that FileInputFormat would give me the position inthe file as the key, and the line as the value. I assume that the keywould be the position in the file of the beginning of the line. Withthat I could have the token be the key, and the line position as thevalue, and use a hash table in the reducer to determine if the tokenappeared in every line. However, I found that it actually seems to givethe position of the input split. I figured this out because, ratherthan getting 50,000 unique keys to the mapper (the number of lines inthe file), I was getting 220 unique keys. (The number of mappers/inputsplits.)


So, what should I do?

Thanks,
Jim

Identifying lines in map()

Reply via email to