I want to use hadoop to discover if there is any token that appears in
every line of a file. I thought that this should be pretty
straightforward, but I'm having a heck of a time with it. (I'm pretty
new to hadoop. I've been using it for about two weeks.)
My original idea was to have the mapper produce every token as the key,
with the line number as the value. But I couldn't find any InputFormat
that would give me line numbers.
However, it seemed that FileInputFormat would give me the position in
the file as the key, and the line as the value. I assume that the key
would be the position in the file of the beginning of the line. With
that I could have the token be the key, and the line position as the
value, and use a hash table in the reducer to determine if the token
appeared in every line. However, I found that it actually seems to give
the position of the input split. I figured this out because, rather
than getting 50,000 unique keys to the mapper (the number of lines in
the file), I was getting 220 unique keys. (The number of mappers/input
splits.)
So, what should I do?
Thanks,
Jim