I want to use hadoop to discover if there is any token that appears in every line of a file. I thought that this should be pretty straightforward, but I'm having a heck of a time with it. (I'm pretty new to hadoop. I've been using it for about two weeks.)

My original idea was to have the mapper produce every token as the key, with the line number as the value. But I couldn't find any InputFormat that would give me line numbers.

However, it seemed that FileInputFormat would give me the position in the file as the key, and the line as the value. I assume that the key would be the position in the file of the beginning of the line. With that I could have the token be the key, and the line position as the value, and use a hash table in the reducer to determine if the token appeared in every line. However, I found that it actually seems to give the position of the input split. I figured this out because, rather than getting 50,000 unique keys to the mapper (the number of lines in the file), I was getting 220 unique keys. (The number of mappers/input splits.)

So, what should I do?

Thanks,
Jim

Reply via email to