Hi James, Something like the following pseudocode:
Mapper: configure: set instance variable "seenOnlyMatches" = true map: save OutputCollector if current line doesn't match, set seenOnlyMatches to false close: output a single record containing the value of seenOnlyMatches (and a null key) super.close() Reducer: if any input records are false, output false. otherwise output true Make sense? -Todd On Sun, Nov 29, 2009 at 5:00 PM, James R. Leek <le...@llnl.gov> wrote: > I want to use hadoop to discover if there is any token that appears in > every line of a file. I thought that this should be pretty straightforward, > but I'm having a heck of a time with it. (I'm pretty new to hadoop. I've > been using it for about two weeks.) > > My original idea was to have the mapper produce every token as the key, > with the line number as the value. But I couldn't find any InputFormat that > would give me line numbers. > > However, it seemed that FileInputFormat would give me the position in the > file as the key, and the line as the value. I assume that the key would be > the position in the file of the beginning of the line. With that I could > have the token be the key, and the line position as the value, and use a > hash table in the reducer to determine if the token appeared in every line. > However, I found that it actually seems to give the position of the input > split. I figured this out because, rather than getting 50,000 unique keys > to the mapper (the number of lines in the file), I was getting 220 unique > keys. (The number of mappers/input splits.) > > So, what should I do? > > Thanks, > Jim >