Ah, I misunderstood.

How about this?

mapper:
  for each word in line:
   add word to a set()
  for each word in set:
    emit (word, 1)
  emit (null, 1)

combiner:
  sum up input values (just like word count)

reducer:
  same as combiner, except you can avoid emitting any words which have a
count less than any other count you previously emitted

then post process - you have the "null" key which is the total line count.
Any keys which have that same value appeared on every line.

If you want to do away with the post process step, you might be able to do
something clever involving using a counter in the map, and then reading that
counter's value from the Reduce task. I've never tried it, but I *think* you
should be able to get at your job's map counter values from the reduce
execution (though not the combiner, necessarily). If you can do this, you'd
know the total line count in the reducer and could avoid emitting any keys
that aren't in every line.

-Todd

On Sun, Nov 29, 2009 at 5:18 PM, James R. Leek <le...@llnl.gov> wrote:

> Thanks Todd, but your code seems check if a given token exists on every
> line.  (Like a the regex example.)  I want to find any tokens that exist on
> every line.  So, give the input:
>
> Amy Sue Fred John
> Jack Joe Sue John
> Alice Bob Fred Sue John
>
> The output should be:
> Sue
> John
>
> because Sue and John appear on every line.  I don't know Sue and John in
> advance.
>
> Thanks,
> Jim
>
>
> Todd Lipcon wrote:
>
>> Hi James,
>>
>> Something like the following pseudocode:
>>
>> Mapper:
>>  configure:
>>    set instance variable "seenOnlyMatches" = true
>>  map:
>>    save OutputCollector
>>    if current line doesn't match, set seenOnlyMatches to false
>>  close:
>>    output a single record containing the value of seenOnlyMatches (and a
>> null key)
>>    super.close()
>>
>> Reducer:
>>  if any input records are false, output false. otherwise output true
>>
>> Make sense?
>>
>> -Todd
>> On Sun, Nov 29, 2009 at 5:00 PM, James R. Leek <le...@llnl.gov> wrote:
>>
>>
>>
>>> I want to use hadoop to discover if there is any token that appears in
>>> every line of a file.  I thought that this should be pretty
>>> straightforward,
>>> but I'm having a heck of a time with it.  (I'm pretty new to hadoop.
>>>  I've
>>> been using it for about two weeks.)
>>>
>>> My original idea was to have the mapper produce every token as the key,
>>> with the line number as the value.  But I couldn't find any InputFormat
>>> that
>>> would give me line numbers.
>>>
>>> However, it seemed that FileInputFormat would give me the position in the
>>> file as the key, and the line as the value.  I assume that the key would
>>> be
>>> the position in the file of the beginning of the line.  With that I could
>>> have the token be the key, and the line position as the value, and use a
>>> hash table in the reducer to determine if the token appeared in every
>>> line.
>>>  However, I found that it actually seems to give the position of the
>>> input
>>> split.  I figured this out because, rather than getting 50,000 unique
>>> keys
>>> to the mapper (the number of lines in the file), I was getting 220 unique
>>> keys.  (The number of mappers/input splits.)
>>>
>>> So, what should I do?
>>>
>>> Thanks,
>>> Jim
>>>
>>>
>>>
>>
>>
>>
>
>

Reply via email to