Re: Identifying lines in map()

Todd Lipcon Sun, 29 Nov 2009 17:07:13 -0800

Hi James,

Something like the following pseudocode:


Mapper:
  configure:
    set instance variable "seenOnlyMatches" = true
  map:
    save OutputCollector
    if current line doesn't match, set seenOnlyMatches to false
  close:
    output a single record containing the value of seenOnlyMatches (and a
null key)
    super.close()

Reducer:
  if any input records are false, output false. otherwise output true

Make sense?

-Todd
On Sun, Nov 29, 2009 at 5:00 PM, James R. Leek <le...@llnl.gov> wrote:

> I want to use hadoop to discover if there is any token that appears in
> every line of a file.  I thought that this should be pretty straightforward,
> but I'm having a heck of a time with it.  (I'm pretty new to hadoop.  I've
> been using it for about two weeks.)
>
> My original idea was to have the mapper produce every token as the key,
> with the line number as the value.  But I couldn't find any InputFormat that
> would give me line numbers.
>
> However, it seemed that FileInputFormat would give me the position in the
> file as the key, and the line as the value.  I assume that the key would be
> the position in the file of the beginning of the line.  With that I could
> have the token be the key, and the line position as the value, and use a
> hash table in the reducer to determine if the token appeared in every line.
>  However, I found that it actually seems to give the position of the input
> split.  I figured this out because, rather than getting 50,000 unique keys
> to the mapper (the number of lines in the file), I was getting 220 unique
> keys.  (The number of mappers/input splits.)
>
> So, what should I do?
>
> Thanks,
> Jim
>

Re: Identifying lines in map()

Reply via email to