Re: hadoop streaming reducer values

Chuck Lam Wed, 13 May 2009 14:39:22 -0700

The behavior you saw in Streaming (list of <key,value> instead of <key, list
of values>) is indeed intentional, and it's part of the design differences
between Streaming and Hadoop Java. That is, in Streaming your reducer is
responsible for "grouping" values of the same key, whereas in Java the
grouping is done for you.


However, the input to your reducer is still sorted (and partitioned) on the
key, so all key/value pairs of the same key will arrive at your reducer in
one contiguous chunk. Your reducer can keep a last_key variable to track
whether all records of the same key have been read in. In Python a reducer
that sums up all values of a key is like this:

#!/usr/bin/env python

import sys

(last_key, sum) = (None, 0.0)

for line in sys.stdin:
    (key, val) = line.split("\t")

    if last_key and last_key != key:
        print last_key + "\t" + str(sum)
        sum = 0.0

    last_key = key
    sum   += float(val)

print last_key + "\t" + str(sum)


Streaming is covered in all 3 upcoming Hadoop books. The above is an example
from mine ;)  http://www.manning.com/lam/ . Tom White has the definite guide
from O'Reilly - http://www.hadoopbook.com/ . Jason has
http://www.apress.com/book/view/9781430219422





On Tue, May 12, 2009 at 7:55 PM, Alan Drew <drewsk...@yahoo.com> wrote:

>
> Hi,
>
> I have a question about the <key, values> that the reducer gets in Hadoop
> Streaming.
>
> I wrote a simple mapper.sh, reducer.sh script files:
>
> mapper.sh :
>
> #!/bin/bash
>
> while read data
> do
>  #tokenize the data and output the values <word, 1>
>  echo $data | awk '{token=0; while(++token<=NF) print $token"\t1"}'
> done
>
> reducer.sh :
>
> #!/bin/bash
>
> while read data
> do
>  echo -e $data
> done
>
> The mapper tokenizes a line of input and outputs <word, 1> pairs to
> standard
> output.  The reducer just outputs what it gets from standard input.
>
> I have a simple input file:
>
> cat in the hat
> ate my mat the
>
> I was expecting the final output to be something like:
>
> the 1 1 1
> cat 1
>
> etc.
>
> but instead each word has its own line, which makes me think that
> <key,value> is being given to the reducer and not <key, values> which is
> default for normal Hadoop (in Java) right?
>
> the 1
> the 1
> the 1
> cat 1
>
> Is there any way to get <key, values> for the reducer and not a bunch of
> <key, value> pairs?  I looked into the -reducer aggregate option, but there
> doesn't seem to be a way to customize what the reducer does with the <key,
> values> other than max,min functions.
>
> Thanks.
> --
> View this message in context:
> http://www.nabble.com/hadoop-streaming-reducer-values-tp23514523p23514523.html
> Sent from the Hadoop core-user mailing list archive at Nabble.com.
>
>

Re: hadoop streaming reducer values

Reply via email to