Re: hadoop streaming reducer values

jason hadoop Wed, 13 May 2009 20:06:47 -0700

Thanks chuck, I didn't read the post and focused on the commas


On Wed, May 13, 2009 at 2:38 PM, Chuck Lam <chuck....@gmail.com> wrote:

> The behavior you saw in Streaming (list of <key,value> instead of <key,
> list
> of values>) is indeed intentional, and it's part of the design differences
> between Streaming and Hadoop Java. That is, in Streaming your reducer is
> responsible for "grouping" values of the same key, whereas in Java the
> grouping is done for you.
>
> However, the input to your reducer is still sorted (and partitioned) on the
> key, so all key/value pairs of the same key will arrive at your reducer in
> one contiguous chunk. Your reducer can keep a last_key variable to track
> whether all records of the same key have been read in. In Python a reducer
> that sums up all values of a key is like this:
>
> #!/usr/bin/env python
>
> import sys
>
> (last_key, sum) = (None, 0.0)
>
> for line in sys.stdin:
>    (key, val) = line.split("\t")
>
>    if last_key and last_key != key:
>        print last_key + "\t" + str(sum)
>        sum = 0.0
>
>    last_key = key
>    sum   += float(val)
>
> print last_key + "\t" + str(sum)
>
>
> Streaming is covered in all 3 upcoming Hadoop books. The above is an
> example
> from mine ;)  http://www.manning.com/lam/ . Tom White has the definite
> guide
> from O'Reilly - http://www.hadoopbook.com/ . Jason has
> http://www.apress.com/book/view/9781430219422
>
>
>
>
>
> On Tue, May 12, 2009 at 7:55 PM, Alan Drew <drewsk...@yahoo.com> wrote:
>
> >
> > Hi,
> >
> > I have a question about the <key, values> that the reducer gets in Hadoop
> > Streaming.
> >
> > I wrote a simple mapper.sh, reducer.sh script files:
> >
> > mapper.sh :
> >
> > #!/bin/bash
> >
> > while read data
> > do
> >  #tokenize the data and output the values <word, 1>
> >  echo $data | awk '{token=0; while(++token<=NF) print $token"\t1"}'
> > done
> >
> > reducer.sh :
> >
> > #!/bin/bash
> >
> > while read data
> > do
> >  echo -e $data
> > done
> >
> > The mapper tokenizes a line of input and outputs <word, 1> pairs to
> > standard
> > output.  The reducer just outputs what it gets from standard input.
> >
> > I have a simple input file:
> >
> > cat in the hat
> > ate my mat the
> >
> > I was expecting the final output to be something like:
> >
> > the 1 1 1
> > cat 1
> >
> > etc.
> >
> > but instead each word has its own line, which makes me think that
> > <key,value> is being given to the reducer and not <key, values> which is
> > default for normal Hadoop (in Java) right?
> >
> > the 1
> > the 1
> > the 1
> > cat 1
> >
> > Is there any way to get <key, values> for the reducer and not a bunch of
> > <key, value> pairs?  I looked into the -reducer aggregate option, but
> there
> > doesn't seem to be a way to customize what the reducer does with the
> <key,
> > values> other than max,min functions.
> >
> > Thanks.
> > --
> > View this message in context:
> >
> http://www.nabble.com/hadoop-streaming-reducer-values-tp23514523p23514523.html
> > Sent from the Hadoop core-user mailing list archive at Nabble.com.
> >
> >
>



-- 
Alpha Chapters of my book on Hadoop are available
http://www.apress.com/book/view/9781430219422
www.prohadoopbook.com a community for Hadoop Professionals

Re: hadoop streaming reducer values

Reply via email to