Thanks chuck, I didn't read the post and focused on the commas
On Wed, May 13, 2009 at 2:38 PM, Chuck Lam <chuck....@gmail.com> wrote: > The behavior you saw in Streaming (list of <key,value> instead of <key, > list > of values>) is indeed intentional, and it's part of the design differences > between Streaming and Hadoop Java. That is, in Streaming your reducer is > responsible for "grouping" values of the same key, whereas in Java the > grouping is done for you. > > However, the input to your reducer is still sorted (and partitioned) on the > key, so all key/value pairs of the same key will arrive at your reducer in > one contiguous chunk. Your reducer can keep a last_key variable to track > whether all records of the same key have been read in. In Python a reducer > that sums up all values of a key is like this: > > #!/usr/bin/env python > > import sys > > (last_key, sum) = (None, 0.0) > > for line in sys.stdin: > (key, val) = line.split("\t") > > if last_key and last_key != key: > print last_key + "\t" + str(sum) > sum = 0.0 > > last_key = key > sum += float(val) > > print last_key + "\t" + str(sum) > > > Streaming is covered in all 3 upcoming Hadoop books. The above is an > example > from mine ;) http://www.manning.com/lam/ . Tom White has the definite > guide > from O'Reilly - http://www.hadoopbook.com/ . Jason has > http://www.apress.com/book/view/9781430219422 > > > > > > On Tue, May 12, 2009 at 7:55 PM, Alan Drew <drewsk...@yahoo.com> wrote: > > > > > Hi, > > > > I have a question about the <key, values> that the reducer gets in Hadoop > > Streaming. > > > > I wrote a simple mapper.sh, reducer.sh script files: > > > > mapper.sh : > > > > #!/bin/bash > > > > while read data > > do > > #tokenize the data and output the values <word, 1> > > echo $data | awk '{token=0; while(++token<=NF) print $token"\t1"}' > > done > > > > reducer.sh : > > > > #!/bin/bash > > > > while read data > > do > > echo -e $data > > done > > > > The mapper tokenizes a line of input and outputs <word, 1> pairs to > > standard > > output. The reducer just outputs what it gets from standard input. > > > > I have a simple input file: > > > > cat in the hat > > ate my mat the > > > > I was expecting the final output to be something like: > > > > the 1 1 1 > > cat 1 > > > > etc. > > > > but instead each word has its own line, which makes me think that > > <key,value> is being given to the reducer and not <key, values> which is > > default for normal Hadoop (in Java) right? > > > > the 1 > > the 1 > > the 1 > > cat 1 > > > > Is there any way to get <key, values> for the reducer and not a bunch of > > <key, value> pairs? I looked into the -reducer aggregate option, but > there > > doesn't seem to be a way to customize what the reducer does with the > <key, > > values> other than max,min functions. > > > > Thanks. > > -- > > View this message in context: > > > http://www.nabble.com/hadoop-streaming-reducer-values-tp23514523p23514523.html > > Sent from the Hadoop core-user mailing list archive at Nabble.com. > > > > > -- Alpha Chapters of my book on Hadoop are available http://www.apress.com/book/view/9781430219422 www.prohadoopbook.com a community for Hadoop Professionals