The behavior you saw in Streaming (list of <key,value> instead of <key, list of values>) is indeed intentional, and it's part of the design differences between Streaming and Hadoop Java. That is, in Streaming your reducer is responsible for "grouping" values of the same key, whereas in Java the grouping is done for you.
However, the input to your reducer is still sorted (and partitioned) on the key, so all key/value pairs of the same key will arrive at your reducer in one contiguous chunk. Your reducer can keep a last_key variable to track whether all records of the same key have been read in. In Python a reducer that sums up all values of a key is like this: #!/usr/bin/env python import sys (last_key, sum) = (None, 0.0) for line in sys.stdin: (key, val) = line.split("\t") if last_key and last_key != key: print last_key + "\t" + str(sum) sum = 0.0 last_key = key sum += float(val) print last_key + "\t" + str(sum) Streaming is covered in all 3 upcoming Hadoop books. The above is an example from mine ;) http://www.manning.com/lam/ . Tom White has the definite guide from O'Reilly - http://www.hadoopbook.com/ . Jason has http://www.apress.com/book/view/9781430219422 On Tue, May 12, 2009 at 7:55 PM, Alan Drew <drewsk...@yahoo.com> wrote: > > Hi, > > I have a question about the <key, values> that the reducer gets in Hadoop > Streaming. > > I wrote a simple mapper.sh, reducer.sh script files: > > mapper.sh : > > #!/bin/bash > > while read data > do > #tokenize the data and output the values <word, 1> > echo $data | awk '{token=0; while(++token<=NF) print $token"\t1"}' > done > > reducer.sh : > > #!/bin/bash > > while read data > do > echo -e $data > done > > The mapper tokenizes a line of input and outputs <word, 1> pairs to > standard > output. The reducer just outputs what it gets from standard input. > > I have a simple input file: > > cat in the hat > ate my mat the > > I was expecting the final output to be something like: > > the 1 1 1 > cat 1 > > etc. > > but instead each word has its own line, which makes me think that > <key,value> is being given to the reducer and not <key, values> which is > default for normal Hadoop (in Java) right? > > the 1 > the 1 > the 1 > cat 1 > > Is there any way to get <key, values> for the reducer and not a bunch of > <key, value> pairs? I looked into the -reducer aggregate option, but there > doesn't seem to be a way to customize what the reducer does with the <key, > values> other than max,min functions. > > Thanks. > -- > View this message in context: > http://www.nabble.com/hadoop-streaming-reducer-values-tp23514523p23514523.html > Sent from the Hadoop core-user mailing list archive at Nabble.com. > >