Thanks subir, "-D stream.mapred.output.field.separator=*" is not an available option, my bad what I should have done is:
-D stream.map.output.field.separator=* On Tue, Feb 28, 2012 at 2:36 PM, Subir S <subir.sasiku...@gmail.com> wrote: > > http://hadoop.apache.org/common/docs/current/streaming.html#Customizing+How+Lines+are+Split+into+Key%2FValue+Pairs > > Read this link, your options are wrong below. > > > > On Tue, Feb 28, 2012 at 1:13 PM, Austin Chungath <austi...@gmail.com> > wrote: > > > When I am using more than one reducer in hadoop streaming where I am > using > > my custom separater rather than the tab, it looks like the hadoop > shuffling > > process is not happening as it should. > > > > This is the reducer output when I am using '\t' to separate my key value > > pair that is output from the mapper. > > > > *output from reducer 1:* > > 10321,22 > > 23644,37 > > 41231,42 > > 23448,20 > > 12325,39 > > 71234,20 > > *output from reducer 2:* > > 24123,43 > > 33213,46 > > 11321,29 > > 21232,32 > > > > the above output is as expected the first column is the key and the > second > > value is the count. There are 10 unique keys and 6 of them are in output > of > > the first reducer and the remaining 4 int the second reducer output. > > > > But now when I use a custom separater for my key value pair output from > my > > mapper. Here I am using '*' as the separator > > -D stream.mapred.output.field.separator=* > > -D mapred.reduce.tasks=2 > > > > *output from reducer 1:* > > 10321,5 > > 21232,19 > > 24123,16 > > 33213,28 > > 23644,21 > > 41231,12 > > 23448,18 > > 11321,29 > > 12325,24 > > 71234,9 > > * * > > *output from reducer 2:* > > 10321,17 > > 21232,13 > > 33213,18 > > 23644,16 > > 41231,30 > > 23448,2 > > 24123,27 > > 12325,15 > > 71234,11 > > > > Now both the reducers are getting all the keys and part of the values go > to > > reducer 1 and part of the reducer go to reducer 2. > > Why is it behaving like this when I am using a custom separator, > shouldn't > > each reducer get a unique key after the shuffling? > > I am using Hadoop 0.20.205.0 and below is the command that I am using to > > run hadoop streaming. Is there some more options that I should specify > for > > hadoop streaming to work properly if I am using a custom separator? > > > > hadoop jar > > $HADOOP_PREFIX/contrib/streaming/hadoop-streaming-0.20.205.0.jar > > -D stream.mapred.output.field.separator=* > > -D mapred.reduce.tasks=2 > > -mapper ./map.py > > -reducer ./reducer.py > > -file ./map.py > > -file ./reducer.py > > -input /user/inputdata > > -output /user/outputdata > > -verbose > > > > > > Any help is much appreciated, > > Thanks, > > Austin > > >