http://hadoop.apache.org/common/docs/current/streaming.html#Customizing+How+Lines+are+Split+into+Key%2FValue+Pairs
Read this link, your options are wrong below. On Tue, Feb 28, 2012 at 1:13 PM, Austin Chungath <austi...@gmail.com> wrote: > When I am using more than one reducer in hadoop streaming where I am using > my custom separater rather than the tab, it looks like the hadoop shuffling > process is not happening as it should. > > This is the reducer output when I am using '\t' to separate my key value > pair that is output from the mapper. > > *output from reducer 1:* > 10321,22 > 23644,37 > 41231,42 > 23448,20 > 12325,39 > 71234,20 > *output from reducer 2:* > 24123,43 > 33213,46 > 11321,29 > 21232,32 > > the above output is as expected the first column is the key and the second > value is the count. There are 10 unique keys and 6 of them are in output of > the first reducer and the remaining 4 int the second reducer output. > > But now when I use a custom separater for my key value pair output from my > mapper. Here I am using '*' as the separator > -D stream.mapred.output.field.separator=* > -D mapred.reduce.tasks=2 > > *output from reducer 1:* > 10321,5 > 21232,19 > 24123,16 > 33213,28 > 23644,21 > 41231,12 > 23448,18 > 11321,29 > 12325,24 > 71234,9 > * * > *output from reducer 2:* > 10321,17 > 21232,13 > 33213,18 > 23644,16 > 41231,30 > 23448,2 > 24123,27 > 12325,15 > 71234,11 > > Now both the reducers are getting all the keys and part of the values go to > reducer 1 and part of the reducer go to reducer 2. > Why is it behaving like this when I am using a custom separator, shouldn't > each reducer get a unique key after the shuffling? > I am using Hadoop 0.20.205.0 and below is the command that I am using to > run hadoop streaming. Is there some more options that I should specify for > hadoop streaming to work properly if I am using a custom separator? > > hadoop jar > $HADOOP_PREFIX/contrib/streaming/hadoop-streaming-0.20.205.0.jar > -D stream.mapred.output.field.separator=* > -D mapred.reduce.tasks=2 > -mapper ./map.py > -reducer ./reducer.py > -file ./map.py > -file ./reducer.py > -input /user/inputdata > -output /user/outputdata > -verbose > > > Any help is much appreciated, > Thanks, > Austin >