When I am using more than one reducer in hadoop streaming where I am using my custom separater rather than the tab, it looks like the hadoop shuffling process is not happening as it should.
This is the reducer output when I am using '\t' to separate my key value pair that is output from the mapper. *output from reducer 1:* 10321,22 23644,37 41231,42 23448,20 12325,39 71234,20 *output from reducer 2:* 24123,43 33213,46 11321,29 21232,32 the above output is as expected the first column is the key and the second value is the count. There are 10 unique keys and 6 of them are in output of the first reducer and the remaining 4 int the second reducer output. But now when I use a custom separater for my key value pair output from my mapper. Here I am using '*' as the separator -D stream.mapred.output.field.separator=* -D mapred.reduce.tasks=2 *output from reducer 1:* 10321,5 21232,19 24123,16 33213,28 23644,21 41231,12 23448,18 11321,29 12325,24 71234,9 * * *output from reducer 2:* 10321,17 21232,13 33213,18 23644,16 41231,30 23448,2 24123,27 12325,15 71234,11 Now both the reducers are getting all the keys and part of the values go to reducer 1 and part of the reducer go to reducer 2. Why is it behaving like this when I am using a custom separator, shouldn't each reducer get a unique key after the shuffling? I am using Hadoop 0.20.205.0 and below is the command that I am using to run hadoop streaming. Is there some more options that I should specify for hadoop streaming to work properly if I am using a custom separator? hadoop jar $HADOOP_PREFIX/contrib/streaming/hadoop-streaming-0.20.205.0.jar -D stream.mapred.output.field.separator=* -D mapred.reduce.tasks=2 -mapper ./map.py -reducer ./reducer.py -file ./map.py -file ./reducer.py -input /user/inputdata -output /user/outputdata -verbose Any help is much appreciated, Thanks, Austin