When I am using more than one reducer in hadoop streaming where I am using
my custom separater rather than the tab, it looks like the hadoop shuffling
process is not happening as it should.

This is the reducer output when I am using '\t' to separate my key value
pair that is output from the mapper.

*output from reducer 1:*
10321,22
23644,37
41231,42
23448,20
12325,39
71234,20
*output from reducer 2:*
24123,43
33213,46
11321,29
21232,32

the above output is as expected the first column is the key and the second
value is the count. There are 10 unique keys and 6 of them are in output of
the first reducer and the remaining 4 int the second reducer output.

But now when I use a custom separater for my key value pair output from my
mapper. Here I am using '*' as the separator
-D stream.mapred.output.field.separator=*
-D mapred.reduce.tasks=2

*output from reducer 1:*
10321,5
21232,19
24123,16
33213,28
23644,21
41231,12
23448,18
11321,29
12325,24
71234,9
* *
*output from reducer 2:*
10321,17
21232,13
33213,18
23644,16
41231,30
23448,2
24123,27
12325,15
71234,11

Now both the reducers are getting all the keys and part of the values go to
reducer 1 and part of the reducer go to reducer 2.
Why is it behaving like this when I am using a custom separator, shouldn't
each reducer get a unique key after the shuffling?
I am using Hadoop 0.20.205.0 and below is the command that I am using to
run hadoop streaming. Is there some more options that I should specify for
hadoop streaming to work properly if I am using a custom separator?

hadoop jar
$HADOOP_PREFIX/contrib/streaming/hadoop-streaming-0.20.205.0.jar
-D stream.mapred.output.field.separator=*
-D mapred.reduce.tasks=2
-mapper ./map.py
-reducer ./reducer.py
-file ./map.py
-file ./reducer.py
-input /user/inputdata
-output /user/outputdata
-verbose


Any help is much appreciated,
Thanks,
Austin

Reply via email to