Re: hadoop streaming : need help in using custom key value separator

Austin Chungath Tue, 28 Feb 2012 02:23:12 -0800

Thanks subir,

"-D stream.mapred.output.field.separator=*" is not an available option, my
bad
what I should have done is:


-D stream.map.output.field.separator=*
On Tue, Feb 28, 2012 at 2:36 PM, Subir S <subir.sasiku...@gmail.com> wrote:

>
> http://hadoop.apache.org/common/docs/current/streaming.html#Customizing+How+Lines+are+Split+into+Key%2FValue+Pairs
>
> Read this link, your options are wrong below.
>
>
>
> On Tue, Feb 28, 2012 at 1:13 PM, Austin Chungath <austi...@gmail.com>
> wrote:
>
> > When I am using more than one reducer in hadoop streaming where I am
> using
> > my custom separater rather than the tab, it looks like the hadoop
> shuffling
> > process is not happening as it should.
> >
> > This is the reducer output when I am using '\t' to separate my key value
> > pair that is output from the mapper.
> >
> > *output from reducer 1:*
> > 10321,22
> > 23644,37
> > 41231,42
> > 23448,20
> > 12325,39
> > 71234,20
> > *output from reducer 2:*
> > 24123,43
> > 33213,46
> > 11321,29
> > 21232,32
> >
> > the above output is as expected the first column is the key and the
> second
> > value is the count. There are 10 unique keys and 6 of them are in output
> of
> > the first reducer and the remaining 4 int the second reducer output.
> >
> > But now when I use a custom separater for my key value pair output from
> my
> > mapper. Here I am using '*' as the separator
> > -D stream.mapred.output.field.separator=*
> > -D mapred.reduce.tasks=2
> >
> > *output from reducer 1:*
> > 10321,5
> > 21232,19
> > 24123,16
> > 33213,28
> > 23644,21
> > 41231,12
> > 23448,18
> > 11321,29
> > 12325,24
> > 71234,9
> > * *
> > *output from reducer 2:*
>  > 10321,17
> > 21232,13
> > 33213,18
> > 23644,16
> > 41231,30
> > 23448,2
> > 24123,27
> > 12325,15
> > 71234,11
> >
> > Now both the reducers are getting all the keys and part of the values go
> to
> > reducer 1 and part of the reducer go to reducer 2.
> > Why is it behaving like this when I am using a custom separator,
> shouldn't
> > each reducer get a unique key after the shuffling?
> > I am using Hadoop 0.20.205.0 and below is the command that I am using to
> > run hadoop streaming. Is there some more options that I should specify
> for
> > hadoop streaming to work properly if I am using a custom separator?
> >
> > hadoop jar
> > $HADOOP_PREFIX/contrib/streaming/hadoop-streaming-0.20.205.0.jar
> > -D stream.mapred.output.field.separator=*
> > -D mapred.reduce.tasks=2
> > -mapper ./map.py
> > -reducer ./reducer.py
> > -file ./map.py
> > -file ./reducer.py
> > -input /user/inputdata
> > -output /user/outputdata
> > -verbose
> >
> >
> > Any help is much appreciated,
> > Thanks,
> > Austin
> >
>

Re: hadoop streaming : need help in using custom key value separator

Reply via email to