Re: Query regarding Hadoop Partitioning

Piyush Kansal Fri, 24 Feb 2012 13:27:02 -0800

Thanks. It worked. It might be annoying to you but I quite new to Java.

On Fri, Feb 24, 2012 at 4:14 PM, Joey Echeverria <j...@cloudera.com> wrote:


> It looks like your partitioner is an inner class. Try making it static:
>
> public static class MOPartition extends Partitioner<Text, Text>
>        public MOPartition() {}
>
> On Fri, Feb 24, 2012 at 3:48 PM, Piyush Kansal <piyush.kan...@gmail.com>
> wrote:
> > Hi,
> >
> > I am right now stuck with an issue while extending the Partitioner class:
> >
> > public class MOPartition extends Partitioner<Text, Text>
> >         public MOPartition() {}
> >
> > java.lang.RuntimeException: java.lang.NoSuchMethodException:
> > globalSort$MOPartition.<init>()
> >
> > I tried defining a empty constructor but still it didnt help. My JRE
> version
> > is 1.6.0.26.
> >
> > Can you please suggest what can be the issue?
> >
> >
> > On Mon, Feb 20, 2012 at 4:12 AM, Piyush Kansal <piyush.kan...@gmail.com>
> > wrote:
> >>
> >> Thanks Harsh. I will try it and will get back to you.
> >>
> >>
> >> On Mon, Feb 20, 2012 at 3:55 AM, Harsh J <ha...@cloudera.com> wrote:
> >>>
> >>> I do not think you can do it out of the box with streaming, but
> >>> last.fm's Dumbo (highly recommended if you use Python M/R) and its
> >>> add-on Feathers libraries can do it apparently.
> >>>
> >>> See Erik Forsberg's detailed answer (second) on
> >>>
> >>>
> http://stackoverflow.com/questions/1626786/generating-separate-output-files-in-hadoop-streaming
> >>> for more.
> >>>
> >>> On Mon, Feb 20, 2012 at 1:57 PM, Piyush Kansal <
> piyush.kan...@gmail.com>
> >>> wrote:
> >>> > Thanks for the immediate reply Harsh. I will try using it.
> >>> >
> >>> > By the way, cant we achieve the same goal with Hadoop Streaming
> (using
> >>> > Python)?
> >>> >
> >>> >
> >>> > On Mon, Feb 20, 2012 at 2:59 AM, Harsh J <ha...@cloudera.com> wrote:
> >>> >>
> >>> >> Piyush,
> >>> >>
> >>> >> Yes. Currently the partitioned data is always sorted by (and then
> >>> >> grouped by) keys before the reduce() calls begin.
> >>> >>
> >>> >> On Mon, Feb 20, 2012 at 12:51 PM, Piyush Kansal
> >>> >> <piyush.kan...@gmail.com>
> >>> >> wrote:
> >>> >> > Thanks Harsh.
> >>> >> >
> >>> >> > But will it also sort the data as Partitioner does.
> >>> >> >
> >>> >> >
> >>> >> > On Sun, Feb 19, 2012 at 10:54 PM, Harsh J <ha...@cloudera.com>
> >>> >> > wrote:
> >>> >> >>
> >>> >> >> Hi,
> >>> >> >>
> >>> >> >> You would find it easier to use the Java API's MultipleOutputs
> >>> >> >> (and/or
> >>> >> >> MultipleOutputFormat, which directly works on a configured key
> >>> >> >> field),
> >>> >> >> to write each key-partition out in its own file.
> >>> >> >>
> >>> >> >> On Mon, Feb 20, 2012 at 7:38 AM, Piyush Kansal
> >>> >> >> <piyush.kan...@gmail.com>
> >>> >> >> wrote:
> >>> >> >> > Hi Friends,
> >>> >> >> >
> >>> >> >> > I have to sort huge amount of data in minimum possible time
> >>> >> >> > probably
> >>> >> >> > using
> >>> >> >> > partitioning. The key is composed of 3 fields(partition, text
> and
> >>> >> >> > number).
> >>> >> >> > This is how partition is defined:
> >>> >> >> >
> >>> >> >> > Partition "1" for range 1-10
> >>> >> >> > Partition "2" for range 11-20
> >>> >> >> > Partition "3" for range 21-30
> >>> >> >> >
> >>> >> >> > I/P file format:
> partition[tab]text[tab]range-start[tab]range-end
> >>> >> >> >
> >>> >> >> > [cloudera@localhost kMer2]$ cat input1
> >>> >> >> >
> >>> >> >> > 1 chr1 1 10
> >>> >> >> > 1 chr1 2 8
> >>> >> >> > 2 chr1 11 18
> >>> >> >> >
> >>> >> >> > [cloudera@localhost kMer2]$ cat input2
> >>> >> >> >
> >>> >> >> > 1 chr1 3 7
> >>> >> >> > 2 chr1 12 19
> >>> >> >> >
> >>> >> >> > [cloudera@localhost kMer2]$ cat input3
> >>> >> >> >
> >>> >> >> > 3 chr1 22 30
> >>> >> >> >
> >>> >> >> > [cloudera@localhost kMer2]$ cat input4
> >>> >> >> >
> >>> >> >> > 3 chr1 22 30
> >>> >> >> > 1 chr1 9 10
> >>> >> >> > 2 chr1 15 16
> >>> >> >> >
> >>> >> >> > Then I ran following command:
> >>> >> >> >
> >>> >> >> > hadoop jar
> >>> >> >> >
> >>> >> >> >
> /usr/lib/hadoop/contrib/streaming/hadoop-streaming-0.20.2-cdh3u2.jar
> >>> >> >> > \
> >>> >> >> > -D stream.map.output.field.separator='\t' \
> >>> >> >> > -D stream.num.map.output.key.fields=3 \
> >>> >> >> > -D map.output.key.field.separator='\t' \
> >>> >> >> > -D mapred.text.key.partitioner.options=-k1 \
> >>> >> >> > -D mapred.reduce.tasks=3 \
> >>> >> >> > -input /usr/pkansal/kMer2/ip \
> >>> >> >> > -output /usr/pkansal/kMer2/op \
> >>> >> >> > -mapper /home/cloudera/kMer2/kMer2Map.py \
> >>> >> >> > -file /home/cloudera/kMer2/kMer2Map.py \
> >>> >> >> > -reducer /home/cloudera/kMer2/kMer2Red.py \
> >>> >> >> > -file /home/cloudera/kMer2/kMer2Red.py
> >>> >> >> >
> >>> >> >> > Both mapper and reducer scripts just contain one line of code:
> >>> >> >> >
> >>> >> >> > for line in sys.stdin:
> >>> >> >> >     line = line.strip()
> >>> >> >> >     print "%s" % (line)
> >>> >> >> >
> >>> >> >> > Following is the o/p:
> >>> >> >> >
> >>> >> >> > [cloudera@localhost kMer2]$ hadoop dfs -cat
> >>> >> >> > /usr/pkansal/kMer2/op/part-00000
> >>> >> >> >
> >>> >> >> > 2 chr1 12 19
> >>> >> >> > 2 chr1 15 16
> >>> >> >> > 3 chr1 22 30
> >>> >> >> > 3 chr1 22 30
> >>> >> >> >
> >>> >> >> > [cloudera@localhost kMer2]$ hadoop dfs -cat
> >>> >> >> > /usr/pkansal/kMer2/op/part-00001
> >>> >> >> >
> >>> >> >> > 1 chr1 2 8
> >>> >> >> > 1 chr1 3 7
> >>> >> >> > 1 chr1 9 10
> >>> >> >> > 2 chr1 11 18
> >>> >> >> >
> >>> >> >> > [cloudera@localhost kMer2]$ hadoop dfs -cat
> >>> >> >> > /usr/pkansal/kMer2/op/part-00002
> >>> >> >> >
> >>> >> >> > 1 chr1 1 10
> >>> >> >> > 3 chr1 22 29
> >>> >> >> >
> >>> >> >> > This is not the o/p which I expected. I expected all records
> >>> >> >> > with:
> >>> >> >> >
> >>> >> >> > partition 1 in one single file eg part-m-00000
> >>> >> >> > partition 2 in one single file eg part-m-00001
> >>> >> >> > partition 3 in one single file eg part-m-00002
> >>> >> >> >
> >>> >> >> > Can you please suggest if I am doing it in a right way?
> >>> >> >> >
> >>> >> >> > --
> >>> >> >> > Regards,
> >>> >> >> > Piyush Kansal
> >>> >> >> >
> >>> >> >>
> >>> >> >>
> >>> >> >>
> >>> >> >> --
> >>> >> >> Harsh J
> >>> >> >> Customer Ops. Engineer
> >>> >> >> Cloudera | http://tiny.cloudera.com/about
> >>> >> >
> >>> >> >
> >>> >> >
> >>> >> >
> >>> >> > --
> >>> >> > Regards,
> >>> >> > Piyush Kansal
> >>> >> >
> >>> >>
> >>> >>
> >>> >>
> >>> >> --
> >>> >> Harsh J
> >>> >> Customer Ops. Engineer
> >>> >> Cloudera | http://tiny.cloudera.com/about
> >>> >
> >>> >
> >>> >
> >>> >
> >>> > --
> >>> > Regards,
> >>> > Piyush Kansal
> >>> >
> >>>
> >>>
> >>>
> >>> --
> >>> Harsh J
> >>> Customer Ops. Engineer
> >>> Cloudera | http://tiny.cloudera.com/about
> >>
> >>
> >>
> >>
> >> --
> >> Regards,
> >> Piyush Kansal
> >>
> >
> >
> >
> > --
> > Regards,
> > Piyush Kansal
> >
>
>
>
> --
> Joseph Echeverria
> Cloudera, Inc.
> 443.305.9434
>



-- 
Regards,
Piyush Kansal

Re: Query regarding Hadoop Partitioning

Reply via email to