Thanks. It worked. It might be annoying to you but I quite new to Java. On Fri, Feb 24, 2012 at 4:14 PM, Joey Echeverria <j...@cloudera.com> wrote:
> It looks like your partitioner is an inner class. Try making it static: > > public static class MOPartition extends Partitioner<Text, Text> > public MOPartition() {} > > On Fri, Feb 24, 2012 at 3:48 PM, Piyush Kansal <piyush.kan...@gmail.com> > wrote: > > Hi, > > > > I am right now stuck with an issue while extending the Partitioner class: > > > > public class MOPartition extends Partitioner<Text, Text> > > public MOPartition() {} > > > > java.lang.RuntimeException: java.lang.NoSuchMethodException: > > globalSort$MOPartition.<init>() > > > > I tried defining a empty constructor but still it didnt help. My JRE > version > > is 1.6.0.26. > > > > Can you please suggest what can be the issue? > > > > > > On Mon, Feb 20, 2012 at 4:12 AM, Piyush Kansal <piyush.kan...@gmail.com> > > wrote: > >> > >> Thanks Harsh. I will try it and will get back to you. > >> > >> > >> On Mon, Feb 20, 2012 at 3:55 AM, Harsh J <ha...@cloudera.com> wrote: > >>> > >>> I do not think you can do it out of the box with streaming, but > >>> last.fm's Dumbo (highly recommended if you use Python M/R) and its > >>> add-on Feathers libraries can do it apparently. > >>> > >>> See Erik Forsberg's detailed answer (second) on > >>> > >>> > http://stackoverflow.com/questions/1626786/generating-separate-output-files-in-hadoop-streaming > >>> for more. > >>> > >>> On Mon, Feb 20, 2012 at 1:57 PM, Piyush Kansal < > piyush.kan...@gmail.com> > >>> wrote: > >>> > Thanks for the immediate reply Harsh. I will try using it. > >>> > > >>> > By the way, cant we achieve the same goal with Hadoop Streaming > (using > >>> > Python)? > >>> > > >>> > > >>> > On Mon, Feb 20, 2012 at 2:59 AM, Harsh J <ha...@cloudera.com> wrote: > >>> >> > >>> >> Piyush, > >>> >> > >>> >> Yes. Currently the partitioned data is always sorted by (and then > >>> >> grouped by) keys before the reduce() calls begin. > >>> >> > >>> >> On Mon, Feb 20, 2012 at 12:51 PM, Piyush Kansal > >>> >> <piyush.kan...@gmail.com> > >>> >> wrote: > >>> >> > Thanks Harsh. > >>> >> > > >>> >> > But will it also sort the data as Partitioner does. > >>> >> > > >>> >> > > >>> >> > On Sun, Feb 19, 2012 at 10:54 PM, Harsh J <ha...@cloudera.com> > >>> >> > wrote: > >>> >> >> > >>> >> >> Hi, > >>> >> >> > >>> >> >> You would find it easier to use the Java API's MultipleOutputs > >>> >> >> (and/or > >>> >> >> MultipleOutputFormat, which directly works on a configured key > >>> >> >> field), > >>> >> >> to write each key-partition out in its own file. > >>> >> >> > >>> >> >> On Mon, Feb 20, 2012 at 7:38 AM, Piyush Kansal > >>> >> >> <piyush.kan...@gmail.com> > >>> >> >> wrote: > >>> >> >> > Hi Friends, > >>> >> >> > > >>> >> >> > I have to sort huge amount of data in minimum possible time > >>> >> >> > probably > >>> >> >> > using > >>> >> >> > partitioning. The key is composed of 3 fields(partition, text > and > >>> >> >> > number). > >>> >> >> > This is how partition is defined: > >>> >> >> > > >>> >> >> > Partition "1" for range 1-10 > >>> >> >> > Partition "2" for range 11-20 > >>> >> >> > Partition "3" for range 21-30 > >>> >> >> > > >>> >> >> > I/P file format: > partition[tab]text[tab]range-start[tab]range-end > >>> >> >> > > >>> >> >> > [cloudera@localhost kMer2]$ cat input1 > >>> >> >> > > >>> >> >> > 1 chr1 1 10 > >>> >> >> > 1 chr1 2 8 > >>> >> >> > 2 chr1 11 18 > >>> >> >> > > >>> >> >> > [cloudera@localhost kMer2]$ cat input2 > >>> >> >> > > >>> >> >> > 1 chr1 3 7 > >>> >> >> > 2 chr1 12 19 > >>> >> >> > > >>> >> >> > [cloudera@localhost kMer2]$ cat input3 > >>> >> >> > > >>> >> >> > 3 chr1 22 30 > >>> >> >> > > >>> >> >> > [cloudera@localhost kMer2]$ cat input4 > >>> >> >> > > >>> >> >> > 3 chr1 22 30 > >>> >> >> > 1 chr1 9 10 > >>> >> >> > 2 chr1 15 16 > >>> >> >> > > >>> >> >> > Then I ran following command: > >>> >> >> > > >>> >> >> > hadoop jar > >>> >> >> > > >>> >> >> > > /usr/lib/hadoop/contrib/streaming/hadoop-streaming-0.20.2-cdh3u2.jar > >>> >> >> > \ > >>> >> >> > -D stream.map.output.field.separator='\t' \ > >>> >> >> > -D stream.num.map.output.key.fields=3 \ > >>> >> >> > -D map.output.key.field.separator='\t' \ > >>> >> >> > -D mapred.text.key.partitioner.options=-k1 \ > >>> >> >> > -D mapred.reduce.tasks=3 \ > >>> >> >> > -input /usr/pkansal/kMer2/ip \ > >>> >> >> > -output /usr/pkansal/kMer2/op \ > >>> >> >> > -mapper /home/cloudera/kMer2/kMer2Map.py \ > >>> >> >> > -file /home/cloudera/kMer2/kMer2Map.py \ > >>> >> >> > -reducer /home/cloudera/kMer2/kMer2Red.py \ > >>> >> >> > -file /home/cloudera/kMer2/kMer2Red.py > >>> >> >> > > >>> >> >> > Both mapper and reducer scripts just contain one line of code: > >>> >> >> > > >>> >> >> > for line in sys.stdin: > >>> >> >> > line = line.strip() > >>> >> >> > print "%s" % (line) > >>> >> >> > > >>> >> >> > Following is the o/p: > >>> >> >> > > >>> >> >> > [cloudera@localhost kMer2]$ hadoop dfs -cat > >>> >> >> > /usr/pkansal/kMer2/op/part-00000 > >>> >> >> > > >>> >> >> > 2 chr1 12 19 > >>> >> >> > 2 chr1 15 16 > >>> >> >> > 3 chr1 22 30 > >>> >> >> > 3 chr1 22 30 > >>> >> >> > > >>> >> >> > [cloudera@localhost kMer2]$ hadoop dfs -cat > >>> >> >> > /usr/pkansal/kMer2/op/part-00001 > >>> >> >> > > >>> >> >> > 1 chr1 2 8 > >>> >> >> > 1 chr1 3 7 > >>> >> >> > 1 chr1 9 10 > >>> >> >> > 2 chr1 11 18 > >>> >> >> > > >>> >> >> > [cloudera@localhost kMer2]$ hadoop dfs -cat > >>> >> >> > /usr/pkansal/kMer2/op/part-00002 > >>> >> >> > > >>> >> >> > 1 chr1 1 10 > >>> >> >> > 3 chr1 22 29 > >>> >> >> > > >>> >> >> > This is not the o/p which I expected. I expected all records > >>> >> >> > with: > >>> >> >> > > >>> >> >> > partition 1 in one single file eg part-m-00000 > >>> >> >> > partition 2 in one single file eg part-m-00001 > >>> >> >> > partition 3 in one single file eg part-m-00002 > >>> >> >> > > >>> >> >> > Can you please suggest if I am doing it in a right way? > >>> >> >> > > >>> >> >> > -- > >>> >> >> > Regards, > >>> >> >> > Piyush Kansal > >>> >> >> > > >>> >> >> > >>> >> >> > >>> >> >> > >>> >> >> -- > >>> >> >> Harsh J > >>> >> >> Customer Ops. Engineer > >>> >> >> Cloudera | http://tiny.cloudera.com/about > >>> >> > > >>> >> > > >>> >> > > >>> >> > > >>> >> > -- > >>> >> > Regards, > >>> >> > Piyush Kansal > >>> >> > > >>> >> > >>> >> > >>> >> > >>> >> -- > >>> >> Harsh J > >>> >> Customer Ops. Engineer > >>> >> Cloudera | http://tiny.cloudera.com/about > >>> > > >>> > > >>> > > >>> > > >>> > -- > >>> > Regards, > >>> > Piyush Kansal > >>> > > >>> > >>> > >>> > >>> -- > >>> Harsh J > >>> Customer Ops. Engineer > >>> Cloudera | http://tiny.cloudera.com/about > >> > >> > >> > >> > >> -- > >> Regards, > >> Piyush Kansal > >> > > > > > > > > -- > > Regards, > > Piyush Kansal > > > > > > -- > Joseph Echeverria > Cloudera, Inc. > 443.305.9434 > -- Regards, Piyush Kansal