Thanks for the immediate reply Harsh. I will try using it. By the way, cant we achieve the same goal with Hadoop Streaming (using Python)?
On Mon, Feb 20, 2012 at 2:59 AM, Harsh J <ha...@cloudera.com> wrote: > Piyush, > > Yes. Currently the partitioned data is always sorted by (and then > grouped by) keys before the reduce() calls begin. > > On Mon, Feb 20, 2012 at 12:51 PM, Piyush Kansal <piyush.kan...@gmail.com> > wrote: > > Thanks Harsh. > > > > But will it also sort the data as Partitioner does. > > > > > > On Sun, Feb 19, 2012 at 10:54 PM, Harsh J <ha...@cloudera.com> wrote: > >> > >> Hi, > >> > >> You would find it easier to use the Java API's MultipleOutputs (and/or > >> MultipleOutputFormat, which directly works on a configured key field), > >> to write each key-partition out in its own file. > >> > >> On Mon, Feb 20, 2012 at 7:38 AM, Piyush Kansal <piyush.kan...@gmail.com > > > >> wrote: > >> > Hi Friends, > >> > > >> > I have to sort huge amount of data in minimum possible time probably > >> > using > >> > partitioning. The key is composed of 3 fields(partition, text and > >> > number). > >> > This is how partition is defined: > >> > > >> > Partition "1" for range 1-10 > >> > Partition "2" for range 11-20 > >> > Partition "3" for range 21-30 > >> > > >> > I/P file format: partition[tab]text[tab]range-start[tab]range-end > >> > > >> > [cloudera@localhost kMer2]$ cat input1 > >> > > >> > 1 chr1 1 10 > >> > 1 chr1 2 8 > >> > 2 chr1 11 18 > >> > > >> > [cloudera@localhost kMer2]$ cat input2 > >> > > >> > 1 chr1 3 7 > >> > 2 chr1 12 19 > >> > > >> > [cloudera@localhost kMer2]$ cat input3 > >> > > >> > 3 chr1 22 30 > >> > > >> > [cloudera@localhost kMer2]$ cat input4 > >> > > >> > 3 chr1 22 30 > >> > 1 chr1 9 10 > >> > 2 chr1 15 16 > >> > > >> > Then I ran following command: > >> > > >> > hadoop jar > >> > /usr/lib/hadoop/contrib/streaming/hadoop-streaming-0.20.2-cdh3u2.jar \ > >> > -D stream.map.output.field.separator='\t' \ > >> > -D stream.num.map.output.key.fields=3 \ > >> > -D map.output.key.field.separator='\t' \ > >> > -D mapred.text.key.partitioner.options=-k1 \ > >> > -D mapred.reduce.tasks=3 \ > >> > -input /usr/pkansal/kMer2/ip \ > >> > -output /usr/pkansal/kMer2/op \ > >> > -mapper /home/cloudera/kMer2/kMer2Map.py \ > >> > -file /home/cloudera/kMer2/kMer2Map.py \ > >> > -reducer /home/cloudera/kMer2/kMer2Red.py \ > >> > -file /home/cloudera/kMer2/kMer2Red.py > >> > > >> > Both mapper and reducer scripts just contain one line of code: > >> > > >> > for line in sys.stdin: > >> > line = line.strip() > >> > print "%s" % (line) > >> > > >> > Following is the o/p: > >> > > >> > [cloudera@localhost kMer2]$ hadoop dfs -cat > >> > /usr/pkansal/kMer2/op/part-00000 > >> > > >> > 2 chr1 12 19 > >> > 2 chr1 15 16 > >> > 3 chr1 22 30 > >> > 3 chr1 22 30 > >> > > >> > [cloudera@localhost kMer2]$ hadoop dfs -cat > >> > /usr/pkansal/kMer2/op/part-00001 > >> > > >> > 1 chr1 2 8 > >> > 1 chr1 3 7 > >> > 1 chr1 9 10 > >> > 2 chr1 11 18 > >> > > >> > [cloudera@localhost kMer2]$ hadoop dfs -cat > >> > /usr/pkansal/kMer2/op/part-00002 > >> > > >> > 1 chr1 1 10 > >> > 3 chr1 22 29 > >> > > >> > This is not the o/p which I expected. I expected all records with: > >> > > >> > partition 1 in one single file eg part-m-00000 > >> > partition 2 in one single file eg part-m-00001 > >> > partition 3 in one single file eg part-m-00002 > >> > > >> > Can you please suggest if I am doing it in a right way? > >> > > >> > -- > >> > Regards, > >> > Piyush Kansal > >> > > >> > >> > >> > >> -- > >> Harsh J > >> Customer Ops. Engineer > >> Cloudera | http://tiny.cloudera.com/about > > > > > > > > > > -- > > Regards, > > Piyush Kansal > > > > > > -- > Harsh J > Customer Ops. Engineer > Cloudera | http://tiny.cloudera.com/about > -- Regards, Piyush Kansal