Thanks Utkarsh. But I cant find such function in Hadoop. Moreover, is there any reason why default partitioning wont work? I mean if it does not work, then why its even there. May be I am missing something?
On Sun, Feb 19, 2012 at 10:40 PM, Utkarsh Gupta <utkarsh_gu...@infosys.com>wrote: > Hi Piyush,**** > > ** ** > > I think you need to override the inbuilt partitioning function.**** > > You can use function like (first field of key)%3**** > > This will send all the keys with same first field to a separate reduce > process**** > > Please correct me if I am wrong.**** > > Thanks **** > > Utkarsh**** > > *From:* Piyush Kansal [mailto:piyush.kan...@gmail.com] > *Sent:* Monday, February 20, 2012 7:39 AM > *To:* mapreduce-user@hadoop.apache.org > *Subject:* Query regarding Hadoop Partitioning**** > > ** ** > > Hi Friends,**** > > I have to sort huge amount of data in minimum possible time probably using > partitioning. The key is composed of 3 fields(partition, text and number). > This is how partition is defined:**** > > - Partition "1" for range 1-10**** > - Partition "2" for range 11-20**** > - Partition "3" for range 21-30**** > > *I/P file format*: partition[tab]text[tab]range-start[tab]range-end**** > > [cloudera@localhost kMer2]$ cat input1**** > > - 1 chr1 1 10**** > - 1 chr1 2 8**** > - 2 chr1 11 18**** > > [cloudera@localhost kMer2]$ cat input2**** > > - 1 chr1 3 7**** > - 2 chr1 12 19**** > > [cloudera@localhost kMer2]$ cat input3**** > > - 3 chr1 22 30**** > > [cloudera@localhost kMer2]$ cat input4**** > > - 3 chr1 22 30**** > - 1 chr1 9 10**** > - 2 chr1 15 16**** > > Then I ran following command:**** > > hadoop jar > /usr/lib/hadoop/contrib/streaming/hadoop-streaming-0.20.2-cdh3u2.jar \**** > > -D stream.map.output.field.separator='\t' \**** > > -D stream.num.map.output.key.fields=3 \**** > > -D map.output.key.field.separator='\t' \**** > > -D mapred.text.key.partitioner.options=-k1 \**** > > -D mapred.reduce.tasks=3 \**** > > -input /usr/pkansal/kMer2/ip \**** > > -output /usr/pkansal/kMer2/op \**** > > -mapper /home/cloudera/kMer2/kMer2Map.py \**** > > -file /home/cloudera/kMer2/kMer2Map.py \**** > > -reducer /home/cloudera/kMer2/kMer2Red.py \**** > > -file /home/cloudera/kMer2/kMer2Red.py**** > > Both mapper and reducer scripts just contain one line of code:**** > > for line in sys.stdin:**** > > line = line.strip()**** > > print "%s" % (line)**** > > Following is the o/p:**** > > [cloudera@localhost kMer2]$ hadoop dfs -cat > /usr/pkansal/kMer2/op/part-00000**** > > - 2 chr1 12 19**** > - 2 chr1 15 16**** > - 3 chr1 22 30**** > - 3 chr1 22 30**** > > [cloudera@localhost kMer2]$ hadoop dfs -cat > /usr/pkansal/kMer2/op/part-00001**** > > - 1 chr1 2 8**** > - 1 chr1 3 7**** > - 1 chr1 9 10**** > - 2 chr1 11 18**** > > [cloudera@localhost kMer2]$ hadoop dfs -cat > /usr/pkansal/kMer2/op/part-00002**** > > - 1 chr1 1 10**** > - 3 chr1 22 29**** > > This is not the o/p which I expected. I expected all records with:**** > > - partition 1 in one single file eg part-m-00000**** > - partition 2 in one single file eg part-m-00001**** > - partition 3 in one single file eg part-m-00002**** > > Can you please suggest if I am doing it in a right way?**** > > -- > Regards, > Piyush Kansal**** > > **************** CAUTION - Disclaimer ***************** > This e-mail contains PRIVILEGED AND CONFIDENTIAL INFORMATION intended solely > for the use of the addressee(s). If you are not the intended recipient, please > notify the sender by e-mail and delete the original message. Further, you are > not > to copy, disclose, or distribute this e-mail or its contents to any other > person and > any such actions are unlawful. This e-mail may contain viruses. Infosys has > taken > every reasonable precaution to minimize this risk, but is not liable for any > damage > you may sustain as a result of any virus in this e-mail. You should carry out > your > own virus checks before opening the e-mail or attachment. Infosys reserves the > right to monitor and review the content of all messages sent to or from this > e-mail > address. Messages sent to or from this e-mail address may be stored on the > Infosys e-mail system. > ***INFOSYS******** End of Disclaimer ********INFOSYS*** > > -- Regards, Piyush Kansal