Re: Query regarding Hadoop Partitioning

Harsh J Mon, 20 Feb 2012 00:00:34 -0800

Piyush,

Yes. Currently the partitioned data is always sorted by (and then
grouped by) keys before the reduce() calls begin.


On Mon, Feb 20, 2012 at 12:51 PM, Piyush Kansal <piyush.kan...@gmail.com> wrote:
> Thanks Harsh.
>
> But will it also sort the data as Partitioner does.
>
>
> On Sun, Feb 19, 2012 at 10:54 PM, Harsh J <ha...@cloudera.com> wrote:
>>
>> Hi,
>>
>> You would find it easier to use the Java API's MultipleOutputs (and/or
>> MultipleOutputFormat, which directly works on a configured key field),
>> to write each key-partition out in its own file.
>>
>> On Mon, Feb 20, 2012 at 7:38 AM, Piyush Kansal <piyush.kan...@gmail.com>
>> wrote:
>> > Hi Friends,
>> >
>> > I have to sort huge amount of data in minimum possible time probably
>> > using
>> > partitioning. The key is composed of 3 fields(partition, text and
>> > number).
>> > This is how partition is defined:
>> >
>> > Partition "1" for range 1-10
>> > Partition "2" for range 11-20
>> > Partition "3" for range 21-30
>> >
>> > I/P file format: partition[tab]text[tab]range-start[tab]range-end
>> >
>> > [cloudera@localhost kMer2]$ cat input1
>> >
>> > 1 chr1 1 10
>> > 1 chr1 2 8
>> > 2 chr1 11 18
>> >
>> > [cloudera@localhost kMer2]$ cat input2
>> >
>> > 1 chr1 3 7
>> > 2 chr1 12 19
>> >
>> > [cloudera@localhost kMer2]$ cat input3
>> >
>> > 3 chr1 22 30
>> >
>> > [cloudera@localhost kMer2]$ cat input4
>> >
>> > 3 chr1 22 30
>> > 1 chr1 9 10
>> > 2 chr1 15 16
>> >
>> > Then I ran following command:
>> >
>> > hadoop jar
>> > /usr/lib/hadoop/contrib/streaming/hadoop-streaming-0.20.2-cdh3u2.jar \
>> > -D stream.map.output.field.separator='\t' \
>> > -D stream.num.map.output.key.fields=3 \
>> > -D map.output.key.field.separator='\t' \
>> > -D mapred.text.key.partitioner.options=-k1 \
>> > -D mapred.reduce.tasks=3 \
>> > -input /usr/pkansal/kMer2/ip \
>> > -output /usr/pkansal/kMer2/op \
>> > -mapper /home/cloudera/kMer2/kMer2Map.py \
>> > -file /home/cloudera/kMer2/kMer2Map.py \
>> > -reducer /home/cloudera/kMer2/kMer2Red.py \
>> > -file /home/cloudera/kMer2/kMer2Red.py
>> >
>> > Both mapper and reducer scripts just contain one line of code:
>> >
>> > for line in sys.stdin:
>> >     line = line.strip()
>> >     print "%s" % (line)
>> >
>> > Following is the o/p:
>> >
>> > [cloudera@localhost kMer2]$ hadoop dfs -cat
>> > /usr/pkansal/kMer2/op/part-00000
>> >
>> > 2 chr1 12 19
>> > 2 chr1 15 16
>> > 3 chr1 22 30
>> > 3 chr1 22 30
>> >
>> > [cloudera@localhost kMer2]$ hadoop dfs -cat
>> > /usr/pkansal/kMer2/op/part-00001
>> >
>> > 1 chr1 2 8
>> > 1 chr1 3 7
>> > 1 chr1 9 10
>> > 2 chr1 11 18
>> >
>> > [cloudera@localhost kMer2]$ hadoop dfs -cat
>> > /usr/pkansal/kMer2/op/part-00002
>> >
>> > 1 chr1 1 10
>> > 3 chr1 22 29
>> >
>> > This is not the o/p which I expected. I expected all records with:
>> >
>> > partition 1 in one single file eg part-m-00000
>> > partition 2 in one single file eg part-m-00001
>> > partition 3 in one single file eg part-m-00002
>> >
>> > Can you please suggest if I am doing it in a right way?
>> >
>> > --
>> > Regards,
>> > Piyush Kansal
>> >
>>
>>
>>
>> --
>> Harsh J
>> Customer Ops. Engineer
>> Cloudera | http://tiny.cloudera.com/about
>
>
>
>
> --
> Regards,
> Piyush Kansal
>



-- 
Harsh J
Customer Ops. Engineer
Cloudera | http://tiny.cloudera.com/about

Re: Query regarding Hadoop Partitioning

Reply via email to