Re: Why order by operation is split into three map reduce jobs ?

Jeff Zhang Wed, 04 May 2011 18:54:06 -0700

It think there's room to optimize this case. If the script is
a = load '1.txt' as (a0:int, a1:int);
b = order a by $0;


we can optimize it to only two jobs rather than three jobs



On Thu, May 5, 2011 at 7:47 AM, Daniel Dai <[email protected]> wrote:

> The first job process any operator before "order by". If there is nothing
> before "order by", it will be only 2 jobs.
>
> If your script is:
>
> a = load '1.txt' as (a0:int, a1:int);
> b = order a by $0;
>
> Pig will insert a "foreach" after "load", so you will have 3 jobs, the
> first job will process "foreach".
>
> If your script is:
> a = load '1.txt';
> b = order a by $0;
>
> Then you only have two jobs.
>
> Daniel
>
>
> On 05/04/2011 12:14 AM, Jeff Zhang wrote:
>
>> Hi all,
>>
>> I find that a order by operation will be split into two map reduce jobs in
>> pig as following.
>>
>> As I understand, only two mapreduce jobs is enough, the first job is
>> sample
>> job, and the second job is the real sort job.
>>
>> But from the below I see that the first job is a trivial job which only
>> convert the data into pig's inter data format. And the next two jobs will
>> use this as an input.
>>
>> I guess maybe this is performance consideration (pig inter data format is
>> much more compact). But I doubt whether the three mapreduce jobs 's
>> performance is better than two mapreduce jobs.
>>
>> Anyone has done such comparison ?
>>
>>
>>
>> #--------------------------------------------------
>> # Map Reduce Plan
>> #--------------------------------------------------
>> MapReduce node 1-22
>> Map Plan
>>
>> Store(hdfs://srwaishdc1nn0001/tmp/temp-485053564/tmp-405426927:org.apache.pig.builtin.BinStorage)
>> - 1-23
>> |
>> |---New For Each(false,false)[bag] - 1-18
>>     |   |
>>     |   Cast[chararray] - 1-15
>>     |   |
>>     |   |---Project[bytearray][0] - 1-14
>>     |   |
>>     |   Cast[int] - 1-17
>>     |   |
>>     |   |---Project[bytearray][1] - 1-16
>>     |
>>
>>
>> |---Load(hdfs://srwaishdc1nn0001/apps/sq/jianfezhang/mobius_outputs/hadoop-out49:PigStorage)
>> - 1-13--------
>> Global sort: false
>> ----------------
>>
>> MapReduce node 1-25
>> Map Plan
>> Local Rearrange[tuple]{tuple}(false) - 1-29
>> |   |
>> |   Constant(all) - 1-28
>> |
>> |---New For Each(true)[tuple] - 1-27
>>     |   |
>>     |   Project[int][1] - 1-26
>>     |
>>
>>
>> |---Load(hdfs://srwaishdc1nn0001/tmp/temp-485053564/tmp-405426927:org.apache.pig.impl.builtin.RandomSampleLoader('or
>> g.apache.pig.builtin.BinStorage','100')) - 1-24--------
>> Reduce Plan
>>
>> Store(hdfs://srwaishdc1nn0001/tmp/temp-485053564/tmp1146107855:org.apache.pig.builtin.BinStorage)
>> - 1-38
>> |
>> |---New For Each(false)[tuple] - 1-37
>>     |   |
>>     |   POUserFunc(org.apache.pig.impl.builtin.FindQuantiles)[tuple] -
>> 1-36
>>     |   |
>>     |   |---Project[tuple][*] - 1-35
>>     |
>>     |---New For Each(false,false)[tuple] - 1-34
>>         |   |
>>         |   Constant(444) - 1-33
>>         |   |
>>         |   RelationToExpressionProject[bag][*] - 1-45
>>         |   |
>>         |   |---Project[tuple][1] - 1-31
>>         |
>>         |---Package[tuple]{chararray} - 1-30--------
>> Global sort: false
>> Secondary sort: true
>> ----------------
>>
>> MapReduce node 1-40
>> Map Plan
>> Local Rearrange[tuple]{int}(false) - 1-41
>> |   |
>> |   Project[int][1] - 1-19
>> |
>>
>> |---Load(hdfs://srwaishdc1nn0001/tmp/temp-485053564/tmp-405426927:org.apache.pig.builtin.BinStorage)
>> - 1-39--------
>> Reduce Plan
>> Store(fakefile:org.apache.pig.builtin.PigStorage) - 1-21
>> |
>> |---New For Each(true)[tuple] - 1-44
>>     |   |
>>     |   Project[bag][1] - 1-43
>>     |
>>     |---Package[tuple]{int} - 1-42--------
>> Global sort: true
>> Quantile file: hdfs://srwaishdc1nn0001/tmp/temp-485053564/tmp1146107855
>> ----------------
>>
>>
>>
>


-- 
Best Regards

Jeff Zhang

Re: Why order by operation is split into three map reduce jobs ?

Reply via email to