The first job process any operator before "order by". If there is nothing before "order by", it will be only 2 jobs.

If your script is:

a = load '1.txt' as (a0:int, a1:int);
b = order a by $0;

Pig will insert a "foreach" after "load", so you will have 3 jobs, the first job will process "foreach".

If your script is:
a = load '1.txt';
b = order a by $0;

Then you only have two jobs.

Daniel

On 05/04/2011 12:14 AM, Jeff Zhang wrote:
Hi all,

I find that a order by operation will be split into two map reduce jobs in
pig as following.

As I understand, only two mapreduce jobs is enough, the first job is sample
job, and the second job is the real sort job.

But from the below I see that the first job is a trivial job which only
convert the data into pig's inter data format. And the next two jobs will
use this as an input.

I guess maybe this is performance consideration (pig inter data format is
much more compact). But I doubt whether the three mapreduce jobs 's
performance is better than two mapreduce jobs.

Anyone has done such comparison ?



#--------------------------------------------------
# Map Reduce Plan
#--------------------------------------------------
MapReduce node 1-22
Map Plan
Store(hdfs://srwaishdc1nn0001/tmp/temp-485053564/tmp-405426927:org.apache.pig.builtin.BinStorage)
- 1-23
|
|---New For Each(false,false)[bag] - 1-18
     |   |
     |   Cast[chararray] - 1-15
     |   |
     |   |---Project[bytearray][0] - 1-14
     |   |
     |   Cast[int] - 1-17
     |   |
     |   |---Project[bytearray][1] - 1-16
     |

|---Load(hdfs://srwaishdc1nn0001/apps/sq/jianfezhang/mobius_outputs/hadoop-out49:PigStorage)
- 1-13--------
Global sort: false
----------------

MapReduce node 1-25
Map Plan
Local Rearrange[tuple]{tuple}(false) - 1-29
|   |
|   Constant(all) - 1-28
|
|---New For Each(true)[tuple] - 1-27
     |   |
     |   Project[int][1] - 1-26
     |

|---Load(hdfs://srwaishdc1nn0001/tmp/temp-485053564/tmp-405426927:org.apache.pig.impl.builtin.RandomSampleLoader('or
g.apache.pig.builtin.BinStorage','100')) - 1-24--------
Reduce Plan
Store(hdfs://srwaishdc1nn0001/tmp/temp-485053564/tmp1146107855:org.apache.pig.builtin.BinStorage)
- 1-38
|
|---New For Each(false)[tuple] - 1-37
     |   |
     |   POUserFunc(org.apache.pig.impl.builtin.FindQuantiles)[tuple] - 1-36
     |   |
     |   |---Project[tuple][*] - 1-35
     |
     |---New For Each(false,false)[tuple] - 1-34
         |   |
         |   Constant(444) - 1-33
         |   |
         |   RelationToExpressionProject[bag][*] - 1-45
         |   |
         |   |---Project[tuple][1] - 1-31
         |
         |---Package[tuple]{chararray} - 1-30--------
Global sort: false
Secondary sort: true
----------------

MapReduce node 1-40
Map Plan
Local Rearrange[tuple]{int}(false) - 1-41
|   |
|   Project[int][1] - 1-19
|
|---Load(hdfs://srwaishdc1nn0001/tmp/temp-485053564/tmp-405426927:org.apache.pig.builtin.BinStorage)
- 1-39--------
Reduce Plan
Store(fakefile:org.apache.pig.builtin.PigStorage) - 1-21
|
|---New For Each(true)[tuple] - 1-44
     |   |
     |   Project[bag][1] - 1-43
     |
     |---Package[tuple]{int} - 1-42--------
Global sort: true
Quantile file: hdfs://srwaishdc1nn0001/tmp/temp-485053564/tmp1146107855
----------------



Reply via email to