The first job process any operator before "order by". If there is
nothing before "order by", it will be only 2 jobs.
If your script is:
a = load '1.txt' as (a0:int, a1:int);
b = order a by $0;
Pig will insert a "foreach" after "load", so you will have 3 jobs, the
first job will process "foreach".
If your script is:
a = load '1.txt';
b = order a by $0;
Then you only have two jobs.
Daniel
On 05/04/2011 12:14 AM, Jeff Zhang wrote:
Hi all,
I find that a order by operation will be split into two map reduce jobs in
pig as following.
As I understand, only two mapreduce jobs is enough, the first job is sample
job, and the second job is the real sort job.
But from the below I see that the first job is a trivial job which only
convert the data into pig's inter data format. And the next two jobs will
use this as an input.
I guess maybe this is performance consideration (pig inter data format is
much more compact). But I doubt whether the three mapreduce jobs 's
performance is better than two mapreduce jobs.
Anyone has done such comparison ?
#--------------------------------------------------
# Map Reduce Plan
#--------------------------------------------------
MapReduce node 1-22
Map Plan
Store(hdfs://srwaishdc1nn0001/tmp/temp-485053564/tmp-405426927:org.apache.pig.builtin.BinStorage)
- 1-23
|
|---New For Each(false,false)[bag] - 1-18
| |
| Cast[chararray] - 1-15
| |
| |---Project[bytearray][0] - 1-14
| |
| Cast[int] - 1-17
| |
| |---Project[bytearray][1] - 1-16
|
|---Load(hdfs://srwaishdc1nn0001/apps/sq/jianfezhang/mobius_outputs/hadoop-out49:PigStorage)
- 1-13--------
Global sort: false
----------------
MapReduce node 1-25
Map Plan
Local Rearrange[tuple]{tuple}(false) - 1-29
| |
| Constant(all) - 1-28
|
|---New For Each(true)[tuple] - 1-27
| |
| Project[int][1] - 1-26
|
|---Load(hdfs://srwaishdc1nn0001/tmp/temp-485053564/tmp-405426927:org.apache.pig.impl.builtin.RandomSampleLoader('or
g.apache.pig.builtin.BinStorage','100')) - 1-24--------
Reduce Plan
Store(hdfs://srwaishdc1nn0001/tmp/temp-485053564/tmp1146107855:org.apache.pig.builtin.BinStorage)
- 1-38
|
|---New For Each(false)[tuple] - 1-37
| |
| POUserFunc(org.apache.pig.impl.builtin.FindQuantiles)[tuple] - 1-36
| |
| |---Project[tuple][*] - 1-35
|
|---New For Each(false,false)[tuple] - 1-34
| |
| Constant(444) - 1-33
| |
| RelationToExpressionProject[bag][*] - 1-45
| |
| |---Project[tuple][1] - 1-31
|
|---Package[tuple]{chararray} - 1-30--------
Global sort: false
Secondary sort: true
----------------
MapReduce node 1-40
Map Plan
Local Rearrange[tuple]{int}(false) - 1-41
| |
| Project[int][1] - 1-19
|
|---Load(hdfs://srwaishdc1nn0001/tmp/temp-485053564/tmp-405426927:org.apache.pig.builtin.BinStorage)
- 1-39--------
Reduce Plan
Store(fakefile:org.apache.pig.builtin.PigStorage) - 1-21
|
|---New For Each(true)[tuple] - 1-44
| |
| Project[bag][1] - 1-43
|
|---Package[tuple]{int} - 1-42--------
Global sort: true
Quantile file: hdfs://srwaishdc1nn0001/tmp/temp-485053564/tmp1146107855
----------------