Hi all,

I find that a order by operation will be split into two map reduce jobs in
pig as following.

As I understand, only two mapreduce jobs is enough, the first job is sample
job, and the second job is the real sort job.

But from the below I see that the first job is a trivial job which only
convert the data into pig's inter data format. And the next two jobs will
use this as an input.

I guess maybe this is performance consideration (pig inter data format is
much more compact). But I doubt whether the three mapreduce jobs 's
performance is better than two mapreduce jobs.

Anyone has done such comparison ?



#--------------------------------------------------
# Map Reduce Plan
#--------------------------------------------------
MapReduce node 1-22
Map Plan
Store(hdfs://srwaishdc1nn0001/tmp/temp-485053564/tmp-405426927:org.apache.pig.builtin.BinStorage)
- 1-23
|
|---New For Each(false,false)[bag] - 1-18
    |   |
    |   Cast[chararray] - 1-15
    |   |
    |   |---Project[bytearray][0] - 1-14
    |   |
    |   Cast[int] - 1-17
    |   |
    |   |---Project[bytearray][1] - 1-16
    |

|---Load(hdfs://srwaishdc1nn0001/apps/sq/jianfezhang/mobius_outputs/hadoop-out49:PigStorage)
- 1-13--------
Global sort: false
----------------

MapReduce node 1-25
Map Plan
Local Rearrange[tuple]{tuple}(false) - 1-29
|   |
|   Constant(all) - 1-28
|
|---New For Each(true)[tuple] - 1-27
    |   |
    |   Project[int][1] - 1-26
    |

|---Load(hdfs://srwaishdc1nn0001/tmp/temp-485053564/tmp-405426927:org.apache.pig.impl.builtin.RandomSampleLoader('or
g.apache.pig.builtin.BinStorage','100')) - 1-24--------
Reduce Plan
Store(hdfs://srwaishdc1nn0001/tmp/temp-485053564/tmp1146107855:org.apache.pig.builtin.BinStorage)
- 1-38
|
|---New For Each(false)[tuple] - 1-37
    |   |
    |   POUserFunc(org.apache.pig.impl.builtin.FindQuantiles)[tuple] - 1-36
    |   |
    |   |---Project[tuple][*] - 1-35
    |
    |---New For Each(false,false)[tuple] - 1-34
        |   |
        |   Constant(444) - 1-33
        |   |
        |   RelationToExpressionProject[bag][*] - 1-45
        |   |
        |   |---Project[tuple][1] - 1-31
        |
        |---Package[tuple]{chararray} - 1-30--------
Global sort: false
Secondary sort: true
----------------

MapReduce node 1-40
Map Plan
Local Rearrange[tuple]{int}(false) - 1-41
|   |
|   Project[int][1] - 1-19
|
|---Load(hdfs://srwaishdc1nn0001/tmp/temp-485053564/tmp-405426927:org.apache.pig.builtin.BinStorage)
- 1-39--------
Reduce Plan
Store(fakefile:org.apache.pig.builtin.PigStorage) - 1-21
|
|---New For Each(true)[tuple] - 1-44
    |   |
    |   Project[bag][1] - 1-43
    |
    |---Package[tuple]{int} - 1-42--------
Global sort: true
Quantile file: hdfs://srwaishdc1nn0001/tmp/temp-485053564/tmp1146107855
----------------


-- 
Best Regards

Jeff Zhang

Reply via email to