Hi all, I find that a order by operation will be split into two map reduce jobs in pig as following.
As I understand, only two mapreduce jobs is enough, the first job is sample job, and the second job is the real sort job. But from the below I see that the first job is a trivial job which only convert the data into pig's inter data format. And the next two jobs will use this as an input. I guess maybe this is performance consideration (pig inter data format is much more compact). But I doubt whether the three mapreduce jobs 's performance is better than two mapreduce jobs. Anyone has done such comparison ? #-------------------------------------------------- # Map Reduce Plan #-------------------------------------------------- MapReduce node 1-22 Map Plan Store(hdfs://srwaishdc1nn0001/tmp/temp-485053564/tmp-405426927:org.apache.pig.builtin.BinStorage) - 1-23 | |---New For Each(false,false)[bag] - 1-18 | | | Cast[chararray] - 1-15 | | | |---Project[bytearray][0] - 1-14 | | | Cast[int] - 1-17 | | | |---Project[bytearray][1] - 1-16 | |---Load(hdfs://srwaishdc1nn0001/apps/sq/jianfezhang/mobius_outputs/hadoop-out49:PigStorage) - 1-13-------- Global sort: false ---------------- MapReduce node 1-25 Map Plan Local Rearrange[tuple]{tuple}(false) - 1-29 | | | Constant(all) - 1-28 | |---New For Each(true)[tuple] - 1-27 | | | Project[int][1] - 1-26 | |---Load(hdfs://srwaishdc1nn0001/tmp/temp-485053564/tmp-405426927:org.apache.pig.impl.builtin.RandomSampleLoader('or g.apache.pig.builtin.BinStorage','100')) - 1-24-------- Reduce Plan Store(hdfs://srwaishdc1nn0001/tmp/temp-485053564/tmp1146107855:org.apache.pig.builtin.BinStorage) - 1-38 | |---New For Each(false)[tuple] - 1-37 | | | POUserFunc(org.apache.pig.impl.builtin.FindQuantiles)[tuple] - 1-36 | | | |---Project[tuple][*] - 1-35 | |---New For Each(false,false)[tuple] - 1-34 | | | Constant(444) - 1-33 | | | RelationToExpressionProject[bag][*] - 1-45 | | | |---Project[tuple][1] - 1-31 | |---Package[tuple]{chararray} - 1-30-------- Global sort: false Secondary sort: true ---------------- MapReduce node 1-40 Map Plan Local Rearrange[tuple]{int}(false) - 1-41 | | | Project[int][1] - 1-19 | |---Load(hdfs://srwaishdc1nn0001/tmp/temp-485053564/tmp-405426927:org.apache.pig.builtin.BinStorage) - 1-39-------- Reduce Plan Store(fakefile:org.apache.pig.builtin.PigStorage) - 1-21 | |---New For Each(true)[tuple] - 1-44 | | | Project[bag][1] - 1-43 | |---Package[tuple]{int} - 1-42-------- Global sort: true Quantile file: hdfs://srwaishdc1nn0001/tmp/temp-485053564/tmp1146107855 ---------------- -- Best Regards Jeff Zhang