It think there's room to optimize this case. If the script is a = load '1.txt' as (a0:int, a1:int); b = order a by $0;
we can optimize it to only two jobs rather than three jobs On Thu, May 5, 2011 at 7:47 AM, Daniel Dai <jiany...@yahoo-inc.com> wrote: > The first job process any operator before "order by". If there is nothing > before "order by", it will be only 2 jobs. > > If your script is: > > a = load '1.txt' as (a0:int, a1:int); > b = order a by $0; > > Pig will insert a "foreach" after "load", so you will have 3 jobs, the > first job will process "foreach". > > If your script is: > a = load '1.txt'; > b = order a by $0; > > Then you only have two jobs. > > Daniel > > > On 05/04/2011 12:14 AM, Jeff Zhang wrote: > >> Hi all, >> >> I find that a order by operation will be split into two map reduce jobs in >> pig as following. >> >> As I understand, only two mapreduce jobs is enough, the first job is >> sample >> job, and the second job is the real sort job. >> >> But from the below I see that the first job is a trivial job which only >> convert the data into pig's inter data format. And the next two jobs will >> use this as an input. >> >> I guess maybe this is performance consideration (pig inter data format is >> much more compact). But I doubt whether the three mapreduce jobs 's >> performance is better than two mapreduce jobs. >> >> Anyone has done such comparison ? >> >> >> >> #-------------------------------------------------- >> # Map Reduce Plan >> #-------------------------------------------------- >> MapReduce node 1-22 >> Map Plan >> >> Store(hdfs://srwaishdc1nn0001/tmp/temp-485053564/tmp-405426927:org.apache.pig.builtin.BinStorage) >> - 1-23 >> | >> |---New For Each(false,false)[bag] - 1-18 >> | | >> | Cast[chararray] - 1-15 >> | | >> | |---Project[bytearray][0] - 1-14 >> | | >> | Cast[int] - 1-17 >> | | >> | |---Project[bytearray][1] - 1-16 >> | >> >> >> |---Load(hdfs://srwaishdc1nn0001/apps/sq/jianfezhang/mobius_outputs/hadoop-out49:PigStorage) >> - 1-13-------- >> Global sort: false >> ---------------- >> >> MapReduce node 1-25 >> Map Plan >> Local Rearrange[tuple]{tuple}(false) - 1-29 >> | | >> | Constant(all) - 1-28 >> | >> |---New For Each(true)[tuple] - 1-27 >> | | >> | Project[int][1] - 1-26 >> | >> >> >> |---Load(hdfs://srwaishdc1nn0001/tmp/temp-485053564/tmp-405426927:org.apache.pig.impl.builtin.RandomSampleLoader('or >> g.apache.pig.builtin.BinStorage','100')) - 1-24-------- >> Reduce Plan >> >> Store(hdfs://srwaishdc1nn0001/tmp/temp-485053564/tmp1146107855:org.apache.pig.builtin.BinStorage) >> - 1-38 >> | >> |---New For Each(false)[tuple] - 1-37 >> | | >> | POUserFunc(org.apache.pig.impl.builtin.FindQuantiles)[tuple] - >> 1-36 >> | | >> | |---Project[tuple][*] - 1-35 >> | >> |---New For Each(false,false)[tuple] - 1-34 >> | | >> | Constant(444) - 1-33 >> | | >> | RelationToExpressionProject[bag][*] - 1-45 >> | | >> | |---Project[tuple][1] - 1-31 >> | >> |---Package[tuple]{chararray} - 1-30-------- >> Global sort: false >> Secondary sort: true >> ---------------- >> >> MapReduce node 1-40 >> Map Plan >> Local Rearrange[tuple]{int}(false) - 1-41 >> | | >> | Project[int][1] - 1-19 >> | >> >> |---Load(hdfs://srwaishdc1nn0001/tmp/temp-485053564/tmp-405426927:org.apache.pig.builtin.BinStorage) >> - 1-39-------- >> Reduce Plan >> Store(fakefile:org.apache.pig.builtin.PigStorage) - 1-21 >> | >> |---New For Each(true)[tuple] - 1-44 >> | | >> | Project[bag][1] - 1-43 >> | >> |---Package[tuple]{int} - 1-42-------- >> Global sort: true >> Quantile file: hdfs://srwaishdc1nn0001/tmp/temp-485053564/tmp1146107855 >> ---------------- >> >> >> > -- Best Regards Jeff Zhang