Hi, Gang, Yes, that's what MultiQueryOptimizer address. After splitting, we split the script into smaller combinable pieces, and MultiQueryOptimizer will combine as much splitter and splittees into the same map-reduce job. So after SplitInserter, you might see more jobs, but you will end up with fewer jobs. The algorithm for MultiQueryOptimizer is: for every splitter, find as much combinable splittees, and combine them into the same mapreduce job. You can find more details at http://wiki.apache.org/pig/PigMultiQueryPerformanceSpecification
Daniel Gang Luo wrote: > Hi Daniel, > This is a question long ago, but I suddenly come up with some more thoughts > on > this. In a query as simple as this: > > A = LOAD 'input'; > B = FILTER A BY $1 == 1; > C = COGROUP A BY $0, B BY $0; > > the optimizer will insert a split operator to reuse A. According to the > source > code, a map-reduce job will be ended when it sees split and output the result > to > A1 and A2 which will be used by two subsequent jobs to process B and C. In > this > case, the first job does nothing meaningful but copy the souce 'input' twice. > Is > there some optimization applied here (like the MultiQueryOptimizer you > mentioned > previously) ? How? > > Since I didn't take a look at the MultiQueryOptimizer, it will be great help > if > you can briefly describe how MultiQueryOptimizer works. Thanks a lot. > > -Gang > > > > > ----- 原始邮件 ---- > 发件人: Daniel Dai <jiany...@yahoo-inc.com> > 收件人: "pig-dev@hadoop.apache.org" <pig-dev@hadoop.apache.org> > 发送日期: 2010/7/26 (周一) 4:58:49 下午 > 主 题: Re: split operator > > Hi, Gang, > It is about multiquery optimization. In MRCompiler, we will create a > map-reduce boundary for split, later in MultiQueryOptimizer, we will > merge several split into one map-reduce job. In this map-reduce job, we > will nest several split plans. > > Daniel > > Gang Luo wrote: > >> Hi Daniel, >> in 4.3.1, the example and figure 6 show this. 5.1 last paragraph says split >> operator maintain one-tuple buffer for each branch and talks about how to >> synchronize multiple branches. I do think that is the in-memory split. >> >> here is the paper: http://www.vldb.org/pvldb/2/vldb09-1074.pdf >> >> >> -Gang >> >> >> >> ----- 原始邮件 ---- >> 发件人: Daniel Dai <jiany...@yahoo-inc.com> >> 收件人: "pig-dev@hadoop.apache.org" <pig-dev@hadoop.apache.org> >> 发送日期: 2010/7/26 (周一) 2:09:25 下午 >> 主 题: Re: split operator >> >> Hi, Gang, >> Which part of the paper are you talking about? We don't do in-memory split. >> We >> > > > >> dump the split result to a temporary file and start a new map-reduce job. >> Split >> >> >> do create a map-reduce boundary (Though it is not entirely true, multiquery >> optimizer may combine some of these jobs) >> >> Daniel >> >> Gang Luo wrote: >> >> >>> Hi all >>> according to the vldb 09 paper, the split operator and all its successive >>> operators reside in memory without any blocking in between. However, the >>> source >>> >>> >>> code (version 0.7) shows that a MR job is actually ended when it meets the >>> split >>> >>> operator and multiple new MR jobs are created, each representing one >>> branch. >>> This write-once-read-multiple-times method is different from the in-memory >>> method mentioned in that paper. Does pig change the strategy for split, or >>> is >>> > > > >>> there still an in-memory version of split I didn't discover? >>> >>> Thanks, >>> -Gang >>> >>> >>> >>> >>> >> >> >> > > > >