Hi Daniel, This is a question long ago, but I suddenly come up with some more thoughts on this. In a query as simple as this:
A = LOAD 'input'; B = FILTER A BY $1 == 1; C = COGROUP A BY $0, B BY $0; the optimizer will insert a split operator to reuse A. According to the source code, a map-reduce job will be ended when it sees split and output the result to A1 and A2 which will be used by two subsequent jobs to process B and C. In this case, the first job does nothing meaningful but copy the souce 'input' twice. Is there some optimization applied here (like the MultiQueryOptimizer you mentioned previously) ? How? Since I didn't take a look at the MultiQueryOptimizer, it will be great help if you can briefly describe how MultiQueryOptimizer works. Thanks a lot. -Gang ----- 原始邮件 ---- 发件人: Daniel Dai <jiany...@yahoo-inc.com> 收件人: "pig-dev@hadoop.apache.org" <pig-dev@hadoop.apache.org> 发送日期: 2010/7/26 (周一) 4:58:49 下午 主 题: Re: split operator Hi, Gang, It is about multiquery optimization. In MRCompiler, we will create a map-reduce boundary for split, later in MultiQueryOptimizer, we will merge several split into one map-reduce job. In this map-reduce job, we will nest several split plans. Daniel Gang Luo wrote: > Hi Daniel, > in 4.3.1, the example and figure 6 show this. 5.1 last paragraph says split > operator maintain one-tuple buffer for each branch and talks about how to > synchronize multiple branches. I do think that is the in-memory split. > > here is the paper: http://www.vldb.org/pvldb/2/vldb09-1074.pdf > > > -Gang > > > > ----- 原始邮件 ---- > 发件人: Daniel Dai <jiany...@yahoo-inc.com> > 收件人: "pig-dev@hadoop.apache.org" <pig-dev@hadoop.apache.org> > 发送日期: 2010/7/26 (周一) 2:09:25 下午 > 主 题: Re: split operator > > Hi, Gang, > Which part of the paper are you talking about? We don't do in-memory split. > We > dump the split result to a temporary file and start a new map-reduce job. > Split > > > do create a map-reduce boundary (Though it is not entirely true, multiquery > optimizer may combine some of these jobs) > > Daniel > > Gang Luo wrote: > >> Hi all >> according to the vldb 09 paper, the split operator and all its successive >> operators reside in memory without any blocking in between. However, the >> source >> >> >> code (version 0.7) shows that a MR job is actually ended when it meets the >>split >> >> operator and multiple new MR jobs are created, each representing one branch. >> This write-once-read-multiple-times method is different from the in-memory >> method mentioned in that paper. Does pig change the strategy for split, or >> is >> there still an in-memory version of split I didn't discover? >> >> Thanks, >> -Gang >> >> >> >> > > > >