Re: split operator

Gang Luo Mon, 23 Aug 2010 16:21:28 -0700

Hi Daniel,
This is a question long ago, but I suddenly come up with some more thoughts on 
this. In a query as simple as this:

A = LOAD 'input';
B = FILTER A BY $1 == 1;
C = COGROUP A BY $0, B BY $0;

the optimizer will insert a split operator to reuse A. According to the source 
code, a map-reduce job will be ended when it sees split and output the result 
to 
A1 and A2 which will be used by two subsequent jobs to process B and C. In this 
case, the first job does nothing meaningful but copy the souce 'input' twice. 
Is 
there some optimization applied here (like the MultiQueryOptimizer you 
mentioned 
previously) ? How?

Since I didn't take a look at the MultiQueryOptimizer, it will be great help if 
you can briefly describe how MultiQueryOptimizer works. Thanks a lot.

-Gang

----- 原始邮件 ----
发件人： Daniel Dai <[email protected]>
收件人： "[email protected]" <[email protected]>
发送日期： 2010/7/26 (周一) 4:58:49 下午
主   题： Re: split operator

Hi, Gang,
It is about multiquery optimization. In MRCompiler, we will create a
map-reduce boundary for split, later in MultiQueryOptimizer, we will
merge several split into one map-reduce job. In this map-reduce job, we
will nest several split plans.

Daniel

Gang Luo wrote:
> Hi Daniel,
> in 4.3.1, the example and figure 6 show this. 5.1 last paragraph says split 
> operator maintain one-tuple buffer for each branch and talks about how to 
> synchronize multiple branches. I do think that is the in-memory split.
>
> here is the paper: http://www.vldb.org/pvldb/2/vldb09-1074.pdf
>
>
> -Gang
>
>
>
> ----- 原始邮件 ----
> 发件人： Daniel Dai <[email protected]>
> 收件人： "[email protected]" <[email protected]>
> 发送日期： 2010/7/26 (周一) 2:09:25 下午
> 主   题： Re: split operator
>
> Hi, Gang,
> Which part of the paper are you talking about? We don't do in-memory split. 
> We 

> dump the split result to a temporary file and start a new map-reduce job. 
> Split 
>
>
> do create a map-reduce boundary (Though it is not entirely true, multiquery 
> optimizer may combine some of these jobs)
>
> Daniel
>
> Gang Luo wrote:
>  
>> Hi all
>> according to the vldb 09 paper, the split operator and all its successive 
>> operators reside in memory without any blocking in between. However, the 
>> source 
>>
>>
>> code (version 0.7) shows that a MR job is actually ended when it meets the 
>>split 
>>
>> operator and multiple new MR jobs are created, each representing one branch. 
>> This write-once-read-multiple-times method is different from the in-memory 
>> method mentioned in that paper. Does pig change the strategy for split, or 
>> is 

>> there still an in-memory version of split I didn't discover?
>>
>> Thanks,
>> -Gang
>>
>>
>>        
>>    
>
>
>      
>

Re: split operator

Reply via email to