Re: split operator

Daniel Dai Mon, 23 Aug 2010 16:51:34 -0700

Hi, Gang,
Yes, that's what MultiQueryOptimizer address. After splitting, we split
the script into smaller combinable pieces, and MultiQueryOptimizer will
combine as much splitter and splittees into the same map-reduce job. So
after SplitInserter, you might see more jobs, but you will end up with
fewer jobs. The algorithm for MultiQueryOptimizer is: for every
splitter, find as much combinable splittees, and combine them into the
same mapreduce job. You can find more details at
http://wiki.apache.org/pig/PigMultiQueryPerformanceSpecification


Daniel

Gang Luo wrote:
> Hi Daniel,
> This is a question long ago, but I suddenly come up with some more thoughts 
> on 
> this. In a query as simple as this:
>
> A = LOAD 'input';
> B = FILTER A BY $1 == 1;
> C = COGROUP A BY $0, B BY $0;
>
> the optimizer will insert a split operator to reuse A. According to the 
> source 
> code, a map-reduce job will be ended when it sees split and output the result 
> to 
> A1 and A2 which will be used by two subsequent jobs to process B and C. In 
> this 
> case, the first job does nothing meaningful but copy the souce 'input' twice. 
> Is 
> there some optimization applied here (like the MultiQueryOptimizer you 
> mentioned 
> previously) ? How?
>
> Since I didn't take a look at the MultiQueryOptimizer, it will be great help 
> if 
> you can briefly describe how MultiQueryOptimizer works. Thanks a lot.
>
> -Gang
>
>
>
>
> ----- 原始邮件 ----
> 发件人： Daniel Dai <[email protected]>
> 收件人： "[email protected]" <[email protected]>
> 发送日期： 2010/7/26 (周一) 4:58:49 下午
> 主   题： Re: split operator
>
> Hi, Gang,
> It is about multiquery optimization. In MRCompiler, we will create a
> map-reduce boundary for split, later in MultiQueryOptimizer, we will
> merge several split into one map-reduce job. In this map-reduce job, we
> will nest several split plans.
>
> Daniel
>
> Gang Luo wrote:
>   
>> Hi Daniel,
>> in 4.3.1, the example and figure 6 show this. 5.1 last paragraph says split 
>> operator maintain one-tuple buffer for each branch and talks about how to 
>> synchronize multiple branches. I do think that is the in-memory split.
>>
>> here is the paper: http://www.vldb.org/pvldb/2/vldb09-1074.pdf
>>
>>
>> -Gang
>>
>>
>>
>> ----- 原始邮件 ----
>> 发件人： Daniel Dai <[email protected]>
>> 收件人： "[email protected]" <[email protected]>
>> 发送日期： 2010/7/26 (周一) 2:09:25 下午
>> 主   题： Re: split operator
>>
>> Hi, Gang,
>> Which part of the paper are you talking about? We don't do in-memory split. 
>> We 
>>     
>
>
>   
>> dump the split result to a temporary file and start a new map-reduce job. 
>> Split 
>>
>>
>> do create a map-reduce boundary (Though it is not entirely true, multiquery 
>> optimizer may combine some of these jobs)
>>
>> Daniel
>>
>> Gang Luo wrote:
>>  
>>     
>>> Hi all
>>> according to the vldb 09 paper, the split operator and all its successive 
>>> operators reside in memory without any blocking in between. However, the 
>>> source 
>>>
>>>
>>> code (version 0.7) shows that a MR job is actually ended when it meets the 
>>> split 
>>>
>>> operator and multiple new MR jobs are created, each representing one 
>>> branch. 
>>> This write-once-read-multiple-times method is different from the in-memory 
>>> method mentioned in that paper. Does pig change the strategy for split, or 
>>> is 
>>>       
>
>
>   
>>> there still an in-memory version of split I didn't discover?
>>>
>>> Thanks,
>>> -Gang
>>>
>>>
>>>        
>>>    
>>>       
>>      
>>  
>>     
>
>
>       
>

Re: split operator

Reply via email to