[jira] Updated: (PIG-983) PERFORMANCE: multi-query optimization on multiple group bys following a join or cogroup

Richard Ding (JIRA) Mon, 05 Oct 2009 12:31:00 -0700

     [ 
https://issues.apache.org/jira/browse/PIG-983?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Richard Ding updated PIG-983:
-----------------------------

    Attachment: PIG-983.patch

This patch deals with the multi-query case where both splitter and splittees 
have non-empty reducers. In this case we can't merge the splittees into the 
splitter as in other cases. We'll merge the splittees into a new MR operator 
and then connect this operator to the splitter, achieving the goal of reducing 
the total number of MR jobs for multi-query scripts.

> PERFORMANCE: multi-query optimization on multiple group bys following a join 
> or cogroup
> ---------------------------------------------------------------------------------------
>
>                 Key: PIG-983
>                 URL: https://issues.apache.org/jira/browse/PIG-983
>             Project: Pig
>          Issue Type: Improvement
>          Components: impl
>            Reporter: Richard Ding
>            Assignee: Richard Ding
>         Attachments: PIG-983.patch
>
>
> The current multi-query optimizer works well with pig scripts like this one:
> {code}
> data = LOAD 'input' AS (a:chararray, b:int, c:int);
> A = GROUP data BY b;
> B = GROUP data BY c;
> C = FOREACH A GENERATE group, COUNT(data);
> D = FOREACH B GENERATE group, SUM(data.b);
> STORE C INTO 'output1';
> STORE D INTO 'output2';
> {code}
> In this case the original three Map-Reduce jobs are merged into one MR job by 
> the optimizer.
> The current optimizer, however, won't reduce the number of MR jobs for the 
> scripts in which multiple group bys follow a join or a cogroup, such as this 
> one:
> {code}
> data1 = LOAD 'input1' AS (a1:chararray, b1:int, c1:int);
> data2 = LOAD 'input2' AS (a2:chararray, b2:int, c2:int);
> A = JOIN data1 BY a1, data2 BY a2;
> B = GROUP A BY data1::b1;
> C = GROUP B BY data2::c2;
> D = FOREACH B GENERATE group, COUNT(A);
> E = FOREACH C GENERATE group, SUM(A.data2::b2);
> STORE D INTO 'output1';
> STORE E INTO 'output2';                        
> {code}
> Three MR jobs are still needed to run this script.
> Multi-query optimizer should work with this kind of scripts by merging the 
> group bys and reducing the overall MR jobs. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-983) PERFORMANCE: multi-query optimization on multiple group bys following a join or cogroup

Reply via email to