[ https://issues.apache.org/jira/browse/PIG-983?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Richard Ding updated PIG-983: ----------------------------- Attachment: PIG-983.patch This patch deals with the multi-query case where both splitter and splittees have non-empty reducers. In this case we can't merge the splittees into the splitter as in other cases. We'll merge the splittees into a new MR operator and then connect this operator to the splitter, achieving the goal of reducing the total number of MR jobs for multi-query scripts. > PERFORMANCE: multi-query optimization on multiple group bys following a join > or cogroup > --------------------------------------------------------------------------------------- > > Key: PIG-983 > URL: https://issues.apache.org/jira/browse/PIG-983 > Project: Pig > Issue Type: Improvement > Components: impl > Reporter: Richard Ding > Assignee: Richard Ding > Attachments: PIG-983.patch > > > The current multi-query optimizer works well with pig scripts like this one: > {code} > data = LOAD 'input' AS (a:chararray, b:int, c:int); > A = GROUP data BY b; > B = GROUP data BY c; > C = FOREACH A GENERATE group, COUNT(data); > D = FOREACH B GENERATE group, SUM(data.b); > STORE C INTO 'output1'; > STORE D INTO 'output2'; > {code} > In this case the original three Map-Reduce jobs are merged into one MR job by > the optimizer. > The current optimizer, however, won't reduce the number of MR jobs for the > scripts in which multiple group bys follow a join or a cogroup, such as this > one: > {code} > data1 = LOAD 'input1' AS (a1:chararray, b1:int, c1:int); > data2 = LOAD 'input2' AS (a2:chararray, b2:int, c2:int); > A = JOIN data1 BY a1, data2 BY a2; > B = GROUP A BY data1::b1; > C = GROUP B BY data2::c2; > D = FOREACH B GENERATE group, COUNT(A); > E = FOREACH C GENERATE group, SUM(A.data2::b2); > STORE D INTO 'output1'; > STORE E INTO 'output2'; > {code} > Three MR jobs are still needed to run this script. > Multi-query optimizer should work with this kind of scripts by merging the > group bys and reducing the overall MR jobs. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.