[ 
https://issues.apache.org/jira/browse/PIG-1637?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12915943#action_12915943
 ] 

Xuefu Zhang commented on PIG-1637:
----------------------------------

+1

Patch looks good, except that we don't have to require that all output 
expressions in the first foreach contain only simple projection. As long as the 
output expression in the first foreach that is referenced multiple times in the 
second foreach contains only simple projection, the merge can proceed. Doing 
this, the following two loops may be better merged to one.

@@ -93,14 +93,17 @@
             // Otherwise, we may do expression calculation more than once, 
defeat the benefit of this
             // optimization
             Set<Integer> inputs = new HashSet<Integer>();
+            boolean duplicateInputs = false;
             for (Operator op : foreach2.getInnerPlan().getSources()) {
                 // If the source is not LOInnerLoad, then it must be 
LOGenerate. This happens when 
                 // the 1st ForEach does not rely on any input of 2nd ForEach
                 if (op instanceof LOInnerLoad) {
                     LOInnerLoad innerLoad = (LOInnerLoad)op;
                     int input = innerLoad.getProjection().getColNum();
-                    if (inputs.contains(input))
-                        return false;
+                    if (inputs.contains(input)) {
+                        duplicateInputs = true;
+                        break;
+                    }
                     else
                         inputs.add(input);
                     
@@ -109,6 +112,27 @@
                 }
             }
             
+            // Duplicate inputs in the case first foreach only containing 
LOInnerLoad and
+            // LOGenerate is allowed, and output plan is simple projection
+            if (duplicateInputs) {
+                Iterator<Operator> it1 = 
foreach1.getInnerPlan().getOperators();
+                while( it1.hasNext() ) {
+                    Operator op = it1.next();
+                    if(!(op instanceof LOGenerate) && !(op instanceof 
LOInnerLoad))
+                        return false;
+                    if (op instanceof LOGenerate) {
+                        List<LogicalExpressionPlan> outputPlans = 
((LOGenerate)op).getOutputPlans();
+                        for (LogicalExpressionPlan outputPlan : outputPlans) {
+                            Iterator<Operator> iter = 
outputPlan.getOperators();
+                            while (iter.hasNext()) {
+                                if (!(iter.next() instanceof 
ProjectExpression))
+                                    return false;
+                            }
+                        }
+                    }
+                }
+            }


> Combiner not use because optimizor inserts a foreach between group and 
> algebric function
> ----------------------------------------------------------------------------------------
>
>                 Key: PIG-1637
>                 URL: https://issues.apache.org/jira/browse/PIG-1637
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: 0.8.0
>            Reporter: Daniel Dai
>            Assignee: Daniel Dai
>             Fix For: 0.8.0
>
>         Attachments: PIG-1637-1.patch, PIG-1637-2.patch
>
>
> The following script does not use combiner after new optimization change.
> {code}
> A = load ':INPATH:/pigmix/page_views' using 
> org.apache.pig.test.udf.storefunc.PigPerformanceLoader()
>     as (user, action, timespent, query_term, ip_addr, timestamp, 
> estimated_revenue, page_info, page_links);
> B = foreach A generate user, (int)timespent as timespent, 
> (double)estimated_revenue as estimated_revenue;
> C = group B all; 
> D = foreach C generate SUM(B.timespent), AVG(B.estimated_revenue);
> store D into ':OUTPATH:';
> {code}
> This is because after group, optimizer detect group key is not used 
> afterward, it add a foreach statement after C. This is how it looks like 
> after optimization:
> {code}
> A = load ':INPATH:/pigmix/page_views' using 
> org.apache.pig.test.udf.storefunc.PigPerformanceLoader()
>     as (user, action, timespent, query_term, ip_addr, timestamp, 
> estimated_revenue, page_info, page_links);
> B = foreach A generate user, (int)timespent as timespent, 
> (double)estimated_revenue as estimated_revenue;
> C = group B all; 
> C1 = foreach C generate B;
> D = foreach C1 generate SUM(B.timespent), AVG(B.estimated_revenue);
> store D into ':OUTPATH:';
> {code}
> That cancel the combiner optimization for D. 
> The way to solve the issue is to merge the C1 we inserted and D. Currently, 
> we do not merge these two foreach. The reason is that one output of the first 
> foreach (B) is referred twice in D, and currently rule assume after merge, we 
> need to calculate B twice in D. Actually, C1 is only doing projection, no 
> calculation of B. Merging C1 and D will not result calculating B twice. So C1 
> and D should be merged.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to