[jira] Commented: (PIG-1637) Combiner not use because optimizor inserts a foreach between group and algebric function

2010-09-28 Thread Daniel Dai (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1637?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12915880#action_12915880
 ] 

Daniel Dai commented on PIG-1637:
-

test-patch result for PIG-1637-2.patch:

 [exec] +1 overall.  
 [exec] 
 [exec] +1 @author.  The patch does not contain any @author tags.
 [exec] 
 [exec] +1 tests included.  The patch appears to include 3 new or 
modified tests.
 [exec] 
 [exec] +1 javadoc.  The javadoc tool did not generate any warning 
messages.
 [exec] 
 [exec] +1 javac.  The applied patch does not increase the total number 
of javac compiler warnings.
 [exec] 
 [exec] +1 findbugs.  The patch does not introduce any new Findbugs 
warnings.
 [exec] 
 [exec] +1 release audit.  The applied patch does not increase the 
total number of release audit warnings.


 Combiner not use because optimizor inserts a foreach between group and 
 algebric function
 

 Key: PIG-1637
 URL: https://issues.apache.org/jira/browse/PIG-1637
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.8.0
Reporter: Daniel Dai
Assignee: Daniel Dai
 Fix For: 0.8.0

 Attachments: PIG-1637-1.patch, PIG-1637-2.patch


 The following script does not use combiner after new optimization change.
 {code}
 A = load ':INPATH:/pigmix/page_views' using 
 org.apache.pig.test.udf.storefunc.PigPerformanceLoader()
 as (user, action, timespent, query_term, ip_addr, timestamp, 
 estimated_revenue, page_info, page_links);
 B = foreach A generate user, (int)timespent as timespent, 
 (double)estimated_revenue as estimated_revenue;
 C = group B all; 
 D = foreach C generate SUM(B.timespent), AVG(B.estimated_revenue);
 store D into ':OUTPATH:';
 {code}
 This is because after group, optimizer detect group key is not used 
 afterward, it add a foreach statement after C. This is how it looks like 
 after optimization:
 {code}
 A = load ':INPATH:/pigmix/page_views' using 
 org.apache.pig.test.udf.storefunc.PigPerformanceLoader()
 as (user, action, timespent, query_term, ip_addr, timestamp, 
 estimated_revenue, page_info, page_links);
 B = foreach A generate user, (int)timespent as timespent, 
 (double)estimated_revenue as estimated_revenue;
 C = group B all; 
 C1 = foreach C generate B;
 D = foreach C1 generate SUM(B.timespent), AVG(B.estimated_revenue);
 store D into ':OUTPATH:';
 {code}
 That cancel the combiner optimization for D. 
 The way to solve the issue is to merge the C1 we inserted and D. Currently, 
 we do not merge these two foreach. The reason is that one output of the first 
 foreach (B) is referred twice in D, and currently rule assume after merge, we 
 need to calculate B twice in D. Actually, C1 is only doing projection, no 
 calculation of B. Merging C1 and D will not result calculating B twice. So C1 
 and D should be merged.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1637) Combiner not use because optimizor inserts a foreach between group and algebric function

2010-09-28 Thread Daniel Dai (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1637?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12915950#action_12915950
 ] 

Daniel Dai commented on PIG-1637:
-

Yes, it could be improved as per Xuefu's suggestion. Anyway, current patch 
solve the combiner not used issue, will commit this part first. I will open 
another Jira to improve it. Also, MergeForEach is a best example to practice 
cloning framework [PIG-1587|https://issues.apache.org/jira/browse/PIG-1587], so 
it is better to improve it once PIG-1587 is available.

 Combiner not use because optimizor inserts a foreach between group and 
 algebric function
 

 Key: PIG-1637
 URL: https://issues.apache.org/jira/browse/PIG-1637
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.8.0
Reporter: Daniel Dai
Assignee: Daniel Dai
 Fix For: 0.8.0

 Attachments: PIG-1637-1.patch, PIG-1637-2.patch


 The following script does not use combiner after new optimization change.
 {code}
 A = load ':INPATH:/pigmix/page_views' using 
 org.apache.pig.test.udf.storefunc.PigPerformanceLoader()
 as (user, action, timespent, query_term, ip_addr, timestamp, 
 estimated_revenue, page_info, page_links);
 B = foreach A generate user, (int)timespent as timespent, 
 (double)estimated_revenue as estimated_revenue;
 C = group B all; 
 D = foreach C generate SUM(B.timespent), AVG(B.estimated_revenue);
 store D into ':OUTPATH:';
 {code}
 This is because after group, optimizer detect group key is not used 
 afterward, it add a foreach statement after C. This is how it looks like 
 after optimization:
 {code}
 A = load ':INPATH:/pigmix/page_views' using 
 org.apache.pig.test.udf.storefunc.PigPerformanceLoader()
 as (user, action, timespent, query_term, ip_addr, timestamp, 
 estimated_revenue, page_info, page_links);
 B = foreach A generate user, (int)timespent as timespent, 
 (double)estimated_revenue as estimated_revenue;
 C = group B all; 
 C1 = foreach C generate B;
 D = foreach C1 generate SUM(B.timespent), AVG(B.estimated_revenue);
 store D into ':OUTPATH:';
 {code}
 That cancel the combiner optimization for D. 
 The way to solve the issue is to merge the C1 we inserted and D. Currently, 
 we do not merge these two foreach. The reason is that one output of the first 
 foreach (B) is referred twice in D, and currently rule assume after merge, we 
 need to calculate B twice in D. Actually, C1 is only doing projection, no 
 calculation of B. Merging C1 and D will not result calculating B twice. So C1 
 and D should be merged.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.