[ https://issues.apache.org/jira/browse/PIG-1637?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Daniel Dai updated PIG-1637: ---------------------------- Attachment: PIG-1637-2.patch A bug caught by Xuefu. Reattach the patch. > Combiner not use because optimizor inserts a foreach between group and > algebric function > ---------------------------------------------------------------------------------------- > > Key: PIG-1637 > URL: https://issues.apache.org/jira/browse/PIG-1637 > Project: Pig > Issue Type: Bug > Affects Versions: 0.8.0 > Reporter: Daniel Dai > Assignee: Daniel Dai > Fix For: 0.8.0 > > Attachments: PIG-1637-1.patch, PIG-1637-2.patch > > > The following script does not use combiner after new optimization change. > {code} > A = load ':INPATH:/pigmix/page_views' using > org.apache.pig.test.udf.storefunc.PigPerformanceLoader() > as (user, action, timespent, query_term, ip_addr, timestamp, > estimated_revenue, page_info, page_links); > B = foreach A generate user, (int)timespent as timespent, > (double)estimated_revenue as estimated_revenue; > C = group B all; > D = foreach C generate SUM(B.timespent), AVG(B.estimated_revenue); > store D into ':OUTPATH:'; > {code} > This is because after group, optimizer detect group key is not used > afterward, it add a foreach statement after C. This is how it looks like > after optimization: > {code} > A = load ':INPATH:/pigmix/page_views' using > org.apache.pig.test.udf.storefunc.PigPerformanceLoader() > as (user, action, timespent, query_term, ip_addr, timestamp, > estimated_revenue, page_info, page_links); > B = foreach A generate user, (int)timespent as timespent, > (double)estimated_revenue as estimated_revenue; > C = group B all; > C1 = foreach C generate B; > D = foreach C1 generate SUM(B.timespent), AVG(B.estimated_revenue); > store D into ':OUTPATH:'; > {code} > That cancel the combiner optimization for D. > The way to solve the issue is to merge the C1 we inserted and D. Currently, > we do not merge these two foreach. The reason is that one output of the first > foreach (B) is referred twice in D, and currently rule assume after merge, we > need to calculate B twice in D. Actually, C1 is only doing projection, no > calculation of B. Merging C1 and D will not result calculating B twice. So C1 > and D should be merged. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.