[ https://issues.apache.org/jira/browse/PIG-1437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12875481#action_12875481 ]
Ashutosh Chauhan commented on PIG-1437: --------------------------------------- Since this is logical transformation of query plan, logical optimizer is the ideal place for this optimization. But I think it instead might be easier to do on MR plan after it is generated. > [Optimization] Rewrite GroupBy-Foreach-flatten(group) to Distinct > ----------------------------------------------------------------- > > Key: PIG-1437 > URL: https://issues.apache.org/jira/browse/PIG-1437 > Project: Pig > Issue Type: Bug > Components: impl > Affects Versions: 0.7.0 > Reporter: Ashutosh Chauhan > Priority: Minor > > Its possible to rewrite queries like this > {code} > A = load 'data' as (name,age); > B = group A by (name,age); > C = foreach B generate group.name, group.age; > dump C; > {code} > or > {code} > (name,age); > B = group A by (name > A = load 'data' as,age); > C = foreach B generate flatten(group); > dump C; > {code} > to > {code} > A = load 'data' as (name,age); > B = distinct A; > dump B; > {code} > This could only be done if no columns within the bags are referenced > subsequently in the script. Since in Pig-Hadoop world DISTINCT will be > executed more effeciently then group-by this will be a huge win. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.