[ https://issues.apache.org/jira/browse/PIG-4709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15120893#comment-15120893 ]
liyunzhang_intel commented on PIG-4709: --------------------------------------- [~pallavi.rao]: for PIG-4709-v3.patch: LGTM > Improve performance of GROUPBY operator on Spark > ------------------------------------------------ > > Key: PIG-4709 > URL: https://issues.apache.org/jira/browse/PIG-4709 > Project: Pig > Issue Type: Sub-task > Components: spark > Reporter: Pallavi Rao > Assignee: Pallavi Rao > Labels: spork > Fix For: spark-branch > > Attachments: PIG-4709-v1.patch, PIG-4709-v2.patch, PIG-4709-v3.patch, > PIG-4709.patch, TEST-org.apache.pig.test.TestCombiner.xml > > > Currently, the GROUPBY operator of PIG is mapped by Spark's CoGroup. When the > grouped data is consumed by subsequent operations to perform algebraic > operations, this is sub-optimal as there is lot of shuffle traffic. > The Spark Plan must be optimized to use reduceBy, where possible, so that a > combiner is used. -- This message was sent by Atlassian JIRA (v6.3.4#6332)