[ https://issues.apache.org/jira/browse/PIG-4709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15050306#comment-15050306 ]
liyunzhang_intel commented on PIG-4709: --------------------------------------- [~pallavi.rao]: i use "ant -Dhadoopversion=23 -Dexectype=spark -Dtestcase=TestCombiner" to test the latest patch. There are 4 failures(see [attachment|https://issues.apache.org/jira/secure/attachment/12776475/TEST-org.apache.pig.test.TestCombiner.xml]) > Improve performance of GROUPBY operator on Spark > ------------------------------------------------ > > Key: PIG-4709 > URL: https://issues.apache.org/jira/browse/PIG-4709 > Project: Pig > Issue Type: Sub-task > Components: spark > Reporter: Pallavi Rao > Assignee: Pallavi Rao > Labels: spork > Fix For: spark-branch > > Attachments: PIG-4709-v1.patch, PIG-4709.patch, > TEST-org.apache.pig.test.TestCombiner.xml > > > Currently, the GROUPBY operator of PIG is mapped by Spark's CoGroup. When the > grouped data is consumed by subsequent operations to perform algebraic > operations, this is sub-optimal as there is lot of shuffle traffic. > The Spark Plan must be optimized to use reduceBy, where possible, so that a > combiner is used. -- This message was sent by Atlassian JIRA (v6.3.4#6332)