[jira] [Commented] (PIG-4709) Improve performance of GROUPBY operator on Spark
[ https://issues.apache.org/jira/browse/PIG-4709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15120893#comment-15120893 ] liyunzhang_intel commented on PIG-4709: --- [~pallavi.rao]: for PIG-4709-v3.patch: LGTM > Improve performance of GROUPBY operator on Spark > > > Key: PIG-4709 > URL: https://issues.apache.org/jira/browse/PIG-4709 > Project: Pig > Issue Type: Sub-task > Components: spark >Reporter: Pallavi Rao >Assignee: Pallavi Rao > Labels: spork > Fix For: spark-branch > > Attachments: PIG-4709-v1.patch, PIG-4709-v2.patch, PIG-4709-v3.patch, > PIG-4709.patch, TEST-org.apache.pig.test.TestCombiner.xml > > > Currently, the GROUPBY operator of PIG is mapped by Spark's CoGroup. When the > grouped data is consumed by subsequent operations to perform algebraic > operations, this is sub-optimal as there is lot of shuffle traffic. > The Spark Plan must be optimized to use reduceBy, where possible, so that a > combiner is used. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PIG-4709) Improve performance of GROUPBY operator on Spark
[ https://issues.apache.org/jira/browse/PIG-4709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15097718#comment-15097718 ] liyunzhang_intel commented on PIG-4709: --- [~pallavi.rao]: Leave some comments about PIG-4709_v2.patch(mainly about the package importing sequence). Not have other suggestions. > Improve performance of GROUPBY operator on Spark > > > Key: PIG-4709 > URL: https://issues.apache.org/jira/browse/PIG-4709 > Project: Pig > Issue Type: Sub-task > Components: spark >Reporter: Pallavi Rao >Assignee: Pallavi Rao > Labels: spork > Fix For: spark-branch > > Attachments: PIG-4709-v1.patch, PIG-4709-v2.patch, PIG-4709.patch, > TEST-org.apache.pig.test.TestCombiner.xml > > > Currently, the GROUPBY operator of PIG is mapped by Spark's CoGroup. When the > grouped data is consumed by subsequent operations to perform algebraic > operations, this is sub-optimal as there is lot of shuffle traffic. > The Spark Plan must be optimized to use reduceBy, where possible, so that a > combiner is used. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PIG-4709) Improve performance of GROUPBY operator on Spark
[ https://issues.apache.org/jira/browse/PIG-4709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15088689#comment-15088689 ] Pallavi Rao commented on PIG-4709: -- [~kellyzly], I have addressed your review comments and the patch is uploaded here and to review board. Do you have any further comments? [~mohitsabharwal], did you get a chance to review? PIG-4766 blocked on this patch. > Improve performance of GROUPBY operator on Spark > > > Key: PIG-4709 > URL: https://issues.apache.org/jira/browse/PIG-4709 > Project: Pig > Issue Type: Sub-task > Components: spark >Reporter: Pallavi Rao >Assignee: Pallavi Rao > Labels: spork > Fix For: spark-branch > > Attachments: PIG-4709-v1.patch, PIG-4709-v2.patch, PIG-4709.patch, > TEST-org.apache.pig.test.TestCombiner.xml > > > Currently, the GROUPBY operator of PIG is mapped by Spark's CoGroup. When the > grouped data is consumed by subsequent operations to perform algebraic > operations, this is sub-optimal as there is lot of shuffle traffic. > The Spark Plan must be optimized to use reduceBy, where possible, so that a > combiner is used. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PIG-4709) Improve performance of GROUPBY operator on Spark
[ https://issues.apache.org/jira/browse/PIG-4709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15059601#comment-15059601 ] liyunzhang_intel commented on PIG-4709: --- [~pallavi.rao]:Leave some comments on the review board. Please view it. > Improve performance of GROUPBY operator on Spark > > > Key: PIG-4709 > URL: https://issues.apache.org/jira/browse/PIG-4709 > Project: Pig > Issue Type: Sub-task > Components: spark >Reporter: Pallavi Rao >Assignee: Pallavi Rao > Labels: spork > Fix For: spark-branch > > Attachments: PIG-4709-v1.patch, PIG-4709.patch, > TEST-org.apache.pig.test.TestCombiner.xml > > > Currently, the GROUPBY operator of PIG is mapped by Spark's CoGroup. When the > grouped data is consumed by subsequent operations to perform algebraic > operations, this is sub-optimal as there is lot of shuffle traffic. > The Spark Plan must be optimized to use reduceBy, where possible, so that a > combiner is used. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PIG-4709) Improve performance of GROUPBY operator on Spark
[ https://issues.apache.org/jira/browse/PIG-4709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15050306#comment-15050306 ] liyunzhang_intel commented on PIG-4709: --- [~pallavi.rao]: i use "ant -Dhadoopversion=23 -Dexectype=spark -Dtestcase=TestCombiner" to test the latest patch. There are 4 failures(see [attachment|https://issues.apache.org/jira/secure/attachment/12776475/TEST-org.apache.pig.test.TestCombiner.xml]) > Improve performance of GROUPBY operator on Spark > > > Key: PIG-4709 > URL: https://issues.apache.org/jira/browse/PIG-4709 > Project: Pig > Issue Type: Sub-task > Components: spark >Reporter: Pallavi Rao >Assignee: Pallavi Rao > Labels: spork > Fix For: spark-branch > > Attachments: PIG-4709-v1.patch, PIG-4709.patch, > TEST-org.apache.pig.test.TestCombiner.xml > > > Currently, the GROUPBY operator of PIG is mapped by Spark's CoGroup. When the > grouped data is consumed by subsequent operations to perform algebraic > operations, this is sub-optimal as there is lot of shuffle traffic. > The Spark Plan must be optimized to use reduceBy, where possible, so that a > combiner is used. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PIG-4709) Improve performance of GROUPBY operator on Spark
[ https://issues.apache.org/jira/browse/PIG-4709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15050314#comment-15050314 ] Pallavi Rao commented on PIG-4709: -- That is right [~kellyzly], the patch does NOT address all cases. Once the basic design/impl is reviewed and committed, I will make further enhancements to ensure all test cases in TestCombiner pass. In fact, I'm working on that in parallel, while you guys review the patch. > Improve performance of GROUPBY operator on Spark > > > Key: PIG-4709 > URL: https://issues.apache.org/jira/browse/PIG-4709 > Project: Pig > Issue Type: Sub-task > Components: spark >Reporter: Pallavi Rao >Assignee: Pallavi Rao > Labels: spork > Fix For: spark-branch > > Attachments: PIG-4709-v1.patch, PIG-4709.patch, > TEST-org.apache.pig.test.TestCombiner.xml > > > Currently, the GROUPBY operator of PIG is mapped by Spark's CoGroup. When the > grouped data is consumed by subsequent operations to perform algebraic > operations, this is sub-optimal as there is lot of shuffle traffic. > The Spark Plan must be optimized to use reduceBy, where possible, so that a > combiner is used. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PIG-4709) Improve performance of GROUPBY operator on Spark
[ https://issues.apache.org/jira/browse/PIG-4709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15039710#comment-15039710 ] liyunzhang_intel commented on PIG-4709: --- [~pallavi.rao]: can you give an example to explain following in your review board? what is algebraicOp.Final, algebraicOp.Intermediate,algebraicOp.Initial? {code} // Checks for algebraic operations and if they exist. // Replaces global rearrange (cogroup) with reduceBy as follows: // Input: // foreach (using algebraicOp) // -> packager // -> globalRearrange // -> localRearrange // Output: // foreach (using algebraicOp.Final) // -> reduceBy (uses algebraicOp.Intermediate) // -> foreach (using algebraicOp.Initial) // -> localRearrange {code} > Improve performance of GROUPBY operator on Spark > > > Key: PIG-4709 > URL: https://issues.apache.org/jira/browse/PIG-4709 > Project: Pig > Issue Type: Sub-task > Components: spark >Reporter: Pallavi Rao >Assignee: Pallavi Rao > Labels: spork > Fix For: spark-branch > > Attachments: PIG-4709.patch > > > Currently, the GROUPBY operator of PIG is mapped by Spark's CoGroup. When the > grouped data is consumed by subsequent operations to perform algebraic > operations, this is sub-optimal as there is lot of shuffle traffic. > The Spark Plan must be optimized to use reduceBy, where possible, so that a > combiner is used. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PIG-4709) Improve performance of GROUPBY operator on Spark
[ https://issues.apache.org/jira/browse/PIG-4709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15039661#comment-15039661 ] liyunzhang_intel commented on PIG-4709: --- [~pallavi.rao]: thanks for your work. i posted a few comments on RB. Can you give more detailed document for this optimization? > Improve performance of GROUPBY operator on Spark > > > Key: PIG-4709 > URL: https://issues.apache.org/jira/browse/PIG-4709 > Project: Pig > Issue Type: Sub-task > Components: spark >Reporter: Pallavi Rao >Assignee: Pallavi Rao > Labels: spork > Fix For: spark-branch > > Attachments: PIG-4709.patch > > > Currently, the GROUPBY operator of PIG is mapped by Spark's CoGroup. When the > grouped data is consumed by subsequent operations to perform algebraic > operations, this is sub-optimal as there is lot of shuffle traffic. > The Spark Plan must be optimized to use reduceBy, where possible, so that a > combiner is used. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PIG-4709) Improve performance of GROUPBY operator on Spark
[ https://issues.apache.org/jira/browse/PIG-4709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15032504#comment-15032504 ] Xuefu Zhang commented on PIG-4709: -- Thanks, [~pallavi.rao]. Great work! I posted a few comments, mostly cosmetic, on RB. This is a complex optimization, and I hope others can also take a look. > Improve performance of GROUPBY operator on Spark > > > Key: PIG-4709 > URL: https://issues.apache.org/jira/browse/PIG-4709 > Project: Pig > Issue Type: Sub-task > Components: spark >Reporter: Pallavi Rao >Assignee: Pallavi Rao > Labels: spork > Fix For: spark-branch > > Attachments: PIG-4709.patch > > > Currently, the GROUPBY operator of PIG is mapped by Spark's CoGroup. When the > grouped data is consumed by subsequent operations to perform algebraic > operations, this is sub-optimal as there is lot of shuffle traffic. > The Spark Plan must be optimized to use reduceBy, where possible, so that a > combiner is used. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PIG-4709) Improve performance of GROUPBY operator on Spark
[ https://issues.apache.org/jira/browse/PIG-4709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15032245#comment-15032245 ] Mohit Sabharwal commented on PIG-4709: -- Thanks, [~pallavi.rao], will take a look. + [~kellyzly] as well. > Improve performance of GROUPBY operator on Spark > > > Key: PIG-4709 > URL: https://issues.apache.org/jira/browse/PIG-4709 > Project: Pig > Issue Type: Sub-task > Components: spark >Reporter: Pallavi Rao >Assignee: Pallavi Rao > Labels: spork > Fix For: spark-branch > > Attachments: PIG-4709.patch > > > Currently, the GROUPBY operator of PIG is mapped by Spark's CoGroup. When the > grouped data is consumed by subsequent operations to perform algebraic > operations, this is sub-optimal as there is lot of shuffle traffic. > The Spark Plan must be optimized to use reduceBy, where possible, so that a > combiner is used. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PIG-4709) Improve performance of GROUPBY operator on Spark
[ https://issues.apache.org/jira/browse/PIG-4709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15031385#comment-15031385 ] Pallavi Rao commented on PIG-4709: -- [~mohitsabharwal], [~xuefuz], review please? > Improve performance of GROUPBY operator on Spark > > > Key: PIG-4709 > URL: https://issues.apache.org/jira/browse/PIG-4709 > Project: Pig > Issue Type: Sub-task > Components: spark >Reporter: Pallavi Rao >Assignee: Pallavi Rao > Labels: spork > Fix For: spark-branch > > Attachments: PIG-4709.patch > > > Currently, the GROUPBY operator of PIG is mapped by Spark's CoGroup. When the > grouped data is consumed by subsequent operations to perform algebraic > operations, this is sub-optimal as there is lot of shuffle traffic. > The Spark Plan must be optimized to use reduceBy, where possible, so that a > combiner is used. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PIG-4709) Improve performance of GROUPBY operator on Spark
[ https://issues.apache.org/jira/browse/PIG-4709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14966413#comment-14966413 ] Pallavi Rao commented on PIG-4709: -- I hacked around the code a bit and optimized one specific case of GROUPBY with algebraic operations on the grouped data. Here are the results: Spork Local (Without Optimization): 2015-10-21 12:36:22,884 [main] INFO org.apache.pig.Main - Pig script completed in 55 seconds and 944 milliseconds (55944 ms) Spork Local (With Optimization): 2015-10-21 12:26:25,145 [main] INFO org.apache.pig.Main - Pig script completed in 22 seconds and 377 milliseconds (22377 ms) PIG Local: 2015-10-21 12:27:54,632 [main] INFO org.apache.pig.Main - Pig script completed in 19 seconds and 147 milliseconds (19147 ms) Spork local reads off of HDFS while Pig local reads off of local file. Given that and the fact that Spark needs to be started and shutdown, the performance seems more or less comparable. > Improve performance of GROUPBY operator on Spark > > > Key: PIG-4709 > URL: https://issues.apache.org/jira/browse/PIG-4709 > Project: Pig > Issue Type: Sub-task > Components: spark >Reporter: Pallavi Rao >Assignee: Pallavi Rao > Labels: spork > Fix For: spark-branch > > > Currently, the GROUPBY operator of PIG is mapped by Spark's CoGroup. When the > grouped data is consumed by subsequent operations to perform algebraic > operations, this is sub-optimal as there is lot of shuffle traffic. > The Spark Plan must be optimized to use reduceBy, where possible, so that a > combiner is used. -- This message was sent by Atlassian JIRA (v6.3.4#6332)