[jira] [Commented] (PIG-4709) Improve performance of GROUPBY operator on Spark

2016-01-27 Thread liyunzhang_intel (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-4709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15120893#comment-15120893
 ] 

liyunzhang_intel commented on PIG-4709:
---

[~pallavi.rao]:
for PIG-4709-v3.patch: LGTM

> Improve performance of GROUPBY operator on Spark
> 
>
> Key: PIG-4709
> URL: https://issues.apache.org/jira/browse/PIG-4709
> Project: Pig
>  Issue Type: Sub-task
>  Components: spark
>Reporter: Pallavi Rao
>Assignee: Pallavi Rao
>  Labels: spork
> Fix For: spark-branch
>
> Attachments: PIG-4709-v1.patch, PIG-4709-v2.patch, PIG-4709-v3.patch, 
> PIG-4709.patch, TEST-org.apache.pig.test.TestCombiner.xml
>
>
> Currently, the GROUPBY operator of PIG is mapped by Spark's CoGroup. When the 
> grouped data is consumed by subsequent operations to perform algebraic 
> operations, this is sub-optimal as there is lot of shuffle traffic. 
> The Spark Plan must be optimized to use reduceBy, where possible, so that a 
> combiner is used.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PIG-4709) Improve performance of GROUPBY operator on Spark

2016-01-13 Thread liyunzhang_intel (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-4709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15097718#comment-15097718
 ] 

liyunzhang_intel commented on PIG-4709:
---

[~pallavi.rao]:
   Leave some comments about PIG-4709_v2.patch(mainly about the package 
importing sequence). Not have other suggestions.

> Improve performance of GROUPBY operator on Spark
> 
>
> Key: PIG-4709
> URL: https://issues.apache.org/jira/browse/PIG-4709
> Project: Pig
>  Issue Type: Sub-task
>  Components: spark
>Reporter: Pallavi Rao
>Assignee: Pallavi Rao
>  Labels: spork
> Fix For: spark-branch
>
> Attachments: PIG-4709-v1.patch, PIG-4709-v2.patch, PIG-4709.patch, 
> TEST-org.apache.pig.test.TestCombiner.xml
>
>
> Currently, the GROUPBY operator of PIG is mapped by Spark's CoGroup. When the 
> grouped data is consumed by subsequent operations to perform algebraic 
> operations, this is sub-optimal as there is lot of shuffle traffic. 
> The Spark Plan must be optimized to use reduceBy, where possible, so that a 
> combiner is used.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PIG-4709) Improve performance of GROUPBY operator on Spark

2016-01-07 Thread Pallavi Rao (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-4709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15088689#comment-15088689
 ] 

Pallavi Rao commented on PIG-4709:
--

[~kellyzly], I have addressed your review comments and the patch is uploaded 
here and to review board. Do you have any further comments?

[~mohitsabharwal], did you get a chance to review? PIG-4766 blocked on this 
patch. 

> Improve performance of GROUPBY operator on Spark
> 
>
> Key: PIG-4709
> URL: https://issues.apache.org/jira/browse/PIG-4709
> Project: Pig
>  Issue Type: Sub-task
>  Components: spark
>Reporter: Pallavi Rao
>Assignee: Pallavi Rao
>  Labels: spork
> Fix For: spark-branch
>
> Attachments: PIG-4709-v1.patch, PIG-4709-v2.patch, PIG-4709.patch, 
> TEST-org.apache.pig.test.TestCombiner.xml
>
>
> Currently, the GROUPBY operator of PIG is mapped by Spark's CoGroup. When the 
> grouped data is consumed by subsequent operations to perform algebraic 
> operations, this is sub-optimal as there is lot of shuffle traffic. 
> The Spark Plan must be optimized to use reduceBy, where possible, so that a 
> combiner is used.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PIG-4709) Improve performance of GROUPBY operator on Spark

2015-12-15 Thread liyunzhang_intel (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-4709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15059601#comment-15059601
 ] 

liyunzhang_intel commented on PIG-4709:
---

[~pallavi.rao]:Leave some comments on the review board. Please view it.

> Improve performance of GROUPBY operator on Spark
> 
>
> Key: PIG-4709
> URL: https://issues.apache.org/jira/browse/PIG-4709
> Project: Pig
>  Issue Type: Sub-task
>  Components: spark
>Reporter: Pallavi Rao
>Assignee: Pallavi Rao
>  Labels: spork
> Fix For: spark-branch
>
> Attachments: PIG-4709-v1.patch, PIG-4709.patch, 
> TEST-org.apache.pig.test.TestCombiner.xml
>
>
> Currently, the GROUPBY operator of PIG is mapped by Spark's CoGroup. When the 
> grouped data is consumed by subsequent operations to perform algebraic 
> operations, this is sub-optimal as there is lot of shuffle traffic. 
> The Spark Plan must be optimized to use reduceBy, where possible, so that a 
> combiner is used.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PIG-4709) Improve performance of GROUPBY operator on Spark

2015-12-10 Thread liyunzhang_intel (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-4709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15050306#comment-15050306
 ] 

liyunzhang_intel commented on PIG-4709:
---

[~pallavi.rao]:
  i use "ant -Dhadoopversion=23 -Dexectype=spark -Dtestcase=TestCombiner" to 
test the latest patch. There are 4 failures(see 
[attachment|https://issues.apache.org/jira/secure/attachment/12776475/TEST-org.apache.pig.test.TestCombiner.xml])

> Improve performance of GROUPBY operator on Spark
> 
>
> Key: PIG-4709
> URL: https://issues.apache.org/jira/browse/PIG-4709
> Project: Pig
>  Issue Type: Sub-task
>  Components: spark
>Reporter: Pallavi Rao
>Assignee: Pallavi Rao
>  Labels: spork
> Fix For: spark-branch
>
> Attachments: PIG-4709-v1.patch, PIG-4709.patch, 
> TEST-org.apache.pig.test.TestCombiner.xml
>
>
> Currently, the GROUPBY operator of PIG is mapped by Spark's CoGroup. When the 
> grouped data is consumed by subsequent operations to perform algebraic 
> operations, this is sub-optimal as there is lot of shuffle traffic. 
> The Spark Plan must be optimized to use reduceBy, where possible, so that a 
> combiner is used.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PIG-4709) Improve performance of GROUPBY operator on Spark

2015-12-10 Thread Pallavi Rao (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-4709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15050314#comment-15050314
 ] 

Pallavi Rao commented on PIG-4709:
--

That is right [~kellyzly], the patch does NOT address all cases. Once the basic 
design/impl is reviewed and committed, I will make further enhancements to 
ensure all test cases in TestCombiner pass. In fact, I'm working on that in 
parallel, while you guys review the patch.

> Improve performance of GROUPBY operator on Spark
> 
>
> Key: PIG-4709
> URL: https://issues.apache.org/jira/browse/PIG-4709
> Project: Pig
>  Issue Type: Sub-task
>  Components: spark
>Reporter: Pallavi Rao
>Assignee: Pallavi Rao
>  Labels: spork
> Fix For: spark-branch
>
> Attachments: PIG-4709-v1.patch, PIG-4709.patch, 
> TEST-org.apache.pig.test.TestCombiner.xml
>
>
> Currently, the GROUPBY operator of PIG is mapped by Spark's CoGroup. When the 
> grouped data is consumed by subsequent operations to perform algebraic 
> operations, this is sub-optimal as there is lot of shuffle traffic. 
> The Spark Plan must be optimized to use reduceBy, where possible, so that a 
> combiner is used.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PIG-4709) Improve performance of GROUPBY operator on Spark

2015-12-03 Thread liyunzhang_intel (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-4709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15039710#comment-15039710
 ] 

liyunzhang_intel commented on PIG-4709:
---

[~pallavi.rao]:
 can you give an example to explain following in your review board? what is  
algebraicOp.Final,  algebraicOp.Intermediate,algebraicOp.Initial?
{code} 
// Checks for algebraic operations and if they exist.
// Replaces global rearrange (cogroup) with reduceBy as follows:
// Input:
// foreach (using algebraicOp)
//   -> packager
//  -> globalRearrange
//  -> localRearrange
// Output:
// foreach (using algebraicOp.Final)
//   -> reduceBy (uses algebraicOp.Intermediate)
//  -> foreach (using algebraicOp.Initial)
//  -> localRearrange
{code}
 

> Improve performance of GROUPBY operator on Spark
> 
>
> Key: PIG-4709
> URL: https://issues.apache.org/jira/browse/PIG-4709
> Project: Pig
>  Issue Type: Sub-task
>  Components: spark
>Reporter: Pallavi Rao
>Assignee: Pallavi Rao
>  Labels: spork
> Fix For: spark-branch
>
> Attachments: PIG-4709.patch
>
>
> Currently, the GROUPBY operator of PIG is mapped by Spark's CoGroup. When the 
> grouped data is consumed by subsequent operations to perform algebraic 
> operations, this is sub-optimal as there is lot of shuffle traffic. 
> The Spark Plan must be optimized to use reduceBy, where possible, so that a 
> combiner is used.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PIG-4709) Improve performance of GROUPBY operator on Spark

2015-12-03 Thread liyunzhang_intel (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-4709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15039661#comment-15039661
 ] 

liyunzhang_intel commented on PIG-4709:
---

[~pallavi.rao]: thanks for your work. i posted a few comments on RB. Can you 
give more detailed document for this optimization?

> Improve performance of GROUPBY operator on Spark
> 
>
> Key: PIG-4709
> URL: https://issues.apache.org/jira/browse/PIG-4709
> Project: Pig
>  Issue Type: Sub-task
>  Components: spark
>Reporter: Pallavi Rao
>Assignee: Pallavi Rao
>  Labels: spork
> Fix For: spark-branch
>
> Attachments: PIG-4709.patch
>
>
> Currently, the GROUPBY operator of PIG is mapped by Spark's CoGroup. When the 
> grouped data is consumed by subsequent operations to perform algebraic 
> operations, this is sub-optimal as there is lot of shuffle traffic. 
> The Spark Plan must be optimized to use reduceBy, where possible, so that a 
> combiner is used.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PIG-4709) Improve performance of GROUPBY operator on Spark

2015-11-30 Thread Xuefu Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-4709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15032504#comment-15032504
 ] 

Xuefu Zhang commented on PIG-4709:
--

Thanks, [~pallavi.rao]. Great work! I posted a few comments, mostly cosmetic, 
on RB. This is a complex optimization, and I hope others can also take a look.

> Improve performance of GROUPBY operator on Spark
> 
>
> Key: PIG-4709
> URL: https://issues.apache.org/jira/browse/PIG-4709
> Project: Pig
>  Issue Type: Sub-task
>  Components: spark
>Reporter: Pallavi Rao
>Assignee: Pallavi Rao
>  Labels: spork
> Fix For: spark-branch
>
> Attachments: PIG-4709.patch
>
>
> Currently, the GROUPBY operator of PIG is mapped by Spark's CoGroup. When the 
> grouped data is consumed by subsequent operations to perform algebraic 
> operations, this is sub-optimal as there is lot of shuffle traffic. 
> The Spark Plan must be optimized to use reduceBy, where possible, so that a 
> combiner is used.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PIG-4709) Improve performance of GROUPBY operator on Spark

2015-11-30 Thread Mohit Sabharwal (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-4709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15032245#comment-15032245
 ] 

Mohit Sabharwal commented on PIG-4709:
--

Thanks, [~pallavi.rao], will take a look. + [~kellyzly] as well.

> Improve performance of GROUPBY operator on Spark
> 
>
> Key: PIG-4709
> URL: https://issues.apache.org/jira/browse/PIG-4709
> Project: Pig
>  Issue Type: Sub-task
>  Components: spark
>Reporter: Pallavi Rao
>Assignee: Pallavi Rao
>  Labels: spork
> Fix For: spark-branch
>
> Attachments: PIG-4709.patch
>
>
> Currently, the GROUPBY operator of PIG is mapped by Spark's CoGroup. When the 
> grouped data is consumed by subsequent operations to perform algebraic 
> operations, this is sub-optimal as there is lot of shuffle traffic. 
> The Spark Plan must be optimized to use reduceBy, where possible, so that a 
> combiner is used.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PIG-4709) Improve performance of GROUPBY operator on Spark

2015-11-29 Thread Pallavi Rao (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-4709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15031385#comment-15031385
 ] 

Pallavi Rao commented on PIG-4709:
--

[~mohitsabharwal], [~xuefuz], review please?

> Improve performance of GROUPBY operator on Spark
> 
>
> Key: PIG-4709
> URL: https://issues.apache.org/jira/browse/PIG-4709
> Project: Pig
>  Issue Type: Sub-task
>  Components: spark
>Reporter: Pallavi Rao
>Assignee: Pallavi Rao
>  Labels: spork
> Fix For: spark-branch
>
> Attachments: PIG-4709.patch
>
>
> Currently, the GROUPBY operator of PIG is mapped by Spark's CoGroup. When the 
> grouped data is consumed by subsequent operations to perform algebraic 
> operations, this is sub-optimal as there is lot of shuffle traffic. 
> The Spark Plan must be optimized to use reduceBy, where possible, so that a 
> combiner is used.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PIG-4709) Improve performance of GROUPBY operator on Spark

2015-10-21 Thread Pallavi Rao (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-4709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14966413#comment-14966413
 ] 

Pallavi Rao commented on PIG-4709:
--

I hacked around the code a bit and optimized one specific case of GROUPBY with 
algebraic operations on the grouped data. Here are the results:
Spork Local (Without Optimization):
2015-10-21 12:36:22,884 [main] INFO  org.apache.pig.Main - Pig script completed 
in 55 seconds and 944 milliseconds (55944 ms)

Spork Local (With Optimization):
2015-10-21 12:26:25,145 [main] INFO  org.apache.pig.Main - Pig script completed 
in 22 seconds and 377 milliseconds (22377 ms)

PIG Local:
2015-10-21 12:27:54,632 [main] INFO  org.apache.pig.Main - Pig script completed 
in 19 seconds and 147 milliseconds (19147 ms)

Spork local reads off of HDFS while Pig local reads off of local file. Given 
that and the fact that Spark needs to be started and shutdown, the performance 
seems more or less comparable.

> Improve performance of GROUPBY operator on Spark
> 
>
> Key: PIG-4709
> URL: https://issues.apache.org/jira/browse/PIG-4709
> Project: Pig
>  Issue Type: Sub-task
>  Components: spark
>Reporter: Pallavi Rao
>Assignee: Pallavi Rao
>  Labels: spork
> Fix For: spark-branch
>
>
> Currently, the GROUPBY operator of PIG is mapped by Spark's CoGroup. When the 
> grouped data is consumed by subsequent operations to perform algebraic 
> operations, this is sub-optimal as there is lot of shuffle traffic. 
> The Spark Plan must be optimized to use reduceBy, where possible, so that a 
> combiner is used.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)