[GitHub] spark pull request: [SPARK-12963] Improve performance of stddev/va...

2016-01-27 Thread davies
GitHub user davies opened a pull request:

https://github.com/apache/spark/pull/10960

[SPARK-12963] Improve performance of stddev/variance

As benchmarked and discussed here: 
https://github.com/apache/spark/pull/10786/files#r50038294, benefits from 
codegen, the declarative aggregate function could be much faster than 
imperative one.

This PR is based on #10944 

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/davies/spark stddev

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/10960.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #10960


commit b4db00675bc3c51ddf8735cace522a5d771cf7e2
Author: Davies Liu 
Date:   2016-01-27T07:43:40Z

cleanup whole stage codegen

commit 70a7c7edd1988c7dd69bccc8e563c9943775bd2c
Author: Davies Liu 
Date:   2016-01-27T23:22:33Z

improve stddev and variance




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12963] Improve performance of stddev/va...

2016-01-27 Thread davies
Github user davies commented on the pull request:

https://github.com/apache/spark/pull/10960#issuecomment-175911728
  
cc @mengxr 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12963] Improve performance of stddev/va...

2016-01-27 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/10960#issuecomment-175914531
  
**[Test build #50240 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/50240/consoleFull)**
 for PR 10960 at commit 
[`61edd5e`](https://github.com/apache/spark/commit/61edd5e3a2c030d7387db5283eee5ada13553505).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12963] Improve performance of stddev/va...

2016-01-27 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/10960#issuecomment-175915586
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/50239/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12963] Improve performance of stddev/va...

2016-01-27 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/10960#issuecomment-175915582
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12963] Improve performance of stddev/va...

2016-01-27 Thread davies
Github user davies commented on a diff in the pull request:

https://github.com/apache/spark/pull/10960#discussion_r51086288
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/CentralMomentAgg.scala
 ---
@@ -109,7 +109,7 @@ abstract class CentralMomentAgg(child: Expression) 
extends ImperativeAggregate w
* Update the central moments buffer.
*/
   override def update(buffer: MutableRow, input: InternalRow): Unit = {
-val v = Cast(child, DoubleType).eval(input)
--- End diff --

Creating a Cast() here is very expensive


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12963] Improve performance of stddev/va...

2016-01-27 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/10960#issuecomment-176043209
  
**[Test build #50265 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/50265/consoleFull)**
 for PR 10960 at commit 
[`3c8d737`](https://github.com/apache/spark/commit/3c8d737d5ee3ce34dee494dc3fac3090d983775a).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12963] Improve performance of stddev/va...

2016-01-28 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/10960#issuecomment-176068170
  
**[Test build #50265 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/50265/consoleFull)**
 for PR 10960 at commit 
[`3c8d737`](https://github.com/apache/spark/commit/3c8d737d5ee3ce34dee494dc3fac3090d983775a).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12963] Improve performance of stddev/va...

2016-01-28 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/10960#issuecomment-176068656
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/50265/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12963] Improve performance of stddev/va...

2016-01-28 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/10960#issuecomment-176068652
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12963] Improve performance of stddev/va...

2016-01-28 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/10960#discussion_r51160846
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/CentralMomentAgg.scala
 ---
@@ -125,19 +125,15 @@ abstract class CentralMomentAgg(child: Expression) 
extends ImperativeAggregate w
   mean += deltaN
   buffer.setDouble(meanOffset, mean)
 
-  if (momentOrder >= 2) {
-m2 = buffer.getDouble(secondMomentOffset)
-m2 += delta * (delta - deltaN)
-buffer.setDouble(secondMomentOffset, m2)
-  }
+  m2 = buffer.getDouble(secondMomentOffset)
+  m2 += delta * (delta - deltaN)
+  buffer.setDouble(secondMomentOffset, m2)
 
-  if (momentOrder >= 3) {
--- End diff --

Those `if` branches are important to save computation for low-order 
statistics. Even we won't use `CentralMomentAgg` for second-order statistics, 
it is still good to keep them.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12963] Improve performance of stddev/va...

2016-01-28 Thread mengxr
Github user mengxr commented on the pull request:

https://github.com/apache/spark/pull/10960#issuecomment-176315428
  
@davies Did you get a chance to test whole-stage codegen with higher-order 
statistics like skewness? If it works, the cleanest solution would be changing 
`CentralMomentAgg` to declarative and then make all existing univariate summary 
statistics call it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12963] Improve performance of stddev/va...

2016-01-28 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/10960#issuecomment-176382224
  
**[Test build #50294 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/50294/consoleFull)**
 for PR 10960 at commit 
[`ae83955`](https://github.com/apache/spark/commit/ae83955ea3a34e38ce55d99f741c99f1f8b2fa8f).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12963] Improve performance of stddev/va...

2016-01-28 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/10960#issuecomment-176383086
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12963] Improve performance of stddev/va...

2016-01-28 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/10960#issuecomment-176383074
  
**[Test build #50294 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/50294/consoleFull)**
 for PR 10960 at commit 
[`ae83955`](https://github.com/apache/spark/commit/ae83955ea3a34e38ce55d99f741c99f1f8b2fa8f).
 * This patch **fails Scala style tests**.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `abstract class CentralMomentAgg(child: Expression) extends 
DeclarativeAggregate `
  * `case class Kurtosis(child: Expression) extends CentralMomentAgg(child) 
`
  * `case class Skewness(child: Expression) extends CentralMomentAgg(child) 
`
  * `case class Echo(child: Expression) extends UnaryExpression `


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12963] Improve performance of stddev/va...

2016-01-28 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/10960#issuecomment-176383091
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/50294/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12963] Improve performance of stddev/va...

2016-01-28 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/10960#issuecomment-176436538
  
**[Test build #50297 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/50297/consoleFull)**
 for PR 10960 at commit 
[`1481bb4`](https://github.com/apache/spark/commit/1481bb4b632ea2d37a703e93ee6b09ff5c9fa8dd).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12963] Improve performance of stddev/va...

2016-01-28 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/10960#issuecomment-176443884
  
**[Test build #50297 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/50297/consoleFull)**
 for PR 10960 at commit 
[`1481bb4`](https://github.com/apache/spark/commit/1481bb4b632ea2d37a703e93ee6b09ff5c9fa8dd).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12963] Improve performance of stddev/va...

2016-01-28 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/10960#issuecomment-176443931
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12963] Improve performance of stddev/va...

2016-01-28 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/10960#issuecomment-176443933
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/50297/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org