[jira] [Commented] (SPARK-10641) skewness and kurtosis support

2015-10-16 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14961396#comment-14961396
 ] 

Xiangrui Meng commented on SPARK-10641:
---

See attached PDF file.

> skewness and kurtosis support
> -
>
> Key: SPARK-10641
> URL: https://issues.apache.org/jira/browse/SPARK-10641
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, SQL
>Reporter: Jihong MA
>Assignee: Seth Hendrickson
> Attachments: simpler-moments.pdf
>
>
> Implementing skewness and kurtosis support based on following algorithm:
> https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Higher-order_statistics



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10641) skewness and kurtosis support

2015-10-16 Thread Seth Hendrickson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14961359#comment-14961359
 ] 

Seth Hendrickson commented on SPARK-10641:
--

[~mengxr] I am interested, do you mine providing it or a link to it?

> skewness and kurtosis support
> -
>
> Key: SPARK-10641
> URL: https://issues.apache.org/jira/browse/SPARK-10641
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, SQL
>Reporter: Jihong MA
>Assignee: Seth Hendrickson
> Attachments: simpler-moments.pdf
>
>
> Implementing skewness and kurtosis support based on following algorithm:
> https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Higher-order_statistics



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10641) skewness and kurtosis support

2015-10-06 Thread Seth Hendrickson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14945877#comment-14945877
 ] 

Seth Hendrickson commented on SPARK-10641:
--

[~mengxr] I submitted a PR as work in progress. I had written my implementation 
before stddev got merged in and so right now they are separate. The main 
difference is the way that the subclasses implement `evaluateExpression` (the 
lower order moments are computed the same with some syntax differences). I 
added in functionality to avoid computing higher order moments when they are 
not asked for.

The optimization you suggest for duplicate computation between skewness and 
kurtosis has not yet been addressed. I believe the same code duplication would 
occur for 

{{df.groupBy("key").agg(var("a"), avg("a"))}}

since both aggregates compute the average. We'll also have to keep an eye on 
the benchmark testing according to your comment below. Thanks for the feedback!

> skewness and kurtosis support
> -
>
> Key: SPARK-10641
> URL: https://issues.apache.org/jira/browse/SPARK-10641
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, SQL
>Reporter: Jihong MA
>Assignee: Seth Hendrickson
>
> Implementing skewness and kurtosis support based on following algorithm:
> https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Higher-order_statistics



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10641) skewness and kurtosis support

2015-10-06 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14945821#comment-14945821
 ] 

Apache Spark commented on SPARK-10641:
--

User 'sethah' has created a pull request for this issue:
https://github.com/apache/spark/pull/9003

> skewness and kurtosis support
> -
>
> Key: SPARK-10641
> URL: https://issues.apache.org/jira/browse/SPARK-10641
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, SQL
>Reporter: Jihong MA
>Assignee: Seth Hendrickson
>
> Implementing skewness and kurtosis support based on following algorithm:
> https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Higher-order_statistics



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10641) skewness and kurtosis support

2015-10-06 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14945412#comment-14945412
 ] 

Xiangrui Meng commented on SPARK-10641:
---

Btw, I checked the implementation of StdDevAgg. I'm not sure we can get benefit 
from using expressions (and hence codegen). See 
https://issues.apache.org/jira/browse/SPARK-10953.

> skewness and kurtosis support
> -
>
> Key: SPARK-10641
> URL: https://issues.apache.org/jira/browse/SPARK-10641
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, SQL
>Reporter: Jihong MA
>Assignee: Seth Hendrickson
>
> Implementing skewness and kurtosis support based on following algorithm:
> https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Higher-order_statistics



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10641) skewness and kurtosis support

2015-10-06 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14945309#comment-14945309
 ] 

Xiangrui Meng commented on SPARK-10641:
---

If we want to implement the numerically stable version. We should refactor the 
StdDevAgg implementation to add moving third and fourth moments. Then the 
StdDevAgg should be renamed to CentralMomentAgg.

In the future, we need to make sure that codegen doesn't include unnecessary 
branches if kurtosis and skewness are not asked by the user.

Btw, there will be some space for optimization, e.g.

{code}
df.groupBy("key").agg(skewness("a"), kurtosis("a"))
{code}

will have duplicate computation.

> skewness and kurtosis support
> -
>
> Key: SPARK-10641
> URL: https://issues.apache.org/jira/browse/SPARK-10641
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, SQL
>Reporter: Jihong MA
>Assignee: Seth Hendrickson
>
> Implementing skewness and kurtosis support based on following algorithm:
> https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Higher-order_statistics



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10641) skewness and kurtosis support

2015-10-06 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14945308#comment-14945308
 ] 

Xiangrui Meng commented on SPARK-10641:
---

If we want to implement the numerically stable version. We should refactor the 
StdDevAgg implementation to add moving third and fourth moments. Then the 
StdDevAgg should be renamed to CentralMomentAgg.

In the future, we need to make sure that codegen doesn't include unnecessary 
branches if kurtosis and skewness are not asked by the user.

Btw, there will be some space for optimization, e.g.

{code}
df.groupBy("key").agg(skewness("a"), kurtosis("a"))
{code}

will have duplicate computation.

> skewness and kurtosis support
> -
>
> Key: SPARK-10641
> URL: https://issues.apache.org/jira/browse/SPARK-10641
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, SQL
>Reporter: Jihong MA
>Assignee: Seth Hendrickson
>
> Implementing skewness and kurtosis support based on following algorithm:
> https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Higher-order_statistics



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10641) skewness and kurtosis support

2015-10-05 Thread Seth Hendrickson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14944485#comment-14944485
 ] 

Seth Hendrickson commented on SPARK-10641:
--

My apologies, I haven't been able to devote much time to this lately. To your 
point, one of the bigger decisions for this PR we'll be how to combine these 
functions with other aggregates, since online algorithms for higher order 
statistical moments require the calculation of all the lower order moments. I 
can have a WIP PR up by tomorrow, so we can get some discussion going. This PR 
will also be affected by several other ongoing PRs.

> skewness and kurtosis support
> -
>
> Key: SPARK-10641
> URL: https://issues.apache.org/jira/browse/SPARK-10641
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, SQL
>Reporter: Jihong MA
>Assignee: Seth Hendrickson
>
> Implementing skewness and kurtosis support based on following algorithm:
> https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Higher-order_statistics



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10641) skewness and kurtosis support

2015-10-05 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14944155#comment-14944155
 ] 

Xiangrui Meng commented on SPARK-10641:
---

Any updates?

> skewness and kurtosis support
> -
>
> Key: SPARK-10641
> URL: https://issues.apache.org/jira/browse/SPARK-10641
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, SQL
>Reporter: Jihong MA
>Assignee: Seth Hendrickson
>
> Implementing skewness and kurtosis support based on following algorithm:
> https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Higher-order_statistics



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10641) skewness and kurtosis support

2015-09-17 Thread Seth Hendrickson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14804724#comment-14804724
 ] 

Seth Hendrickson commented on SPARK-10641:
--

I'm working on this issue.

> skewness and kurtosis support
> -
>
> Key: SPARK-10641
> URL: https://issues.apache.org/jira/browse/SPARK-10641
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, SQL
>Reporter: Jihong MA
>
> Implementing skewness and kurtosis support based on following algorithm:
> https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Higher-order_statistics



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org