[jira] [Commented] (SPARK-10641) skewness and kurtosis support
[ https://issues.apache.org/jira/browse/SPARK-10641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14961396#comment-14961396 ] Xiangrui Meng commented on SPARK-10641: --- See attached PDF file. > skewness and kurtosis support > - > > Key: SPARK-10641 > URL: https://issues.apache.org/jira/browse/SPARK-10641 > Project: Spark > Issue Type: Sub-task > Components: ML, SQL >Reporter: Jihong MA >Assignee: Seth Hendrickson > Attachments: simpler-moments.pdf > > > Implementing skewness and kurtosis support based on following algorithm: > https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Higher-order_statistics -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10641) skewness and kurtosis support
[ https://issues.apache.org/jira/browse/SPARK-10641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14961359#comment-14961359 ] Seth Hendrickson commented on SPARK-10641: -- [~mengxr] I am interested, do you mine providing it or a link to it? > skewness and kurtosis support > - > > Key: SPARK-10641 > URL: https://issues.apache.org/jira/browse/SPARK-10641 > Project: Spark > Issue Type: Sub-task > Components: ML, SQL >Reporter: Jihong MA >Assignee: Seth Hendrickson > Attachments: simpler-moments.pdf > > > Implementing skewness and kurtosis support based on following algorithm: > https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Higher-order_statistics -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10641) skewness and kurtosis support
[ https://issues.apache.org/jira/browse/SPARK-10641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14945877#comment-14945877 ] Seth Hendrickson commented on SPARK-10641: -- [~mengxr] I submitted a PR as work in progress. I had written my implementation before stddev got merged in and so right now they are separate. The main difference is the way that the subclasses implement `evaluateExpression` (the lower order moments are computed the same with some syntax differences). I added in functionality to avoid computing higher order moments when they are not asked for. The optimization you suggest for duplicate computation between skewness and kurtosis has not yet been addressed. I believe the same code duplication would occur for {{df.groupBy("key").agg(var("a"), avg("a"))}} since both aggregates compute the average. We'll also have to keep an eye on the benchmark testing according to your comment below. Thanks for the feedback! > skewness and kurtosis support > - > > Key: SPARK-10641 > URL: https://issues.apache.org/jira/browse/SPARK-10641 > Project: Spark > Issue Type: New Feature > Components: ML, SQL >Reporter: Jihong MA >Assignee: Seth Hendrickson > > Implementing skewness and kurtosis support based on following algorithm: > https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Higher-order_statistics -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10641) skewness and kurtosis support
[ https://issues.apache.org/jira/browse/SPARK-10641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14945821#comment-14945821 ] Apache Spark commented on SPARK-10641: -- User 'sethah' has created a pull request for this issue: https://github.com/apache/spark/pull/9003 > skewness and kurtosis support > - > > Key: SPARK-10641 > URL: https://issues.apache.org/jira/browse/SPARK-10641 > Project: Spark > Issue Type: New Feature > Components: ML, SQL >Reporter: Jihong MA >Assignee: Seth Hendrickson > > Implementing skewness and kurtosis support based on following algorithm: > https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Higher-order_statistics -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10641) skewness and kurtosis support
[ https://issues.apache.org/jira/browse/SPARK-10641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14945412#comment-14945412 ] Xiangrui Meng commented on SPARK-10641: --- Btw, I checked the implementation of StdDevAgg. I'm not sure we can get benefit from using expressions (and hence codegen). See https://issues.apache.org/jira/browse/SPARK-10953. > skewness and kurtosis support > - > > Key: SPARK-10641 > URL: https://issues.apache.org/jira/browse/SPARK-10641 > Project: Spark > Issue Type: New Feature > Components: ML, SQL >Reporter: Jihong MA >Assignee: Seth Hendrickson > > Implementing skewness and kurtosis support based on following algorithm: > https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Higher-order_statistics -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10641) skewness and kurtosis support
[ https://issues.apache.org/jira/browse/SPARK-10641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14945309#comment-14945309 ] Xiangrui Meng commented on SPARK-10641: --- If we want to implement the numerically stable version. We should refactor the StdDevAgg implementation to add moving third and fourth moments. Then the StdDevAgg should be renamed to CentralMomentAgg. In the future, we need to make sure that codegen doesn't include unnecessary branches if kurtosis and skewness are not asked by the user. Btw, there will be some space for optimization, e.g. {code} df.groupBy("key").agg(skewness("a"), kurtosis("a")) {code} will have duplicate computation. > skewness and kurtosis support > - > > Key: SPARK-10641 > URL: https://issues.apache.org/jira/browse/SPARK-10641 > Project: Spark > Issue Type: New Feature > Components: ML, SQL >Reporter: Jihong MA >Assignee: Seth Hendrickson > > Implementing skewness and kurtosis support based on following algorithm: > https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Higher-order_statistics -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10641) skewness and kurtosis support
[ https://issues.apache.org/jira/browse/SPARK-10641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14945308#comment-14945308 ] Xiangrui Meng commented on SPARK-10641: --- If we want to implement the numerically stable version. We should refactor the StdDevAgg implementation to add moving third and fourth moments. Then the StdDevAgg should be renamed to CentralMomentAgg. In the future, we need to make sure that codegen doesn't include unnecessary branches if kurtosis and skewness are not asked by the user. Btw, there will be some space for optimization, e.g. {code} df.groupBy("key").agg(skewness("a"), kurtosis("a")) {code} will have duplicate computation. > skewness and kurtosis support > - > > Key: SPARK-10641 > URL: https://issues.apache.org/jira/browse/SPARK-10641 > Project: Spark > Issue Type: New Feature > Components: ML, SQL >Reporter: Jihong MA >Assignee: Seth Hendrickson > > Implementing skewness and kurtosis support based on following algorithm: > https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Higher-order_statistics -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10641) skewness and kurtosis support
[ https://issues.apache.org/jira/browse/SPARK-10641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14944485#comment-14944485 ] Seth Hendrickson commented on SPARK-10641: -- My apologies, I haven't been able to devote much time to this lately. To your point, one of the bigger decisions for this PR we'll be how to combine these functions with other aggregates, since online algorithms for higher order statistical moments require the calculation of all the lower order moments. I can have a WIP PR up by tomorrow, so we can get some discussion going. This PR will also be affected by several other ongoing PRs. > skewness and kurtosis support > - > > Key: SPARK-10641 > URL: https://issues.apache.org/jira/browse/SPARK-10641 > Project: Spark > Issue Type: New Feature > Components: ML, SQL >Reporter: Jihong MA >Assignee: Seth Hendrickson > > Implementing skewness and kurtosis support based on following algorithm: > https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Higher-order_statistics -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10641) skewness and kurtosis support
[ https://issues.apache.org/jira/browse/SPARK-10641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14944155#comment-14944155 ] Xiangrui Meng commented on SPARK-10641: --- Any updates? > skewness and kurtosis support > - > > Key: SPARK-10641 > URL: https://issues.apache.org/jira/browse/SPARK-10641 > Project: Spark > Issue Type: New Feature > Components: ML, SQL >Reporter: Jihong MA >Assignee: Seth Hendrickson > > Implementing skewness and kurtosis support based on following algorithm: > https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Higher-order_statistics -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10641) skewness and kurtosis support
[ https://issues.apache.org/jira/browse/SPARK-10641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14804724#comment-14804724 ] Seth Hendrickson commented on SPARK-10641: -- I'm working on this issue. > skewness and kurtosis support > - > > Key: SPARK-10641 > URL: https://issues.apache.org/jira/browse/SPARK-10641 > Project: Spark > Issue Type: New Feature > Components: ML, SQL >Reporter: Jihong MA > > Implementing skewness and kurtosis support based on following algorithm: > https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Higher-order_statistics -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org