[jira] [Commented] (SPARK-10602) Univariate statistics as UDAFs: single-pass continuous stats
[ https://issues.apache.org/jira/browse/SPARK-10602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14904437#comment-14904437 ] Sabyasachi Nayak commented on SPARK-10602: -- Thanks Seth. I wanted to implement skewness and kurtosis in scala.but you mentioned scala style will fail.Please let me know when it is ready. > Univariate statistics as UDAFs: single-pass continuous stats > > > Key: SPARK-10602 > URL: https://issues.apache.org/jira/browse/SPARK-10602 > Project: Spark > Issue Type: Sub-task > Components: ML, SQL >Reporter: Joseph K. Bradley >Assignee: Seth Hendrickson > > See parent JIRA for more details. This subtask covers statistics for > continuous values requiring a single pass over the data, such as min and max. > This JIRA is an umbrella. For individual stats, please create and link a new > JIRA. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10602) Univariate statistics as UDAFs: single-pass continuous stats
[ https://issues.apache.org/jira/browse/SPARK-10602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14900991#comment-14900991 ] Seth Hendrickson commented on SPARK-10602: -- My branch is here: [SPARK-10641|https://github.com/sethah/spark/tree/SPARK-10641], the implementation is mostly [here|https://github.com/sethah/spark/blob/SPARK-10641/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/functions.scala] Note that I wrote a few tests but they aren't passing currently (also scalastyle will fail) as this is a WIP. The update rules are based on descriptions [here|https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Higher-order_statistics]. I also made a pass at implementing a numerically stable update for skewness and kurtosis following the [Kahan update|https://en.wikipedia.org/wiki/Kahan_summation_algorithm]. The implementation follows the algorithms described [here|http://researcher.watson.ibm.com/researcher/files/us-ytian/stability.pdf]. I am not sure that they are working and I haven't been able to find a great way to test for numerical stability but I'll try to get that worked out soon. Also note that the legacy way of implementing aggregates is still required, but I have simply issued placeholders for the {{AggregateFunction1}} implementations for now. > Univariate statistics as UDAFs: single-pass continuous stats > > > Key: SPARK-10602 > URL: https://issues.apache.org/jira/browse/SPARK-10602 > Project: Spark > Issue Type: Sub-task > Components: ML, SQL >Reporter: Joseph K. Bradley >Assignee: Seth Hendrickson > > See parent JIRA for more details. This subtask covers statistics for > continuous values requiring a single pass over the data, such as min and max. > This JIRA is an umbrella. For individual stats, please create and link a new > JIRA. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10602) Univariate statistics as UDAFs: single-pass continuous stats
[ https://issues.apache.org/jira/browse/SPARK-10602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14900166#comment-14900166 ] Sabyasachi Nayak commented on SPARK-10602: -- Hey Seth,Can you pls share the working versions of single pass algos for skewness and kurtosis what you have? Thanks, Sabs > Univariate statistics as UDAFs: single-pass continuous stats > > > Key: SPARK-10602 > URL: https://issues.apache.org/jira/browse/SPARK-10602 > Project: Spark > Issue Type: Sub-task > Components: ML, SQL >Reporter: Joseph K. Bradley >Assignee: Seth Hendrickson > > See parent JIRA for more details. This subtask covers statistics for > continuous values requiring a single pass over the data, such as min and max. > This JIRA is an umbrella. For individual stats, please create and link a new > JIRA. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10602) Univariate statistics as UDAFs: single-pass continuous stats
[ https://issues.apache.org/jira/browse/SPARK-10602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14876644#comment-14876644 ] Seth Hendrickson commented on SPARK-10602: -- Hey [~josephkb], right now I only have bandwidth for skewness and kurtosis support [SPARK-10641|https://issues.apache.org/jira/browse/SPARK-10641]. Could you assign that one to me? Others should feel free to work on related Jiras that fall under this umbrella. Thanks! > Univariate statistics as UDAFs: single-pass continuous stats > > > Key: SPARK-10602 > URL: https://issues.apache.org/jira/browse/SPARK-10602 > Project: Spark > Issue Type: Sub-task > Components: ML, SQL >Reporter: Joseph K. Bradley >Assignee: Seth Hendrickson > > See parent JIRA for more details. This subtask covers statistics for > continuous values requiring a single pass over the data, such as min and max. > This JIRA is an umbrella. For individual stats, please create and link a new > JIRA. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10602) Univariate statistics as UDAFs: single-pass continuous stats
[ https://issues.apache.org/jira/browse/SPARK-10602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14790859#comment-14790859 ] Joseph K. Bradley commented on SPARK-10602: --- Yeah, JIRA only allows 2 levels of subtasks (a long-time annoyance of mine!). I'd recommend linking here using "contains." I'll fix the link for now. > Univariate statistics as UDAFs: single-pass continuous stats > > > Key: SPARK-10602 > URL: https://issues.apache.org/jira/browse/SPARK-10602 > Project: Spark > Issue Type: Sub-task > Components: ML, SQL >Reporter: Joseph K. Bradley > > See parent JIRA for more details. This subtask covers statistics for > continuous values requiring a single pass over the data, such as min and max. > This JIRA is an umbrella. For individual stats, please create and link a new > JIRA. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10602) Univariate statistics as UDAFs: single-pass continuous stats
[ https://issues.apache.org/jira/browse/SPARK-10602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14790816#comment-14790816 ] Jihong MA commented on SPARK-10602: --- I go ahead/ created SPARK-10641, since this JIRA is not listed as umbrella, couldn't link to this JIRA directly instead linked to SPARK-10384. @Joseph, can you assign SPARK-10641 to Seth? and help fix the link, Thanks! > Univariate statistics as UDAFs: single-pass continuous stats > > > Key: SPARK-10602 > URL: https://issues.apache.org/jira/browse/SPARK-10602 > Project: Spark > Issue Type: Sub-task > Components: ML, SQL >Reporter: Joseph K. Bradley > > See parent JIRA for more details. This subtask covers statistics for > continuous values requiring a single pass over the data, such as min and max. > This JIRA is an umbrella. For individual stats, please create and link a new > JIRA. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10602) Univariate statistics as UDAFs: single-pass continuous stats
[ https://issues.apache.org/jira/browse/SPARK-10602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14790713#comment-14790713 ] Seth Hendrickson commented on SPARK-10602: -- Right now I have working versions of single pass algos for skewness and kurtosis. I'm still writing tests and need to work on performance profiling, but I just wanted to make sure we don't duplicate any work. I'll post back with more details soon. > Univariate statistics as UDAFs: single-pass continuous stats > > > Key: SPARK-10602 > URL: https://issues.apache.org/jira/browse/SPARK-10602 > Project: Spark > Issue Type: Sub-task > Components: ML, SQL >Reporter: Joseph K. Bradley > > See parent JIRA for more details. This subtask covers statistics for > continuous values requiring a single pass over the data, such as min and max. > This JIRA is an umbrella. For individual stats, please create and link a new > JIRA. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org