[ https://issues.apache.org/jira/browse/SPARK-10384?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Xiangrui Meng updated SPARK-10384: ---------------------------------- Description: It would be nice to define univariate statistics as UDAFs. This JIRA discusses general implementation and tracks the process of subtasks. Univariate statistics include: continuous: min, max, range, variance, stddev, median, quantiles, skewness, and kurtosis categorical: number of categories, mode If we define them as UDAFs, it would be quite flexible to use them with DataFrames, e.g., {code} df.groupBy("key").agg(min("x"), min("y"), variance("x"), skewness("x")) {code} Note that some univariate statistics depend on others, e.g., variance might depend on mean and count. It would be nice if SQL can optimize the sequence to avoid duplicate computation. Univariate statistics for continuous variables: * -min- * -max- * range (SPARK-10861) * -mean- * sample variance * population variance * -sample standard deviation- (SPARK-6458) * -population standard deviation- (SPARK-6458) * skewness (SPARK-10641) * kurtosis (SPARK-10641) * approximate median * approximate quantiles Univariate statistics for categorical variables: * mode: https://en.wikipedia.org/wiki/Mode_(statistics) * number of categories was: It would be nice to define univariate statistics as UDAFs. This JIRA discusses general implementation and tracks the process of subtasks. Univariate statistics include: continuous: min, max, range, variance, stddev, median, quantiles, skewness, and kurtosis categorical: number of categories, mode If we define them as UDAFs, it would be quite flexible to use them with DataFrames, e.g., {code} df.groupBy("key").agg(min("x"), min("y"), variance("x"), skewness("x")) {code} Note that some univariate statistics depend on others, e.g., variance might depend on mean and count. It would be nice if SQL can optimize the sequence to avoid duplicate computation. Univariate statistics for continuous variables: * -min- * -max- * range * -mean- * sample variance * population variance * -sample standard deviation- (SPARK-6458) * -population standard deviation- (SPARK-6458) * skewness (SPARK-10641) * kurtosis (SPARK-10641) * approximate median * approximate quantiles Univariate statistics for categorical variables: * mode: https://en.wikipedia.org/wiki/Mode_(statistics) * number of categories > Univariate statistics as UDAFs > ------------------------------ > > Key: SPARK-10384 > URL: https://issues.apache.org/jira/browse/SPARK-10384 > Project: Spark > Issue Type: Umbrella > Components: ML, SQL > Reporter: Xiangrui Meng > Assignee: Burak Yavuz > > It would be nice to define univariate statistics as UDAFs. This JIRA > discusses general implementation and tracks the process of subtasks. > Univariate statistics include: > continuous: min, max, range, variance, stddev, median, quantiles, skewness, > and kurtosis > categorical: number of categories, mode > If we define them as UDAFs, it would be quite flexible to use them with > DataFrames, e.g., > {code} > df.groupBy("key").agg(min("x"), min("y"), variance("x"), skewness("x")) > {code} > Note that some univariate statistics depend on others, e.g., variance might > depend on mean and count. It would be nice if SQL can optimize the sequence > to avoid duplicate computation. > Univariate statistics for continuous variables: > * -min- > * -max- > * range (SPARK-10861) > * -mean- > * sample variance > * population variance > * -sample standard deviation- (SPARK-6458) > * -population standard deviation- (SPARK-6458) > * skewness (SPARK-10641) > * kurtosis (SPARK-10641) > * approximate median > * approximate quantiles > Univariate statistics for categorical variables: > * mode: https://en.wikipedia.org/wiki/Mode_(statistics) > * number of categories -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org