[ https://issues.apache.org/jira/browse/SPARK-10384?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Xiangrui Meng resolved SPARK-10384. ----------------------------------- Resolution: Fixed Fix Version/s: 1.6.0 > Univariate statistics as UDAFs > ------------------------------ > > Key: SPARK-10384 > URL: https://issues.apache.org/jira/browse/SPARK-10384 > Project: Spark > Issue Type: Umbrella > Components: ML, SQL > Reporter: Xiangrui Meng > Assignee: Xiangrui Meng > Fix For: 1.6.0 > > > It would be nice to define univariate statistics as UDAFs. This JIRA > discusses general implementation and tracks the process of subtasks. > Univariate statistics include: > continuous: min, max, range, variance, stddev, median, quantiles, skewness, > and kurtosis > categorical: number of categories, mode > If we define them as UDAFs, it would be quite flexible to use them with > DataFrames, e.g., > {code} > df.groupBy("key").agg(min("x"), min("y"), variance("x"), skewness("x")) > {code} > Note that some univariate statistics depend on others, e.g., variance might > depend on mean and count. It would be nice if SQL can optimize the sequence > to avoid duplicate computation. > Univariate statistics for continuous variables: > * -min- > * -max- > * -range- (SPARK-10861) - won't add > * -mean- > * sample variance (SPARK-9296) > * population variance (SPARK-9296) > * -sample standard deviation- (SPARK-6458) > * -population standard deviation- (SPARK-6458) > * skewness (SPARK-10641) > * kurtosis (SPARK-10641) > * approximate median (SPARK-6761) -> 1.7.0 > * approximate quantiles (SPARK-6761) -> 1.7.0 > Univariate statistics for categorical variables: > * mode: https://en.wikipedia.org/wiki/Mode_(statistics) (SPARK-10936) -> 1.7.0 > * -number of categories- (This is COUNT DISTINCT in SQL.) -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org