[ 
https://issues.apache.org/jira/browse/SPARK-10384?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-10384:
----------------------------------
    Description: 
It would be nice to define univariate statistics as UDAFs. This JIRA discusses 
general implementation and tracks the process of subtasks. Univariate 
statistics include:

continuous: min, max, range, variance, stddev, median, quantiles, skewness, and 
kurtosis
categorical: number of categories, mode

If we define them as UDAFs, it would be quite flexible to use them with 
DataFrames, e.g.,

{code}
df.groupBy("key").agg(min("x"), min("y"), variance("x"), skewness("x"))
{code}

Note that some univariate statistics depend on others, e.g., variance might 
depend on mean and count. It would be nice if SQL can optimize the sequence to 
avoid duplicate computation.

Univariate statistics for continuous variables:
* ~~min~~
* max
* range
* sample variance
* population variance
* sample standard deviation
* population standard deviation
* skewness
* kurtosis
* approximate median
* approximate quantiles

Univariate statistics for categorical variables:
* mode: https://en.wikipedia.org/wiki/Mode_(statistics)
* number of categories

  was:
It would be nice to define univariate statistics as UDAFs. This JIRA discusses 
general implementation and tracks the process of subtasks. Univariate 
statistics include:

continuous: min, max, range, variance, stddev, median, quantiles, skewness, and 
kurtosis
categorical: number of categories, mode

If we define them as UDAFs, it would be quite flexible to use them with 
DataFrames, e.g.,

{code}
df.groupBy("key").agg(min("x"), min("y"), variance("x"), skewness("x"))
{code}

Note that some univariate statistics depend on others, e.g., variance might 
depend on mean and count. It would be nice if SQL can optimize the sequence to 
avoid duplicate computation.

Univariate statistics for continuous variables:
* min
* max
* range
* sample variance
* population variance
* sample standard deviation
* population standard deviation
* skewness
* kurtosis
* approximate median
* approximate quantiles

Univariate statistics for categorical variables:
* mode: https://en.wikipedia.org/wiki/Mode_(statistics)
* number of categories


> Univariate statistics as UDAFs
> ------------------------------
>
>                 Key: SPARK-10384
>                 URL: https://issues.apache.org/jira/browse/SPARK-10384
>             Project: Spark
>          Issue Type: Umbrella
>          Components: ML, SQL
>            Reporter: Xiangrui Meng
>            Assignee: Burak Yavuz
>
> It would be nice to define univariate statistics as UDAFs. This JIRA 
> discusses general implementation and tracks the process of subtasks. 
> Univariate statistics include:
> continuous: min, max, range, variance, stddev, median, quantiles, skewness, 
> and kurtosis
> categorical: number of categories, mode
> If we define them as UDAFs, it would be quite flexible to use them with 
> DataFrames, e.g.,
> {code}
> df.groupBy("key").agg(min("x"), min("y"), variance("x"), skewness("x"))
> {code}
> Note that some univariate statistics depend on others, e.g., variance might 
> depend on mean and count. It would be nice if SQL can optimize the sequence 
> to avoid duplicate computation.
> Univariate statistics for continuous variables:
> * ~~min~~
> * max
> * range
> * sample variance
> * population variance
> * sample standard deviation
> * population standard deviation
> * skewness
> * kurtosis
> * approximate median
> * approximate quantiles
> Univariate statistics for categorical variables:
> * mode: https://en.wikipedia.org/wiki/Mode_(statistics)
> * number of categories



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to