[ 
https://issues.apache.org/jira/browse/SPARK-6117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6117:
-----------------------------------

    Assignee:     (was: Apache Spark)

> describe function for summary statistics
> ----------------------------------------
>
>                 Key: SPARK-6117
>                 URL: https://issues.apache.org/jira/browse/SPARK-6117
>             Project: Spark
>          Issue Type: Sub-task
>          Components: SQL
>            Reporter: Reynold Xin
>              Labels: starter
>
> DataFrame.describe should return a DataFrame with summary statistics. 
> {code}
> def describe(cols: String*): DataFrame
> {code}
> If cols is empty, then run describe on all numeric columns.
> The returned DataFrame should have 5 rows (count, mean, stddev, min, max) and 
> n + 1 columns. The 1st column is the name of the aggregate function, and the 
> next n columns are the numeric columns of interest in the input DataFrame.
> Similar to Pandas (but removing percentile since accurate percentiles are too 
> expensive to compute for Big Data)
> {code}
> In [19]: df.describe()
> Out[19]: 
>               A         B         C         D
> count  6.000000  6.000000  6.000000  6.000000
> mean   0.073711 -0.431125 -0.687758 -0.233103
> std    0.843157  0.922818  0.779887  0.973118
> min   -0.861849 -2.104569 -1.509059 -1.135632
> max    1.212112  0.567020  0.276232  1.071804
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to