Reynold Xin created SPARK-6117:
----------------------------------

             Summary: describe function for summary statistics
                 Key: SPARK-6117
                 URL: https://issues.apache.org/jira/browse/SPARK-6117
             Project: Spark
          Issue Type: Sub-task
          Components: SQL
            Reporter: Reynold Xin


DataFrame.describe should return a DataFrame with summary statistics. 

{code}
  def describe(cols: String*): DataFrame
{code}

If cols is empty, then run describe on all numeric columns.

The returned DataFrame should have 5 rows (count, mean, stddev, min, max) and n 
+ 1 columns. The 1st column is the name of the aggregate function, and the next 
n columns are the numeric columns of interest in the input DataFrame.

Similar to Pandas (but removing percentile since accurate percentiles are too 
expensive to compute for Big Data)
{code}
In [19]: df.describe()
Out[19]: 
              A         B         C         D
count  6.000000  6.000000  6.000000  6.000000
mean   0.073711 -0.431125 -0.687758 -0.233103
std    0.843157  0.922818  0.779887  0.973118
min   -0.861849 -2.104569 -1.509059 -1.135632
max    1.212112  0.567020  0.276232  1.071804
{code}





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to