[ 
https://issues.apache.org/jira/browse/SPARK-22201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

cold gin updated SPARK-22201:
-----------------------------
    Description: 
As per the api documentation, the default no-arg Dataframe describe() function 
should only include numerical column types, but it is including String types as 
well. This creates unusable statistical results (for example, max returns 
"V8903" for one of the string columns in my dataset), and this also leads to 
stacktraces when you run show() on the resulting dataframe returned from 
describe().

There also appears to be several related issues to this:

https://issues.apache.org/jira/browse/SPARK-16468

https://issues.apache.org/jira/browse/SPARK-16429

But SPARK-16429 does not make sense with what the default api says, and only 
Int, Double, etc (numeric) columns should be included when generating the 
statistics. 

Perhaps this reveals the need for a new function to produce stats that make 
sense only for string columns, or else an additional parameter to describe() to 
filter in/out certain column types? 

In summary, the *default* describe api behavior (no arg behavior) should not 
include string columns. Note that boolean columns are correctly excluded by 
describe()

  was:
As per the api documentation, the default no-arg Dataframe describe() function 
should only include numerical column types, but it is including String types as 
well. This creates unusable statistical results (for example, max returns 
"V8903" for one of the string columns in my dataset).

There also appears to be several related issues to this:

https://issues.apache.org/jira/browse/SPARK-16468

https://issues.apache.org/jira/browse/SPARK-16429

But SPARK-16429 does not make sense with what the default api says, and only 
Int, Double, etc (numeric) columns should be included when generating the 
statistics. 

Perhaps this reveals the need for a new function to produce stats that make 
sense only for string columns, or else an additional parameter to describe() to 
filter in/out certain column types? 

In summary, the *default* describe api behavior (no arg behavior) should not 
include string columns.


> Dataframe describe includes string columns
> ------------------------------------------
>
>                 Key: SPARK-22201
>                 URL: https://issues.apache.org/jira/browse/SPARK-22201
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 2.2.0
>            Reporter: cold gin
>
> As per the api documentation, the default no-arg Dataframe describe() 
> function should only include numerical column types, but it is including 
> String types as well. This creates unusable statistical results (for example, 
> max returns "V8903" for one of the string columns in my dataset), and this 
> also leads to stacktraces when you run show() on the resulting dataframe 
> returned from describe().
> There also appears to be several related issues to this:
> https://issues.apache.org/jira/browse/SPARK-16468
> https://issues.apache.org/jira/browse/SPARK-16429
> But SPARK-16429 does not make sense with what the default api says, and only 
> Int, Double, etc (numeric) columns should be included when generating the 
> statistics. 
> Perhaps this reveals the need for a new function to produce stats that make 
> sense only for string columns, or else an additional parameter to describe() 
> to filter in/out certain column types? 
> In summary, the *default* describe api behavior (no arg behavior) should not 
> include string columns. Note that boolean columns are correctly excluded by 
> describe()



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to