[ 
https://issues.apache.org/jira/browse/SPARK-22201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16191973#comment-16191973
 ] 

cold gin commented on SPARK-22201:
----------------------------------

Ok I see what you mean - it is Dataset, not Dataframe.. Thank you for for 
pointing out the Scala doc. But I still don't see that it makes sense that they 
are co-mingled (that the describe() api selects *both* numeric and string). It 
seems like the numeric columns should be selected exclusively for the 
statistical values returned. I know that there are things like counts that are 
beneficial for strings, but "mean" and "stddev" are not coherent for strings, 
so this is my point about the separation of apis, or perhaps changing the 
default no-arg behavior, and adding strings only if requested. Also, there does 
not seem to be a straightforward api to filter out just numeric or string 
*type* columns (ie - filter by column type), which would make things a lot 
easier. I am having to use drop() for the string columns, but this is messy imo.
 

> Dataframe describe includes string columns
> ------------------------------------------
>
>                 Key: SPARK-22201
>                 URL: https://issues.apache.org/jira/browse/SPARK-22201
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 2.2.0
>            Reporter: cold gin
>
> As per the api documentation, the default no-arg Dataframe describe() 
> function should only include numerical column types, but it is including 
> String types as well. This creates unusable statistical results (for example, 
> max returns "V8903" for one of the string columns in my dataset), and this 
> also leads to stacktraces when you run show() on the resulting dataframe 
> returned from describe().
> There also appears to be several related issues to this:
> https://issues.apache.org/jira/browse/SPARK-16468
> https://issues.apache.org/jira/browse/SPARK-16429
> But SPARK-16429 does not make sense with what the default api says, and only 
> Int, Double, etc (numeric) columns should be included when generating the 
> statistics. 
> Perhaps this reveals the need for a new function to produce stats that make 
> sense only for string columns, or else an additional parameter to describe() 
> to filter in/out certain column types? 
> In summary, the *default* describe api behavior (no arg behavior) should not 
> include string columns. Note that boolean columns are correctly excluded by 
> describe()



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to