[ https://issues.apache.org/jira/browse/SPARK-22201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16192938#comment-16192938 ]
cold gin commented on SPARK-22201: ---------------------------------- Ok, and thank you, I appreciate your time and feedback. Having the numeric columns automatically pre-selected by default would make for a more robust api imo, (ie - no column list to supply by default). What you said about pandas having a parameter to include strings seems to support the default behavior of numeric columns only also. > Dataframe describe includes string columns > ------------------------------------------ > > Key: SPARK-22201 > URL: https://issues.apache.org/jira/browse/SPARK-22201 > Project: Spark > Issue Type: Improvement > Components: Spark Core > Affects Versions: 2.2.0 > Reporter: cold gin > Priority: Minor > > As per the api documentation, the default no-arg Dataframe describe() > function should only include numerical column types, but it is including > String types as well. This creates unusable statistical results (for example, > max returns "V8903" for one of the string columns in my dataset), and this > also leads to stacktraces when you run show() on the resulting dataframe > returned from describe(). > There also appears to be several related issues to this: > https://issues.apache.org/jira/browse/SPARK-16468 > https://issues.apache.org/jira/browse/SPARK-16429 > But SPARK-16429 does not make sense with what the default api says, and only > Int, Double, etc (numeric) columns should be included when generating the > statistics. > Perhaps this reveals the need for a new function to produce stats that make > sense only for string columns, or else an additional parameter to describe() > to filter in/out certain column types? > In summary, the *default* describe api behavior (no arg behavior) should not > include string columns. Note that boolean columns are correctly excluded by > describe() -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org