[ https://issues.apache.org/jira/browse/SPARK-40474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Wenchen Fan reassigned SPARK-40474: ----------------------------------- Assignee: Xiaonan Yang > Correct CSV schema inference and data parsing behavior on columns with mixed > dates and timestamps > ------------------------------------------------------------------------------------------------- > > Key: SPARK-40474 > URL: https://issues.apache.org/jira/browse/SPARK-40474 > Project: Spark > Issue Type: Improvement > Components: SQL > Affects Versions: 3.4.0 > Reporter: Xiaonan Yang > Assignee: Xiaonan Yang > Priority: Major > Fix For: 3.4.0 > > > In this ticket https://issues.apache.org/jira/browse/SPARK-39469, we > introduced the support of date type in CSV schema inference. The schema > inference behavior on date time columns now is: > * For a column only containing dates, we will infer it as Date type > * For a column only containing timestamps, we will infer it as Timestamp type > * For a column containing a mixture of dates and timestamps, we will infer > it as Timestamp type > However, we found that we are too ambitious on the last scenario, to support > which we have introduced much complexity in code and caused a lot of > performance concerns. Thus, we want to simplify and correct the behavior of > the last scenario as: > * For a column containing a mixture of dates and timestamps > ** If user specifies timestamp format, it will always be inferred as > `StringType` > ** If no timestamp format specified by user, we will try inferring it as > `TimestampType` if possible, otherwise it will be inferred as `StringType` -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org