[jira] [Commented] (SPARK-45424) Regression in CSV schema inference when timestamps do not match specified timestampFormat

Andy Grove (Jira) Thu, 05 Oct 2023 13:08:16 -0700


    [ 
https://issues.apache.org/jira/browse/SPARK-45424?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17772365#comment-17772365
 ]


Andy Grove commented on SPARK-45424:
------------------------------------

The regression seems to have been introduced in 
https://issues.apache.org/jira/browse/SPARK-39280 and/or 
https://issues.apache.org/jira/browse/SPARK-39281

Commits:

[https://github.com/apache/spark/commit/b1c0d599ba32a4562ae1697e3f488264f1d03c76]

[https://github.com/apache/spark/commit/3192bbd29585607d43d0819c6c2d3ac00180261a]

 

[~fanjia] Do you understand why this behavior has changed?

> Regression in CSV schema inference when timestamps do not match specified 
> timestampFormat
> -----------------------------------------------------------------------------------------
>
>                 Key: SPARK-45424
>                 URL: https://issues.apache.org/jira/browse/SPARK-45424
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 3.5.0
>            Reporter: Andy Grove
>            Priority: Major
>
> There is a regression in Spark 3.5.0 when inferring the schema of CSV files 
> containing timestamps, where a column will be inferred as a timestamp even if 
> the contents do not match the specified timestampFormat.
> *Test Data*
> I have the following CSV file:
> {code:java}
> 2884-06-24T02:45:51.138
> 2884-06-24T02:45:51.138
> 2884-06-24T02:45:51.138
> {code}
> *Spark 3.4.0 Behavior (correct)*
> In Spark 3.4.0, if I specify the correct timestamp format, then the schema is 
> inferred as timestamp:
> {code:java}
> scala> val df = spark.read.option("timestampFormat", 
> "yyyy-MM-dd'T'HH:mm:ss.SSS").option("inferSchema", 
> true).csv("/tmp/timestamps.csv")
> df: org.apache.spark.sql.DataFrame = [_c0: timestamp]
> {code}
> If I specify an incompatible timestampFormat, then the schema is inferred as 
> string:
> {code:java}
> scala> val df = spark.read.option("timestampFormat", 
> "yyyy-MM-dd'T'HH:mm:ss").option("inferSchema", 
> true).csv("/tmp/timestamps.csv")
> df: org.apache.spark.sql.DataFrame = [_c0: string]
> {code}
> *Spark 3.5.0*
> In Spark 3.5.0, the column will be inferred as timestamp even if the data 
> does not match the specified timestampFormat.
> {code:java}
> scala> val df = spark.read.option("timestampFormat", 
> "yyyy-MM-dd'T'HH:mm:ss").option("inferSchema", 
> true).csv("/tmp/timestamps.csv")
> df: org.apache.spark.sql.DataFrame = [_c0: timestamp]
> {code}
> Reading the DataFrame then results in an error:
> {code:java}
> Caused by: java.time.format.DateTimeParseException: Text 
> '2884-06-24T02:45:51.138' could not be parsed, unparsed text found at index 19
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-45424) Regression in CSV schema inference when timestamps do not match specified timestampFormat

Reply via email to