[ https://issues.apache.org/jira/browse/SPARK-45424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
ASF GitHub Bot updated SPARK-45424: ----------------------------------- Labels: pull-request-available (was: ) > Regression in CSV schema inference when timestamps do not match specified > timestampFormat > ----------------------------------------------------------------------------------------- > > Key: SPARK-45424 > URL: https://issues.apache.org/jira/browse/SPARK-45424 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 3.5.0 > Reporter: Andy Grove > Priority: Major > Labels: pull-request-available > > There is a regression in Spark 3.5.0 when inferring the schema of CSV files > containing timestamps, where a column will be inferred as a timestamp even if > the contents do not match the specified timestampFormat. > *Test Data* > I have the following CSV file: > {code:java} > 2884-06-24T02:45:51.138 > 2884-06-24T02:45:51.138 > 2884-06-24T02:45:51.138 > {code} > *Spark 3.4.0 Behavior (correct)* > In Spark 3.4.0, if I specify the correct timestamp format, then the schema is > inferred as timestamp: > {code:java} > scala> val df = spark.read.option("timestampFormat", > "yyyy-MM-dd'T'HH:mm:ss.SSS").option("inferSchema", > true).csv("/tmp/timestamps.csv") > df: org.apache.spark.sql.DataFrame = [_c0: timestamp] > {code} > If I specify an incompatible timestampFormat, then the schema is inferred as > string: > {code:java} > scala> val df = spark.read.option("timestampFormat", > "yyyy-MM-dd'T'HH:mm:ss").option("inferSchema", > true).csv("/tmp/timestamps.csv") > df: org.apache.spark.sql.DataFrame = [_c0: string] > {code} > *Spark 3.5.0* > In Spark 3.5.0, the column will be inferred as timestamp even if the data > does not match the specified timestampFormat. > {code:java} > scala> val df = spark.read.option("timestampFormat", > "yyyy-MM-dd'T'HH:mm:ss").option("inferSchema", > true).csv("/tmp/timestamps.csv") > df: org.apache.spark.sql.DataFrame = [_c0: timestamp] > {code} > Reading the DataFrame then results in an error: > {code:java} > Caused by: java.time.format.DateTimeParseException: Text > '2884-06-24T02:45:51.138' could not be parsed, unparsed text found at index 19 > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org