[ https://issues.apache.org/jira/browse/SPARK-39731?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Wenchen Fan resolved SPARK-39731. --------------------------------- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 37147 [https://github.com/apache/spark/pull/37147] > Correctness issue when parsing dates with yyyyMMdd format in CSV and JSON > ------------------------------------------------------------------------- > > Key: SPARK-39731 > URL: https://issues.apache.org/jira/browse/SPARK-39731 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 3.4.0 > Reporter: Ivan Sadikov > Assignee: Ivan Sadikov > Priority: Major > Fix For: 3.4.0 > > > In Spark 3.x, when reading CSV data like this: > {code:java} > name,mydate > 1,2020011 > 2,20201203{code} > and specifying date pattern as "yyyyMMdd", dates are not parsed correctly > with CORRECTED time parser policy. > For example, > {code:java} > val df = spark.read.schema("name string, mydate date").option("dateFormat", > "yyyyMMdd").option("header", "true").csv("file:/tmp/test.csv") > df.show(false){code} > Returns: > {code:java} > +----+--------------+ > |name|mydate | > +----+--------------+ > |1 |+2020011-01-01| > |2 |2020-12-03 | > +----+--------------+ {code} > and it used to return null instead of the invalid date in Spark 3.2 or below. > > The issue appears to be caused by this PR: > [https://github.com/apache/spark/pull/32959]. > > A similar issue can observed in JSON data source. > test.json > {code:java} > {"date": "2020011"} > {"date": "20201203"} {code} > > Running commands > {code:java} > val df = spark.read.schema("date date").option("dateFormat", > "yyyyMMdd").json("file:/tmp/test.json") > df.show(false) {code} > returns > {code:java} > +--------------+ > |date | > +--------------+ > |+2020011-01-01| > |2020-12-03 | > +--------------+{code} > but before the patch linked in the description it used to show: > {code:java} > +----------+ > |date | > +----------+ > |7500-08-09| > |2020-12-03| > +----------+{code} > which is strange either way. I will try to address it in the PR. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org