[ https://issues.apache.org/jira/browse/SPARK-39731?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Ivan Sadikov updated SPARK-39731: --------------------------------- Description: In Spark 3.x, when reading CSV data like this: {code:java} name,mydate 1,2020011 2,20201203{code} and specifying date pattern as "yyyyMMdd", dates are not parsed correctly with CORRECTED time parser policy. For example, {code:java} val df = spark.read.schema("name string, mydate date").option("dateFormat", "yyyyMMdd").option("header", "true").csv("file:/tmp/test.csv") df.show(false){code} Returns: {code:java} +----+--------------+ |name|mydate | +----+--------------+ |1 |+2020011-01-01| |2 |2020-12-03 | +----+--------------+ {code} and it used to return null instead of the invalid date in Spark 3.2 or below. The issue appears to be caused by this PR: [https://github.com/apache/spark/pull/32959]. A similar issue can observed in JSON data source. test.json {code:java} {"date": "2020011"} {"date": "20201203"} {code} Running commands {code:java} val df = spark.read.schema("date date").option("dateFormat", "yyyyMMdd").json("file:/tmp/test.json") df.show(false) {code} returns {code:java} +--------------+ |date | +--------------+ |+2020011-01-01| |2020-12-03 | +--------------+{code} but before the patch linked in the description it used to show: {code:java} +----------+ |date | +----------+ |7500-08-09| |2020-12-03| +----------+{code} which is strange either way. I will try to address it in the PR. was: In Spark 3.x, when reading CSV data like this: {code:java} name,mydate 1,2020011 2,20201203{code} and specifying date pattern as "yyyyMMdd", dates are not parsed correctly with CORRECTED time parser policy. For example, {code:java} val df = spark.read.schema("name string, mydate date").option("dateFormat", "yyyyMMdd").option("header", "true").csv("file:/tmp/test.csv") df.show(false){code} Returns: {code:java} +----+--------------+ |name|mydate | +----+--------------+ |1 |+2020011-01-01| |2 |2020-12-03 | +----+--------------+ {code} and it used to return null instead of the invalid date in Spark 3.2 or below. The issue appears to be caused by this PR: [https://github.com/apache/spark/pull/32959]. > Correctness issue when parsing dates with yyyyMMdd format in CSV and JSON > ------------------------------------------------------------------------- > > Key: SPARK-39731 > URL: https://issues.apache.org/jira/browse/SPARK-39731 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 3.4.0 > Reporter: Ivan Sadikov > Priority: Major > > In Spark 3.x, when reading CSV data like this: > {code:java} > name,mydate > 1,2020011 > 2,20201203{code} > and specifying date pattern as "yyyyMMdd", dates are not parsed correctly > with CORRECTED time parser policy. > For example, > {code:java} > val df = spark.read.schema("name string, mydate date").option("dateFormat", > "yyyyMMdd").option("header", "true").csv("file:/tmp/test.csv") > df.show(false){code} > Returns: > {code:java} > +----+--------------+ > |name|mydate | > +----+--------------+ > |1 |+2020011-01-01| > |2 |2020-12-03 | > +----+--------------+ {code} > and it used to return null instead of the invalid date in Spark 3.2 or below. > > The issue appears to be caused by this PR: > [https://github.com/apache/spark/pull/32959]. > > A similar issue can observed in JSON data source. > test.json > {code:java} > {"date": "2020011"} > {"date": "20201203"} {code} > > Running commands > {code:java} > val df = spark.read.schema("date date").option("dateFormat", > "yyyyMMdd").json("file:/tmp/test.json") > df.show(false) {code} > returns > {code:java} > +--------------+ > |date | > +--------------+ > |+2020011-01-01| > |2020-12-03 | > +--------------+{code} > but before the patch linked in the description it used to show: > {code:java} > +----------+ > |date | > +----------+ > |7500-08-09| > |2020-12-03| > +----------+{code} > which is strange either way. I will try to address it in the PR. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org