[ https://issues.apache.org/jira/browse/SPARK-40289?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Jianbang Xian updated SPARK-40289: ---------------------------------- Description: I created an ORC file by the code as follows. {code:java} val data = Seq( ("", "2022-01-32"), // pay attention to this, null ("", "9808-02-30"), // pay attention to this, 9808-02-29 ("", "2022-06-31"), // pay attention to this, 2022-06-30 ) val cols = Seq("str", "date_str") val df=spark.createDataFrame(data).toDF(cols:_*).repartition(1) df.printSchema() df.show(100) df.write.mode("overwrite").orc("/tmp/orc/data.orc") {code} Please note that these three cases are invalid date. And I read it via: {code:java} scala> var df = spark.read.schema("date_str date").orc("/tmp/orc/data.orc"); df.show() +----------+ | date_str| +----------+ | null| |9808-02-29| |2022-06-30| +----------+{code} Why is `2022-01-31` converted to `null`, while `9808-02-30` is converted to `9808-02-29`? Intuitively, they are invalid date, we should return 3 nulls. Is it a bug or a feature? *Background* * I am working on the project: [https://github.com/NVIDIA/spark-rapids] * And I am working on a feature, that is to support reading ORC file as an cuDF (CUDA DataFrame). cuDF is an in-memory data-format of GPU. * I need to follow the behaviors of ORC reading in CPU. Otherwise, the users of spark-rapids will feel strange with the results. * Therefore I want to know why those happpened. was: I created an ORC file by the code as follows. {code:java} val data = Seq( ("", "2022-01-32"), // pay attention to this, null ("", "9808-02-30"), // pay attention to this, 9808-02-29 ("", "2022-06-31"), // pay attention to this, 2022-06-30 ) val cols = Seq("str", "date_str") val df=spark.createDataFrame(data).toDF(cols:_*).repartition(1) df.printSchema() df.show(100) df.write.mode("overwrite").orc("/tmp/orc/data.orc") {code} Please note that these three cases are invalid date. And I read it via: {code:java} scala> var df = spark.read.schema("date_str date").orc("/tmp/orc/data.orc"); df.show() +----------+ | date_str| +----------+ | null| |9808-02-29| |2022-06-30| +----------+{code} Why is `2022-01-31` converted to `null`, while `9808-02-30` is converted to `9808-02-29`? Intuitively, they are invalid date, we should return 3 nulls. Is it a bug or a feature? *Background* * I am working on the project: [https://github.com/NVIDIA/spark-rapids] * And I am working on a feature, that is to support reading ORC file as an cuDF (CUDA DataFrame). cuDF is an in-memory data-format of GPU. * I need to follow the behaviors of ORC reading in CPU. Otherwise, the users of spark-rapids will feel strange with the results. * Therefore I want to know why those happpened. > The result is strange when casting string to date in ORC reading via Schema > Evolution > ------------------------------------------------------------------------------------- > > Key: SPARK-40289 > URL: https://issues.apache.org/jira/browse/SPARK-40289 > Project: Spark > Issue Type: Bug > Components: Spark Shell > Affects Versions: 3.1.1 > Environment: * Ubuntu 1804 LTS > * Spark 311 > Reporter: Jianbang Xian > Priority: Minor > > I created an ORC file by the code as follows. > {code:java} > val data = Seq( > ("", "2022-01-32"), // pay attention to this, null > ("", "9808-02-30"), // pay attention to this, 9808-02-29 > ("", "2022-06-31"), // pay attention to this, 2022-06-30 > ) > val cols = Seq("str", "date_str") > val df=spark.createDataFrame(data).toDF(cols:_*).repartition(1) > df.printSchema() > df.show(100) > df.write.mode("overwrite").orc("/tmp/orc/data.orc") > {code} > Please note that these three cases are invalid date. > And I read it via: > {code:java} > scala> var df = spark.read.schema("date_str date").orc("/tmp/orc/data.orc"); > df.show() > +----------+ > | date_str| > +----------+ > | null| > |9808-02-29| > |2022-06-30| > +----------+{code} > Why is `2022-01-31` converted to `null`, while `9808-02-30` is converted to > `9808-02-29`? > Intuitively, they are invalid date, we should return 3 nulls. Is it a bug or > a feature? > > > *Background* > * I am working on the project: [https://github.com/NVIDIA/spark-rapids] > * And I am working on a feature, that is to support reading ORC file as an > cuDF (CUDA DataFrame). cuDF is an in-memory data-format of GPU. > * I need to follow the behaviors of ORC reading in CPU. Otherwise, the users > of spark-rapids will feel strange with the results. > * Therefore I want to know why those happpened. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org