[jira] [Created] (SPARK-33571) Handling of hybrid to proleptic calendar when reading and writing Parquet data not working correctly

Simon (Jira) Thu, 26 Nov 2020 02:55:35 -0800

Simon created SPARK-33571:
-----------------------------

             Summary: Handling of hybrid to proleptic calendar when reading and 
writing Parquet data not working correctly
                 Key: SPARK-33571
                 URL: https://issues.apache.org/jira/browse/SPARK-33571
             Project: Spark
          Issue Type: Bug
          Components: Spark Core
    Affects Versions: 3.0.1, 3.0.0
            Reporter: Simon



The handling of old dates written with older Spark versions (<2.4.6) using the 
hybrid calendar in Spark 3.0.0 and 3.0.1 seems to be broken/not working 
correctly.

>From what I understand it should work like this:
 * Only relevant for `DateType` before 1582-10-15 or `TimestampType` before 
1900-01-01T00:00:00Z
 * Only applies when reading or writing parquet files
 * When reading parquet files written with Spark < 2.4.6 which contain dates or 
timestamps before the above mentioned moments in time a `SparkUpgradeException` 
should be raised informing the user to choose either `LEGACY` or `CORRECTED` 
for the `datetimeRebaseModeInRead`
 * When reading parquet files written with Spark < 2.4.6 which contain dates or 
timestamps before the above mentioned moments in time and 
`datetimeRebaseModeInRead` is set to `LEGACY` the dates and timestamps should 
show the same values in Spark 3.0.1. with for example `df.show()` as they did 
in Spark 2.4.5
 * When reading parquet files written with Spark < 2.4.6 which contain dates or 
timestamps before the above mentioned moments in time and 
`datetimeRebaseModeInRead` is set to `CORRECTED` the dates and timestamps 
should show different values in Spark 3.0.1. with for example `df.show()` as 
they did in Spark 2.4.5
 * When writing parqet files with Spark > 3.0.0 which contain dates or 
timestamps before the above mentioned moment in time a `SparkUpgradeException` 
should be raised informing the user to choose either `LEGACY` or `CORRECTED` 
for the `datetimeRebaseModeInWrite`

First of all I'm not 100% sure all of this is correct. I've been unable to find 
any clear documentation on the expected behavior. The understanding I have was 
pieced together from the mailing list 
([http://apache-spark-user-list.1001560.n3.nabble.com/Spark-3-0-1-new-Proleptic-Gregorian-calendar-td38914.html)]
 the blog post linked there and looking at the Spark code.

>From our testing we're seeing several issues:
 * Reading parquet data with Spark 3.0.1 that was written with Spark 2.4.5. 
that contains fields of type `TimestampType` which contain timestamps before 
the above mentioned moments in time without `datetimeRebaseModeInRead` set 
doesn't raise the `SparkUpgradeException`, it succeeds without any changes to 
the resulting dataframe compares to that dataframe in Spark 2.4.5
 * Reading parquet data with Spark 3.0.1 that was written with Spark 2.4.5. 
that contains fields of type `TimestampType` or `DateType` which contain dates 
or timestamps before the above mentioned moments in time with 
`datetimeRebaseModeInRead` set to `LEGACY` results in the same values in the 
dataframe as when using `CORRECTED`, so it seems like no rebasing is happening.

I've made some scripts to help with testing/show the behavior, it uses pyspark 
2.4.5, 2.4.6 and 3.0.1. You can find them here 
[https://github.com/simonvanderveldt/spark3-rebasemode-issue.] I'll post the 
outputs in a comment below as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-33571) Handling of hybrid to proleptic calendar when reading and writing Parquet data not working correctly

Reply via email to