[ 
https://issues.apache.org/jira/browse/SPARK-31423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17083595#comment-17083595
 ] 

Maxim Gekk commented on SPARK-31423:
------------------------------------

I have debugged this slightly on Spark 2.4, so, '1582-10-14' falls to the case 
while parsing from UTF8String:
https://github.com/AdoptOpenJDK/openjdk-jdk8u/blob/aa318070b27849f1fe00d14684b2a40f7b29bf79/jdk/src/share/classes/java/util/GregorianCalendar.java#L2762-L2768
{code:java}
                    // The date is in a "missing" period.
                    if (!isLenient()) {
                        throw new IllegalArgumentException("the specified date 
doesn't exist");
                    }
                    // Take the Julian date for compatibility, which
                    // will produce a Gregorian date.
                    fixedDate = jfd;
{code}
In the strong mode, the code 
https://github.com/apache/spark/blob/branch-2.4/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L517
 would throw the exception:
{code}
throw new IllegalArgumentException("the specified date doesn't exist")
{code}
but we are in the "weak" mode, in this way Java 7 GregorianCalendar interprets 
the date especially:
{code}
// Take the Julian date for compatibility, which
 // will produce a Gregorian date.
{code}

The date '1582-10-14' doesn't exist in the hybrid calendar used by Java 7 time 
API. It is questionable how to handle the date in such calendar. 

> DATES and TIMESTAMPS for a certain range are off by 10 days when stored in ORC
> ------------------------------------------------------------------------------
>
>                 Key: SPARK-31423
>                 URL: https://issues.apache.org/jira/browse/SPARK-31423
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 3.0.0, 3.1.0
>            Reporter: Bruce Robbins
>            Priority: Major
>
> There is a range of days (1582-10-05 to 1582-10-14) for which DATEs and 
> TIMESTAMPS are changed when stored in ORC. The value is off by 10 days.
> For example:
> {noformat}
> scala> val df = sql("select cast('1582-10-14' as DATE) dt")
> df: org.apache.spark.sql.DataFrame = [dt: date]
> scala> df.show // seems fine
> +----------+
> |        dt|
> +----------+
> |1582-10-14|
> +----------+
> scala> df.write.mode("overwrite").orc("/tmp/funny_orc_date")
> scala> spark.read.orc("/tmp/funny_orc_date").show // off by 10 days
> +----------+
> |        dt|
> +----------+
> |1582-10-24|
> +----------+
> scala>
> {noformat}
> ORC has the same issue with TIMESTAMPS:
> {noformat}
> scala> val df = sql("select cast('1582-10-14 00:00:00' as TIMESTAMP) ts")
> df: org.apache.spark.sql.DataFrame = [ts: timestamp]
> scala> df.show // seems fine
> +-------------------+
> |                 ts|
> +-------------------+
> |1582-10-14 00:00:00|
> +-------------------+
> scala> df.write.mode("overwrite").orc("/tmp/funny_orc_timestamp")
> scala> spark.read.orc("/tmp/funny_orc_timestamp").show(truncate=false) // off 
> by 10 days
> +-------------------+
> |ts                 |
> +-------------------+
> |1582-10-24 00:00:00|
> +-------------------+
> scala> 
> {noformat}
> However, when written to Parquet or Avro, DATES and TIMESTAMPs for this range 
> do not change.
> {noformat}
> scala> val df = sql("select cast('1582-10-14' as DATE) dt")
> df: org.apache.spark.sql.DataFrame = [dt: date]
> scala> df.write.mode("overwrite").parquet("/tmp/funny_parquet_date")
> scala> spark.read.parquet("/tmp/funny_parquet_date").show // reflects 
> original value
> +----------+
> |        dt|
> +----------+
> |1582-10-14|
> +----------+
> scala> val df = sql("select cast('1582-10-14' as DATE) dt")
> df: org.apache.spark.sql.DataFrame = [dt: date]
> scala> df.write.mode("overwrite").format("avro").save("/tmp/funny_avro_date")
> scala> spark.read.format("avro").load("/tmp/funny_avro_date").show // 
> reflects original value
> +----------+
> |        dt|
> +----------+
> |1582-10-14|
> +----------+
> scala> 
> {noformat}
> It's unclear to me whether ORC is behaving correctly or not, as this is how 
> Spark 2.4 works with DATEs and TIMESTAMPs in general (and also how Spark 3.x 
> works with DATEs and TIMESTAMPs in general when 
> {{spark.sql.legacy.timeParserPolicy}} is set to {{LEGACY}}). In Spark 2.4, 
> DATEs and TIMESTAMPs in this range don't exist:
> {noformat}
> scala> sql("select cast('1582-10-14' as DATE) dt").show // the same cast done 
> in Spark 2.4
> +----------+
> |        dt|
> +----------+
> |1582-10-24|
> +----------+
> scala> 
> {noformat}
> I assume the following snippet is relevant (from the Wikipedia entry on the 
> Gregorian calendar):
> {quote}To deal with the 10 days' difference (between calendar and 
> reality)[Note 2] that this drift had already reached, the date was advanced 
> so that 4 October 1582 was followed by 15 October 1582
> {quote}
> Spark 3.x should treat DATEs and TIMESTAMPS in this range consistently, and 
> probably based on spark.sql.legacy.timeParserPolicy (or some other config) 
> rather than file format.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to