[ https://issues.apache.org/jira/browse/SPARK-31423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17082498#comment-17082498 ]
Bruce Robbins commented on SPARK-31423: --------------------------------------- [~cloud_fan] {quote}FYI this is the behavior of Spark 2.4 {quote} Yes, I noted that in my description. What I mean is that in Spark 3.x (and without any legacy config touched), only ORC demonstrates this behavior. CAST, and the Parquet and Avro file formats do not demonstrate this behavior. > DATES and TIMESTAMPS for a certain range are off by 10 days when stored in ORC > ------------------------------------------------------------------------------ > > Key: SPARK-31423 > URL: https://issues.apache.org/jira/browse/SPARK-31423 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 3.0.0, 3.1.0 > Reporter: Bruce Robbins > Priority: Major > > There is a range of days (1582-10-05 to 1582-10-14) for which DATEs and > TIMESTAMPS are changed when stored in ORC. The value is off by 10 days. > For example: > {noformat} > scala> val df = sql("select cast('1582-10-14' as DATE) dt") > df: org.apache.spark.sql.DataFrame = [dt: date] > scala> df.show // seems fine > +----------+ > | dt| > +----------+ > |1582-10-14| > +----------+ > scala> df.write.mode("overwrite").orc("/tmp/funny_orc_date") > scala> spark.read.orc("/tmp/funny_orc_date").show // off by 10 days > +----------+ > | dt| > +----------+ > |1582-10-24| > +----------+ > scala> > {noformat} > ORC has the same issue with TIMESTAMPS: > {noformat} > scala> val df = sql("select cast('1582-10-14 00:00:00' as TIMESTAMP) ts") > df: org.apache.spark.sql.DataFrame = [ts: timestamp] > scala> df.show // seems fine > +-------------------+ > | ts| > +-------------------+ > |1582-10-14 00:00:00| > +-------------------+ > scala> df.write.mode("overwrite").orc("/tmp/funny_orc_timestamp") > scala> spark.read.orc("/tmp/funny_orc_timestamp").show(truncate=false) // off > by 10 days > +-------------------+ > |ts | > +-------------------+ > |1582-10-24 00:00:00| > +-------------------+ > scala> > {noformat} > However, when written to Parquet or Avro, DATES and TIMESTAMPs for this range > do not change. > {noformat} > scala> val df = sql("select cast('1582-10-14' as DATE) dt") > df: org.apache.spark.sql.DataFrame = [dt: date] > scala> df.write.mode("overwrite").parquet("/tmp/funny_parquet_date") > scala> spark.read.parquet("/tmp/funny_parquet_date").show // reflects > original value > +----------+ > | dt| > +----------+ > |1582-10-14| > +----------+ > scala> val df = sql("select cast('1582-10-14' as DATE) dt") > df: org.apache.spark.sql.DataFrame = [dt: date] > scala> df.write.mode("overwrite").format("avro").save("/tmp/funny_avro_date") > scala> spark.read.format("avro").load("/tmp/funny_avro_date").show // > reflects original value > +----------+ > | dt| > +----------+ > |1582-10-14| > +----------+ > scala> > {noformat} > It's unclear to me whether ORC is behaving correctly or not, as this is how > Spark 2.4 works with DATEs and TIMESTAMPs in general (and also how Spark 3.x > works with DATEs and TIMESTAMPs in general when > {{spark.sql.legacy.timeParserPolicy}} is set to {{LEGACY}}). In Spark 2.4, > DATEs and TIMESTAMPs in this range don't exist: > {noformat} > scala> sql("select cast('1582-10-14' as DATE) dt").show // the same cast done > in Spark 2.4 > +----------+ > | dt| > +----------+ > |1582-10-24| > +----------+ > scala> > {noformat} > I assume the following snippet is relevant (from the Wikipedia entry on the > Gregorian calendar): > {quote}To deal with the 10 days' difference (between calendar and > reality)[Note 2] that this drift had already reached, the date was advanced > so that 4 October 1582 was followed by 15 October 1582 > {quote} > Spark 3.x should treat DATEs and TIMESTAMPS in this range consistently, and > probably based on spark.sql.legacy.timeParserPolicy (or some other config) > rather than file format. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org