[jira] [Commented] (SPARK-31423) DATES and TIMESTAMPS for a certain range are off by 10 days when stored in ORC
[ https://issues.apache.org/jira/browse/SPARK-31423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17085585#comment-17085585 ] Wenchen Fan commented on SPARK-31423: - I'm closing it as "not a bug". This happens because we "rebase" the datetime values for the calendar changes. There are some dates exist in one calendar but not another. What we can do is to move to the next valid date. Eventually every place should use the standard calendar so we won't have this problem. > DATES and TIMESTAMPS for a certain range are off by 10 days when stored in ORC > -- > > Key: SPARK-31423 > URL: https://issues.apache.org/jira/browse/SPARK-31423 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0, 3.1.0 >Reporter: Bruce Robbins >Priority: Major > > There is a range of days (1582-10-05 to 1582-10-14) for which DATEs and > TIMESTAMPS are changed when stored in ORC. The value is off by 10 days. > For example: > {noformat} > scala> val df = sql("select cast('1582-10-14' as DATE) dt") > df: org.apache.spark.sql.DataFrame = [dt: date] > scala> df.show // seems fine > +--+ > |dt| > +--+ > |1582-10-14| > +--+ > scala> df.write.mode("overwrite").orc("/tmp/funny_orc_date") > scala> spark.read.orc("/tmp/funny_orc_date").show // off by 10 days > +--+ > |dt| > +--+ > |1582-10-24| > +--+ > scala> > {noformat} > ORC has the same issue with TIMESTAMPS: > {noformat} > scala> val df = sql("select cast('1582-10-14 00:00:00' as TIMESTAMP) ts") > df: org.apache.spark.sql.DataFrame = [ts: timestamp] > scala> df.show // seems fine > +---+ > | ts| > +---+ > |1582-10-14 00:00:00| > +---+ > scala> df.write.mode("overwrite").orc("/tmp/funny_orc_timestamp") > scala> spark.read.orc("/tmp/funny_orc_timestamp").show(truncate=false) // off > by 10 days > +---+ > |ts | > +---+ > |1582-10-24 00:00:00| > +---+ > scala> > {noformat} > However, when written to Parquet or Avro, DATES and TIMESTAMPs for this range > do not change. > {noformat} > scala> val df = sql("select cast('1582-10-14' as DATE) dt") > df: org.apache.spark.sql.DataFrame = [dt: date] > scala> df.write.mode("overwrite").parquet("/tmp/funny_parquet_date") > scala> spark.read.parquet("/tmp/funny_parquet_date").show // reflects > original value > +--+ > |dt| > +--+ > |1582-10-14| > +--+ > scala> val df = sql("select cast('1582-10-14' as DATE) dt") > df: org.apache.spark.sql.DataFrame = [dt: date] > scala> df.write.mode("overwrite").format("avro").save("/tmp/funny_avro_date") > scala> spark.read.format("avro").load("/tmp/funny_avro_date").show // > reflects original value > +--+ > |dt| > +--+ > |1582-10-14| > +--+ > scala> > {noformat} > It's unclear to me whether ORC is behaving correctly or not, as this is how > Spark 2.4 works with DATEs and TIMESTAMPs in general (and also how Spark 3.x > works with DATEs and TIMESTAMPs in general when > {{spark.sql.legacy.timeParserPolicy}} is set to {{LEGACY}}). In Spark 2.4, > DATEs and TIMESTAMPs in this range don't exist: > {noformat} > scala> sql("select cast('1582-10-14' as DATE) dt").show // the same cast done > in Spark 2.4 > +--+ > |dt| > +--+ > |1582-10-24| > +--+ > scala> > {noformat} > I assume the following snippet is relevant (from the Wikipedia entry on the > Gregorian calendar): > {quote}To deal with the 10 days' difference (between calendar and > reality)[Note 2] that this drift had already reached, the date was advanced > so that 4 October 1582 was followed by 15 October 1582 > {quote} > Spark 3.x should treat DATEs and TIMESTAMPS in this range consistently, and > probably based on spark.sql.legacy.timeParserPolicy (or some other config) > rather than file format. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31423) DATES and TIMESTAMPS for a certain range are off by 10 days when stored in ORC
[ https://issues.apache.org/jira/browse/SPARK-31423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17084308#comment-17084308 ] Maxim Gekk commented on SPARK-31423: [~bersprockets] I think we should take the next valid date for any not-existed dates, see the linked PR. > DATES and TIMESTAMPS for a certain range are off by 10 days when stored in ORC > -- > > Key: SPARK-31423 > URL: https://issues.apache.org/jira/browse/SPARK-31423 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0, 3.1.0 >Reporter: Bruce Robbins >Priority: Major > > There is a range of days (1582-10-05 to 1582-10-14) for which DATEs and > TIMESTAMPS are changed when stored in ORC. The value is off by 10 days. > For example: > {noformat} > scala> val df = sql("select cast('1582-10-14' as DATE) dt") > df: org.apache.spark.sql.DataFrame = [dt: date] > scala> df.show // seems fine > +--+ > |dt| > +--+ > |1582-10-14| > +--+ > scala> df.write.mode("overwrite").orc("/tmp/funny_orc_date") > scala> spark.read.orc("/tmp/funny_orc_date").show // off by 10 days > +--+ > |dt| > +--+ > |1582-10-24| > +--+ > scala> > {noformat} > ORC has the same issue with TIMESTAMPS: > {noformat} > scala> val df = sql("select cast('1582-10-14 00:00:00' as TIMESTAMP) ts") > df: org.apache.spark.sql.DataFrame = [ts: timestamp] > scala> df.show // seems fine > +---+ > | ts| > +---+ > |1582-10-14 00:00:00| > +---+ > scala> df.write.mode("overwrite").orc("/tmp/funny_orc_timestamp") > scala> spark.read.orc("/tmp/funny_orc_timestamp").show(truncate=false) // off > by 10 days > +---+ > |ts | > +---+ > |1582-10-24 00:00:00| > +---+ > scala> > {noformat} > However, when written to Parquet or Avro, DATES and TIMESTAMPs for this range > do not change. > {noformat} > scala> val df = sql("select cast('1582-10-14' as DATE) dt") > df: org.apache.spark.sql.DataFrame = [dt: date] > scala> df.write.mode("overwrite").parquet("/tmp/funny_parquet_date") > scala> spark.read.parquet("/tmp/funny_parquet_date").show // reflects > original value > +--+ > |dt| > +--+ > |1582-10-14| > +--+ > scala> val df = sql("select cast('1582-10-14' as DATE) dt") > df: org.apache.spark.sql.DataFrame = [dt: date] > scala> df.write.mode("overwrite").format("avro").save("/tmp/funny_avro_date") > scala> spark.read.format("avro").load("/tmp/funny_avro_date").show // > reflects original value > +--+ > |dt| > +--+ > |1582-10-14| > +--+ > scala> > {noformat} > It's unclear to me whether ORC is behaving correctly or not, as this is how > Spark 2.4 works with DATEs and TIMESTAMPs in general (and also how Spark 3.x > works with DATEs and TIMESTAMPs in general when > {{spark.sql.legacy.timeParserPolicy}} is set to {{LEGACY}}). In Spark 2.4, > DATEs and TIMESTAMPs in this range don't exist: > {noformat} > scala> sql("select cast('1582-10-14' as DATE) dt").show // the same cast done > in Spark 2.4 > +--+ > |dt| > +--+ > |1582-10-24| > +--+ > scala> > {noformat} > I assume the following snippet is relevant (from the Wikipedia entry on the > Gregorian calendar): > {quote}To deal with the 10 days' difference (between calendar and > reality)[Note 2] that this drift had already reached, the date was advanced > so that 4 October 1582 was followed by 15 October 1582 > {quote} > Spark 3.x should treat DATEs and TIMESTAMPS in this range consistently, and > probably based on spark.sql.legacy.timeParserPolicy (or some other config) > rather than file format. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31423) DATES and TIMESTAMPS for a certain range are off by 10 days when stored in ORC
[ https://issues.apache.org/jira/browse/SPARK-31423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17084288#comment-17084288 ] Bruce Robbins commented on SPARK-31423: --- Thanks. It seems we can either - close this as "not a bug", or - I can file an ORC Jira (later this week, after I fiddle with the library a bit) and mark this as blocked on the ORC Jira. > DATES and TIMESTAMPS for a certain range are off by 10 days when stored in ORC > -- > > Key: SPARK-31423 > URL: https://issues.apache.org/jira/browse/SPARK-31423 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0, 3.1.0 >Reporter: Bruce Robbins >Priority: Major > > There is a range of days (1582-10-05 to 1582-10-14) for which DATEs and > TIMESTAMPS are changed when stored in ORC. The value is off by 10 days. > For example: > {noformat} > scala> val df = sql("select cast('1582-10-14' as DATE) dt") > df: org.apache.spark.sql.DataFrame = [dt: date] > scala> df.show // seems fine > +--+ > |dt| > +--+ > |1582-10-14| > +--+ > scala> df.write.mode("overwrite").orc("/tmp/funny_orc_date") > scala> spark.read.orc("/tmp/funny_orc_date").show // off by 10 days > +--+ > |dt| > +--+ > |1582-10-24| > +--+ > scala> > {noformat} > ORC has the same issue with TIMESTAMPS: > {noformat} > scala> val df = sql("select cast('1582-10-14 00:00:00' as TIMESTAMP) ts") > df: org.apache.spark.sql.DataFrame = [ts: timestamp] > scala> df.show // seems fine > +---+ > | ts| > +---+ > |1582-10-14 00:00:00| > +---+ > scala> df.write.mode("overwrite").orc("/tmp/funny_orc_timestamp") > scala> spark.read.orc("/tmp/funny_orc_timestamp").show(truncate=false) // off > by 10 days > +---+ > |ts | > +---+ > |1582-10-24 00:00:00| > +---+ > scala> > {noformat} > However, when written to Parquet or Avro, DATES and TIMESTAMPs for this range > do not change. > {noformat} > scala> val df = sql("select cast('1582-10-14' as DATE) dt") > df: org.apache.spark.sql.DataFrame = [dt: date] > scala> df.write.mode("overwrite").parquet("/tmp/funny_parquet_date") > scala> spark.read.parquet("/tmp/funny_parquet_date").show // reflects > original value > +--+ > |dt| > +--+ > |1582-10-14| > +--+ > scala> val df = sql("select cast('1582-10-14' as DATE) dt") > df: org.apache.spark.sql.DataFrame = [dt: date] > scala> df.write.mode("overwrite").format("avro").save("/tmp/funny_avro_date") > scala> spark.read.format("avro").load("/tmp/funny_avro_date").show // > reflects original value > +--+ > |dt| > +--+ > |1582-10-14| > +--+ > scala> > {noformat} > It's unclear to me whether ORC is behaving correctly or not, as this is how > Spark 2.4 works with DATEs and TIMESTAMPs in general (and also how Spark 3.x > works with DATEs and TIMESTAMPs in general when > {{spark.sql.legacy.timeParserPolicy}} is set to {{LEGACY}}). In Spark 2.4, > DATEs and TIMESTAMPs in this range don't exist: > {noformat} > scala> sql("select cast('1582-10-14' as DATE) dt").show // the same cast done > in Spark 2.4 > +--+ > |dt| > +--+ > |1582-10-24| > +--+ > scala> > {noformat} > I assume the following snippet is relevant (from the Wikipedia entry on the > Gregorian calendar): > {quote}To deal with the 10 days' difference (between calendar and > reality)[Note 2] that this drift had already reached, the date was advanced > so that 4 October 1582 was followed by 15 October 1582 > {quote} > Spark 3.x should treat DATEs and TIMESTAMPS in this range consistently, and > probably based on spark.sql.legacy.timeParserPolicy (or some other config) > rather than file format. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31423) DATES and TIMESTAMPS for a certain range are off by 10 days when stored in ORC
[ https://issues.apache.org/jira/browse/SPARK-31423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17084234#comment-17084234 ] Wenchen Fan commented on SPARK-31423: - I hope the ORC community can figure this out and switch to/support the standard proleptic Gregorian calendar. One thing we can do is to check the behavior of Hive 3.x, as Hive also switches to the proleptic Gregorian calendar and should have the same issue. > DATES and TIMESTAMPS for a certain range are off by 10 days when stored in ORC > -- > > Key: SPARK-31423 > URL: https://issues.apache.org/jira/browse/SPARK-31423 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0, 3.1.0 >Reporter: Bruce Robbins >Priority: Major > > There is a range of days (1582-10-05 to 1582-10-14) for which DATEs and > TIMESTAMPS are changed when stored in ORC. The value is off by 10 days. > For example: > {noformat} > scala> val df = sql("select cast('1582-10-14' as DATE) dt") > df: org.apache.spark.sql.DataFrame = [dt: date] > scala> df.show // seems fine > +--+ > |dt| > +--+ > |1582-10-14| > +--+ > scala> df.write.mode("overwrite").orc("/tmp/funny_orc_date") > scala> spark.read.orc("/tmp/funny_orc_date").show // off by 10 days > +--+ > |dt| > +--+ > |1582-10-24| > +--+ > scala> > {noformat} > ORC has the same issue with TIMESTAMPS: > {noformat} > scala> val df = sql("select cast('1582-10-14 00:00:00' as TIMESTAMP) ts") > df: org.apache.spark.sql.DataFrame = [ts: timestamp] > scala> df.show // seems fine > +---+ > | ts| > +---+ > |1582-10-14 00:00:00| > +---+ > scala> df.write.mode("overwrite").orc("/tmp/funny_orc_timestamp") > scala> spark.read.orc("/tmp/funny_orc_timestamp").show(truncate=false) // off > by 10 days > +---+ > |ts | > +---+ > |1582-10-24 00:00:00| > +---+ > scala> > {noformat} > However, when written to Parquet or Avro, DATES and TIMESTAMPs for this range > do not change. > {noformat} > scala> val df = sql("select cast('1582-10-14' as DATE) dt") > df: org.apache.spark.sql.DataFrame = [dt: date] > scala> df.write.mode("overwrite").parquet("/tmp/funny_parquet_date") > scala> spark.read.parquet("/tmp/funny_parquet_date").show // reflects > original value > +--+ > |dt| > +--+ > |1582-10-14| > +--+ > scala> val df = sql("select cast('1582-10-14' as DATE) dt") > df: org.apache.spark.sql.DataFrame = [dt: date] > scala> df.write.mode("overwrite").format("avro").save("/tmp/funny_avro_date") > scala> spark.read.format("avro").load("/tmp/funny_avro_date").show // > reflects original value > +--+ > |dt| > +--+ > |1582-10-14| > +--+ > scala> > {noformat} > It's unclear to me whether ORC is behaving correctly or not, as this is how > Spark 2.4 works with DATEs and TIMESTAMPs in general (and also how Spark 3.x > works with DATEs and TIMESTAMPs in general when > {{spark.sql.legacy.timeParserPolicy}} is set to {{LEGACY}}). In Spark 2.4, > DATEs and TIMESTAMPs in this range don't exist: > {noformat} > scala> sql("select cast('1582-10-14' as DATE) dt").show // the same cast done > in Spark 2.4 > +--+ > |dt| > +--+ > |1582-10-24| > +--+ > scala> > {noformat} > I assume the following snippet is relevant (from the Wikipedia entry on the > Gregorian calendar): > {quote}To deal with the 10 days' difference (between calendar and > reality)[Note 2] that this drift had already reached, the date was advanced > so that 4 October 1582 was followed by 15 October 1582 > {quote} > Spark 3.x should treat DATEs and TIMESTAMPS in this range consistently, and > probably based on spark.sql.legacy.timeParserPolicy (or some other config) > rather than file format. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31423) DATES and TIMESTAMPS for a certain range are off by 10 days when stored in ORC
[ https://issues.apache.org/jira/browse/SPARK-31423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17084218#comment-17084218 ] Bruce Robbins commented on SPARK-31423: --- OK, so this is a case of a limitation of the ORC library, and there is not much we can do about it in Spark (other than throw an user-overridable exception when this happens). Therefore, some users might see a case like this: {noformat} val df = sql("select cast('1582-10-14' as DATE) dt") df.write.mode("overwrite").parquet("/tmp/dateparquet") df.write.mode("overwrite").format("avro").save("/tmp/dateavro") df.write.mode("overwrite").orc("/tmp/dateorc") val dfParquet = spark.read.parquet("/tmp/dateparquet") val dfAvro = spark.read.format("avro").load("/tmp/dateavro") val dfOrc = spark.read.orc("/tmp/dateorc") scala> scala> dfParquet.join(dfAvro, "dt").count // can join to avro res4: Long = 1 scala> dfParquet.join(dfOrc, "dt").count // doesn't find it in orc res5: Long = 0 scala> {noformat} > DATES and TIMESTAMPS for a certain range are off by 10 days when stored in ORC > -- > > Key: SPARK-31423 > URL: https://issues.apache.org/jira/browse/SPARK-31423 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0, 3.1.0 >Reporter: Bruce Robbins >Priority: Major > > There is a range of days (1582-10-05 to 1582-10-14) for which DATEs and > TIMESTAMPS are changed when stored in ORC. The value is off by 10 days. > For example: > {noformat} > scala> val df = sql("select cast('1582-10-14' as DATE) dt") > df: org.apache.spark.sql.DataFrame = [dt: date] > scala> df.show // seems fine > +--+ > |dt| > +--+ > |1582-10-14| > +--+ > scala> df.write.mode("overwrite").orc("/tmp/funny_orc_date") > scala> spark.read.orc("/tmp/funny_orc_date").show // off by 10 days > +--+ > |dt| > +--+ > |1582-10-24| > +--+ > scala> > {noformat} > ORC has the same issue with TIMESTAMPS: > {noformat} > scala> val df = sql("select cast('1582-10-14 00:00:00' as TIMESTAMP) ts") > df: org.apache.spark.sql.DataFrame = [ts: timestamp] > scala> df.show // seems fine > +---+ > | ts| > +---+ > |1582-10-14 00:00:00| > +---+ > scala> df.write.mode("overwrite").orc("/tmp/funny_orc_timestamp") > scala> spark.read.orc("/tmp/funny_orc_timestamp").show(truncate=false) // off > by 10 days > +---+ > |ts | > +---+ > |1582-10-24 00:00:00| > +---+ > scala> > {noformat} > However, when written to Parquet or Avro, DATES and TIMESTAMPs for this range > do not change. > {noformat} > scala> val df = sql("select cast('1582-10-14' as DATE) dt") > df: org.apache.spark.sql.DataFrame = [dt: date] > scala> df.write.mode("overwrite").parquet("/tmp/funny_parquet_date") > scala> spark.read.parquet("/tmp/funny_parquet_date").show // reflects > original value > +--+ > |dt| > +--+ > |1582-10-14| > +--+ > scala> val df = sql("select cast('1582-10-14' as DATE) dt") > df: org.apache.spark.sql.DataFrame = [dt: date] > scala> df.write.mode("overwrite").format("avro").save("/tmp/funny_avro_date") > scala> spark.read.format("avro").load("/tmp/funny_avro_date").show // > reflects original value > +--+ > |dt| > +--+ > |1582-10-14| > +--+ > scala> > {noformat} > It's unclear to me whether ORC is behaving correctly or not, as this is how > Spark 2.4 works with DATEs and TIMESTAMPs in general (and also how Spark 3.x > works with DATEs and TIMESTAMPs in general when > {{spark.sql.legacy.timeParserPolicy}} is set to {{LEGACY}}). In Spark 2.4, > DATEs and TIMESTAMPs in this range don't exist: > {noformat} > scala> sql("select cast('1582-10-14' as DATE) dt").show // the same cast done > in Spark 2.4 > +--+ > |dt| > +--+ > |1582-10-24| > +--+ > scala> > {noformat} > I assume the following snippet is relevant (from the Wikipedia entry on the > Gregorian calendar): > {quote}To deal with the 10 days' difference (between calendar and > reality)[Note 2] that this drift had already reached, the date was advanced > so that 4 October 1582 was followed by 15 October 1582 > {quote} > Spark 3.x should treat DATEs and TIMESTAMPS in this range consistently, and > probably based on spark.sql.legacy.timeParserPolicy (or some other config) > rather than file format. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31423) DATES and TIMESTAMPS for a certain range are off by 10 days when stored in ORC
[ https://issues.apache.org/jira/browse/SPARK-31423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17083828#comment-17083828 ] Wenchen Fan commented on SPARK-31423: - We probably have a bug about picking the next valid date, but the behavior itself is expected: Spark always pick the next valid date {code} scala> sql("select date'1990-9-31'").show +-+ |DATE '1990-10-01'| +-+ | 1990-10-01| +-+ {code} It's a bit weird that this date is valid in Spark but invalid in ORC because the calendar is different. I'm OK to fail for this case, with a config to allow users to write it anyway by picking the next valid date. > DATES and TIMESTAMPS for a certain range are off by 10 days when stored in ORC > -- > > Key: SPARK-31423 > URL: https://issues.apache.org/jira/browse/SPARK-31423 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0, 3.1.0 >Reporter: Bruce Robbins >Priority: Major > > There is a range of days (1582-10-05 to 1582-10-14) for which DATEs and > TIMESTAMPS are changed when stored in ORC. The value is off by 10 days. > For example: > {noformat} > scala> val df = sql("select cast('1582-10-14' as DATE) dt") > df: org.apache.spark.sql.DataFrame = [dt: date] > scala> df.show // seems fine > +--+ > |dt| > +--+ > |1582-10-14| > +--+ > scala> df.write.mode("overwrite").orc("/tmp/funny_orc_date") > scala> spark.read.orc("/tmp/funny_orc_date").show // off by 10 days > +--+ > |dt| > +--+ > |1582-10-24| > +--+ > scala> > {noformat} > ORC has the same issue with TIMESTAMPS: > {noformat} > scala> val df = sql("select cast('1582-10-14 00:00:00' as TIMESTAMP) ts") > df: org.apache.spark.sql.DataFrame = [ts: timestamp] > scala> df.show // seems fine > +---+ > | ts| > +---+ > |1582-10-14 00:00:00| > +---+ > scala> df.write.mode("overwrite").orc("/tmp/funny_orc_timestamp") > scala> spark.read.orc("/tmp/funny_orc_timestamp").show(truncate=false) // off > by 10 days > +---+ > |ts | > +---+ > |1582-10-24 00:00:00| > +---+ > scala> > {noformat} > However, when written to Parquet or Avro, DATES and TIMESTAMPs for this range > do not change. > {noformat} > scala> val df = sql("select cast('1582-10-14' as DATE) dt") > df: org.apache.spark.sql.DataFrame = [dt: date] > scala> df.write.mode("overwrite").parquet("/tmp/funny_parquet_date") > scala> spark.read.parquet("/tmp/funny_parquet_date").show // reflects > original value > +--+ > |dt| > +--+ > |1582-10-14| > +--+ > scala> val df = sql("select cast('1582-10-14' as DATE) dt") > df: org.apache.spark.sql.DataFrame = [dt: date] > scala> df.write.mode("overwrite").format("avro").save("/tmp/funny_avro_date") > scala> spark.read.format("avro").load("/tmp/funny_avro_date").show // > reflects original value > +--+ > |dt| > +--+ > |1582-10-14| > +--+ > scala> > {noformat} > It's unclear to me whether ORC is behaving correctly or not, as this is how > Spark 2.4 works with DATEs and TIMESTAMPs in general (and also how Spark 3.x > works with DATEs and TIMESTAMPs in general when > {{spark.sql.legacy.timeParserPolicy}} is set to {{LEGACY}}). In Spark 2.4, > DATEs and TIMESTAMPs in this range don't exist: > {noformat} > scala> sql("select cast('1582-10-14' as DATE) dt").show // the same cast done > in Spark 2.4 > +--+ > |dt| > +--+ > |1582-10-24| > +--+ > scala> > {noformat} > I assume the following snippet is relevant (from the Wikipedia entry on the > Gregorian calendar): > {quote}To deal with the 10 days' difference (between calendar and > reality)[Note 2] that this drift had already reached, the date was advanced > so that 4 October 1582 was followed by 15 October 1582 > {quote} > Spark 3.x should treat DATEs and TIMESTAMPS in this range consistently, and > probably based on spark.sql.legacy.timeParserPolicy (or some other config) > rather than file format. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31423) DATES and TIMESTAMPS for a certain range are off by 10 days when stored in ORC
[ https://issues.apache.org/jira/browse/SPARK-31423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17083822#comment-17083822 ] Wenchen Fan commented on SPARK-31423: - The ORC file format spec doesn't specify the calendar, but the ORC library (reader and writer) uses java 7 Date/Timestamp which indicates the hybrid Julian calendar. > DATES and TIMESTAMPS for a certain range are off by 10 days when stored in ORC > -- > > Key: SPARK-31423 > URL: https://issues.apache.org/jira/browse/SPARK-31423 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0, 3.1.0 >Reporter: Bruce Robbins >Priority: Major > > There is a range of days (1582-10-05 to 1582-10-14) for which DATEs and > TIMESTAMPS are changed when stored in ORC. The value is off by 10 days. > For example: > {noformat} > scala> val df = sql("select cast('1582-10-14' as DATE) dt") > df: org.apache.spark.sql.DataFrame = [dt: date] > scala> df.show // seems fine > +--+ > |dt| > +--+ > |1582-10-14| > +--+ > scala> df.write.mode("overwrite").orc("/tmp/funny_orc_date") > scala> spark.read.orc("/tmp/funny_orc_date").show // off by 10 days > +--+ > |dt| > +--+ > |1582-10-24| > +--+ > scala> > {noformat} > ORC has the same issue with TIMESTAMPS: > {noformat} > scala> val df = sql("select cast('1582-10-14 00:00:00' as TIMESTAMP) ts") > df: org.apache.spark.sql.DataFrame = [ts: timestamp] > scala> df.show // seems fine > +---+ > | ts| > +---+ > |1582-10-14 00:00:00| > +---+ > scala> df.write.mode("overwrite").orc("/tmp/funny_orc_timestamp") > scala> spark.read.orc("/tmp/funny_orc_timestamp").show(truncate=false) // off > by 10 days > +---+ > |ts | > +---+ > |1582-10-24 00:00:00| > +---+ > scala> > {noformat} > However, when written to Parquet or Avro, DATES and TIMESTAMPs for this range > do not change. > {noformat} > scala> val df = sql("select cast('1582-10-14' as DATE) dt") > df: org.apache.spark.sql.DataFrame = [dt: date] > scala> df.write.mode("overwrite").parquet("/tmp/funny_parquet_date") > scala> spark.read.parquet("/tmp/funny_parquet_date").show // reflects > original value > +--+ > |dt| > +--+ > |1582-10-14| > +--+ > scala> val df = sql("select cast('1582-10-14' as DATE) dt") > df: org.apache.spark.sql.DataFrame = [dt: date] > scala> df.write.mode("overwrite").format("avro").save("/tmp/funny_avro_date") > scala> spark.read.format("avro").load("/tmp/funny_avro_date").show // > reflects original value > +--+ > |dt| > +--+ > |1582-10-14| > +--+ > scala> > {noformat} > It's unclear to me whether ORC is behaving correctly or not, as this is how > Spark 2.4 works with DATEs and TIMESTAMPs in general (and also how Spark 3.x > works with DATEs and TIMESTAMPs in general when > {{spark.sql.legacy.timeParserPolicy}} is set to {{LEGACY}}). In Spark 2.4, > DATEs and TIMESTAMPs in this range don't exist: > {noformat} > scala> sql("select cast('1582-10-14' as DATE) dt").show // the same cast done > in Spark 2.4 > +--+ > |dt| > +--+ > |1582-10-24| > +--+ > scala> > {noformat} > I assume the following snippet is relevant (from the Wikipedia entry on the > Gregorian calendar): > {quote}To deal with the 10 days' difference (between calendar and > reality)[Note 2] that this drift had already reached, the date was advanced > so that 4 October 1582 was followed by 15 October 1582 > {quote} > Spark 3.x should treat DATEs and TIMESTAMPS in this range consistently, and > probably based on spark.sql.legacy.timeParserPolicy (or some other config) > rather than file format. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31423) DATES and TIMESTAMPS for a certain range are off by 10 days when stored in ORC
[ https://issues.apache.org/jira/browse/SPARK-31423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17083611#comment-17083611 ] Bruce Robbins commented on SPARK-31423: --- {quote}It is questionable how to handle the date in such calendar.{quote} Sorry if I am asking a dumb question. Are you saying that ORC stores only in the hybrid Julian calendar? > DATES and TIMESTAMPS for a certain range are off by 10 days when stored in ORC > -- > > Key: SPARK-31423 > URL: https://issues.apache.org/jira/browse/SPARK-31423 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0, 3.1.0 >Reporter: Bruce Robbins >Priority: Major > > There is a range of days (1582-10-05 to 1582-10-14) for which DATEs and > TIMESTAMPS are changed when stored in ORC. The value is off by 10 days. > For example: > {noformat} > scala> val df = sql("select cast('1582-10-14' as DATE) dt") > df: org.apache.spark.sql.DataFrame = [dt: date] > scala> df.show // seems fine > +--+ > |dt| > +--+ > |1582-10-14| > +--+ > scala> df.write.mode("overwrite").orc("/tmp/funny_orc_date") > scala> spark.read.orc("/tmp/funny_orc_date").show // off by 10 days > +--+ > |dt| > +--+ > |1582-10-24| > +--+ > scala> > {noformat} > ORC has the same issue with TIMESTAMPS: > {noformat} > scala> val df = sql("select cast('1582-10-14 00:00:00' as TIMESTAMP) ts") > df: org.apache.spark.sql.DataFrame = [ts: timestamp] > scala> df.show // seems fine > +---+ > | ts| > +---+ > |1582-10-14 00:00:00| > +---+ > scala> df.write.mode("overwrite").orc("/tmp/funny_orc_timestamp") > scala> spark.read.orc("/tmp/funny_orc_timestamp").show(truncate=false) // off > by 10 days > +---+ > |ts | > +---+ > |1582-10-24 00:00:00| > +---+ > scala> > {noformat} > However, when written to Parquet or Avro, DATES and TIMESTAMPs for this range > do not change. > {noformat} > scala> val df = sql("select cast('1582-10-14' as DATE) dt") > df: org.apache.spark.sql.DataFrame = [dt: date] > scala> df.write.mode("overwrite").parquet("/tmp/funny_parquet_date") > scala> spark.read.parquet("/tmp/funny_parquet_date").show // reflects > original value > +--+ > |dt| > +--+ > |1582-10-14| > +--+ > scala> val df = sql("select cast('1582-10-14' as DATE) dt") > df: org.apache.spark.sql.DataFrame = [dt: date] > scala> df.write.mode("overwrite").format("avro").save("/tmp/funny_avro_date") > scala> spark.read.format("avro").load("/tmp/funny_avro_date").show // > reflects original value > +--+ > |dt| > +--+ > |1582-10-14| > +--+ > scala> > {noformat} > It's unclear to me whether ORC is behaving correctly or not, as this is how > Spark 2.4 works with DATEs and TIMESTAMPs in general (and also how Spark 3.x > works with DATEs and TIMESTAMPs in general when > {{spark.sql.legacy.timeParserPolicy}} is set to {{LEGACY}}). In Spark 2.4, > DATEs and TIMESTAMPs in this range don't exist: > {noformat} > scala> sql("select cast('1582-10-14' as DATE) dt").show // the same cast done > in Spark 2.4 > +--+ > |dt| > +--+ > |1582-10-24| > +--+ > scala> > {noformat} > I assume the following snippet is relevant (from the Wikipedia entry on the > Gregorian calendar): > {quote}To deal with the 10 days' difference (between calendar and > reality)[Note 2] that this drift had already reached, the date was advanced > so that 4 October 1582 was followed by 15 October 1582 > {quote} > Spark 3.x should treat DATEs and TIMESTAMPS in this range consistently, and > probably based on spark.sql.legacy.timeParserPolicy (or some other config) > rather than file format. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31423) DATES and TIMESTAMPS for a certain range are off by 10 days when stored in ORC
[ https://issues.apache.org/jira/browse/SPARK-31423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17083595#comment-17083595 ] Maxim Gekk commented on SPARK-31423: I have debugged this slightly on Spark 2.4, so, '1582-10-14' falls to the case while parsing from UTF8String: https://github.com/AdoptOpenJDK/openjdk-jdk8u/blob/aa318070b27849f1fe00d14684b2a40f7b29bf79/jdk/src/share/classes/java/util/GregorianCalendar.java#L2762-L2768 {code:java} // The date is in a "missing" period. if (!isLenient()) { throw new IllegalArgumentException("the specified date doesn't exist"); } // Take the Julian date for compatibility, which // will produce a Gregorian date. fixedDate = jfd; {code} In the strong mode, the code https://github.com/apache/spark/blob/branch-2.4/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L517 would throw the exception: {code} throw new IllegalArgumentException("the specified date doesn't exist") {code} but we are in the "weak" mode, in this way Java 7 GregorianCalendar interprets the date especially: {code} // Take the Julian date for compatibility, which // will produce a Gregorian date. {code} The date '1582-10-14' doesn't exist in the hybrid calendar used by Java 7 time API. It is questionable how to handle the date in such calendar. > DATES and TIMESTAMPS for a certain range are off by 10 days when stored in ORC > -- > > Key: SPARK-31423 > URL: https://issues.apache.org/jira/browse/SPARK-31423 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0, 3.1.0 >Reporter: Bruce Robbins >Priority: Major > > There is a range of days (1582-10-05 to 1582-10-14) for which DATEs and > TIMESTAMPS are changed when stored in ORC. The value is off by 10 days. > For example: > {noformat} > scala> val df = sql("select cast('1582-10-14' as DATE) dt") > df: org.apache.spark.sql.DataFrame = [dt: date] > scala> df.show // seems fine > +--+ > |dt| > +--+ > |1582-10-14| > +--+ > scala> df.write.mode("overwrite").orc("/tmp/funny_orc_date") > scala> spark.read.orc("/tmp/funny_orc_date").show // off by 10 days > +--+ > |dt| > +--+ > |1582-10-24| > +--+ > scala> > {noformat} > ORC has the same issue with TIMESTAMPS: > {noformat} > scala> val df = sql("select cast('1582-10-14 00:00:00' as TIMESTAMP) ts") > df: org.apache.spark.sql.DataFrame = [ts: timestamp] > scala> df.show // seems fine > +---+ > | ts| > +---+ > |1582-10-14 00:00:00| > +---+ > scala> df.write.mode("overwrite").orc("/tmp/funny_orc_timestamp") > scala> spark.read.orc("/tmp/funny_orc_timestamp").show(truncate=false) // off > by 10 days > +---+ > |ts | > +---+ > |1582-10-24 00:00:00| > +---+ > scala> > {noformat} > However, when written to Parquet or Avro, DATES and TIMESTAMPs for this range > do not change. > {noformat} > scala> val df = sql("select cast('1582-10-14' as DATE) dt") > df: org.apache.spark.sql.DataFrame = [dt: date] > scala> df.write.mode("overwrite").parquet("/tmp/funny_parquet_date") > scala> spark.read.parquet("/tmp/funny_parquet_date").show // reflects > original value > +--+ > |dt| > +--+ > |1582-10-14| > +--+ > scala> val df = sql("select cast('1582-10-14' as DATE) dt") > df: org.apache.spark.sql.DataFrame = [dt: date] > scala> df.write.mode("overwrite").format("avro").save("/tmp/funny_avro_date") > scala> spark.read.format("avro").load("/tmp/funny_avro_date").show // > reflects original value > +--+ > |dt| > +--+ > |1582-10-14| > +--+ > scala> > {noformat} > It's unclear to me whether ORC is behaving correctly or not, as this is how > Spark 2.4 works with DATEs and TIMESTAMPs in general (and also how Spark 3.x > works with DATEs and TIMESTAMPs in general when > {{spark.sql.legacy.timeParserPolicy}} is set to {{LEGACY}}). In Spark 2.4, > DATEs and TIMESTAMPs in this range don't exist: > {noformat} > scala> sql("select cast('1582-10-14' as DATE) dt").show // the same cast done > in Spark 2.4 > +--+ > |dt| > +--+ > |1582-10-24| > +--+ > scala> > {noformat} > I assume the following snippet is relevant (from the Wikipedia entry on the > Gregorian calendar): > {quote}To deal with the 10 days' difference (between calendar and > reality)[Note 2] that this drift had already reached, the date was advanced > so that 4 October 1582 was followed by 15 October 1582 > {quote} > Spark 3.x should treat DATEs and
[jira] [Commented] (SPARK-31423) DATES and TIMESTAMPS for a certain range are off by 10 days when stored in ORC
[ https://issues.apache.org/jira/browse/SPARK-31423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17083314#comment-17083314 ] Maxim Gekk commented on SPARK-31423: I am working on the issue. > DATES and TIMESTAMPS for a certain range are off by 10 days when stored in ORC > -- > > Key: SPARK-31423 > URL: https://issues.apache.org/jira/browse/SPARK-31423 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0, 3.1.0 >Reporter: Bruce Robbins >Priority: Major > > There is a range of days (1582-10-05 to 1582-10-14) for which DATEs and > TIMESTAMPS are changed when stored in ORC. The value is off by 10 days. > For example: > {noformat} > scala> val df = sql("select cast('1582-10-14' as DATE) dt") > df: org.apache.spark.sql.DataFrame = [dt: date] > scala> df.show // seems fine > +--+ > |dt| > +--+ > |1582-10-14| > +--+ > scala> df.write.mode("overwrite").orc("/tmp/funny_orc_date") > scala> spark.read.orc("/tmp/funny_orc_date").show // off by 10 days > +--+ > |dt| > +--+ > |1582-10-24| > +--+ > scala> > {noformat} > ORC has the same issue with TIMESTAMPS: > {noformat} > scala> val df = sql("select cast('1582-10-14 00:00:00' as TIMESTAMP) ts") > df: org.apache.spark.sql.DataFrame = [ts: timestamp] > scala> df.show // seems fine > +---+ > | ts| > +---+ > |1582-10-14 00:00:00| > +---+ > scala> df.write.mode("overwrite").orc("/tmp/funny_orc_timestamp") > scala> spark.read.orc("/tmp/funny_orc_timestamp").show(truncate=false) // off > by 10 days > +---+ > |ts | > +---+ > |1582-10-24 00:00:00| > +---+ > scala> > {noformat} > However, when written to Parquet or Avro, DATES and TIMESTAMPs for this range > do not change. > {noformat} > scala> val df = sql("select cast('1582-10-14' as DATE) dt") > df: org.apache.spark.sql.DataFrame = [dt: date] > scala> df.write.mode("overwrite").parquet("/tmp/funny_parquet_date") > scala> spark.read.parquet("/tmp/funny_parquet_date").show // reflects > original value > +--+ > |dt| > +--+ > |1582-10-14| > +--+ > scala> val df = sql("select cast('1582-10-14' as DATE) dt") > df: org.apache.spark.sql.DataFrame = [dt: date] > scala> df.write.mode("overwrite").format("avro").save("/tmp/funny_avro_date") > scala> spark.read.format("avro").load("/tmp/funny_avro_date").show // > reflects original value > +--+ > |dt| > +--+ > |1582-10-14| > +--+ > scala> > {noformat} > It's unclear to me whether ORC is behaving correctly or not, as this is how > Spark 2.4 works with DATEs and TIMESTAMPs in general (and also how Spark 3.x > works with DATEs and TIMESTAMPs in general when > {{spark.sql.legacy.timeParserPolicy}} is set to {{LEGACY}}). In Spark 2.4, > DATEs and TIMESTAMPs in this range don't exist: > {noformat} > scala> sql("select cast('1582-10-14' as DATE) dt").show // the same cast done > in Spark 2.4 > +--+ > |dt| > +--+ > |1582-10-24| > +--+ > scala> > {noformat} > I assume the following snippet is relevant (from the Wikipedia entry on the > Gregorian calendar): > {quote}To deal with the 10 days' difference (between calendar and > reality)[Note 2] that this drift had already reached, the date was advanced > so that 4 October 1582 was followed by 15 October 1582 > {quote} > Spark 3.x should treat DATEs and TIMESTAMPS in this range consistently, and > probably based on spark.sql.legacy.timeParserPolicy (or some other config) > rather than file format. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31423) DATES and TIMESTAMPS for a certain range are off by 10 days when stored in ORC
[ https://issues.apache.org/jira/browse/SPARK-31423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17082498#comment-17082498 ] Bruce Robbins commented on SPARK-31423: --- [~cloud_fan] {quote}FYI this is the behavior of Spark 2.4 {quote} Yes, I noted that in my description. What I mean is that in Spark 3.x (and without any legacy config touched), only ORC demonstrates this behavior. CAST, and the Parquet and Avro file formats do not demonstrate this behavior. > DATES and TIMESTAMPS for a certain range are off by 10 days when stored in ORC > -- > > Key: SPARK-31423 > URL: https://issues.apache.org/jira/browse/SPARK-31423 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0, 3.1.0 >Reporter: Bruce Robbins >Priority: Major > > There is a range of days (1582-10-05 to 1582-10-14) for which DATEs and > TIMESTAMPS are changed when stored in ORC. The value is off by 10 days. > For example: > {noformat} > scala> val df = sql("select cast('1582-10-14' as DATE) dt") > df: org.apache.spark.sql.DataFrame = [dt: date] > scala> df.show // seems fine > +--+ > |dt| > +--+ > |1582-10-14| > +--+ > scala> df.write.mode("overwrite").orc("/tmp/funny_orc_date") > scala> spark.read.orc("/tmp/funny_orc_date").show // off by 10 days > +--+ > |dt| > +--+ > |1582-10-24| > +--+ > scala> > {noformat} > ORC has the same issue with TIMESTAMPS: > {noformat} > scala> val df = sql("select cast('1582-10-14 00:00:00' as TIMESTAMP) ts") > df: org.apache.spark.sql.DataFrame = [ts: timestamp] > scala> df.show // seems fine > +---+ > | ts| > +---+ > |1582-10-14 00:00:00| > +---+ > scala> df.write.mode("overwrite").orc("/tmp/funny_orc_timestamp") > scala> spark.read.orc("/tmp/funny_orc_timestamp").show(truncate=false) // off > by 10 days > +---+ > |ts | > +---+ > |1582-10-24 00:00:00| > +---+ > scala> > {noformat} > However, when written to Parquet or Avro, DATES and TIMESTAMPs for this range > do not change. > {noformat} > scala> val df = sql("select cast('1582-10-14' as DATE) dt") > df: org.apache.spark.sql.DataFrame = [dt: date] > scala> df.write.mode("overwrite").parquet("/tmp/funny_parquet_date") > scala> spark.read.parquet("/tmp/funny_parquet_date").show // reflects > original value > +--+ > |dt| > +--+ > |1582-10-14| > +--+ > scala> val df = sql("select cast('1582-10-14' as DATE) dt") > df: org.apache.spark.sql.DataFrame = [dt: date] > scala> df.write.mode("overwrite").format("avro").save("/tmp/funny_avro_date") > scala> spark.read.format("avro").load("/tmp/funny_avro_date").show // > reflects original value > +--+ > |dt| > +--+ > |1582-10-14| > +--+ > scala> > {noformat} > It's unclear to me whether ORC is behaving correctly or not, as this is how > Spark 2.4 works with DATEs and TIMESTAMPs in general (and also how Spark 3.x > works with DATEs and TIMESTAMPs in general when > {{spark.sql.legacy.timeParserPolicy}} is set to {{LEGACY}}). In Spark 2.4, > DATEs and TIMESTAMPs in this range don't exist: > {noformat} > scala> sql("select cast('1582-10-14' as DATE) dt").show // the same cast done > in Spark 2.4 > +--+ > |dt| > +--+ > |1582-10-24| > +--+ > scala> > {noformat} > I assume the following snippet is relevant (from the Wikipedia entry on the > Gregorian calendar): > {quote}To deal with the 10 days' difference (between calendar and > reality)[Note 2] that this drift had already reached, the date was advanced > so that 4 October 1582 was followed by 15 October 1582 > {quote} > Spark 3.x should treat DATEs and TIMESTAMPS in this range consistently, and > probably based on spark.sql.legacy.timeParserPolicy (or some other config) > rather than file format. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31423) DATES and TIMESTAMPS for a certain range are off by 10 days when stored in ORC
[ https://issues.apache.org/jira/browse/SPARK-31423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17082228#comment-17082228 ] Wenchen Fan commented on SPARK-31423: - FYI this is the behavior of Spark 2.4: ``` scala> val df = sql("select cast('1582-10-14' as DATE) dt") df: org.apache.spark.sql.DataFrame = [dt: date] scala> df.show +--+ |dt| +--+ |1582-10-24| +--+ scala> df.write.mode("overwrite").orc("/tmp/funny_orc_date") scala> spark.read.orc("/tmp/funny_orc_date").show +--+ |dt| +--+ |1582-10-24| +--+ ``` The result is wrong at the very beginning. > DATES and TIMESTAMPS for a certain range are off by 10 days when stored in ORC > -- > > Key: SPARK-31423 > URL: https://issues.apache.org/jira/browse/SPARK-31423 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0, 3.1.0 >Reporter: Bruce Robbins >Priority: Major > > There is a range of days (1582-10-05 to 1582-10-14) for which DATEs and > TIMESTAMPS are changed when stored in ORC. The value is off by 10 days. > For example: > {noformat} > scala> val df = sql("select cast('1582-10-14' as DATE) dt") > df: org.apache.spark.sql.DataFrame = [dt: date] > scala> df.show // seems fine > +--+ > |dt| > +--+ > |1582-10-14| > +--+ > scala> df.write.mode("overwrite").orc("/tmp/funny_orc_date") > scala> spark.read.orc("/tmp/funny_orc_date").show // off by 10 days > +--+ > |dt| > +--+ > |1582-10-24| > +--+ > scala> > {noformat} > ORC has the same issue with TIMESTAMPS: > {noformat} > scala> val df = sql("select cast('1582-10-14 00:00:00' as TIMESTAMP) ts") > df: org.apache.spark.sql.DataFrame = [ts: timestamp] > scala> df.show // seems fine > +---+ > | ts| > +---+ > |1582-10-14 00:00:00| > +---+ > scala> df.write.mode("overwrite").orc("/tmp/funny_orc_timestamp") > scala> spark.read.orc("/tmp/funny_orc_timestamp").show(truncate=false) // off > by 10 days > +---+ > |ts | > +---+ > |1582-10-24 00:00:00| > +---+ > scala> > {noformat} > However, when written to Parquet or Avro, DATES and TIMESTAMPs for this range > do not change. > {noformat} > scala> val df = sql("select cast('1582-10-14' as DATE) dt") > df: org.apache.spark.sql.DataFrame = [dt: date] > scala> df.write.mode("overwrite").parquet("/tmp/funny_parquet_date") > scala> spark.read.parquet("/tmp/funny_parquet_date").show // reflects > original value > +--+ > |dt| > +--+ > |1582-10-14| > +--+ > scala> val df = sql("select cast('1582-10-14' as DATE) dt") > df: org.apache.spark.sql.DataFrame = [dt: date] > scala> df.write.mode("overwrite").format("avro").save("/tmp/funny_avro_date") > scala> spark.read.format("avro").load("/tmp/funny_avro_date").show // > reflects original value > +--+ > |dt| > +--+ > |1582-10-14| > +--+ > scala> > {noformat} > It's unclear to me whether ORC is behaving correctly or not, as this is how > Spark 2.4 works with DATEs and TIMESTAMPs in general (and also how Spark 3.x > works with DATEs and TIMESTAMPs in general when > {{spark.sql.legacy.timeParserPolicy}} is set to {{LEGACY}}). In Spark 2.4, > DATEs and TIMESTAMPs in this range don't exist: > {noformat} > scala> sql("select cast('1582-10-14' as DATE) dt").show // the same cast done > in Spark 2.4 > +--+ > |dt| > +--+ > |1582-10-24| > +--+ > scala> > {noformat} > I assume the following snippet is relevant (from the Wikipedia entry on the > Gregorian calendar): > {quote}To deal with the 10 days' difference (between calendar and > reality)[Note 2] that this drift had already reached, the date was advanced > so that 4 October 1582 was followed by 15 October 1582 > {quote} > Spark 3.x should treat DATEs and TIMESTAMPS in this range consistently, and > probably based on spark.sql.legacy.timeParserPolicy (or some other config) > rather than file format. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31423) DATES and TIMESTAMPS for a certain range are off by 10 days when stored in ORC
[ https://issues.apache.org/jira/browse/SPARK-31423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17082051#comment-17082051 ] Maxim Gekk commented on SPARK-31423: This is intentional behavior because ORC format assumes the hybrid calendar (Julian + Gregorian) but Parquet and Avro assume Proleptic Gregorian calendar. See https://issues.apache.org/jira/browse/SPARK-30951 > DATES and TIMESTAMPS for a certain range are off by 10 days when stored in ORC > -- > > Key: SPARK-31423 > URL: https://issues.apache.org/jira/browse/SPARK-31423 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0, 3.1.0 >Reporter: Bruce Robbins >Priority: Major > > There is a range of days (1582-10-05 to 1582-10-14) for which DATEs and > TIMESTAMPS are changed when stored in ORC. The value is off by 10 days. > For example: > {noformat} > scala> val df = sql("select cast('1582-10-14' as DATE) dt") > df: org.apache.spark.sql.DataFrame = [dt: date] > scala> df.show // seems fine > +--+ > |dt| > +--+ > |1582-10-14| > +--+ > scala> df.write.mode("overwrite").orc("/tmp/funny_orc_date") > scala> spark.read.orc("/tmp/funny_orc_date").show // off by 10 days > +--+ > |dt| > +--+ > |1582-10-24| > +--+ > scala> > {noformat} > ORC has the same issue with TIMESTAMPS: > {noformat} > scala> val df = sql("select cast('1582-10-14 00:00:00' as TIMESTAMP) ts") > df: org.apache.spark.sql.DataFrame = [ts: timestamp] > scala> df.show // seems fine > +---+ > | ts| > +---+ > |1582-10-14 00:00:00| > +---+ > scala> df.write.mode("overwrite").orc("/tmp/funny_orc_timestamp") > scala> spark.read.orc("/tmp/funny_orc_timestamp").show(truncate=false) // off > by 10 days > +---+ > |ts | > +---+ > |1582-10-24 00:00:00| > +---+ > scala> > {noformat} > However, when written to Parquet or Avro, DATES and TIMESTAMPs for this range > do not change. > {noformat} > scala> val df = sql("select cast('1582-10-14' as DATE) dt") > df: org.apache.spark.sql.DataFrame = [dt: date] > scala> df.write.mode("overwrite").parquet("/tmp/funny_parquet_date") > scala> spark.read.parquet("/tmp/funny_parquet_date").show // reflects > original value > +--+ > |dt| > +--+ > |1582-10-14| > +--+ > scala> val df = sql("select cast('1582-10-14' as DATE) dt") > df: org.apache.spark.sql.DataFrame = [dt: date] > scala> df.write.mode("overwrite").format("avro").save("/tmp/funny_avro_date") > scala> spark.read.format("avro").load("/tmp/funny_avro_date").show // > reflects original value > +--+ > |dt| > +--+ > |1582-10-14| > +--+ > scala> > {noformat} > It's unclear to me whether ORC is behaving correctly or not, as this is how > Spark 2.4 works with DATEs and TIMESTAMPs in general (and also how Spark 3.x > works with DATEs and TIMESTAMPs in general when > {{spark.sql.legacy.timeParserPolicy}} is set to {{LEGACY}}). In Spark 2.4, > DATEs and TIMESTAMPs in this range don't exist: > {noformat} > scala> sql("select cast('1582-10-14' as DATE) dt").show // the same cast done > in Spark 2.4 > +--+ > |dt| > +--+ > |1582-10-24| > +--+ > scala> > {noformat} > I assume the following snippet is relevant (from the Wikipedia entry on the > Gregorian calendar): > {quote}To deal with the 10 days' difference (between calendar and > reality)[Note 2] that this drift had already reached, the date was advanced > so that 4 October 1582 was followed by 15 October 1582 > {quote} > Spark 3.x should treat DATEs and TIMESTAMPS in this range consistently, and > probably based on spark.sql.legacy.timeParserPolicy (or some other config) > rather than file format. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31423) DATES and TIMESTAMPS for a certain range are off by 10 days when stored in ORC
[ https://issues.apache.org/jira/browse/SPARK-31423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17082039#comment-17082039 ] Hyukjin Kwon commented on SPARK-31423: -- cc [~maxgekk], [~cloud_fan] FYI. > DATES and TIMESTAMPS for a certain range are off by 10 days when stored in ORC > -- > > Key: SPARK-31423 > URL: https://issues.apache.org/jira/browse/SPARK-31423 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0, 3.1.0 >Reporter: Bruce Robbins >Priority: Major > > There is a range of days (1582-10-05 to 1582-10-14) for which DATEs and > TIMESTAMPS are changed when stored in ORC. The value is off by 10 days. > For example: > {noformat} > scala> val df = sql("select cast('1582-10-14' as DATE) dt") > df: org.apache.spark.sql.DataFrame = [dt: date] > scala> df.show // seems fine > +--+ > |dt| > +--+ > |1582-10-14| > +--+ > scala> df.write.mode("overwrite").orc("/tmp/funny_orc_date") > scala> spark.read.orc("/tmp/funny_orc_date").show // off by 10 days > +--+ > |dt| > +--+ > |1582-10-24| > +--+ > scala> > {noformat} > ORC has the same issue with TIMESTAMPS: > {noformat} > scala> val df = sql("select cast('1582-10-14 00:00:00' as TIMESTAMP) ts") > df: org.apache.spark.sql.DataFrame = [ts: timestamp] > scala> df.show // seems fine > +---+ > | ts| > +---+ > |1582-10-14 00:00:00| > +---+ > scala> df.write.mode("overwrite").orc("/tmp/funny_orc_timestamp") > scala> spark.read.orc("/tmp/funny_orc_timestamp").show(truncate=false) // off > by 10 days > +---+ > |ts | > +---+ > |1582-10-24 00:00:00| > +---+ > scala> > {noformat} > However, when written to Parquet or Avro, DATES and TIMESTAMPs for this range > do not change. > {noformat} > scala> val df = sql("select cast('1582-10-14' as DATE) dt") > df: org.apache.spark.sql.DataFrame = [dt: date] > scala> df.write.mode("overwrite").parquet("/tmp/funny_parquet_date") > scala> spark.read.parquet("/tmp/funny_parquet_date").show // reflects > original value > +--+ > |dt| > +--+ > |1582-10-14| > +--+ > scala> val df = sql("select cast('1582-10-14' as DATE) dt") > df: org.apache.spark.sql.DataFrame = [dt: date] > scala> df.write.mode("overwrite").format("avro").save("/tmp/funny_avro_date") > scala> spark.read.format("avro").load("/tmp/funny_avro_date").show // > reflects original value > +--+ > |dt| > +--+ > |1582-10-14| > +--+ > scala> > {noformat} > It's unclear to me whether ORC is behaving correctly or not, as this is how > Spark 2.4 works with DATEs and TIMESTAMPs in general (and also how Spark 3.x > works with DATEs and TIMESTAMPs in general when > {{spark.sql.legacy.timeParserPolicy}} is set to {{LEGACY}}). In Spark 2.4, > DATEs and TIMESTAMPs in this range don't exist: > {noformat} > scala> sql("select cast('1582-10-14' as DATE) dt").show // the same cast done > in Spark 2.4 > +--+ > |dt| > +--+ > |1582-10-24| > +--+ > scala> > {noformat} > I assume the following snippet is relevant (from the Wikipedia entry on the > Gregorian calendar): > {quote}To deal with the 10 days' difference (between calendar and > reality)[Note 2] that this drift had already reached, the date was advanced > so that 4 October 1582 was followed by 15 October 1582 > {quote} > Spark 3.x should treat DATEs and TIMESTAMPS in this range consistently, and > probably based on spark.sql.legacy.timeParserPolicy (or some other config) > rather than file format. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org