[jira] [Commented] (SPARK-31423) DATES and TIMESTAMPS for a certain range are off by 10 days when stored in ORC

2020-04-17 Thread Wenchen Fan (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17085585#comment-17085585
 ] 

Wenchen Fan commented on SPARK-31423:
-

I'm closing it as "not a bug". This happens because we "rebase" the datetime 
values for the calendar changes. There are some dates exist in one calendar but 
not another. What we can do is to move to the next valid date. Eventually every 
place should use the standard calendar so we won't have this problem.

> DATES and TIMESTAMPS for a certain range are off by 10 days when stored in ORC
> --
>
> Key: SPARK-31423
> URL: https://issues.apache.org/jira/browse/SPARK-31423
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Bruce Robbins
>Priority: Major
>
> There is a range of days (1582-10-05 to 1582-10-14) for which DATEs and 
> TIMESTAMPS are changed when stored in ORC. The value is off by 10 days.
> For example:
> {noformat}
> scala> val df = sql("select cast('1582-10-14' as DATE) dt")
> df: org.apache.spark.sql.DataFrame = [dt: date]
> scala> df.show // seems fine
> +--+
> |dt|
> +--+
> |1582-10-14|
> +--+
> scala> df.write.mode("overwrite").orc("/tmp/funny_orc_date")
> scala> spark.read.orc("/tmp/funny_orc_date").show // off by 10 days
> +--+
> |dt|
> +--+
> |1582-10-24|
> +--+
> scala>
> {noformat}
> ORC has the same issue with TIMESTAMPS:
> {noformat}
> scala> val df = sql("select cast('1582-10-14 00:00:00' as TIMESTAMP) ts")
> df: org.apache.spark.sql.DataFrame = [ts: timestamp]
> scala> df.show // seems fine
> +---+
> | ts|
> +---+
> |1582-10-14 00:00:00|
> +---+
> scala> df.write.mode("overwrite").orc("/tmp/funny_orc_timestamp")
> scala> spark.read.orc("/tmp/funny_orc_timestamp").show(truncate=false) // off 
> by 10 days
> +---+
> |ts |
> +---+
> |1582-10-24 00:00:00|
> +---+
> scala> 
> {noformat}
> However, when written to Parquet or Avro, DATES and TIMESTAMPs for this range 
> do not change.
> {noformat}
> scala> val df = sql("select cast('1582-10-14' as DATE) dt")
> df: org.apache.spark.sql.DataFrame = [dt: date]
> scala> df.write.mode("overwrite").parquet("/tmp/funny_parquet_date")
> scala> spark.read.parquet("/tmp/funny_parquet_date").show // reflects 
> original value
> +--+
> |dt|
> +--+
> |1582-10-14|
> +--+
> scala> val df = sql("select cast('1582-10-14' as DATE) dt")
> df: org.apache.spark.sql.DataFrame = [dt: date]
> scala> df.write.mode("overwrite").format("avro").save("/tmp/funny_avro_date")
> scala> spark.read.format("avro").load("/tmp/funny_avro_date").show // 
> reflects original value
> +--+
> |dt|
> +--+
> |1582-10-14|
> +--+
> scala> 
> {noformat}
> It's unclear to me whether ORC is behaving correctly or not, as this is how 
> Spark 2.4 works with DATEs and TIMESTAMPs in general (and also how Spark 3.x 
> works with DATEs and TIMESTAMPs in general when 
> {{spark.sql.legacy.timeParserPolicy}} is set to {{LEGACY}}). In Spark 2.4, 
> DATEs and TIMESTAMPs in this range don't exist:
> {noformat}
> scala> sql("select cast('1582-10-14' as DATE) dt").show // the same cast done 
> in Spark 2.4
> +--+
> |dt|
> +--+
> |1582-10-24|
> +--+
> scala> 
> {noformat}
> I assume the following snippet is relevant (from the Wikipedia entry on the 
> Gregorian calendar):
> {quote}To deal with the 10 days' difference (between calendar and 
> reality)[Note 2] that this drift had already reached, the date was advanced 
> so that 4 October 1582 was followed by 15 October 1582
> {quote}
> Spark 3.x should treat DATEs and TIMESTAMPS in this range consistently, and 
> probably based on spark.sql.legacy.timeParserPolicy (or some other config) 
> rather than file format.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31423) DATES and TIMESTAMPS for a certain range are off by 10 days when stored in ORC

2020-04-15 Thread Maxim Gekk (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17084308#comment-17084308
 ] 

Maxim Gekk commented on SPARK-31423:


[~bersprockets] I think we should take the next valid date for any not-existed 
dates, see the linked PR.

> DATES and TIMESTAMPS for a certain range are off by 10 days when stored in ORC
> --
>
> Key: SPARK-31423
> URL: https://issues.apache.org/jira/browse/SPARK-31423
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Bruce Robbins
>Priority: Major
>
> There is a range of days (1582-10-05 to 1582-10-14) for which DATEs and 
> TIMESTAMPS are changed when stored in ORC. The value is off by 10 days.
> For example:
> {noformat}
> scala> val df = sql("select cast('1582-10-14' as DATE) dt")
> df: org.apache.spark.sql.DataFrame = [dt: date]
> scala> df.show // seems fine
> +--+
> |dt|
> +--+
> |1582-10-14|
> +--+
> scala> df.write.mode("overwrite").orc("/tmp/funny_orc_date")
> scala> spark.read.orc("/tmp/funny_orc_date").show // off by 10 days
> +--+
> |dt|
> +--+
> |1582-10-24|
> +--+
> scala>
> {noformat}
> ORC has the same issue with TIMESTAMPS:
> {noformat}
> scala> val df = sql("select cast('1582-10-14 00:00:00' as TIMESTAMP) ts")
> df: org.apache.spark.sql.DataFrame = [ts: timestamp]
> scala> df.show // seems fine
> +---+
> | ts|
> +---+
> |1582-10-14 00:00:00|
> +---+
> scala> df.write.mode("overwrite").orc("/tmp/funny_orc_timestamp")
> scala> spark.read.orc("/tmp/funny_orc_timestamp").show(truncate=false) // off 
> by 10 days
> +---+
> |ts |
> +---+
> |1582-10-24 00:00:00|
> +---+
> scala> 
> {noformat}
> However, when written to Parquet or Avro, DATES and TIMESTAMPs for this range 
> do not change.
> {noformat}
> scala> val df = sql("select cast('1582-10-14' as DATE) dt")
> df: org.apache.spark.sql.DataFrame = [dt: date]
> scala> df.write.mode("overwrite").parquet("/tmp/funny_parquet_date")
> scala> spark.read.parquet("/tmp/funny_parquet_date").show // reflects 
> original value
> +--+
> |dt|
> +--+
> |1582-10-14|
> +--+
> scala> val df = sql("select cast('1582-10-14' as DATE) dt")
> df: org.apache.spark.sql.DataFrame = [dt: date]
> scala> df.write.mode("overwrite").format("avro").save("/tmp/funny_avro_date")
> scala> spark.read.format("avro").load("/tmp/funny_avro_date").show // 
> reflects original value
> +--+
> |dt|
> +--+
> |1582-10-14|
> +--+
> scala> 
> {noformat}
> It's unclear to me whether ORC is behaving correctly or not, as this is how 
> Spark 2.4 works with DATEs and TIMESTAMPs in general (and also how Spark 3.x 
> works with DATEs and TIMESTAMPs in general when 
> {{spark.sql.legacy.timeParserPolicy}} is set to {{LEGACY}}). In Spark 2.4, 
> DATEs and TIMESTAMPs in this range don't exist:
> {noformat}
> scala> sql("select cast('1582-10-14' as DATE) dt").show // the same cast done 
> in Spark 2.4
> +--+
> |dt|
> +--+
> |1582-10-24|
> +--+
> scala> 
> {noformat}
> I assume the following snippet is relevant (from the Wikipedia entry on the 
> Gregorian calendar):
> {quote}To deal with the 10 days' difference (between calendar and 
> reality)[Note 2] that this drift had already reached, the date was advanced 
> so that 4 October 1582 was followed by 15 October 1582
> {quote}
> Spark 3.x should treat DATEs and TIMESTAMPS in this range consistently, and 
> probably based on spark.sql.legacy.timeParserPolicy (or some other config) 
> rather than file format.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31423) DATES and TIMESTAMPS for a certain range are off by 10 days when stored in ORC

2020-04-15 Thread Bruce Robbins (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17084288#comment-17084288
 ] 

Bruce Robbins commented on SPARK-31423:
---

Thanks. It seems we can either

- close this as "not a bug", or
- I can file an ORC Jira (later this week, after I fiddle with the library a 
bit) and mark this as blocked on the ORC Jira.

> DATES and TIMESTAMPS for a certain range are off by 10 days when stored in ORC
> --
>
> Key: SPARK-31423
> URL: https://issues.apache.org/jira/browse/SPARK-31423
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Bruce Robbins
>Priority: Major
>
> There is a range of days (1582-10-05 to 1582-10-14) for which DATEs and 
> TIMESTAMPS are changed when stored in ORC. The value is off by 10 days.
> For example:
> {noformat}
> scala> val df = sql("select cast('1582-10-14' as DATE) dt")
> df: org.apache.spark.sql.DataFrame = [dt: date]
> scala> df.show // seems fine
> +--+
> |dt|
> +--+
> |1582-10-14|
> +--+
> scala> df.write.mode("overwrite").orc("/tmp/funny_orc_date")
> scala> spark.read.orc("/tmp/funny_orc_date").show // off by 10 days
> +--+
> |dt|
> +--+
> |1582-10-24|
> +--+
> scala>
> {noformat}
> ORC has the same issue with TIMESTAMPS:
> {noformat}
> scala> val df = sql("select cast('1582-10-14 00:00:00' as TIMESTAMP) ts")
> df: org.apache.spark.sql.DataFrame = [ts: timestamp]
> scala> df.show // seems fine
> +---+
> | ts|
> +---+
> |1582-10-14 00:00:00|
> +---+
> scala> df.write.mode("overwrite").orc("/tmp/funny_orc_timestamp")
> scala> spark.read.orc("/tmp/funny_orc_timestamp").show(truncate=false) // off 
> by 10 days
> +---+
> |ts |
> +---+
> |1582-10-24 00:00:00|
> +---+
> scala> 
> {noformat}
> However, when written to Parquet or Avro, DATES and TIMESTAMPs for this range 
> do not change.
> {noformat}
> scala> val df = sql("select cast('1582-10-14' as DATE) dt")
> df: org.apache.spark.sql.DataFrame = [dt: date]
> scala> df.write.mode("overwrite").parquet("/tmp/funny_parquet_date")
> scala> spark.read.parquet("/tmp/funny_parquet_date").show // reflects 
> original value
> +--+
> |dt|
> +--+
> |1582-10-14|
> +--+
> scala> val df = sql("select cast('1582-10-14' as DATE) dt")
> df: org.apache.spark.sql.DataFrame = [dt: date]
> scala> df.write.mode("overwrite").format("avro").save("/tmp/funny_avro_date")
> scala> spark.read.format("avro").load("/tmp/funny_avro_date").show // 
> reflects original value
> +--+
> |dt|
> +--+
> |1582-10-14|
> +--+
> scala> 
> {noformat}
> It's unclear to me whether ORC is behaving correctly or not, as this is how 
> Spark 2.4 works with DATEs and TIMESTAMPs in general (and also how Spark 3.x 
> works with DATEs and TIMESTAMPs in general when 
> {{spark.sql.legacy.timeParserPolicy}} is set to {{LEGACY}}). In Spark 2.4, 
> DATEs and TIMESTAMPs in this range don't exist:
> {noformat}
> scala> sql("select cast('1582-10-14' as DATE) dt").show // the same cast done 
> in Spark 2.4
> +--+
> |dt|
> +--+
> |1582-10-24|
> +--+
> scala> 
> {noformat}
> I assume the following snippet is relevant (from the Wikipedia entry on the 
> Gregorian calendar):
> {quote}To deal with the 10 days' difference (between calendar and 
> reality)[Note 2] that this drift had already reached, the date was advanced 
> so that 4 October 1582 was followed by 15 October 1582
> {quote}
> Spark 3.x should treat DATEs and TIMESTAMPS in this range consistently, and 
> probably based on spark.sql.legacy.timeParserPolicy (or some other config) 
> rather than file format.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31423) DATES and TIMESTAMPS for a certain range are off by 10 days when stored in ORC

2020-04-15 Thread Wenchen Fan (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17084234#comment-17084234
 ] 

Wenchen Fan commented on SPARK-31423:
-

I hope the ORC community can figure this out and switch to/support the standard 
proleptic Gregorian calendar. One thing we can do is to check the behavior of 
Hive 3.x, as Hive also switches to the proleptic Gregorian calendar and should 
have the same issue.

> DATES and TIMESTAMPS for a certain range are off by 10 days when stored in ORC
> --
>
> Key: SPARK-31423
> URL: https://issues.apache.org/jira/browse/SPARK-31423
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Bruce Robbins
>Priority: Major
>
> There is a range of days (1582-10-05 to 1582-10-14) for which DATEs and 
> TIMESTAMPS are changed when stored in ORC. The value is off by 10 days.
> For example:
> {noformat}
> scala> val df = sql("select cast('1582-10-14' as DATE) dt")
> df: org.apache.spark.sql.DataFrame = [dt: date]
> scala> df.show // seems fine
> +--+
> |dt|
> +--+
> |1582-10-14|
> +--+
> scala> df.write.mode("overwrite").orc("/tmp/funny_orc_date")
> scala> spark.read.orc("/tmp/funny_orc_date").show // off by 10 days
> +--+
> |dt|
> +--+
> |1582-10-24|
> +--+
> scala>
> {noformat}
> ORC has the same issue with TIMESTAMPS:
> {noformat}
> scala> val df = sql("select cast('1582-10-14 00:00:00' as TIMESTAMP) ts")
> df: org.apache.spark.sql.DataFrame = [ts: timestamp]
> scala> df.show // seems fine
> +---+
> | ts|
> +---+
> |1582-10-14 00:00:00|
> +---+
> scala> df.write.mode("overwrite").orc("/tmp/funny_orc_timestamp")
> scala> spark.read.orc("/tmp/funny_orc_timestamp").show(truncate=false) // off 
> by 10 days
> +---+
> |ts |
> +---+
> |1582-10-24 00:00:00|
> +---+
> scala> 
> {noformat}
> However, when written to Parquet or Avro, DATES and TIMESTAMPs for this range 
> do not change.
> {noformat}
> scala> val df = sql("select cast('1582-10-14' as DATE) dt")
> df: org.apache.spark.sql.DataFrame = [dt: date]
> scala> df.write.mode("overwrite").parquet("/tmp/funny_parquet_date")
> scala> spark.read.parquet("/tmp/funny_parquet_date").show // reflects 
> original value
> +--+
> |dt|
> +--+
> |1582-10-14|
> +--+
> scala> val df = sql("select cast('1582-10-14' as DATE) dt")
> df: org.apache.spark.sql.DataFrame = [dt: date]
> scala> df.write.mode("overwrite").format("avro").save("/tmp/funny_avro_date")
> scala> spark.read.format("avro").load("/tmp/funny_avro_date").show // 
> reflects original value
> +--+
> |dt|
> +--+
> |1582-10-14|
> +--+
> scala> 
> {noformat}
> It's unclear to me whether ORC is behaving correctly or not, as this is how 
> Spark 2.4 works with DATEs and TIMESTAMPs in general (and also how Spark 3.x 
> works with DATEs and TIMESTAMPs in general when 
> {{spark.sql.legacy.timeParserPolicy}} is set to {{LEGACY}}). In Spark 2.4, 
> DATEs and TIMESTAMPs in this range don't exist:
> {noformat}
> scala> sql("select cast('1582-10-14' as DATE) dt").show // the same cast done 
> in Spark 2.4
> +--+
> |dt|
> +--+
> |1582-10-24|
> +--+
> scala> 
> {noformat}
> I assume the following snippet is relevant (from the Wikipedia entry on the 
> Gregorian calendar):
> {quote}To deal with the 10 days' difference (between calendar and 
> reality)[Note 2] that this drift had already reached, the date was advanced 
> so that 4 October 1582 was followed by 15 October 1582
> {quote}
> Spark 3.x should treat DATEs and TIMESTAMPS in this range consistently, and 
> probably based on spark.sql.legacy.timeParserPolicy (or some other config) 
> rather than file format.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31423) DATES and TIMESTAMPS for a certain range are off by 10 days when stored in ORC

2020-04-15 Thread Bruce Robbins (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17084218#comment-17084218
 ] 

Bruce Robbins commented on SPARK-31423:
---

OK, so this is a case of a limitation of the ORC library, and there is not much 
we can do about it in Spark (other than throw an user-overridable exception 
when this happens).

Therefore, some users might see a case like this:
{noformat}
val df = sql("select cast('1582-10-14' as DATE) dt")
df.write.mode("overwrite").parquet("/tmp/dateparquet")
df.write.mode("overwrite").format("avro").save("/tmp/dateavro")
df.write.mode("overwrite").orc("/tmp/dateorc")

val dfParquet = spark.read.parquet("/tmp/dateparquet")
val dfAvro = spark.read.format("avro").load("/tmp/dateavro")
val dfOrc = spark.read.orc("/tmp/dateorc")

scala> 

scala> dfParquet.join(dfAvro, "dt").count // can join to avro
res4: Long = 1

scala> dfParquet.join(dfOrc, "dt").count // doesn't find it in orc
res5: Long = 0

scala> 
{noformat}

> DATES and TIMESTAMPS for a certain range are off by 10 days when stored in ORC
> --
>
> Key: SPARK-31423
> URL: https://issues.apache.org/jira/browse/SPARK-31423
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Bruce Robbins
>Priority: Major
>
> There is a range of days (1582-10-05 to 1582-10-14) for which DATEs and 
> TIMESTAMPS are changed when stored in ORC. The value is off by 10 days.
> For example:
> {noformat}
> scala> val df = sql("select cast('1582-10-14' as DATE) dt")
> df: org.apache.spark.sql.DataFrame = [dt: date]
> scala> df.show // seems fine
> +--+
> |dt|
> +--+
> |1582-10-14|
> +--+
> scala> df.write.mode("overwrite").orc("/tmp/funny_orc_date")
> scala> spark.read.orc("/tmp/funny_orc_date").show // off by 10 days
> +--+
> |dt|
> +--+
> |1582-10-24|
> +--+
> scala>
> {noformat}
> ORC has the same issue with TIMESTAMPS:
> {noformat}
> scala> val df = sql("select cast('1582-10-14 00:00:00' as TIMESTAMP) ts")
> df: org.apache.spark.sql.DataFrame = [ts: timestamp]
> scala> df.show // seems fine
> +---+
> | ts|
> +---+
> |1582-10-14 00:00:00|
> +---+
> scala> df.write.mode("overwrite").orc("/tmp/funny_orc_timestamp")
> scala> spark.read.orc("/tmp/funny_orc_timestamp").show(truncate=false) // off 
> by 10 days
> +---+
> |ts |
> +---+
> |1582-10-24 00:00:00|
> +---+
> scala> 
> {noformat}
> However, when written to Parquet or Avro, DATES and TIMESTAMPs for this range 
> do not change.
> {noformat}
> scala> val df = sql("select cast('1582-10-14' as DATE) dt")
> df: org.apache.spark.sql.DataFrame = [dt: date]
> scala> df.write.mode("overwrite").parquet("/tmp/funny_parquet_date")
> scala> spark.read.parquet("/tmp/funny_parquet_date").show // reflects 
> original value
> +--+
> |dt|
> +--+
> |1582-10-14|
> +--+
> scala> val df = sql("select cast('1582-10-14' as DATE) dt")
> df: org.apache.spark.sql.DataFrame = [dt: date]
> scala> df.write.mode("overwrite").format("avro").save("/tmp/funny_avro_date")
> scala> spark.read.format("avro").load("/tmp/funny_avro_date").show // 
> reflects original value
> +--+
> |dt|
> +--+
> |1582-10-14|
> +--+
> scala> 
> {noformat}
> It's unclear to me whether ORC is behaving correctly or not, as this is how 
> Spark 2.4 works with DATEs and TIMESTAMPs in general (and also how Spark 3.x 
> works with DATEs and TIMESTAMPs in general when 
> {{spark.sql.legacy.timeParserPolicy}} is set to {{LEGACY}}). In Spark 2.4, 
> DATEs and TIMESTAMPs in this range don't exist:
> {noformat}
> scala> sql("select cast('1582-10-14' as DATE) dt").show // the same cast done 
> in Spark 2.4
> +--+
> |dt|
> +--+
> |1582-10-24|
> +--+
> scala> 
> {noformat}
> I assume the following snippet is relevant (from the Wikipedia entry on the 
> Gregorian calendar):
> {quote}To deal with the 10 days' difference (between calendar and 
> reality)[Note 2] that this drift had already reached, the date was advanced 
> so that 4 October 1582 was followed by 15 October 1582
> {quote}
> Spark 3.x should treat DATEs and TIMESTAMPS in this range consistently, and 
> probably based on spark.sql.legacy.timeParserPolicy (or some other config) 
> rather than file format.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31423) DATES and TIMESTAMPS for a certain range are off by 10 days when stored in ORC

2020-04-15 Thread Wenchen Fan (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17083828#comment-17083828
 ] 

Wenchen Fan commented on SPARK-31423:
-

We probably have a bug about picking the next valid date, but the behavior 
itself is expected: Spark always pick the next valid date
{code}
scala> sql("select date'1990-9-31'").show
+-+
|DATE '1990-10-01'|
+-+
|   1990-10-01|
+-+
{code}

It's a bit weird that this date is valid in Spark but invalid in ORC because 
the calendar is different. I'm OK to fail for this case, with a config to allow 
users to write it anyway by picking the next valid date.

> DATES and TIMESTAMPS for a certain range are off by 10 days when stored in ORC
> --
>
> Key: SPARK-31423
> URL: https://issues.apache.org/jira/browse/SPARK-31423
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Bruce Robbins
>Priority: Major
>
> There is a range of days (1582-10-05 to 1582-10-14) for which DATEs and 
> TIMESTAMPS are changed when stored in ORC. The value is off by 10 days.
> For example:
> {noformat}
> scala> val df = sql("select cast('1582-10-14' as DATE) dt")
> df: org.apache.spark.sql.DataFrame = [dt: date]
> scala> df.show // seems fine
> +--+
> |dt|
> +--+
> |1582-10-14|
> +--+
> scala> df.write.mode("overwrite").orc("/tmp/funny_orc_date")
> scala> spark.read.orc("/tmp/funny_orc_date").show // off by 10 days
> +--+
> |dt|
> +--+
> |1582-10-24|
> +--+
> scala>
> {noformat}
> ORC has the same issue with TIMESTAMPS:
> {noformat}
> scala> val df = sql("select cast('1582-10-14 00:00:00' as TIMESTAMP) ts")
> df: org.apache.spark.sql.DataFrame = [ts: timestamp]
> scala> df.show // seems fine
> +---+
> | ts|
> +---+
> |1582-10-14 00:00:00|
> +---+
> scala> df.write.mode("overwrite").orc("/tmp/funny_orc_timestamp")
> scala> spark.read.orc("/tmp/funny_orc_timestamp").show(truncate=false) // off 
> by 10 days
> +---+
> |ts |
> +---+
> |1582-10-24 00:00:00|
> +---+
> scala> 
> {noformat}
> However, when written to Parquet or Avro, DATES and TIMESTAMPs for this range 
> do not change.
> {noformat}
> scala> val df = sql("select cast('1582-10-14' as DATE) dt")
> df: org.apache.spark.sql.DataFrame = [dt: date]
> scala> df.write.mode("overwrite").parquet("/tmp/funny_parquet_date")
> scala> spark.read.parquet("/tmp/funny_parquet_date").show // reflects 
> original value
> +--+
> |dt|
> +--+
> |1582-10-14|
> +--+
> scala> val df = sql("select cast('1582-10-14' as DATE) dt")
> df: org.apache.spark.sql.DataFrame = [dt: date]
> scala> df.write.mode("overwrite").format("avro").save("/tmp/funny_avro_date")
> scala> spark.read.format("avro").load("/tmp/funny_avro_date").show // 
> reflects original value
> +--+
> |dt|
> +--+
> |1582-10-14|
> +--+
> scala> 
> {noformat}
> It's unclear to me whether ORC is behaving correctly or not, as this is how 
> Spark 2.4 works with DATEs and TIMESTAMPs in general (and also how Spark 3.x 
> works with DATEs and TIMESTAMPs in general when 
> {{spark.sql.legacy.timeParserPolicy}} is set to {{LEGACY}}). In Spark 2.4, 
> DATEs and TIMESTAMPs in this range don't exist:
> {noformat}
> scala> sql("select cast('1582-10-14' as DATE) dt").show // the same cast done 
> in Spark 2.4
> +--+
> |dt|
> +--+
> |1582-10-24|
> +--+
> scala> 
> {noformat}
> I assume the following snippet is relevant (from the Wikipedia entry on the 
> Gregorian calendar):
> {quote}To deal with the 10 days' difference (between calendar and 
> reality)[Note 2] that this drift had already reached, the date was advanced 
> so that 4 October 1582 was followed by 15 October 1582
> {quote}
> Spark 3.x should treat DATEs and TIMESTAMPS in this range consistently, and 
> probably based on spark.sql.legacy.timeParserPolicy (or some other config) 
> rather than file format.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31423) DATES and TIMESTAMPS for a certain range are off by 10 days when stored in ORC

2020-04-15 Thread Wenchen Fan (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17083822#comment-17083822
 ] 

Wenchen Fan commented on SPARK-31423:
-

The ORC file format spec doesn't specify the calendar, but the ORC library 
(reader and writer) uses java 7 Date/Timestamp which indicates the hybrid 
Julian calendar.

> DATES and TIMESTAMPS for a certain range are off by 10 days when stored in ORC
> --
>
> Key: SPARK-31423
> URL: https://issues.apache.org/jira/browse/SPARK-31423
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Bruce Robbins
>Priority: Major
>
> There is a range of days (1582-10-05 to 1582-10-14) for which DATEs and 
> TIMESTAMPS are changed when stored in ORC. The value is off by 10 days.
> For example:
> {noformat}
> scala> val df = sql("select cast('1582-10-14' as DATE) dt")
> df: org.apache.spark.sql.DataFrame = [dt: date]
> scala> df.show // seems fine
> +--+
> |dt|
> +--+
> |1582-10-14|
> +--+
> scala> df.write.mode("overwrite").orc("/tmp/funny_orc_date")
> scala> spark.read.orc("/tmp/funny_orc_date").show // off by 10 days
> +--+
> |dt|
> +--+
> |1582-10-24|
> +--+
> scala>
> {noformat}
> ORC has the same issue with TIMESTAMPS:
> {noformat}
> scala> val df = sql("select cast('1582-10-14 00:00:00' as TIMESTAMP) ts")
> df: org.apache.spark.sql.DataFrame = [ts: timestamp]
> scala> df.show // seems fine
> +---+
> | ts|
> +---+
> |1582-10-14 00:00:00|
> +---+
> scala> df.write.mode("overwrite").orc("/tmp/funny_orc_timestamp")
> scala> spark.read.orc("/tmp/funny_orc_timestamp").show(truncate=false) // off 
> by 10 days
> +---+
> |ts |
> +---+
> |1582-10-24 00:00:00|
> +---+
> scala> 
> {noformat}
> However, when written to Parquet or Avro, DATES and TIMESTAMPs for this range 
> do not change.
> {noformat}
> scala> val df = sql("select cast('1582-10-14' as DATE) dt")
> df: org.apache.spark.sql.DataFrame = [dt: date]
> scala> df.write.mode("overwrite").parquet("/tmp/funny_parquet_date")
> scala> spark.read.parquet("/tmp/funny_parquet_date").show // reflects 
> original value
> +--+
> |dt|
> +--+
> |1582-10-14|
> +--+
> scala> val df = sql("select cast('1582-10-14' as DATE) dt")
> df: org.apache.spark.sql.DataFrame = [dt: date]
> scala> df.write.mode("overwrite").format("avro").save("/tmp/funny_avro_date")
> scala> spark.read.format("avro").load("/tmp/funny_avro_date").show // 
> reflects original value
> +--+
> |dt|
> +--+
> |1582-10-14|
> +--+
> scala> 
> {noformat}
> It's unclear to me whether ORC is behaving correctly or not, as this is how 
> Spark 2.4 works with DATEs and TIMESTAMPs in general (and also how Spark 3.x 
> works with DATEs and TIMESTAMPs in general when 
> {{spark.sql.legacy.timeParserPolicy}} is set to {{LEGACY}}). In Spark 2.4, 
> DATEs and TIMESTAMPs in this range don't exist:
> {noformat}
> scala> sql("select cast('1582-10-14' as DATE) dt").show // the same cast done 
> in Spark 2.4
> +--+
> |dt|
> +--+
> |1582-10-24|
> +--+
> scala> 
> {noformat}
> I assume the following snippet is relevant (from the Wikipedia entry on the 
> Gregorian calendar):
> {quote}To deal with the 10 days' difference (between calendar and 
> reality)[Note 2] that this drift had already reached, the date was advanced 
> so that 4 October 1582 was followed by 15 October 1582
> {quote}
> Spark 3.x should treat DATEs and TIMESTAMPS in this range consistently, and 
> probably based on spark.sql.legacy.timeParserPolicy (or some other config) 
> rather than file format.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31423) DATES and TIMESTAMPS for a certain range are off by 10 days when stored in ORC

2020-04-14 Thread Bruce Robbins (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17083611#comment-17083611
 ] 

Bruce Robbins commented on SPARK-31423:
---

{quote}It is questionable how to handle the date in such calendar.{quote}
Sorry if I am asking a dumb question. Are you saying that ORC stores only in 
the hybrid Julian calendar?

> DATES and TIMESTAMPS for a certain range are off by 10 days when stored in ORC
> --
>
> Key: SPARK-31423
> URL: https://issues.apache.org/jira/browse/SPARK-31423
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Bruce Robbins
>Priority: Major
>
> There is a range of days (1582-10-05 to 1582-10-14) for which DATEs and 
> TIMESTAMPS are changed when stored in ORC. The value is off by 10 days.
> For example:
> {noformat}
> scala> val df = sql("select cast('1582-10-14' as DATE) dt")
> df: org.apache.spark.sql.DataFrame = [dt: date]
> scala> df.show // seems fine
> +--+
> |dt|
> +--+
> |1582-10-14|
> +--+
> scala> df.write.mode("overwrite").orc("/tmp/funny_orc_date")
> scala> spark.read.orc("/tmp/funny_orc_date").show // off by 10 days
> +--+
> |dt|
> +--+
> |1582-10-24|
> +--+
> scala>
> {noformat}
> ORC has the same issue with TIMESTAMPS:
> {noformat}
> scala> val df = sql("select cast('1582-10-14 00:00:00' as TIMESTAMP) ts")
> df: org.apache.spark.sql.DataFrame = [ts: timestamp]
> scala> df.show // seems fine
> +---+
> | ts|
> +---+
> |1582-10-14 00:00:00|
> +---+
> scala> df.write.mode("overwrite").orc("/tmp/funny_orc_timestamp")
> scala> spark.read.orc("/tmp/funny_orc_timestamp").show(truncate=false) // off 
> by 10 days
> +---+
> |ts |
> +---+
> |1582-10-24 00:00:00|
> +---+
> scala> 
> {noformat}
> However, when written to Parquet or Avro, DATES and TIMESTAMPs for this range 
> do not change.
> {noformat}
> scala> val df = sql("select cast('1582-10-14' as DATE) dt")
> df: org.apache.spark.sql.DataFrame = [dt: date]
> scala> df.write.mode("overwrite").parquet("/tmp/funny_parquet_date")
> scala> spark.read.parquet("/tmp/funny_parquet_date").show // reflects 
> original value
> +--+
> |dt|
> +--+
> |1582-10-14|
> +--+
> scala> val df = sql("select cast('1582-10-14' as DATE) dt")
> df: org.apache.spark.sql.DataFrame = [dt: date]
> scala> df.write.mode("overwrite").format("avro").save("/tmp/funny_avro_date")
> scala> spark.read.format("avro").load("/tmp/funny_avro_date").show // 
> reflects original value
> +--+
> |dt|
> +--+
> |1582-10-14|
> +--+
> scala> 
> {noformat}
> It's unclear to me whether ORC is behaving correctly or not, as this is how 
> Spark 2.4 works with DATEs and TIMESTAMPs in general (and also how Spark 3.x 
> works with DATEs and TIMESTAMPs in general when 
> {{spark.sql.legacy.timeParserPolicy}} is set to {{LEGACY}}). In Spark 2.4, 
> DATEs and TIMESTAMPs in this range don't exist:
> {noformat}
> scala> sql("select cast('1582-10-14' as DATE) dt").show // the same cast done 
> in Spark 2.4
> +--+
> |dt|
> +--+
> |1582-10-24|
> +--+
> scala> 
> {noformat}
> I assume the following snippet is relevant (from the Wikipedia entry on the 
> Gregorian calendar):
> {quote}To deal with the 10 days' difference (between calendar and 
> reality)[Note 2] that this drift had already reached, the date was advanced 
> so that 4 October 1582 was followed by 15 October 1582
> {quote}
> Spark 3.x should treat DATEs and TIMESTAMPS in this range consistently, and 
> probably based on spark.sql.legacy.timeParserPolicy (or some other config) 
> rather than file format.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31423) DATES and TIMESTAMPS for a certain range are off by 10 days when stored in ORC

2020-04-14 Thread Maxim Gekk (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17083595#comment-17083595
 ] 

Maxim Gekk commented on SPARK-31423:


I have debugged this slightly on Spark 2.4, so, '1582-10-14' falls to the case 
while parsing from UTF8String:
https://github.com/AdoptOpenJDK/openjdk-jdk8u/blob/aa318070b27849f1fe00d14684b2a40f7b29bf79/jdk/src/share/classes/java/util/GregorianCalendar.java#L2762-L2768
{code:java}
// The date is in a "missing" period.
if (!isLenient()) {
throw new IllegalArgumentException("the specified date 
doesn't exist");
}
// Take the Julian date for compatibility, which
// will produce a Gregorian date.
fixedDate = jfd;
{code}
In the strong mode, the code 
https://github.com/apache/spark/blob/branch-2.4/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L517
 would throw the exception:
{code}
throw new IllegalArgumentException("the specified date doesn't exist")
{code}
but we are in the "weak" mode, in this way Java 7 GregorianCalendar interprets 
the date especially:
{code}
// Take the Julian date for compatibility, which
 // will produce a Gregorian date.
{code}

The date '1582-10-14' doesn't exist in the hybrid calendar used by Java 7 time 
API. It is questionable how to handle the date in such calendar. 

> DATES and TIMESTAMPS for a certain range are off by 10 days when stored in ORC
> --
>
> Key: SPARK-31423
> URL: https://issues.apache.org/jira/browse/SPARK-31423
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Bruce Robbins
>Priority: Major
>
> There is a range of days (1582-10-05 to 1582-10-14) for which DATEs and 
> TIMESTAMPS are changed when stored in ORC. The value is off by 10 days.
> For example:
> {noformat}
> scala> val df = sql("select cast('1582-10-14' as DATE) dt")
> df: org.apache.spark.sql.DataFrame = [dt: date]
> scala> df.show // seems fine
> +--+
> |dt|
> +--+
> |1582-10-14|
> +--+
> scala> df.write.mode("overwrite").orc("/tmp/funny_orc_date")
> scala> spark.read.orc("/tmp/funny_orc_date").show // off by 10 days
> +--+
> |dt|
> +--+
> |1582-10-24|
> +--+
> scala>
> {noformat}
> ORC has the same issue with TIMESTAMPS:
> {noformat}
> scala> val df = sql("select cast('1582-10-14 00:00:00' as TIMESTAMP) ts")
> df: org.apache.spark.sql.DataFrame = [ts: timestamp]
> scala> df.show // seems fine
> +---+
> | ts|
> +---+
> |1582-10-14 00:00:00|
> +---+
> scala> df.write.mode("overwrite").orc("/tmp/funny_orc_timestamp")
> scala> spark.read.orc("/tmp/funny_orc_timestamp").show(truncate=false) // off 
> by 10 days
> +---+
> |ts |
> +---+
> |1582-10-24 00:00:00|
> +---+
> scala> 
> {noformat}
> However, when written to Parquet or Avro, DATES and TIMESTAMPs for this range 
> do not change.
> {noformat}
> scala> val df = sql("select cast('1582-10-14' as DATE) dt")
> df: org.apache.spark.sql.DataFrame = [dt: date]
> scala> df.write.mode("overwrite").parquet("/tmp/funny_parquet_date")
> scala> spark.read.parquet("/tmp/funny_parquet_date").show // reflects 
> original value
> +--+
> |dt|
> +--+
> |1582-10-14|
> +--+
> scala> val df = sql("select cast('1582-10-14' as DATE) dt")
> df: org.apache.spark.sql.DataFrame = [dt: date]
> scala> df.write.mode("overwrite").format("avro").save("/tmp/funny_avro_date")
> scala> spark.read.format("avro").load("/tmp/funny_avro_date").show // 
> reflects original value
> +--+
> |dt|
> +--+
> |1582-10-14|
> +--+
> scala> 
> {noformat}
> It's unclear to me whether ORC is behaving correctly or not, as this is how 
> Spark 2.4 works with DATEs and TIMESTAMPs in general (and also how Spark 3.x 
> works with DATEs and TIMESTAMPs in general when 
> {{spark.sql.legacy.timeParserPolicy}} is set to {{LEGACY}}). In Spark 2.4, 
> DATEs and TIMESTAMPs in this range don't exist:
> {noformat}
> scala> sql("select cast('1582-10-14' as DATE) dt").show // the same cast done 
> in Spark 2.4
> +--+
> |dt|
> +--+
> |1582-10-24|
> +--+
> scala> 
> {noformat}
> I assume the following snippet is relevant (from the Wikipedia entry on the 
> Gregorian calendar):
> {quote}To deal with the 10 days' difference (between calendar and 
> reality)[Note 2] that this drift had already reached, the date was advanced 
> so that 4 October 1582 was followed by 15 October 1582
> {quote}
> Spark 3.x should treat DATEs and 

[jira] [Commented] (SPARK-31423) DATES and TIMESTAMPS for a certain range are off by 10 days when stored in ORC

2020-04-14 Thread Maxim Gekk (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17083314#comment-17083314
 ] 

Maxim Gekk commented on SPARK-31423:


I am working on the issue.

> DATES and TIMESTAMPS for a certain range are off by 10 days when stored in ORC
> --
>
> Key: SPARK-31423
> URL: https://issues.apache.org/jira/browse/SPARK-31423
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Bruce Robbins
>Priority: Major
>
> There is a range of days (1582-10-05 to 1582-10-14) for which DATEs and 
> TIMESTAMPS are changed when stored in ORC. The value is off by 10 days.
> For example:
> {noformat}
> scala> val df = sql("select cast('1582-10-14' as DATE) dt")
> df: org.apache.spark.sql.DataFrame = [dt: date]
> scala> df.show // seems fine
> +--+
> |dt|
> +--+
> |1582-10-14|
> +--+
> scala> df.write.mode("overwrite").orc("/tmp/funny_orc_date")
> scala> spark.read.orc("/tmp/funny_orc_date").show // off by 10 days
> +--+
> |dt|
> +--+
> |1582-10-24|
> +--+
> scala>
> {noformat}
> ORC has the same issue with TIMESTAMPS:
> {noformat}
> scala> val df = sql("select cast('1582-10-14 00:00:00' as TIMESTAMP) ts")
> df: org.apache.spark.sql.DataFrame = [ts: timestamp]
> scala> df.show // seems fine
> +---+
> | ts|
> +---+
> |1582-10-14 00:00:00|
> +---+
> scala> df.write.mode("overwrite").orc("/tmp/funny_orc_timestamp")
> scala> spark.read.orc("/tmp/funny_orc_timestamp").show(truncate=false) // off 
> by 10 days
> +---+
> |ts |
> +---+
> |1582-10-24 00:00:00|
> +---+
> scala> 
> {noformat}
> However, when written to Parquet or Avro, DATES and TIMESTAMPs for this range 
> do not change.
> {noformat}
> scala> val df = sql("select cast('1582-10-14' as DATE) dt")
> df: org.apache.spark.sql.DataFrame = [dt: date]
> scala> df.write.mode("overwrite").parquet("/tmp/funny_parquet_date")
> scala> spark.read.parquet("/tmp/funny_parquet_date").show // reflects 
> original value
> +--+
> |dt|
> +--+
> |1582-10-14|
> +--+
> scala> val df = sql("select cast('1582-10-14' as DATE) dt")
> df: org.apache.spark.sql.DataFrame = [dt: date]
> scala> df.write.mode("overwrite").format("avro").save("/tmp/funny_avro_date")
> scala> spark.read.format("avro").load("/tmp/funny_avro_date").show // 
> reflects original value
> +--+
> |dt|
> +--+
> |1582-10-14|
> +--+
> scala> 
> {noformat}
> It's unclear to me whether ORC is behaving correctly or not, as this is how 
> Spark 2.4 works with DATEs and TIMESTAMPs in general (and also how Spark 3.x 
> works with DATEs and TIMESTAMPs in general when 
> {{spark.sql.legacy.timeParserPolicy}} is set to {{LEGACY}}). In Spark 2.4, 
> DATEs and TIMESTAMPs in this range don't exist:
> {noformat}
> scala> sql("select cast('1582-10-14' as DATE) dt").show // the same cast done 
> in Spark 2.4
> +--+
> |dt|
> +--+
> |1582-10-24|
> +--+
> scala> 
> {noformat}
> I assume the following snippet is relevant (from the Wikipedia entry on the 
> Gregorian calendar):
> {quote}To deal with the 10 days' difference (between calendar and 
> reality)[Note 2] that this drift had already reached, the date was advanced 
> so that 4 October 1582 was followed by 15 October 1582
> {quote}
> Spark 3.x should treat DATEs and TIMESTAMPS in this range consistently, and 
> probably based on spark.sql.legacy.timeParserPolicy (or some other config) 
> rather than file format.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31423) DATES and TIMESTAMPS for a certain range are off by 10 days when stored in ORC

2020-04-13 Thread Bruce Robbins (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17082498#comment-17082498
 ] 

Bruce Robbins commented on SPARK-31423:
---

[~cloud_fan] 
{quote}FYI this is the behavior of Spark 2.4
{quote}
Yes, I noted that in my description. What I mean is that in Spark 3.x (and 
without any legacy config touched), only ORC demonstrates this behavior. CAST, 
and the Parquet and Avro file formats do not demonstrate this behavior.

> DATES and TIMESTAMPS for a certain range are off by 10 days when stored in ORC
> --
>
> Key: SPARK-31423
> URL: https://issues.apache.org/jira/browse/SPARK-31423
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Bruce Robbins
>Priority: Major
>
> There is a range of days (1582-10-05 to 1582-10-14) for which DATEs and 
> TIMESTAMPS are changed when stored in ORC. The value is off by 10 days.
> For example:
> {noformat}
> scala> val df = sql("select cast('1582-10-14' as DATE) dt")
> df: org.apache.spark.sql.DataFrame = [dt: date]
> scala> df.show // seems fine
> +--+
> |dt|
> +--+
> |1582-10-14|
> +--+
> scala> df.write.mode("overwrite").orc("/tmp/funny_orc_date")
> scala> spark.read.orc("/tmp/funny_orc_date").show // off by 10 days
> +--+
> |dt|
> +--+
> |1582-10-24|
> +--+
> scala>
> {noformat}
> ORC has the same issue with TIMESTAMPS:
> {noformat}
> scala> val df = sql("select cast('1582-10-14 00:00:00' as TIMESTAMP) ts")
> df: org.apache.spark.sql.DataFrame = [ts: timestamp]
> scala> df.show // seems fine
> +---+
> | ts|
> +---+
> |1582-10-14 00:00:00|
> +---+
> scala> df.write.mode("overwrite").orc("/tmp/funny_orc_timestamp")
> scala> spark.read.orc("/tmp/funny_orc_timestamp").show(truncate=false) // off 
> by 10 days
> +---+
> |ts |
> +---+
> |1582-10-24 00:00:00|
> +---+
> scala> 
> {noformat}
> However, when written to Parquet or Avro, DATES and TIMESTAMPs for this range 
> do not change.
> {noformat}
> scala> val df = sql("select cast('1582-10-14' as DATE) dt")
> df: org.apache.spark.sql.DataFrame = [dt: date]
> scala> df.write.mode("overwrite").parquet("/tmp/funny_parquet_date")
> scala> spark.read.parquet("/tmp/funny_parquet_date").show // reflects 
> original value
> +--+
> |dt|
> +--+
> |1582-10-14|
> +--+
> scala> val df = sql("select cast('1582-10-14' as DATE) dt")
> df: org.apache.spark.sql.DataFrame = [dt: date]
> scala> df.write.mode("overwrite").format("avro").save("/tmp/funny_avro_date")
> scala> spark.read.format("avro").load("/tmp/funny_avro_date").show // 
> reflects original value
> +--+
> |dt|
> +--+
> |1582-10-14|
> +--+
> scala> 
> {noformat}
> It's unclear to me whether ORC is behaving correctly or not, as this is how 
> Spark 2.4 works with DATEs and TIMESTAMPs in general (and also how Spark 3.x 
> works with DATEs and TIMESTAMPs in general when 
> {{spark.sql.legacy.timeParserPolicy}} is set to {{LEGACY}}). In Spark 2.4, 
> DATEs and TIMESTAMPs in this range don't exist:
> {noformat}
> scala> sql("select cast('1582-10-14' as DATE) dt").show // the same cast done 
> in Spark 2.4
> +--+
> |dt|
> +--+
> |1582-10-24|
> +--+
> scala> 
> {noformat}
> I assume the following snippet is relevant (from the Wikipedia entry on the 
> Gregorian calendar):
> {quote}To deal with the 10 days' difference (between calendar and 
> reality)[Note 2] that this drift had already reached, the date was advanced 
> so that 4 October 1582 was followed by 15 October 1582
> {quote}
> Spark 3.x should treat DATEs and TIMESTAMPS in this range consistently, and 
> probably based on spark.sql.legacy.timeParserPolicy (or some other config) 
> rather than file format.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31423) DATES and TIMESTAMPS for a certain range are off by 10 days when stored in ORC

2020-04-13 Thread Wenchen Fan (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17082228#comment-17082228
 ] 

Wenchen Fan commented on SPARK-31423:
-

FYI this is the behavior of Spark 2.4:
```
scala> val df = sql("select cast('1582-10-14' as DATE) dt")
df: org.apache.spark.sql.DataFrame = [dt: date]

scala> df.show
+--+
|dt|
+--+
|1582-10-24|
+--+


scala> df.write.mode("overwrite").orc("/tmp/funny_orc_date")

scala> spark.read.orc("/tmp/funny_orc_date").show
+--+
|dt|
+--+
|1582-10-24|
+--+
```

The result is wrong at the very beginning.

> DATES and TIMESTAMPS for a certain range are off by 10 days when stored in ORC
> --
>
> Key: SPARK-31423
> URL: https://issues.apache.org/jira/browse/SPARK-31423
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Bruce Robbins
>Priority: Major
>
> There is a range of days (1582-10-05 to 1582-10-14) for which DATEs and 
> TIMESTAMPS are changed when stored in ORC. The value is off by 10 days.
> For example:
> {noformat}
> scala> val df = sql("select cast('1582-10-14' as DATE) dt")
> df: org.apache.spark.sql.DataFrame = [dt: date]
> scala> df.show // seems fine
> +--+
> |dt|
> +--+
> |1582-10-14|
> +--+
> scala> df.write.mode("overwrite").orc("/tmp/funny_orc_date")
> scala> spark.read.orc("/tmp/funny_orc_date").show // off by 10 days
> +--+
> |dt|
> +--+
> |1582-10-24|
> +--+
> scala>
> {noformat}
> ORC has the same issue with TIMESTAMPS:
> {noformat}
> scala> val df = sql("select cast('1582-10-14 00:00:00' as TIMESTAMP) ts")
> df: org.apache.spark.sql.DataFrame = [ts: timestamp]
> scala> df.show // seems fine
> +---+
> | ts|
> +---+
> |1582-10-14 00:00:00|
> +---+
> scala> df.write.mode("overwrite").orc("/tmp/funny_orc_timestamp")
> scala> spark.read.orc("/tmp/funny_orc_timestamp").show(truncate=false) // off 
> by 10 days
> +---+
> |ts |
> +---+
> |1582-10-24 00:00:00|
> +---+
> scala> 
> {noformat}
> However, when written to Parquet or Avro, DATES and TIMESTAMPs for this range 
> do not change.
> {noformat}
> scala> val df = sql("select cast('1582-10-14' as DATE) dt")
> df: org.apache.spark.sql.DataFrame = [dt: date]
> scala> df.write.mode("overwrite").parquet("/tmp/funny_parquet_date")
> scala> spark.read.parquet("/tmp/funny_parquet_date").show // reflects 
> original value
> +--+
> |dt|
> +--+
> |1582-10-14|
> +--+
> scala> val df = sql("select cast('1582-10-14' as DATE) dt")
> df: org.apache.spark.sql.DataFrame = [dt: date]
> scala> df.write.mode("overwrite").format("avro").save("/tmp/funny_avro_date")
> scala> spark.read.format("avro").load("/tmp/funny_avro_date").show // 
> reflects original value
> +--+
> |dt|
> +--+
> |1582-10-14|
> +--+
> scala> 
> {noformat}
> It's unclear to me whether ORC is behaving correctly or not, as this is how 
> Spark 2.4 works with DATEs and TIMESTAMPs in general (and also how Spark 3.x 
> works with DATEs and TIMESTAMPs in general when 
> {{spark.sql.legacy.timeParserPolicy}} is set to {{LEGACY}}). In Spark 2.4, 
> DATEs and TIMESTAMPs in this range don't exist:
> {noformat}
> scala> sql("select cast('1582-10-14' as DATE) dt").show // the same cast done 
> in Spark 2.4
> +--+
> |dt|
> +--+
> |1582-10-24|
> +--+
> scala> 
> {noformat}
> I assume the following snippet is relevant (from the Wikipedia entry on the 
> Gregorian calendar):
> {quote}To deal with the 10 days' difference (between calendar and 
> reality)[Note 2] that this drift had already reached, the date was advanced 
> so that 4 October 1582 was followed by 15 October 1582
> {quote}
> Spark 3.x should treat DATEs and TIMESTAMPS in this range consistently, and 
> probably based on spark.sql.legacy.timeParserPolicy (or some other config) 
> rather than file format.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31423) DATES and TIMESTAMPS for a certain range are off by 10 days when stored in ORC

2020-04-12 Thread Maxim Gekk (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17082051#comment-17082051
 ] 

Maxim Gekk commented on SPARK-31423:


This is intentional behavior because ORC format assumes the hybrid calendar 
(Julian + Gregorian) but Parquet and Avro assume Proleptic Gregorian calendar. 
See https://issues.apache.org/jira/browse/SPARK-30951

> DATES and TIMESTAMPS for a certain range are off by 10 days when stored in ORC
> --
>
> Key: SPARK-31423
> URL: https://issues.apache.org/jira/browse/SPARK-31423
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Bruce Robbins
>Priority: Major
>
> There is a range of days (1582-10-05 to 1582-10-14) for which DATEs and 
> TIMESTAMPS are changed when stored in ORC. The value is off by 10 days.
> For example:
> {noformat}
> scala> val df = sql("select cast('1582-10-14' as DATE) dt")
> df: org.apache.spark.sql.DataFrame = [dt: date]
> scala> df.show // seems fine
> +--+
> |dt|
> +--+
> |1582-10-14|
> +--+
> scala> df.write.mode("overwrite").orc("/tmp/funny_orc_date")
> scala> spark.read.orc("/tmp/funny_orc_date").show // off by 10 days
> +--+
> |dt|
> +--+
> |1582-10-24|
> +--+
> scala>
> {noformat}
> ORC has the same issue with TIMESTAMPS:
> {noformat}
> scala> val df = sql("select cast('1582-10-14 00:00:00' as TIMESTAMP) ts")
> df: org.apache.spark.sql.DataFrame = [ts: timestamp]
> scala> df.show // seems fine
> +---+
> | ts|
> +---+
> |1582-10-14 00:00:00|
> +---+
> scala> df.write.mode("overwrite").orc("/tmp/funny_orc_timestamp")
> scala> spark.read.orc("/tmp/funny_orc_timestamp").show(truncate=false) // off 
> by 10 days
> +---+
> |ts |
> +---+
> |1582-10-24 00:00:00|
> +---+
> scala> 
> {noformat}
> However, when written to Parquet or Avro, DATES and TIMESTAMPs for this range 
> do not change.
> {noformat}
> scala> val df = sql("select cast('1582-10-14' as DATE) dt")
> df: org.apache.spark.sql.DataFrame = [dt: date]
> scala> df.write.mode("overwrite").parquet("/tmp/funny_parquet_date")
> scala> spark.read.parquet("/tmp/funny_parquet_date").show // reflects 
> original value
> +--+
> |dt|
> +--+
> |1582-10-14|
> +--+
> scala> val df = sql("select cast('1582-10-14' as DATE) dt")
> df: org.apache.spark.sql.DataFrame = [dt: date]
> scala> df.write.mode("overwrite").format("avro").save("/tmp/funny_avro_date")
> scala> spark.read.format("avro").load("/tmp/funny_avro_date").show // 
> reflects original value
> +--+
> |dt|
> +--+
> |1582-10-14|
> +--+
> scala> 
> {noformat}
> It's unclear to me whether ORC is behaving correctly or not, as this is how 
> Spark 2.4 works with DATEs and TIMESTAMPs in general (and also how Spark 3.x 
> works with DATEs and TIMESTAMPs in general when 
> {{spark.sql.legacy.timeParserPolicy}} is set to {{LEGACY}}). In Spark 2.4, 
> DATEs and TIMESTAMPs in this range don't exist:
> {noformat}
> scala> sql("select cast('1582-10-14' as DATE) dt").show // the same cast done 
> in Spark 2.4
> +--+
> |dt|
> +--+
> |1582-10-24|
> +--+
> scala> 
> {noformat}
> I assume the following snippet is relevant (from the Wikipedia entry on the 
> Gregorian calendar):
> {quote}To deal with the 10 days' difference (between calendar and 
> reality)[Note 2] that this drift had already reached, the date was advanced 
> so that 4 October 1582 was followed by 15 October 1582
> {quote}
> Spark 3.x should treat DATEs and TIMESTAMPS in this range consistently, and 
> probably based on spark.sql.legacy.timeParserPolicy (or some other config) 
> rather than file format.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31423) DATES and TIMESTAMPS for a certain range are off by 10 days when stored in ORC

2020-04-12 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17082039#comment-17082039
 ] 

Hyukjin Kwon commented on SPARK-31423:
--

cc [~maxgekk], [~cloud_fan] FYI.

> DATES and TIMESTAMPS for a certain range are off by 10 days when stored in ORC
> --
>
> Key: SPARK-31423
> URL: https://issues.apache.org/jira/browse/SPARK-31423
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Bruce Robbins
>Priority: Major
>
> There is a range of days (1582-10-05 to 1582-10-14) for which DATEs and 
> TIMESTAMPS are changed when stored in ORC. The value is off by 10 days.
> For example:
> {noformat}
> scala> val df = sql("select cast('1582-10-14' as DATE) dt")
> df: org.apache.spark.sql.DataFrame = [dt: date]
> scala> df.show // seems fine
> +--+
> |dt|
> +--+
> |1582-10-14|
> +--+
> scala> df.write.mode("overwrite").orc("/tmp/funny_orc_date")
> scala> spark.read.orc("/tmp/funny_orc_date").show // off by 10 days
> +--+
> |dt|
> +--+
> |1582-10-24|
> +--+
> scala>
> {noformat}
> ORC has the same issue with TIMESTAMPS:
> {noformat}
> scala> val df = sql("select cast('1582-10-14 00:00:00' as TIMESTAMP) ts")
> df: org.apache.spark.sql.DataFrame = [ts: timestamp]
> scala> df.show // seems fine
> +---+
> | ts|
> +---+
> |1582-10-14 00:00:00|
> +---+
> scala> df.write.mode("overwrite").orc("/tmp/funny_orc_timestamp")
> scala> spark.read.orc("/tmp/funny_orc_timestamp").show(truncate=false) // off 
> by 10 days
> +---+
> |ts |
> +---+
> |1582-10-24 00:00:00|
> +---+
> scala> 
> {noformat}
> However, when written to Parquet or Avro, DATES and TIMESTAMPs for this range 
> do not change.
> {noformat}
> scala> val df = sql("select cast('1582-10-14' as DATE) dt")
> df: org.apache.spark.sql.DataFrame = [dt: date]
> scala> df.write.mode("overwrite").parquet("/tmp/funny_parquet_date")
> scala> spark.read.parquet("/tmp/funny_parquet_date").show // reflects 
> original value
> +--+
> |dt|
> +--+
> |1582-10-14|
> +--+
> scala> val df = sql("select cast('1582-10-14' as DATE) dt")
> df: org.apache.spark.sql.DataFrame = [dt: date]
> scala> df.write.mode("overwrite").format("avro").save("/tmp/funny_avro_date")
> scala> spark.read.format("avro").load("/tmp/funny_avro_date").show // 
> reflects original value
> +--+
> |dt|
> +--+
> |1582-10-14|
> +--+
> scala> 
> {noformat}
> It's unclear to me whether ORC is behaving correctly or not, as this is how 
> Spark 2.4 works with DATEs and TIMESTAMPs in general (and also how Spark 3.x 
> works with DATEs and TIMESTAMPs in general when 
> {{spark.sql.legacy.timeParserPolicy}} is set to {{LEGACY}}). In Spark 2.4, 
> DATEs and TIMESTAMPs in this range don't exist:
> {noformat}
> scala> sql("select cast('1582-10-14' as DATE) dt").show // the same cast done 
> in Spark 2.4
> +--+
> |dt|
> +--+
> |1582-10-24|
> +--+
> scala> 
> {noformat}
> I assume the following snippet is relevant (from the Wikipedia entry on the 
> Gregorian calendar):
> {quote}To deal with the 10 days' difference (between calendar and 
> reality)[Note 2] that this drift had already reached, the date was advanced 
> so that 4 October 1582 was followed by 15 October 1582
> {quote}
> Spark 3.x should treat DATEs and TIMESTAMPS in this range consistently, and 
> probably based on spark.sql.legacy.timeParserPolicy (or some other config) 
> rather than file format.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org