[jira] [Updated] (SPARK-42635) Several counter-intuitive behaviours in the TimestampAdd expression

Chenhao Li (Jira) Wed, 01 Mar 2023 10:32:15 -0800


     [ 
https://issues.apache.org/jira/browse/SPARK-42635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Chenhao Li updated SPARK-42635:
-------------------------------
    Description: 
# When the time is close to daylight saving time transition, the result may be 
discontinuous and not monotonic.

We currently have:

 
{code:java}
scala> spark.conf.set("spark.sql.session.timeZone", "America/Los_Angeles")
scala> spark.sql("select timestampadd(second, 24 * 3600 - 1, 
timestamp'2011-03-12 03:00:00')").show
+------------------------------------------------------------------------+
|timestampadd(second, ((24 * 3600) - 1), TIMESTAMP '2011-03-12 03:00:00')|
+------------------------------------------------------------------------+
|                                                     2011-03-13 03:59:59|
+------------------------------------------------------------------------+
scala> spark.sql("select timestampadd(second, 24 * 3600, timestamp'2011-03-12 
03:00:00')").show
+------------------------------------------------------------------+
|timestampadd(second, (24 * 3600), TIMESTAMP '2011-03-12 03:00:00')|
+------------------------------------------------------------------+
|                                               2011-03-13 03:00:00|
+------------------------------------------------------------------+ {code}
 

In the second query, adding one more second will set the time back one hour 
instead. Plus, there is only 23 * 3600 seconds from 2011-03-12 03:00:00 to 
2011-03-13 03:00:00, instead of 24 * 3600 seconds, due to the daylight saving 
time transition.

The root cause of the problem is the Spark code at 
[https://src.dev.databricks.com/databricks/runtime/-/blob/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala?L797]
 wrongly assumes every day has MICROS_PER_DAYseconds, and does the day and 
time-in-day split before looking at the timezone.

2. Adding month, quarter, and year silently ignores Int overflow during unit 
conversion.

The root cause is here 
[https://src.dev.databricks.com/databricks/runtime/-/blob/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala?L1254].
 quantity is multiplied by 3 or MONTHS_PER_YEARwithout checking overflow. Note 
that we do have overflow checking in adding the amount to the timestamp, so the 
behavior is inconsistent.

This can cause counter-intuitive results like this:

 

{{scala> spark.sql("select timestampadd(quarter, 1431655764, 
timestamp'1970-01-01')").show
+------------------------------------------------------------------+
|timestampadd(quarter, 1431655764, TIMESTAMP '1970-01-01 00:00:00')|

+------------------------------------------------------------------+
|1969-09-01 00:00:00|

{+}------------------------------------------------------------------{+}}}

3. Adding sub-month units (week, day, hour, minute, second, millisecond, 
microsecond)silently ignores Long overflow during unit conversion.

This is similar to the previous problem:

 

{{scala> spark.sql("select timestampadd(day, 106751992, 
timestamp'1970-01-01')").show(false)
+-------------------------------------------------------------+
|timestampadd(day, 106751992, TIMESTAMP '1970-01-01 00:00:00')|

+-------------------------------------------------------------+
|-290308-12-22 15:58:10.448384|

{+}-------------------------------------------------------------{+}}}

 

  was:
# When the time is close to daylight saving time transition, the result may be 
discontinuous and not monotonic.

As pointed out by me and @Utkarsh Agarwal in 
[https://github.com/databricks/runtime/pull/54936/files#r1118047445], the 
result is conter-intuitive when the time is close to daylight saving time 
transition and the added amount is close to the multiple of days.

We currently have:

```

scala> spark.conf.set("spark.sql.session.timeZone", "America/Los_Angeles")
scala> spark.sql("select timestampadd(second, 24 * 3600 - 1, 
timestamp'2011-03-12 03:00:00')").show
+------------------------------------------------------------------------+
|timestampadd(second, ((24 * 3600) - 1), TIMESTAMP '2011-03-12 03:00:00')|
+------------------------------------------------------------------------+
|                                                     2011-03-13 03:59:59|
+------------------------------------------------------------------------+
scala> spark.sql("select timestampadd(second, 24 * 3600, timestamp'2011-03-12 
03:00:00')").show
+------------------------------------------------------------------+
|timestampadd(second, (24 * 3600), TIMESTAMP '2011-03-12 03:00:00')|
+------------------------------------------------------------------+
|                                               2011-03-13 03:00:00|
+------------------------------------------------------------------+

```

In the second query, adding one more second will set the time back one hour 
instead. Plus, there is only 23 * 3600seconds from 2011-03-12 03:00:00 to 
2011-03-13 03:00:00, instead of 24 * 3600 seconds, due to the daylight saving 
time transition.

The root cause of the problem is the Spark code at 
[https://src.dev.databricks.com/databricks/runtime/-/blob/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala?L797]
 wrongly assumes every day has MICROS_PER_DAYseconds, and does the day and 
time-in-day split before looking at the timezone.

2. Adding month, quarter, and year silently ignores Int overflow during unit 
conversion.

The root cause is here 
[https://src.dev.databricks.com/databricks/runtime/-/blob/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala?L1254].
 quantity is multiplied by 3 or MONTHS_PER_YEARwithout checking overflow. Note 
that we do have overflow checking in adding the amount to the timestamp, so the 
behavior is inconsistent.

This can cause counter-intuitive results like this:

 

{{scala> spark.sql("select timestampadd(quarter, 1431655764, 
timestamp'1970-01-01')").show
+------------------------------------------------------------------+
|timestampadd(quarter, 1431655764, TIMESTAMP '1970-01-01 00:00:00')|

+------------------------------------------------------------------+
|1969-09-01 00:00:00|

{+}------------------------------------------------------------------{+}}}

3. Adding sub-month units (week, day, hour, minute, second, millisecond, 
microsecond)silently ignores Long overflow during unit conversion.

This is similar to the previous problem:

 

{{scala> spark.sql("select timestampadd(day, 106751992, 
timestamp'1970-01-01')").show(false)
+-------------------------------------------------------------+
|timestampadd(day, 106751992, TIMESTAMP '1970-01-01 00:00:00')|

+-------------------------------------------------------------+
|-290308-12-22 15:58:10.448384|

{+}-------------------------------------------------------------{+}}}

 


> Several counter-intuitive behaviours in the TimestampAdd expression
> -------------------------------------------------------------------
>
>                 Key: SPARK-42635
>                 URL: https://issues.apache.org/jira/browse/SPARK-42635
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core, SQL
>    Affects Versions: 3.3.0, 3.3.1, 3.3.2
>            Reporter: Chenhao Li
>            Priority: Major
>
> # When the time is close to daylight saving time transition, the result may 
> be discontinuous and not monotonic.
> We currently have:
>  
> {code:java}
> scala> spark.conf.set("spark.sql.session.timeZone", "America/Los_Angeles")
> scala> spark.sql("select timestampadd(second, 24 * 3600 - 1, 
> timestamp'2011-03-12 03:00:00')").show
> +------------------------------------------------------------------------+
> |timestampadd(second, ((24 * 3600) - 1), TIMESTAMP '2011-03-12 03:00:00')|
> +------------------------------------------------------------------------+
> |                                                     2011-03-13 03:59:59|
> +------------------------------------------------------------------------+
> scala> spark.sql("select timestampadd(second, 24 * 3600, timestamp'2011-03-12 
> 03:00:00')").show
> +------------------------------------------------------------------+
> |timestampadd(second, (24 * 3600), TIMESTAMP '2011-03-12 03:00:00')|
> +------------------------------------------------------------------+
> |                                               2011-03-13 03:00:00|
> +------------------------------------------------------------------+ {code}
>  
> In the second query, adding one more second will set the time back one hour 
> instead. Plus, there is only 23 * 3600 seconds from 2011-03-12 03:00:00 to 
> 2011-03-13 03:00:00, instead of 24 * 3600 seconds, due to the daylight saving 
> time transition.
> The root cause of the problem is the Spark code at 
> [https://src.dev.databricks.com/databricks/runtime/-/blob/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala?L797]
>  wrongly assumes every day has MICROS_PER_DAYseconds, and does the day and 
> time-in-day split before looking at the timezone.
> 2. Adding month, quarter, and year silently ignores Int overflow during unit 
> conversion.
> The root cause is here 
> [https://src.dev.databricks.com/databricks/runtime/-/blob/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala?L1254].
>  quantity is multiplied by 3 or MONTHS_PER_YEARwithout checking overflow. 
> Note that we do have overflow checking in adding the amount to the timestamp, 
> so the behavior is inconsistent.
> This can cause counter-intuitive results like this:
>  
> {{scala> spark.sql("select timestampadd(quarter, 1431655764, 
> timestamp'1970-01-01')").show
> +------------------------------------------------------------------+
> |timestampadd(quarter, 1431655764, TIMESTAMP '1970-01-01 00:00:00')|
> +------------------------------------------------------------------+
> |1969-09-01 00:00:00|
> {+}------------------------------------------------------------------{+}}}
> 3. Adding sub-month units (week, day, hour, minute, second, millisecond, 
> microsecond)silently ignores Long overflow during unit conversion.
> This is similar to the previous problem:
>  
> {{scala> spark.sql("select timestampadd(day, 106751992, 
> timestamp'1970-01-01')").show(false)
> +-------------------------------------------------------------+
> |timestampadd(day, 106751992, TIMESTAMP '1970-01-01 00:00:00')|
> +-------------------------------------------------------------+
> |-290308-12-22 15:58:10.448384|
> {+}-------------------------------------------------------------{+}}}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-42635) Several counter-intuitive behaviours in the TimestampAdd expression

Reply via email to