[ https://issues.apache.org/jira/browse/SPARK-42635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Chenhao Li updated SPARK-42635: ------------------------------- Description: # When the time is close to daylight saving time transition, the result may be discontinuous and not monotonic. We currently have: {code:java} scala> spark.conf.set("spark.sql.session.timeZone", "America/Los_Angeles") scala> spark.sql("select timestampadd(second, 24 * 3600 - 1, timestamp'2011-03-12 03:00:00')").show +------------------------------------------------------------------------+ |timestampadd(second, ((24 * 3600) - 1), TIMESTAMP '2011-03-12 03:00:00')| +------------------------------------------------------------------------+ | 2011-03-13 03:59:59| +------------------------------------------------------------------------+ scala> spark.sql("select timestampadd(second, 24 * 3600, timestamp'2011-03-12 03:00:00')").show +------------------------------------------------------------------+ |timestampadd(second, (24 * 3600), TIMESTAMP '2011-03-12 03:00:00')| +------------------------------------------------------------------+ | 2011-03-13 03:00:00| +------------------------------------------------------------------+ {code} In the second query, adding one more second will set the time back one hour instead. Plus, there is only 23 * 3600 seconds from 2011-03-12 03:00:00 to 2011-03-13 03:00:00, instead of 24 * 3600 seconds, due to the daylight saving time transition. The root cause of the problem is the Spark code at [https://src.dev.databricks.com/databricks/runtime/-/blob/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala?L797] wrongly assumes every day has MICROS_PER_DAYseconds, and does the day and time-in-day split before looking at the timezone. 2. Adding month, quarter, and year silently ignores Int overflow during unit conversion. The root cause is here [https://src.dev.databricks.com/databricks/runtime/-/blob/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala?L1254]. quantity is multiplied by 3 or MONTHS_PER_YEARwithout checking overflow. Note that we do have overflow checking in adding the amount to the timestamp, so the behavior is inconsistent. This can cause counter-intuitive results like this: {{scala> spark.sql("select timestampadd(quarter, 1431655764, timestamp'1970-01-01')").show +------------------------------------------------------------------+ |timestampadd(quarter, 1431655764, TIMESTAMP '1970-01-01 00:00:00')| +------------------------------------------------------------------+ |1969-09-01 00:00:00| {+}------------------------------------------------------------------{+}}} 3. Adding sub-month units (week, day, hour, minute, second, millisecond, microsecond)silently ignores Long overflow during unit conversion. This is similar to the previous problem: {{scala> spark.sql("select timestampadd(day, 106751992, timestamp'1970-01-01')").show(false) +-------------------------------------------------------------+ |timestampadd(day, 106751992, TIMESTAMP '1970-01-01 00:00:00')| +-------------------------------------------------------------+ |-290308-12-22 15:58:10.448384| {+}-------------------------------------------------------------{+}}} was: # When the time is close to daylight saving time transition, the result may be discontinuous and not monotonic. As pointed out by me and @Utkarsh Agarwal in [https://github.com/databricks/runtime/pull/54936/files#r1118047445], the result is conter-intuitive when the time is close to daylight saving time transition and the added amount is close to the multiple of days. We currently have: ``` scala> spark.conf.set("spark.sql.session.timeZone", "America/Los_Angeles") scala> spark.sql("select timestampadd(second, 24 * 3600 - 1, timestamp'2011-03-12 03:00:00')").show +------------------------------------------------------------------------+ |timestampadd(second, ((24 * 3600) - 1), TIMESTAMP '2011-03-12 03:00:00')| +------------------------------------------------------------------------+ | 2011-03-13 03:59:59| +------------------------------------------------------------------------+ scala> spark.sql("select timestampadd(second, 24 * 3600, timestamp'2011-03-12 03:00:00')").show +------------------------------------------------------------------+ |timestampadd(second, (24 * 3600), TIMESTAMP '2011-03-12 03:00:00')| +------------------------------------------------------------------+ | 2011-03-13 03:00:00| +------------------------------------------------------------------+ ``` In the second query, adding one more second will set the time back one hour instead. Plus, there is only 23 * 3600seconds from 2011-03-12 03:00:00 to 2011-03-13 03:00:00, instead of 24 * 3600 seconds, due to the daylight saving time transition. The root cause of the problem is the Spark code at [https://src.dev.databricks.com/databricks/runtime/-/blob/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala?L797] wrongly assumes every day has MICROS_PER_DAYseconds, and does the day and time-in-day split before looking at the timezone. 2. Adding month, quarter, and year silently ignores Int overflow during unit conversion. The root cause is here [https://src.dev.databricks.com/databricks/runtime/-/blob/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala?L1254]. quantity is multiplied by 3 or MONTHS_PER_YEARwithout checking overflow. Note that we do have overflow checking in adding the amount to the timestamp, so the behavior is inconsistent. This can cause counter-intuitive results like this: {{scala> spark.sql("select timestampadd(quarter, 1431655764, timestamp'1970-01-01')").show +------------------------------------------------------------------+ |timestampadd(quarter, 1431655764, TIMESTAMP '1970-01-01 00:00:00')| +------------------------------------------------------------------+ |1969-09-01 00:00:00| {+}------------------------------------------------------------------{+}}} 3. Adding sub-month units (week, day, hour, minute, second, millisecond, microsecond)silently ignores Long overflow during unit conversion. This is similar to the previous problem: {{scala> spark.sql("select timestampadd(day, 106751992, timestamp'1970-01-01')").show(false) +-------------------------------------------------------------+ |timestampadd(day, 106751992, TIMESTAMP '1970-01-01 00:00:00')| +-------------------------------------------------------------+ |-290308-12-22 15:58:10.448384| {+}-------------------------------------------------------------{+}}} > Several counter-intuitive behaviours in the TimestampAdd expression > ------------------------------------------------------------------- > > Key: SPARK-42635 > URL: https://issues.apache.org/jira/browse/SPARK-42635 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL > Affects Versions: 3.3.0, 3.3.1, 3.3.2 > Reporter: Chenhao Li > Priority: Major > > # When the time is close to daylight saving time transition, the result may > be discontinuous and not monotonic. > We currently have: > > {code:java} > scala> spark.conf.set("spark.sql.session.timeZone", "America/Los_Angeles") > scala> spark.sql("select timestampadd(second, 24 * 3600 - 1, > timestamp'2011-03-12 03:00:00')").show > +------------------------------------------------------------------------+ > |timestampadd(second, ((24 * 3600) - 1), TIMESTAMP '2011-03-12 03:00:00')| > +------------------------------------------------------------------------+ > | 2011-03-13 03:59:59| > +------------------------------------------------------------------------+ > scala> spark.sql("select timestampadd(second, 24 * 3600, timestamp'2011-03-12 > 03:00:00')").show > +------------------------------------------------------------------+ > |timestampadd(second, (24 * 3600), TIMESTAMP '2011-03-12 03:00:00')| > +------------------------------------------------------------------+ > | 2011-03-13 03:00:00| > +------------------------------------------------------------------+ {code} > > In the second query, adding one more second will set the time back one hour > instead. Plus, there is only 23 * 3600 seconds from 2011-03-12 03:00:00 to > 2011-03-13 03:00:00, instead of 24 * 3600 seconds, due to the daylight saving > time transition. > The root cause of the problem is the Spark code at > [https://src.dev.databricks.com/databricks/runtime/-/blob/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala?L797] > wrongly assumes every day has MICROS_PER_DAYseconds, and does the day and > time-in-day split before looking at the timezone. > 2. Adding month, quarter, and year silently ignores Int overflow during unit > conversion. > The root cause is here > [https://src.dev.databricks.com/databricks/runtime/-/blob/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala?L1254]. > quantity is multiplied by 3 or MONTHS_PER_YEARwithout checking overflow. > Note that we do have overflow checking in adding the amount to the timestamp, > so the behavior is inconsistent. > This can cause counter-intuitive results like this: > > {{scala> spark.sql("select timestampadd(quarter, 1431655764, > timestamp'1970-01-01')").show > +------------------------------------------------------------------+ > |timestampadd(quarter, 1431655764, TIMESTAMP '1970-01-01 00:00:00')| > +------------------------------------------------------------------+ > |1969-09-01 00:00:00| > {+}------------------------------------------------------------------{+}}} > 3. Adding sub-month units (week, day, hour, minute, second, millisecond, > microsecond)silently ignores Long overflow during unit conversion. > This is similar to the previous problem: > > {{scala> spark.sql("select timestampadd(day, 106751992, > timestamp'1970-01-01')").show(false) > +-------------------------------------------------------------+ > |timestampadd(day, 106751992, TIMESTAMP '1970-01-01 00:00:00')| > +-------------------------------------------------------------+ > |-290308-12-22 15:58:10.448384| > {+}-------------------------------------------------------------{+}}} > -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org