[ 
https://issues.apache.org/jira/browse/SPARK-37544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17536865#comment-17536865
 ] 

Bruce Robbins commented on SPARK-37544:
---------------------------------------

Reproduction of the bug depends on your time zone. It all works as expected if 
you are running in the UTC time zone. The master branch will still produce 
incorrect results if your local time zone is, say, {{{}America/Los_Angeles{}}}. 
For example:
{noformat}
spark-sql> select sequence(date '2021-01-01', date '2022-01-01', interval '3' 
month) x;
[2021-01-01,2021-03-31,2021-06-30,2021-09-30,2022-01-01]
Time taken: 0.664 seconds, Fetched 1 row(s)
spark-sql> 
{noformat}
InternalSequenceBase converts the date to micros by multiplying days by micros 
per day. This converts the date into a time zone agnostic timestamp. However, 
InternalSequenceBase uses a function to perform the arithmetic 
(DateTimeUtils#timestampAddInterval) that assumes a _time zone aware_ timestamp.

If your time zone is America/Los_Angeles, and you specify a start date of 
'2021-01-01', InternalSequenceBase converts that to '2021-01-01 00:00:00 UTC'. 
However, what we should pass to DateTimeUtils#timestampAddInterval is 
'2021-01-01 00:00:00 _PST_' . As a result, the sequence ends up adding 3 months 
to '2020-12-31 16:00 PST', which is '2021-03-31 16:00 PDT'.

This is why the second value in the sequence is 2021-03-31 and not 2021-04-01.

One fix is to change InternalSequenceBase to use timestampNTZAddInterval for 
dates. However, that would put sequences out-of-sync with Spark's date 
arithmetic, which _is_ time zone aware. Take for example the following Spark 
date arithmetic:
{noformat}
select cast(date'2022-03-09' + interval '4' days '23' hour as date) as x;
{noformat}
In the {{America/Los_Angeles}} time zone, it returns {{{}2022-03-14{}}}.

However, in the {{UTC}} time zone, it instead returns {{{}2022-03-13{}}}.

So to be consistent with Spark date arithmetic, InternalSequenceBase should use 
daysToMicros and microsToDays for dates rather than simply multiplying days by 
a scale value. I will put up a PR that does exactly that in the next day or so.

> sequence over dates with month interval is producing incorrect results
> ----------------------------------------------------------------------
>
>                 Key: SPARK-37544
>                 URL: https://issues.apache.org/jira/browse/SPARK-37544
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 3.1.1, 3.2.0
>         Environment: Ubuntu 20, OSX 11.6
> OpenJDK 11, Spark 3.2
>            Reporter: Vsevolod Ostapenko
>            Priority: Major
>
> Sequence function with dates and step interval in months producing unexpected 
> results.
> Here is a sample using Spark 3.2 (though the behavior is the same in 3.1.1 
> and presumably earlier):
> {{scala> spark.sql("select sequence(date '2021-01-01', date '2022-01-01', 
> interval '3' month) x, date '2021-01-01' + interval '3' month y").collect()}}
> {{res1: Array[org.apache.spark.sql.Row] = Array([WrappedArray(2021-01-01, 
> {color:#FF0000}*2021-03-31, 2021-06-30, 2021-09-30,* 
> {color}{color:#172b4d}2022-01-01{color}),2021-04-01])}}
> Expected result of adding 3 months to the 2021-01-01 is 2021-04-01, while 
> sequence returns 2021-03-31.
> At the same time sequence over timestamps works as expected:
> {{scala> spark.sql("select sequence(timestamp '2021-01-01 00:00', timestamp 
> '2022-01-01 00:00', interval '3' month) x").collect()}}
> {{res2: Array[org.apache.spark.sql.Row] = Array([WrappedArray(2021-01-01 
> 00:00:00.0, *2021-04-01* 00:00:00.0, *2021-07-01* 00:00:00.0, *2021-10-01* 
> 00:00:00.0, 2022-01-01 00:00:00.0)])}}
>  
> A similar issue was reported in the past - [SPARK-31654] sequence producing 
> inconsistent intervals for month step - ASF JIRA (apache.org)
> It's marked resolved, but the problem is either resurfaced or was never 
> actually fixed.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to