[ https://issues.apache.org/jira/browse/SPARK-44717?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17751962#comment-17751962 ]
Attila Zsolt Piros commented on SPARK-44717: -------------------------------------------- The TIMESTAMP_NTZ would work for sure. Here is the test: {noformat} $ TZ="America/New_York" $ ./bin/pyspark .... >>> sql("select TIMESTAMP_NTZ '2011-03-13 01:00:00' + >>> make_interval(0,0,0,0,1,0,0)").show() +------------------------------------------------------------------------+ |TIMESTAMP_NTZ '2011-03-13 01:00:00' + make_interval(0, 0, 0, 0, 1, 0, 0)| +------------------------------------------------------------------------+ | 2011-03-13 02:00:00| +------------------------------------------------------------------------+ >>> sql("select TIMESTAMP '2011-03-13 01:00:00' + >>> make_interval(0,0,0,0,1,0,0)").show() +--------------------------------------------------------------------+ |TIMESTAMP '2011-03-13 01:00:00' + make_interval(0, 0, 0, 0, 1, 0, 0)| +--------------------------------------------------------------------+ | 2011-03-13 03:00:00| +--------------------------------------------------------------------+ {noformat} > "pyspark.pandas.resample" is incorrect when DST is overlapped and setting > "spark.sql.timestampType" to TIMESTAMP_NTZ does not help > ---------------------------------------------------------------------------------------------------------------------------------- > > Key: SPARK-44717 > URL: https://issues.apache.org/jira/browse/SPARK-44717 > Project: Spark > Issue Type: Bug > Components: Pandas API on Spark > Affects Versions: 3.4.0, 3.4.1, 4.0.0 > Reporter: Attila Zsolt Piros > Priority: Major > > Use one of the existing test: > - "11H" case of test_dataframe_resample > (pyspark.pandas.tests.test_resample.ResampleTests) > - "1001H" case of test_series_resample > (pyspark.pandas.tests.test_resample.ResampleTests) > After setting the TZ for example to New York (like by using the following > python code in a "setUpClass": > {noformat} > os.environ["TZ"] = 'America/New_York' > {noformat}) > You will get the error for the latter mentioned test: > {noformat} > ====================================================================== > FAIL [4.219s]: test_series_resample > (pyspark.pandas.tests.test_resample.ResampleTests) > ---------------------------------------------------------------------- > Traceback (most recent call last): > File "/__w/spark/spark/python/pyspark/pandas/tests/test_resample.py", line > 276, in test_series_resample > self._test_resample(self.pdf3.A, self.psdf3.A, ["1001H"], "right", > "right", "sum") > File "/__w/spark/spark/python/pyspark/pandas/tests/test_resample.py", line > 259, in _test_resample > self.assert_eq( > File "/__w/spark/spark/python/pyspark/testing/pandasutils.py", line 457, in > assert_eq > _assert_pandas_almost_equal(lobj, robj) > File "/__w/spark/spark/python/pyspark/testing/pandasutils.py", line 228, in > _assert_pandas_almost_equal > raise PySparkAssertionError( > pyspark.errors.exceptions.base.PySparkAssertionError: > [DIFFERENT_PANDAS_SERIES] Series are not almost equal: > Left: > Freq: 1001H > float64 > Right: > float64 > {noformat} > The problem is the in the pyspark resample there will be more resampled rows > in the result. The DST change will cause those extra lines as the computed > __tmp_resample_bin_col__ be something like: > {noformat} > | __index_level_0__ | __tmp_resample_bin_col__ | A > ..... > |2011-03-08 00:00:00|2011-03-26 11:00:00 |0.3980551570183919 | > |2011-03-09 00:00:00|2011-03-26 11:00:00 |0.6511376673995046 | > |2011-03-10 00:00:00|2011-03-26 11:00:00 |0.6141085426890365 | > |2011-03-11 00:00:00|2011-03-26 11:00:00 |0.11557638066163867 | > |2011-03-12 00:00:00|2011-03-26 11:00:00 |0.4517788243490799 | > |2011-03-13 00:00:00|2011-03-26 11:00:00 |0.8637060550157284 | > |2011-03-14 00:00:00|2011-03-26 10:00:00 |0.8169499149450166 | > |2011-03-15 00:00:00|2011-03-26 10:00:00 |0.4585916249356583 | > |2011-03-16 00:00:00|2011-03-26 10:00:00 |0.8362472880832088 | > |2011-03-17 00:00:00|2011-03-26 10:00:00 |0.026716901748386812| > |2011-03-18 00:00:00|2011-03-26 10:00:00 |0.9086816462089563 | > {noformat} > You can see the extra lines around when the DST kicked in on 2011-03-13 in > New York. > Even setting the conf "spark.sql.timestampType" to"TIMESTAMP_NTZ" does not > help. > You can see my tests here: > https://github.com/attilapiros/spark/pull/5 > Pandas timestamps are TZ less: > ` > {noformat} > import pandas as pd > a = pd.Timestamp(year=2011, month=3, day=13, hour=1) > b = pd.Timedelta(hours=1) > >> a > Timestamp('2011-03-13 01:00:00') > >>> a+b > Timestamp('2011-03-13 02:00:00') > >>> a+b+b > Timestamp('2011-03-13 03:00:00') > {noformat} > But pyspark TimestampType uses TZ and DST: > {noformat} > >>> sql("select TIMESTAMP '2011-03-13 01:00:00'").show() > +-------------------------------+ > |TIMESTAMP '2011-03-13 01:00:00'| > +-------------------------------+ > | 2011-03-13 01:00:00| > +-------------------------------+ > >>> sql("select TIMESTAMP '2011-03-13 01:00:00' + > >>> make_interval(0,0,0,0,1,0,0)").show() > +--------------------------------------------------------------------+ > |TIMESTAMP '2011-03-13 01:00:00' + make_interval(0, 0, 0, 0, 1, 0, 0)| > +--------------------------------------------------------------------+ > | 2011-03-13 03:00:00| > +--------------------------------------------------------------------+ > {noformat} > The current resample code uses the above interval based calculation. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org