jorisvandenbossche commented on PR #12528: URL: https://github.com/apache/arrow/pull/12528#issuecomment-1131361871
Something else, coming back to the previous example (https://github.com/apache/arrow/pull/12528#issuecomment-1124954128) and the logic how to resolve nonexistent times (https://github.com/apache/arrow/pull/12528#issuecomment-1125068996), there is still something wrong in that logic, I think. Let's consider the DST jump in 2015 for CE(S)T, and a time 10min before, just before (1s) and at the jump, and 10min after the jump (the jump is from 02:00 to 03:00): ``` arr = pa.array(["2015-03-29 01:50:00", "2015-03-29 01:59:59", "2015-03-29 03:00:00", "2015-03-29 03:10:00"]).cast(pa.timestamp("us")) arr = pc.assume_timezone(arr, "Europe/Brussels") In [35]: arr Out[35]: <pyarrow.lib.TimestampArray object at 0x7fddd9b1ab80> [ 2015-03-29 00:50:00.000000, 2015-03-29 00:59:59.000000, 2015-03-29 01:00:00.000000, 2015-03-29 01:10:00.000000 ] In [36]: arr.to_pandas() Out[36]: 0 2015-03-29 01:50:00+01:00 1 2015-03-29 01:59:59+01:00 2 2015-03-29 03:00:00+02:00 3 2015-03-29 03:10:00+02:00 dtype: datetime64[ns, Europe/Brussels] ``` Now, if we round this to 16min: ``` In [37]: pc.round_temporal(arr, 16, "minute") Out[37]: <pyarrow.lib.TimestampArray object at 0x7fddd9b24fa0> [ 2015-03-29 00:52:00.000000, 2015-03-29 00:08:00.000000, 2015-03-29 00:56:00.000000, 2015-03-29 01:12:00.000000 ] In [38]: pc.round_temporal(arr, 16, "minute").to_pandas() Out[38]: 0 2015-03-29 01:52:00+01:00 1 2015-03-29 01:08:00+01:00 2 2015-03-29 01:56:00+01:00 3 2015-03-29 03:12:00+02:00 dtype: datetime64[ns, Europe/Brussels] ``` So the second value in the result is clearly wrong since the absolute difference between original and rounded should never be more than the rounding unit (and in case of rounding and not ceil/floor, actually never more than half of the rounding unit, so 8min in this case): ``` In [43]: pc.subtract(arr, pc.round_temporal(arr, 16, "minute")).to_pandas().abs() Out[43]: 0 0 days 00:02:00 1 0 days 00:51:59 2 0 days 00:04:00 3 0 days 00:02:00 dtype: timedelta64[ns] ``` This time of "01:59:59+01:00" was rounded in local time to the non-existent 02:08, which was then corrected to 01:08. So that's a flaw in the logic that is implemented (that I proposed?). But also in general, the question is: _what is actually the expected / desired result in the first place?_ Because the first and third value end up as "01:52:00+01:00" and "01:56:00+01:00" after rounding. But those two values are only 4 min apart, while the rounding was by 16min. So you would expect that every possible value in the result is 16min apart from each other. I think this comes back to: do we want to result to be rounded in "physical time" (so the results are actually 16 min apart), or in "wall clock time"? Generally we round in local wall clock time, but when there are DST jumps, that will then cause times that are not actually the rounding unit apart from each other (only in case the rounding unit is not an exact divider of the DST jump, like the 16min here). (it might be worth checking how other software does this, if there is software that actually handles those cases .. Eg what does lubridate do?) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org