jorisvandenbossche commented on PR #12528:
URL: https://github.com/apache/arrow/pull/12528#issuecomment-1131361871

   Something else, coming back to the previous example 
(https://github.com/apache/arrow/pull/12528#issuecomment-1124954128) and the 
logic how to resolve nonexistent times 
(https://github.com/apache/arrow/pull/12528#issuecomment-1125068996), there is 
still something wrong in that logic, I think. 
   
   Let's consider the DST jump in 2015 for CE(S)T, and a time 10min before, 
just before (1s) and at the jump, and 10min after the jump (the jump is from 
02:00 to 03:00):
   
   ```
   arr = pa.array(["2015-03-29 01:50:00", "2015-03-29 01:59:59", "2015-03-29 
03:00:00", "2015-03-29 03:10:00"]).cast(pa.timestamp("us"))
   arr = pc.assume_timezone(arr, "Europe/Brussels")
   
   In [35]: arr
   Out[35]: 
   <pyarrow.lib.TimestampArray object at 0x7fddd9b1ab80>
   [
     2015-03-29 00:50:00.000000,
     2015-03-29 00:59:59.000000,
     2015-03-29 01:00:00.000000,
     2015-03-29 01:10:00.000000
   ]
   
   In [36]: arr.to_pandas()
   Out[36]: 
   0   2015-03-29 01:50:00+01:00
   1   2015-03-29 01:59:59+01:00
   2   2015-03-29 03:00:00+02:00
   3   2015-03-29 03:10:00+02:00
   dtype: datetime64[ns, Europe/Brussels]
   ```
   
   Now, if we round this to 16min:
   
   ```
   In [37]: pc.round_temporal(arr, 16, "minute")
   Out[37]: 
   <pyarrow.lib.TimestampArray object at 0x7fddd9b24fa0>
   [
     2015-03-29 00:52:00.000000,
     2015-03-29 00:08:00.000000,
     2015-03-29 00:56:00.000000,
     2015-03-29 01:12:00.000000
   ]
   
   In [38]: pc.round_temporal(arr, 16, "minute").to_pandas()
   Out[38]: 
   0   2015-03-29 01:52:00+01:00
   1   2015-03-29 01:08:00+01:00
   2   2015-03-29 01:56:00+01:00
   3   2015-03-29 03:12:00+02:00
   dtype: datetime64[ns, Europe/Brussels]
   ```
   
   So the second value in the result is clearly wrong since the absolute 
difference between original and rounded should never be more than the rounding 
unit (and in case of rounding and not ceil/floor, actually never more than half 
of the rounding unit, so 8min in this case):
   
   ```
   In [43]: pc.subtract(arr, pc.round_temporal(arr, 16, 
"minute")).to_pandas().abs()
   Out[43]: 
   0   0 days 00:02:00
   1   0 days 00:51:59
   2   0 days 00:04:00
   3   0 days 00:02:00
   dtype: timedelta64[ns]
   ```
   
   This time of "01:59:59+01:00" was rounded in local time to the non-existent 
02:08, which was then corrected to 01:08. So that's a flaw in the logic that is 
implemented (that I proposed?).
   
   But also in general, the question is: _what is actually the expected / 
desired result in the first place?_ 
   Because the first and third value end up as "01:52:00+01:00" and 
"01:56:00+01:00" after rounding. But those two values are only 4 min apart, 
while the rounding was by 16min. So you would expect that every possible value 
in the result is 16min apart from each other.
   
   I think this comes back to: do we want to result to be rounded in "physical 
time" (so the results are actually 16 min apart), or in "wall clock time"? 
   Generally we round in local wall clock time, but when there are DST jumps, 
that will then cause times that are not actually the rounding unit apart from 
each other (only in case the rounding unit is not an exact divider of the DST 
jump, like the 16min here).
   
   (it might be worth checking how other software does this, if there is 
software that actually handles those cases .. Eg what does lubridate do?)
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to