alamb opened a new issue, #10602: URL: https://github.com/apache/datafusion/issues/10602
### Is your feature request related to a problem or challenge? Broken out from @Abdullahsab3's great ticket https://github.com/apache/datafusion/issues/10368 We would like to apply date_bin and have it bin the dates in local time, properly including daylight savings time calculations For example, given the following data at 12:01AM in the morning on the first day of each month,I would like to bin it into the first day of each month. ```sql create or replace table t AS VALUES ('2024-01-01T00:00:01'), ('2024-02-01T00:00:01'), ('2024-03-01T00:00:01'), ('2024-04-01T00:00:01'), ('2024-05-01T00:00:01'), ('2024-06-01T00:00:01'), ('2024-07-01T00:00:01'), ('2024-08-01T00:00:01'), ('2024-09-01T00:00:01'), ('2024-10-01T00:00:01'), ('2024-11-01T00:00:01'), ('2024-12-01T00:00:01') ; ``` This works as expected when there is no timestamp (all timestamps end up in the correct day) ```sql > select arrow_typeof(column1::timestamp) as arrow_type, column1::timestamp, date_bin(interval '1 day', column1::timestamp) as "date_bin" from t; +-----------------------------+---------------------+---------------------+ | arrow_type | t.column1 | date_bin | +-----------------------------+---------------------+---------------------+ | Timestamp(Nanosecond, None) | 2024-01-01T23:59:00 | 2024-01-01T00:00:00 | <-- binned correctly | Timestamp(Nanosecond, None) | 2024-02-01T23:59:00 | 2024-02-01T00:00:00 | | Timestamp(Nanosecond, None) | 2024-03-01T23:59:00 | 2024-03-01T00:00:00 | | Timestamp(Nanosecond, None) | 2024-04-01T23:59:00 | 2024-04-01T00:00:00 | | Timestamp(Nanosecond, None) | 2024-05-01T23:59:00 | 2024-05-01T00:00:00 | | Timestamp(Nanosecond, None) | 2024-06-01T23:59:00 | 2024-06-01T00:00:00 | | Timestamp(Nanosecond, None) | 2024-07-01T23:59:00 | 2024-07-01T00:00:00 | | Timestamp(Nanosecond, None) | 2024-08-01T23:59:00 | 2024-08-01T00:00:00 | | Timestamp(Nanosecond, None) | 2024-09-01T23:59:00 | 2024-09-01T00:00:00 | | Timestamp(Nanosecond, None) | 2024-10-01T23:59:00 | 2024-10-01T00:00:00 | | Timestamp(Nanosecond, None) | 2024-11-01T23:59:00 | 2024-11-01T00:00:00 | | Timestamp(Nanosecond, None) | 2024-12-01T23:59:00 | 2024-12-01T00:00:00 | +-----------------------------+---------------------+---------------------+ 12 row(s) fetched. Elapsed 0.004 seconds. ``` However, once the timestamp is in a timezone it has problems: ```sql > create or replace view t_timezone as select column1::timestamp AT TIME ZONE 'Europe/Brussels' as "column1" from t; 0 row(s) fetched. Elapsed 0.001 seconds. ``` The timestamps now binned at incorrectly (the are binned based on UTC midnight, not midnight in Brussels) so for example a time in `2024-01-01` goes into a `2023-12-31` bucket: ```sql > select arrow_typeof(column1) as arrow_type, column1, date_bin(interval '1 day', column1) as "date_bin" from t_timezone; +------------------------------------------------+---------------------------+---------------------------+ | arrow_type | column1 | date_bin | +------------------------------------------------+---------------------------+---------------------------+ | Timestamp(Nanosecond, Some("Europe/Brussels")) | 2024-01-01T00:00:01+01:00 | 2023-12-31T01:00:00+01:00 | <-- binned in previous day | Timestamp(Nanosecond, Some("Europe/Brussels")) | 2024-02-01T00:00:01+01:00 | 2024-01-31T01:00:00+01:00 | | Timestamp(Nanosecond, Some("Europe/Brussels")) | 2024-03-01T00:00:01+01:00 | 2024-02-29T01:00:00+01:00 | | Timestamp(Nanosecond, Some("Europe/Brussels")) | 2024-04-01T00:00:01+02:00 | 2024-03-31T01:00:00+01:00 | | Timestamp(Nanosecond, Some("Europe/Brussels")) | 2024-05-01T00:00:01+02:00 | 2024-04-30T02:00:00+02:00 | | Timestamp(Nanosecond, Some("Europe/Brussels")) | 2024-06-01T00:00:01+02:00 | 2024-05-31T02:00:00+02:00 | | Timestamp(Nanosecond, Some("Europe/Brussels")) | 2024-07-01T00:00:01+02:00 | 2024-06-30T02:00:00+02:00 | | Timestamp(Nanosecond, Some("Europe/Brussels")) | 2024-08-01T00:00:01+02:00 | 2024-07-31T02:00:00+02:00 | | Timestamp(Nanosecond, Some("Europe/Brussels")) | 2024-09-01T00:00:01+02:00 | 2024-08-31T02:00:00+02:00 | | Timestamp(Nanosecond, Some("Europe/Brussels")) | 2024-10-01T00:00:01+02:00 | 2024-09-30T02:00:00+02:00 | | Timestamp(Nanosecond, Some("Europe/Brussels")) | 2024-11-01T00:00:01+01:00 | 2024-10-31T01:00:00+01:00 | | Timestamp(Nanosecond, Some("Europe/Brussels")) | 2024-12-01T00:00:01+01:00 | 2024-11-30T01:00:00+01:00 | +------------------------------------------------+---------------------------+---------------------------+ ``` The third argument to `date_bin` allows changing the starting offset for a day, so let's set it to midnight in Europe/Brussels time. This almost works but does not take into account daylight savings time so the timestamps are binned incorrectly during DST: ```sql > select arrow_typeof(column1) as arrow_type, column1, date_bin(interval '1 day', column1, '2020-01-01T00:00:00Z'::timestamp AT TIME ZONE 'Europe/Brussels') as "date_bin" from t_timezone; +------------------------------------------------+---------------------------+---------------------------+ | arrow_type | column1 | date_bin | +------------------------------------------------+---------------------------+---------------------------+ | Timestamp(Nanosecond, Some("Europe/Brussels")) | 2024-01-01T00:00:01+01:00 | 2024-01-01T00:00:00+01:00 | <-- correct! | Timestamp(Nanosecond, Some("Europe/Brussels")) | 2024-02-01T00:00:01+01:00 | 2024-02-01T00:00:00+01:00 | | Timestamp(Nanosecond, Some("Europe/Brussels")) | 2024-03-01T00:00:01+01:00 | 2024-03-01T00:00:00+01:00 | | Timestamp(Nanosecond, Some("Europe/Brussels")) | 2024-04-01T00:00:01+02:00 | 2024-03-31T00:00:00+01:00 | <-- DST! Binned into previous day | Timestamp(Nanosecond, Some("Europe/Brussels")) | 2024-05-01T00:00:01+02:00 | 2024-04-30T01:00:00+02:00 | | Timestamp(Nanosecond, Some("Europe/Brussels")) | 2024-06-01T00:00:01+02:00 | 2024-05-31T01:00:00+02:00 | | Timestamp(Nanosecond, Some("Europe/Brussels")) | 2024-07-01T00:00:01+02:00 | 2024-06-30T01:00:00+02:00 | | Timestamp(Nanosecond, Some("Europe/Brussels")) | 2024-08-01T00:00:01+02:00 | 2024-07-31T01:00:00+02:00 | | Timestamp(Nanosecond, Some("Europe/Brussels")) | 2024-09-01T00:00:01+02:00 | 2024-08-31T01:00:00+02:00 | | Timestamp(Nanosecond, Some("Europe/Brussels")) | 2024-10-01T00:00:01+02:00 | 2024-09-30T01:00:00+02:00 | | Timestamp(Nanosecond, Some("Europe/Brussels")) | 2024-11-01T00:00:01+01:00 | 2024-11-01T00:00:00+01:00 | <-- DST over, now correct again | Timestamp(Nanosecond, Some("Europe/Brussels")) | 2024-12-01T00:00:01+01:00 | 2024-12-01T00:00:00+01:00 | +------------------------------------------------+---------------------------+---------------------------+ 12 row(s) fetched. Elapsed 0.003 seconds. ``` ### Describe the solution you'd like _No response_ ### Describe alternatives you've considered _No response_ ### Additional context _No response_ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org