alamb opened a new issue, #10602:
URL: https://github.com/apache/datafusion/issues/10602

   ### Is your feature request related to a problem or challenge?
   
   
   Broken out from @Abdullahsab3's great ticket 
https://github.com/apache/datafusion/issues/10368
   
   We would like to apply date_bin and have it bin the dates in local time, 
properly including daylight savings time calculations
   
   For example, given the following data at 12:01AM in the morning on the first 
day of each month,I would like to bin it into the first day of each month.
   
   ```sql
   create or replace table t AS
   VALUES
     ('2024-01-01T00:00:01'),
     ('2024-02-01T00:00:01'),
     ('2024-03-01T00:00:01'),
     ('2024-04-01T00:00:01'),
     ('2024-05-01T00:00:01'),
     ('2024-06-01T00:00:01'),
     ('2024-07-01T00:00:01'),
     ('2024-08-01T00:00:01'),
     ('2024-09-01T00:00:01'),
     ('2024-10-01T00:00:01'),
     ('2024-11-01T00:00:01'),
     ('2024-12-01T00:00:01')
   ;
   ```
   
   This works as expected when there is no timestamp (all timestamps end up in 
the correct day)
   
   ```sql
   > select arrow_typeof(column1::timestamp) as arrow_type, column1::timestamp, 
date_bin(interval '1 day', column1::timestamp) as "date_bin" from t;
   +-----------------------------+---------------------+---------------------+
   | arrow_type                  | t.column1           | date_bin            |
   +-----------------------------+---------------------+---------------------+
   | Timestamp(Nanosecond, None) | 2024-01-01T23:59:00 | 2024-01-01T00:00:00 | 
<-- binned correctly
   | Timestamp(Nanosecond, None) | 2024-02-01T23:59:00 | 2024-02-01T00:00:00 |
   | Timestamp(Nanosecond, None) | 2024-03-01T23:59:00 | 2024-03-01T00:00:00 |
   | Timestamp(Nanosecond, None) | 2024-04-01T23:59:00 | 2024-04-01T00:00:00 |
   | Timestamp(Nanosecond, None) | 2024-05-01T23:59:00 | 2024-05-01T00:00:00 |
   | Timestamp(Nanosecond, None) | 2024-06-01T23:59:00 | 2024-06-01T00:00:00 |
   | Timestamp(Nanosecond, None) | 2024-07-01T23:59:00 | 2024-07-01T00:00:00 |
   | Timestamp(Nanosecond, None) | 2024-08-01T23:59:00 | 2024-08-01T00:00:00 |
   | Timestamp(Nanosecond, None) | 2024-09-01T23:59:00 | 2024-09-01T00:00:00 |
   | Timestamp(Nanosecond, None) | 2024-10-01T23:59:00 | 2024-10-01T00:00:00 |
   | Timestamp(Nanosecond, None) | 2024-11-01T23:59:00 | 2024-11-01T00:00:00 |
   | Timestamp(Nanosecond, None) | 2024-12-01T23:59:00 | 2024-12-01T00:00:00 |
   +-----------------------------+---------------------+---------------------+
   12 row(s) fetched.
   Elapsed 0.004 seconds.
   ```
   
   
   
   However, once the timestamp is in a timezone it has problems:
   
   ```sql
   > create or replace view t_timezone 
   as 
   select column1::timestamp  AT TIME ZONE 'Europe/Brussels' as "column1" 
   from t;
   0 row(s) fetched.
   Elapsed 0.001 seconds.
   ```
   
   The timestamps now binned at incorrectly (the are binned based on UTC 
midnight, not midnight in Brussels) so for example a time in `2024-01-01` goes 
into a `2023-12-31` bucket:
   
   ```sql
   > select
      arrow_typeof(column1) as arrow_type,
      column1,
      date_bin(interval '1 day', column1) as "date_bin"
     from t_timezone;
   
+------------------------------------------------+---------------------------+---------------------------+
   | arrow_type                                     | column1                   
| date_bin                  |
   
+------------------------------------------------+---------------------------+---------------------------+
   | Timestamp(Nanosecond, Some("Europe/Brussels")) | 2024-01-01T00:00:01+01:00 
| 2023-12-31T01:00:00+01:00 | <-- binned in previous day
   | Timestamp(Nanosecond, Some("Europe/Brussels")) | 2024-02-01T00:00:01+01:00 
| 2024-01-31T01:00:00+01:00 |
   | Timestamp(Nanosecond, Some("Europe/Brussels")) | 2024-03-01T00:00:01+01:00 
| 2024-02-29T01:00:00+01:00 |
   | Timestamp(Nanosecond, Some("Europe/Brussels")) | 2024-04-01T00:00:01+02:00 
| 2024-03-31T01:00:00+01:00 |
   | Timestamp(Nanosecond, Some("Europe/Brussels")) | 2024-05-01T00:00:01+02:00 
| 2024-04-30T02:00:00+02:00 |
   | Timestamp(Nanosecond, Some("Europe/Brussels")) | 2024-06-01T00:00:01+02:00 
| 2024-05-31T02:00:00+02:00 |
   | Timestamp(Nanosecond, Some("Europe/Brussels")) | 2024-07-01T00:00:01+02:00 
| 2024-06-30T02:00:00+02:00 |
   | Timestamp(Nanosecond, Some("Europe/Brussels")) | 2024-08-01T00:00:01+02:00 
| 2024-07-31T02:00:00+02:00 |
   | Timestamp(Nanosecond, Some("Europe/Brussels")) | 2024-09-01T00:00:01+02:00 
| 2024-08-31T02:00:00+02:00 |
   | Timestamp(Nanosecond, Some("Europe/Brussels")) | 2024-10-01T00:00:01+02:00 
| 2024-09-30T02:00:00+02:00 |
   | Timestamp(Nanosecond, Some("Europe/Brussels")) | 2024-11-01T00:00:01+01:00 
| 2024-10-31T01:00:00+01:00 |
   | Timestamp(Nanosecond, Some("Europe/Brussels")) | 2024-12-01T00:00:01+01:00 
| 2024-11-30T01:00:00+01:00 |
   
+------------------------------------------------+---------------------------+---------------------------+
   ```
   
   The third argument to `date_bin` allows  changing the starting offset for a 
day, so let's set it to midnight in Europe/Brussels time. This almost works but 
does not take into account daylight savings time so the timestamps are binned 
incorrectly during DST:
   
   ```sql
   > select
      arrow_typeof(column1) as arrow_type,
      column1,
      date_bin(interval '1 day', column1, '2020-01-01T00:00:00Z'::timestamp AT 
TIME ZONE 'Europe/Brussels') as "date_bin"
     from t_timezone;
   
+------------------------------------------------+---------------------------+---------------------------+
   | arrow_type                                     | column1                   
| date_bin                  |
   
+------------------------------------------------+---------------------------+---------------------------+
   | Timestamp(Nanosecond, Some("Europe/Brussels")) | 2024-01-01T00:00:01+01:00 
| 2024-01-01T00:00:00+01:00 | <-- correct!
   | Timestamp(Nanosecond, Some("Europe/Brussels")) | 2024-02-01T00:00:01+01:00 
| 2024-02-01T00:00:00+01:00 |
   | Timestamp(Nanosecond, Some("Europe/Brussels")) | 2024-03-01T00:00:01+01:00 
| 2024-03-01T00:00:00+01:00 |
   | Timestamp(Nanosecond, Some("Europe/Brussels")) | 2024-04-01T00:00:01+02:00 
| 2024-03-31T00:00:00+01:00 | <-- DST! Binned into previous day
   | Timestamp(Nanosecond, Some("Europe/Brussels")) | 2024-05-01T00:00:01+02:00 
| 2024-04-30T01:00:00+02:00 |
   | Timestamp(Nanosecond, Some("Europe/Brussels")) | 2024-06-01T00:00:01+02:00 
| 2024-05-31T01:00:00+02:00 |
   | Timestamp(Nanosecond, Some("Europe/Brussels")) | 2024-07-01T00:00:01+02:00 
| 2024-06-30T01:00:00+02:00 |
   | Timestamp(Nanosecond, Some("Europe/Brussels")) | 2024-08-01T00:00:01+02:00 
| 2024-07-31T01:00:00+02:00 |
   | Timestamp(Nanosecond, Some("Europe/Brussels")) | 2024-09-01T00:00:01+02:00 
| 2024-08-31T01:00:00+02:00 |
   | Timestamp(Nanosecond, Some("Europe/Brussels")) | 2024-10-01T00:00:01+02:00 
| 2024-09-30T01:00:00+02:00 |
   | Timestamp(Nanosecond, Some("Europe/Brussels")) | 2024-11-01T00:00:01+01:00 
| 2024-11-01T00:00:00+01:00 | <-- DST over, now correct again
   | Timestamp(Nanosecond, Some("Europe/Brussels")) | 2024-12-01T00:00:01+01:00 
| 2024-12-01T00:00:00+01:00 |
   
+------------------------------------------------+---------------------------+---------------------------+
   12 row(s) fetched.
   Elapsed 0.003 seconds.
   ```
   
   
   ### Describe the solution you'd like
   
   _No response_
   
   ### Describe alternatives you've considered
   
   _No response_
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to