[ 
https://issues.apache.org/jira/browse/SPARK-54507?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod KC updated SPARK-54507:
-----------------------------
    Description: 
Add a new `time_bucket()` SQL function that buckets TIME values into 
fixed-width intervals, returning the start time of each bucket. This enables 
histogram generation and time-of-day pattern analysis for TIME columns.

Currently, Spark SQL users cannot easily group TIME values by intervals (e.g., 
15-minute or 1-hour buckets) without complex manual calculations.

Common use cases that require this function:
 - Retail: Analyze customer traffic by 30-minute slots to optimize staffing
 - Healthcare: Group appointments by 15-minute intervals for scheduling 
optimization
 - Manufacturing: Aggregate sensor readings by hourly buckets to detect 
production patterns
 - DevOps: Bucket system events by 5-minute intervals for performance monitoring

Industry precedent:
 - SQL Server 2022: `DATE_BUCKET()` supports TIME type bucketing
 - TimescaleDB: `time_bucket()` is one of their most popular functions

Proposed Spark SQL examples

 
{code:sql}
-- Bucket a single time value by 15 minutes
SELECT time_bucket(INTERVAL '15' MINUTE, TIME'09:37:22');-- Result: 09:30:00
-- Bucket by 30 minutes
SELECT time_bucket(INTERVAL '30' MINUTE, TIME'14:47:00');-- Result: 14:30:00
-- Bucket by 1 hour
SELECT time_bucket(INTERVAL '1' HOUR, TIME'16:35:00');-- Result: 16:00:00
-- Bucket by 2 hours
SELECT time_bucket(INTERVAL '2' HOUR, TIME'15:20:00');-- Result: 14:00:00 {code}

Group by Example

{code:java}
WITH time_bucket_data AS (
  SELECT * FROM VALUES
    (1, TIME'09:15:30.123456'),
    (2, TIME'09:37:45.654321'),
    (3, TIME'10:05:12.987654'),
    (4, TIME'14:42:00.111222'),
    (5, TIME'14:55:30.333444')
  AS t(id, event_time)
)
SELECT time_bucket(INTERVAL '30' MINUTE, event_time) AS time_slot,
       COUNT(*) AS cnt
FROM time_bucket_data
GROUP BY time_slot
ORDER BY time_slot;

+---------+---+
|time_slot|cnt|
+---------+---+
| 09:00:00|  1|
| 09:30:00|  1|
| 10:00:00|  1|
| 14:30:00|  2|
+---------+---+
{code}


  was:
Add a new `time_bucket()` SQL function that buckets TIME values into 
fixed-width intervals, returning the start time of each bucket. This enables 
histogram generation and time-of-day pattern analysis for TIME columns.

Currently, Spark SQL users cannot easily group TIME values by intervals (e.g., 
15-minute or 1-hour buckets) without complex manual calculations.

Common use cases that require this function:
 - Retail: Analyze customer traffic by 30-minute slots to optimize staffing
 - Healthcare: Group appointments by 15-minute intervals for scheduling 
optimization
 - Manufacturing: Aggregate sensor readings by hourly buckets to detect 
production patterns
 - DevOps: Bucket system events by 5-minute intervals for performance monitoring

Industry precedent:
 - SQL Server 2022: `DATE_BUCKET()` supports TIME type bucketing
 - TimescaleDB: `time_bucket()` is one of their most popular functions

Proposed Spark SQL examples

 
{code:java}
-- Bucket a single time value by 15 minutes
SELECT time_bucket(INTERVAL '15' MINUTE, TIME'09:37:22');-- Result: 09:30:00
-- Bucket by 30 minutes
SELECT time_bucket(INTERVAL '30' MINUTE, TIME'14:47:00');-- Result: 14:30:00
-- Bucket by 1 hour
SELECT time_bucket(INTERVAL '1' HOUR, TIME'16:35:00');-- Result: 16:00:00
-- Bucket by 2 hours
SELECT time_bucket(INTERVAL '2' HOUR, TIME'15:20:00');-- Result: 14:00:00 {code}

Group by Example

{code:java}
WITH time_bucket_data AS (
  SELECT * FROM VALUES
    (1, TIME'09:15:30.123456'),
    (2, TIME'09:37:45.654321'),
    (3, TIME'10:05:12.987654'),
    (4, TIME'14:42:00.111222'),
    (5, TIME'14:55:30.333444')
  AS t(id, event_time)
)
SELECT time_bucket(INTERVAL '30' MINUTE, event_time) AS time_slot,
       COUNT(*) AS cnt
FROM time_bucket_data
GROUP BY time_slot
ORDER BY time_slot;

+---------+---+
|time_slot|cnt|
+---------+---+
| 09:00:00|  1|
| 09:30:00|  1|
| 10:00:00|  1|
| 14:30:00|  2|
+---------+---+
{code}



> Add time_bucket function for TIME type bucketing
> ------------------------------------------------
>
>                 Key: SPARK-54507
>                 URL: https://issues.apache.org/jira/browse/SPARK-54507
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 4.1.0
>            Reporter: Vinod KC
>            Priority: Major
>
> Add a new `time_bucket()` SQL function that buckets TIME values into 
> fixed-width intervals, returning the start time of each bucket. This enables 
> histogram generation and time-of-day pattern analysis for TIME columns.
> Currently, Spark SQL users cannot easily group TIME values by intervals 
> (e.g., 15-minute or 1-hour buckets) without complex manual calculations.
> Common use cases that require this function:
>  - Retail: Analyze customer traffic by 30-minute slots to optimize staffing
>  - Healthcare: Group appointments by 15-minute intervals for scheduling 
> optimization
>  - Manufacturing: Aggregate sensor readings by hourly buckets to detect 
> production patterns
>  - DevOps: Bucket system events by 5-minute intervals for performance 
> monitoring
> Industry precedent:
>  - SQL Server 2022: `DATE_BUCKET()` supports TIME type bucketing
>  - TimescaleDB: `time_bucket()` is one of their most popular functions
> Proposed Spark SQL examples
>  
> {code:sql}
> -- Bucket a single time value by 15 minutes
> SELECT time_bucket(INTERVAL '15' MINUTE, TIME'09:37:22');-- Result: 09:30:00
> -- Bucket by 30 minutes
> SELECT time_bucket(INTERVAL '30' MINUTE, TIME'14:47:00');-- Result: 14:30:00
> -- Bucket by 1 hour
> SELECT time_bucket(INTERVAL '1' HOUR, TIME'16:35:00');-- Result: 16:00:00
> -- Bucket by 2 hours
> SELECT time_bucket(INTERVAL '2' HOUR, TIME'15:20:00');-- Result: 14:00:00 
> {code}
> Group by Example
> {code:java}
> WITH time_bucket_data AS (
>   SELECT * FROM VALUES
>     (1, TIME'09:15:30.123456'),
>     (2, TIME'09:37:45.654321'),
>     (3, TIME'10:05:12.987654'),
>     (4, TIME'14:42:00.111222'),
>     (5, TIME'14:55:30.333444')
>   AS t(id, event_time)
> )
> SELECT time_bucket(INTERVAL '30' MINUTE, event_time) AS time_slot,
>        COUNT(*) AS cnt
> FROM time_bucket_data
> GROUP BY time_slot
> ORDER BY time_slot;
> +---------+---+
> |time_slot|cnt|
> +---------+---+
> | 09:00:00|  1|
> | 09:30:00|  1|
> | 10:00:00|  1|
> | 14:30:00|  2|
> +---------+---+
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to