[
https://issues.apache.org/jira/browse/SPARK-54507?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Vinod KC updated SPARK-54507:
-----------------------------
Description:
Add a new `time_bucket()` SQL function that buckets TIME values into
fixed-width intervals,
returning the start time of each bucket. This enables histogram generation and
time-of-day
pattern analysis for TIME columns.
Currently, Spark SQL users cannot easily group TIME values by intervals (e.g.,
15-minute or 1-hour
buckets) without complex manual calculations.
Common use cases that require this function:
- Retail: Analyze customer traffic by 30-minute slots to optimize staffing
- Healthcare: Group appointments by 15-minute intervals for scheduling
optimization
- Manufacturing: Aggregate sensor readings by hourly buckets to detect
production patterns
- DevOps: Bucket system events by 5-minute intervals for performance monitoring
Industry precedent:
- SQL Server 2022: `DATE_BUCKET()` supports TIME type bucketing
- TimescaleDB: `time_bucket()` is one of their most popular functions
Proposed Spark SQL example
{code:java}
WITH time_bucket_data AS (
SELECT * FROM VALUES
(1, TIME'09:15:30.123456'),
(2, TIME'09:37:45.654321'),
(3, TIME'10:05:12.987654'),
(4, TIME'14:42:00.111222'),
(5, TIME'14:55:30.333444')
AS t(id, event_time)
)
SELECT time_bucket(INTERVAL '30' MINUTE, event_time) AS time_slot,
COUNT(*) AS cnt
FROM time_bucket_data
GROUP BY time_slot
ORDER BY time_slot;
{code}
Output
{code:java}
+---------+---+
|time_slot|cnt|
+---------+---+
| 09:00:00| 1|
| 09:30:00| 1|
| 10:00:00| 1|
| 14:30:00| 2|
+---------+---+
{code}
was:
Add a new `time_bucket()` SQL function that buckets TIME values into
fixed-width intervals,
returning the start time of each bucket. This enables histogram generation and
time-of-day
pattern analysis for TIME columns.
Currently, Spark SQL users cannot easily group TIME values by intervals (e.g.,
15-minute or 1-hour
buckets) without complex manual calculations.
Common use cases that require this function:
- Retail: Analyze customer traffic by 30-minute slots to optimize staffing
- Healthcare: Group appointments by 15-minute intervals for scheduling
optimization
- Manufacturing: Aggregate sensor readings by hourly buckets to detect
production patterns
- DevOps: Bucket system events by 5-minute intervals for performance monitoring
Industry precedent:
- SQL Server 2022: `DATE_BUCKET()` supports TIME type bucketing
- TimescaleDB: `time_bucket()` is one of their most popular functions
eg:
{code:java}
WITH time_bucket_data AS (
SELECT * FROM VALUES
(1, TIME'09:15:30.123456'),
(2, TIME'09:37:45.654321'),
(3, TIME'10:05:12.987654'),
(4, TIME'14:42:00.111222'),
(5, TIME'14:55:30.333444')
AS t(id, event_time)
)
SELECT time_bucket(INTERVAL '30' MINUTE, event_time) AS time_slot,
COUNT(*) AS cnt
FROM time_bucket_data
GROUP BY time_slot
ORDER BY time_slot;
{code}
Output
{code:java}
+---------+---+
|time_slot|cnt|
+---------+---+
| 09:00:00| 1|
| 09:30:00| 1|
| 10:00:00| 1|
| 14:30:00| 2|
+---------+---+
{code}
> Add time_bucket function for TIME type bucketing
> ------------------------------------------------
>
> Key: SPARK-54507
> URL: https://issues.apache.org/jira/browse/SPARK-54507
> Project: Spark
> Issue Type: Improvement
> Components: SQL
> Affects Versions: 4.1.0
> Reporter: Vinod KC
> Priority: Major
>
> Add a new `time_bucket()` SQL function that buckets TIME values into
> fixed-width intervals,
> returning the start time of each bucket. This enables histogram generation
> and time-of-day
> pattern analysis for TIME columns.
> Currently, Spark SQL users cannot easily group TIME values by intervals
> (e.g., 15-minute or 1-hour
> buckets) without complex manual calculations.
> Common use cases that require this function:
> - Retail: Analyze customer traffic by 30-minute slots to optimize staffing
> - Healthcare: Group appointments by 15-minute intervals for scheduling
> optimization
> - Manufacturing: Aggregate sensor readings by hourly buckets to detect
> production patterns
> - DevOps: Bucket system events by 5-minute intervals for performance
> monitoring
> Industry precedent:
> - SQL Server 2022: `DATE_BUCKET()` supports TIME type bucketing
> - TimescaleDB: `time_bucket()` is one of their most popular functions
> Proposed Spark SQL example
> {code:java}
> WITH time_bucket_data AS (
> SELECT * FROM VALUES
> (1, TIME'09:15:30.123456'),
> (2, TIME'09:37:45.654321'),
> (3, TIME'10:05:12.987654'),
> (4, TIME'14:42:00.111222'),
> (5, TIME'14:55:30.333444')
> AS t(id, event_time)
> )
> SELECT time_bucket(INTERVAL '30' MINUTE, event_time) AS time_slot,
> COUNT(*) AS cnt
> FROM time_bucket_data
> GROUP BY time_slot
> ORDER BY time_slot;
> {code}
> Output
> {code:java}
> +---------+---+
> |time_slot|cnt|
> +---------+---+
> | 09:00:00| 1|
> | 09:30:00| 1|
> | 10:00:00| 1|
> | 14:30:00| 2|
> +---------+---+
> {code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]