vranes opened a new pull request, #55535:
URL: https://github.com/apache/spark/pull/55535

   ### What changes were proposed in this pull request?
   
   This PR adds a new scalar SQL function `time_bucket(bucket_size, ts[, 
origin])` that aligns a timestamp to the start of a fixed-size interval bucket. 
Given a bucket size (day-time or year-month interval), a timestamp, and an 
optional origin, it returns the start of the half-open bucket `[start, start + 
bucket_size)` containing the timestamp. Buckets are anchored at `origin` 
(default `1970-01-01 00:00:00 UTC` for TIMESTAMP) and the grid extends 
infinitely in both directions. All bucketing is performed on UTC micros; the 
session time zone does not affect bucket alignment. For local wall-clock 
alignment in a DST zone, users can cast the TIMESTAMP to TIMESTAMP_NTZ.
   
   Changes:
   - New `TimeBucket` expression in 
`sql/catalyst/.../expressions/datetimeExpressions.scala` with an 
`ExpressionBuilder` that dispatches to two- or three-argument forms.
   - Bucketing helpers `timeBucketDTInterval` / `timeBucketYMInterval` in 
`DateTimeUtils.scala`, with overflow checks (`Math.subtractExact`, 
`Math.multiplyExact`) on extreme timestamps and origins.
   - Registered in `FunctionRegistry`.
   - Scala API: `functions.time_bucket(...)`.
   - PySpark API: `pyspark.sql.functions.time_bucket` + Connect variant.
   
   ### Why are the changes needed?
   
   Aligning timestamps to fixed-size buckets (15 minutes, 1 hour, 1 month, 
etc.) is a common time-series pattern, but today users must assemble it 
manually with `from_unixtime(unix_timestamp(ts) DIV N * N)`-style arithmetic. 
That pattern is error-prone, doesn't support year-month intervals cleanly, and 
has no way to express an alignment origin. A first-class `time_bucket` matches 
the idiom popularized by PostgreSQL / TimescaleDB and makes the operation safe, 
concise, and composable.
   
   ### Does this PR introduce _any_ user-facing change?
   
   Yes — a new function `time_bucket` is available in SQL, Scala, and PySpark.
   
   Example:
   \`\`\`sql
   SELECT time_bucket(INTERVAL '15' MINUTE, TIMESTAMP '2024-01-01 11:27:00');
   -- 2024-01-01 11:15:00
   
   SELECT time_bucket(
     INTERVAL '15' MINUTE,
     TIMESTAMP '2024-01-01 11:27:00',
     TIMESTAMP '1970-01-01 00:05:00');
   -- 2024-01-01 11:20:00
   \`\`\`
   
   ### How was this patch tested?
   
   - New unit tests in `DateExpressionsSuite` (codegen + interpreted paths, DT 
and YM intervals, `TIMESTAMP`/`TIMESTAMP_NTZ`, NULL propagation, negative/zero 
bucket-size validation, `ExpressionBuilder`).
   - New unit tests in `DateTimeUtilsSuite` for `timeBucketDTInterval` / 
`timeBucketYMInterval` including boundary values, negative timestamps, and 
extreme-origin overflow paths.
   - New SQL golden file `sql-tests/inputs/time-bucket.sql` covering: DT + YM 
interval buckets, TIMESTAMP + TIMESTAMP_NTZ, explicit origins, DST-safe 
NTZ-cast pattern (America/Los_Angeles), NULL propagation, invalid inputs 
(non-foldable, wrong types, non-positive).
   - PySpark doctest.
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   Generated-by: Claude Code


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to