kosiew opened a new issue, #22987:
URL: https://github.com/apache/datafusion/issues/22987

   ## Summary
   
   `date_bin_impl` has separate scalar and array execution paths that each 
implement timestamp/time scaling, binning, and per-row error handling. This 
duplication has already allowed scalar and array behavior to diverge: a scalar 
negative sub-second timestamp with a month interval returned `NULL` / later the 
expected value, while the equivalent array path returned an execution error.
   
   Refactor the implementation so scalar and array inputs share the same 
per-row mapping logic wherever possible.
   
   ## Motivation
   
   The core invariant for `date_bin` is:
   
   > For a given stride, source value, origin, and data type, scalar and array 
inputs should produce the same per-row result. Recoverable per-row binning 
errors should become `NULL` consistently in both paths, while invalid shared 
inputs such as unsupported stride/origin should remain query errors.
   
   
[Today](https://github.com/apache/datafusion/tree/96a6096c6f4b924e8cab4bc1629759a948e12939)
 this invariant is enforced by duplicated code and tests rather than by 
structure. The array branch has helper logic for timestamp arrays, while scalar 
timestamp arms repeat the same pattern per timestamp unit. Time inputs also 
duplicate scale/bin/modulo behavior between scalar and array branches.
   
   This makes future fixes risky: a change to overflow handling, modulo 
behavior, timezone preservation, or error-to-NULL semantics can easily be 
applied to one path and missed in the other.
   
   ## Current behavior / context
   
   Recent work fixed issues around:
   
   - negative sub-second timestamps before the epoch with month intervals
   - scalar vs array error behavior for per-row binning failures
   - checked scaling from second/millisecond/microsecond timestamp units to 
nanoseconds
   - distinguishing per-row source errors from invalid shared origin errors
   
   The code is now correct for the covered regressions, but `date_bin_impl` 
still contains separate scalar and array branches with repeated logic for:
   
   - timestamp unit scale lookup
   - checked scale-to-nanoseconds conversion
   - applying the stride function
   - converting binned nanoseconds back to the original time unit
   - converting per-row failures to `NULL`
   - preserving timestamp timezone metadata for array outputs
   - time-of-day modulo behavior for `TIME` inputs
   
   ## Proposed refactor
   
   Introduce a small internal abstraction that centralizes the per-row mapping 
rules. Possible shape:
   
   1. A helper for timestamp values:
   
   ```rust
   fn date_bin_timestamp_value<T: ArrowTimestampType>(
       value: i64,
       origin: i64,
       stride: i64,
       stride_fn: BinFunction,
   ) -> Option<i64> {
       let scale = timestamp_scale(T::UNIT);
       checked_scale_to_nanos(value, scale)
           .and_then(|scaled| stride_fn(stride, scaled, origin))
           .map(|binned| binned / scale)
           .ok()
   }
   ```
   
   Then use this helper in both scalar and array paths.
   
   2. Similar helpers for `TIME` values, or a generic helper parameterized by 
scale and output conversion:
   
   ```rust
   fn date_bin_time_value(
       value: i64,
       scale: i64,
       origin: i64,
       stride: i64,
       stride_fn: BinFunction,
   ) -> Option<i64> {
       checked_scale_to_nanos(value, scale)
           .and_then(|scaled| stride_fn(stride, scaled, origin))
           .map(|binned| (binned % NANOSECONDS_IN_DAY) / scale)
           .ok()
   }
   ```
   
   3. Keep shared-input validation outside the per-row mapper:
   
   - stride type validation
   - non-zero stride check
   - TIME stride `< 1 day` validation
   - origin scaling / origin overflow handling
   - source/origin type compatibility errors
   
   This preserves the intended distinction:
   
   - bad shared inputs -> error
   - bad per-row source value -> `NULL`
   
   ## Acceptance criteria
   
   - Scalar and array timestamp inputs call the same helper for per-row 
timestamp binning.
   - Scalar and array time inputs call the same helper or clearly equivalent 
shared helper for per-row time binning.
   - Existing behavior is preserved for:
     - `Timestamp(Nanosecond|Microsecond|Millisecond|Second)`
     - timestamp timezones
     - `Time32(Second|Millisecond)`
     - `Time64(Microsecond|Nanosecond)`
     - `NULL` inputs
     - per-row overflow / out-of-range values -> `NULL`
     - invalid shared origin scaling -> query error
     - unsupported stride/origin/source types -> existing errors
   - Existing regression tests continue to pass.
   - Add or keep tests that explicitly compare scalar vs array results for at 
least:
     - negative sub-second timestamp with month interval
     - source scaling overflow
     - month interval out-of-range binning
   
   ## Suggested tests
   
   Run:
   
   ```bash
   cargo test -p datafusion-functions test_date_bin --lib
   cargo test -p datafusion-sqllogictest --test sqllogictests -- date_bin_errors
   ```
   
   Consider adding a compact unit test table that invokes `date_bin` once with 
a scalar source and once with a one-row array source, then asserts the same 
result/nullness for each regression case.
   
   ## Related PR
   #22610
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to