alamb opened a new issue, #22577: URL: https://github.com/apache/datafusion/issues/22577
## Is your feature request related to a problem or challenge? DataFusion currently has (at least) three overlapping implementations of "cast a scalar / literal to a target type", with different contracts and duplicated per-type logic: 1. **`ScalarValue::cast_to` / `cast_to_with_options`** — `datafusion/common/src/scalar/mod.rs`. The general-purpose cast. Builds a single-row array and runs the arrow cast kernel; returns `Result`. (Recently gained array-free fast paths for identity and string↔string casts in #22576.) 2. **`try_cast_literal_to_type`** — `datafusion/expr-common/src/casts.rs`. An array-free, hand-rolled cast used by the unwrap-cast optimizations. Returns `Option`, with its own per-type helpers (`try_cast_numeric_literal`, `try_cast_string_literal`, `try_cast_dictionary`, `try_cast_binary`, `cast_between_timestamp`). It is deliberately **more restrictive** than `cast_to`: it only performs value-preserving casts and returns `None` for out-of-range numeric, precision-losing decimal, lossy date↔timestamp, timestamp→string, and string→numeric conversions. These restrictions are load-bearing for optimizer correctness (you may only unwrap `CAST(col AS T) = lit` when `lit` converts back to its original type exactly). 3. **`cast_literal_to_type_with_op`** — `datafusion/optimizer/src/simplify_expressions/unwrap_cast.rs`. Yet another special case (for `Utf8 = Int`-style comparisons), implemented via `cast_to` plus a manual round-trip check. Because the logic is duplicated, casting behavior and optimizations have to be implemented in multiple places, and the implementations can (and do) diverge in subtle, correctness-relevant ways. ## Describe the solution you'd like Consolidate the cast implementations so there is **a single place to implement and optimize scalar casting**. The overall goal is to be able to add a **non-copying** version of `cast_to`: today `cast_to(&self)` forces a clone/allocation even on the fast paths, and `try_cast_literal_to_type` re-allocates strings via `to_string()`. We want that optimized in exactly one location. **First step (this issue): consolidate the implementations, without changing behavior.** - Make `ScalarValue::cast_to` the single canonical scalar-cast implementation (it already has the array-free fast paths). - Re-express `try_cast_literal_to_type` in terms of `cast_to`, keeping its restrictive `Option` semantics as a thin "value-preserving only" wrapper (the range / lossy-temporal / type-family guards) so the unwrap-cast optimizer behavior is unchanged. - Fold `cast_literal_to_type_with_op` into the same path where possible. Once consolidated, a follow-up can add the non-copying (owned / `Cow`) cast variant and optimize it in that one place. ## Describe alternatives you've considered - Adding an owned / non-copying API to `try_cast_literal_to_type` directly (explored in #22574). This works, but optimizes only one of the three implementations and leaves the duplication in place. ## Additional context - Background: review discussion on #22562 (LIKE `'prefix%'` pruning), which surfaced a redundant string allocation in the literal-cast path. - #22576 added the identity / string↔string array-free fast paths to `cast_to`, and made the `cast_round_trip` test cross-check `cast_to` against the arrow cast kernel. - ⚠️ **Constraint when consolidating:** `cast_to` (arrow kernel, `safe: false`) is *less* restrictive than `try_cast_literal_to_type`. Naively routing the optimizer through `cast_to` would unwrap casts it must not (e.g. lossy date↔timestamp, or parsing strings to numbers), producing wrong results. The restrictive guards must be preserved. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
