kosiew opened a new pull request, #20814:
URL: https://github.com/apache/datafusion/pull/20814

   
   ## Which issue does this PR close?
   
   * Part of #20164
   
   ## Rationale for this change
   
   Physical `CastExpr` previously stored only a target `DataType`. This caused 
field-level semantics (name, nullability, and metadata) to be lost when casts 
were represented in the physical layer. In contrast, logical expressions 
already carry this information through `FieldRef`.
   
   This mismatch created several issues:
   
   * Physical and logical cast representations diverged in how they preserve 
schema semantics.
   * Struct casting logic behaved differently depending on whether the cast was 
represented as `CastExpr` or `CastColumnExpr`.
   * Downstream components (such as schema rewriting and ordering equivalence 
analysis) required additional branching and duplicated logic.
   
   Making `CastExpr` field-aware aligns the physical representation with 
logical semantics and enables consistent schema propagation across execution 
planning and expression evaluation.
   
   ## What changes are included in this PR?
   
   This PR introduces field-aware semantics to `CastExpr` and simplifies 
several areas that previously relied on type-only casting.
   
   Key changes include:
   
   1. **Field-aware CastExpr**
   
      * Replace the `cast_type: DataType` field with `target_field: FieldRef`.
      * Add `new_with_target_field` constructor to explicitly construct 
field-aware casts.
      * Keep the existing `new(expr, DataType)` constructor as a compatibility 
shim that creates a canonical field.
   
   2. **Return-field and nullability behavior**
   
      * `return_field` now returns the full `target_field`, preserving name, 
nullability, and metadata.
      * `nullable()` now derives its result from the resolved target field 
rather than the input expression.
      * Add compatibility logic for legacy type-only casts to preserve previous 
behavior.
   
   3. **Struct cast validation improvements**
   
      * Struct-to-struct casting now validates compatibility using field 
information before execution.
      * Planning-time validation prevents unsupported casts from reaching 
execution.
   
   4. **Shared cast property logic**
   
      * Introduce shared helper functions (`cast_expr_properties`, 
`is_order_preserving_cast_family`) for determining ordering preservation.
      * Reuse this logic in both `CastExpr` and `CastColumnExpr` to avoid 
duplicated implementations.
   
   5. **Schema rewriter improvements**
   
      * Refactor physical column resolution into `resolve_physical_column`.
      * Simplify cast insertion logic when logical and physical fields differ.
      * Pass explicit physical and logical fields to cast creation for improved 
correctness.
   
   6. **Ordering equivalence simplification**
   
      * Introduce `substitute_cast_like_ordering` helper to unify handling of 
`CastExpr` and `CastColumnExpr` in ordering equivalence analysis.
   
   7. **Additional unit tests**
   
      * Validate metadata propagation through `return_field`.
      * Verify nullability behavior for field-aware casts.
      * Ensure legacy type-only casts preserve existing semantics.
      * Test struct-cast validation with nested field semantics.
   
   ## Are these changes tested?
   
   Yes.
   
   New unit tests were added in `physical-expr/src/expressions/cast.rs` to 
verify:
   
   * Metadata propagation through field-aware casts
   * Correct nullability behavior derived from the target field
   * Backward compatibility with legacy type-only constructors
   * Struct cast compatibility validation using nested fields
   
   Existing tests continue to pass and validate compatibility with the previous 
API behavior.
   
   ## Are there any user-facing changes?
   
   There are no direct user-facing behavior changes.
   
   This change primarily improves internal schema semantics and consistency in 
the physical expression layer. Existing APIs remain compatible through the 
legacy constructor that accepts only a `DataType`.
   
   ## LLM-generated code disclosure
   
   This PR includes LLM-generated code and comments. All LLM-generated content 
has been manually reviewed and tested.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to