discord9 opened a new issue, #22142:
URL: https://github.com/apache/datafusion/issues/22142

   ### Describe the bug
   
   The `unwrap_cast` optimization in `ExprSimplifier` (both logical and 
physical) rewrites patterns like `CAST(column AS target_type) <op> 
literal_target` into `column <op> CAST(literal_target AS source_type)` without 
verifying that the rewrite preserves query semantics. This default-allow 
behavior silently produces incorrect results for many common cast patterns.
   
   ### Specific broken scenarios
   
   **Timestamp precision downcast (many→one rewriting):**
   ```sql
   -- ts_ns has values 1704067200001000000 and 1704067200001234567
   -- Both cast to the same timestamp(ms)
   SELECT * FROM t WHERE CAST(ts_ns AS timestamp(ms)) = timestamp_ms_value;
   -- Rewrites to: WHERE ts_ns = <nanosecond-literal>
   -- Result: only matches ONE nanosecond value, not both → WRONG row count
   ```
   
   **Source domain not preserved (overflow/null loss):**
   ```sql
   -- column1 is Int64, literal is 9999999999 (doesn't fit in Int32)
   SELECT * FROM t WHERE CAST(column1 AS Int32) < 9999999999;
   -- Rewrites to: WHERE column1 < CAST(9999999999 AS Int32)
   -- The literal cast silently truncates → WRONG predicate
   ```
   
   **Type semantics changed (string round-trip, float truncation):**
   ```sql
   SELECT * FROM t WHERE CAST(int_col AS Utf8) = '0123';
   -- Rewrites to: WHERE int_col = 0123
   -- '0123' is NOT equivalent to 123 → WRONG comparison
   ```
   
   **Timezone silently dropped:**
   ```sql
   SELECT * FROM t WHERE CAST(ts AS timestamp(ms, 'UTC')) = timestamp_literal;
   -- Rewrites without preserving the timezone → semantics changed
   ```
   
   ### Additional affected patterns
   
   - Int→UInt (negative values become NULL, rewrite doesn't account)
   - UInt64→Int64 (large UInt64 values above Int64::MAX)
   - Decimal scale downcast (fractional digits truncated)
   - Decimal→Int (truncation not preserved)
   - Float→Int / Float64→Float32 (precision loss)
   - Date↔Timestamp (many→one)
   - Dictionary↔Utf8 (dictionary key semantics)
   
   ### Impact
   
   The optimizer is silently producing wrong query results. There is no error 
or warning — the plan looks correct and the query returns results that are 
simply semantically wrong. This has existed since the feature was introduced.
   
   ### Proposed fix
   
   Replace the default-allow logic with a closed-by-default allowlist. Only 
cast patterns proven safe (with literal round-trip verification) should be 
eligible. Implementation in https://github.com/apache/datafusion/pull/21908.
   
   - Closed-by-default: no cast is unwrapped unless explicitly allowlisted
   - Current allowlist: same-timezone timestamp precision widening only
   - Require exact literal round-trip: lit to source_type and back must equal 
the original
   - All other casts (int widenings, string to int, decimals, dictionaries) 
intentionally kept in plan
   
   ### To Reproduce
   
   See `datafusion/sqllogictest/test_files/unwrap_cast.slt` or the unit tests 
in `datafusion/expr-common/src/casts.rs`, 
`datafusion/optimizer/src/simplify_expressions/unwrap_cast.rs`, 
`datafusion/physical-expr/src/simplifier/unwrap_cast.rs`.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to