[I] Stats/interval propagation should degrade to unbounded intervals on internal errors, not fail the query [datafusion]

via GitHub Tue, 05 May 2026 12:32:02 -0700


adriangb opened a new issue, #22028:
URL: https://github.com/apache/datafusion/issues/22028


   ## Background
   
   Interval and statistics propagation in DataFusion is a best-effort 
optimization: it refines bounds on expressions to enable better pruning, filter 
pushdown, and join planning, but it is never *required* for query correctness. 
If propagation can't compute precise bounds for an expression, the optimizer 
should fall back to an unbounded interval (or unknown distribution) and 
continue planning.
   
   Today, several propagation paths return `Result<Interval>` and `?`-propagate 
errors all the way out to the user as `Internal error: ...`. A single 
arithmetic combination that hasn't been wired through the type-coercion 
plumbing — or any other unexpected failure inside `Interval` arithmetic — turns 
into a hard query failure rather than degraded statistics.
   
   A concrete recent example: #22027 fixed five `assert_eq_or_internal_err!` 
calls in `Interval::mul`/`div`/`intersect`/`union`/`contains` that were firing 
for `Decimal128(38, 10) / Decimal128(20, 0)` and similar mismatched-type cases. 
The underlying bug was real (missing type coercion), but the way it surfaced — 
as an internal error from a customer's `numeric / count(*)` alert query — was 
disproportionate to the actual problem (we couldn't refine bounds for one 
expression).
   
   ## Proposal
   
   Treat interval/stats propagation as *advisory*: if any error escapes 
`Interval` arithmetic during propagation, log a debug warning and substitute an 
unbounded interval (or `Distribution::new_generic` with an unknown range), then 
continue optimizing.
   
   ### Natural seams
   
   The two main entry points where errors should be caught and converted into 
"no information":
   
   1. **`BinaryExpr::evaluate_bounds`** in 
`datafusion/physical-expr/src/expressions/binary.rs` (around line 411). 
Currently:
      ```rust
      fn evaluate_bounds(&self, children: &[&Interval]) -> Result<Interval> {
          let left_interval = children[0];
          let right_interval = children[1];
          apply_operator(&self.op, left_interval, right_interval)
      }
      ```
      Wrap the `apply_operator` call so that on `Err`, we log and return 
`Interval::make_unbounded(&self.data_type(...))`.
   
   2. **`apply_operator` callers in `statistics.rs`** 
(`datafusion/expr-common/src/statistics.rs` lines 738 and 763). Both compute a 
`range_evaluation` from operand ranges; on error they should degrade to a 
`Bernoulli(unknown)` (line 738) or unbounded `Distribution::new_generic` (line 
763) rather than failing the entire stats computation.
   
   `propagate_constraints` in `BinaryExpr` and the CP solver paths in 
`cp_solver.rs` (`update_ranges`, `propagate_arithmetic`, etc.) likely deserve 
the same treatment — failure to propagate should yield \"no refinement\" 
(return `Ok(Some(vec![]))` or skip the refinement), not bubble out.
   
   ### Considerations
   
   - **Visibility**: silently swallowing internal errors masks real bugs in 
interval arithmetic. Mitigations:
     - Always log at `warn!` or `debug!` with the operator and operand types.
     - Gate the fallback on `DataFusionError::Internal` specifically — let 
`External`, `IoError`, `ResourcesExhausted`, etc. continue to propagate.
     - Add a session config like `optimizer.fail_on_interval_error` (default 
`false`) so tests and CI can opt into the strict behavior and catch regressions.
   - **Test coverage**: existing tests rely on errors propagating in some cases 
(e.g. CP-solver edge cases). Each call site needs a careful audit to make sure 
\"no information\" is the right fallback for that specific function's contract.
   - **Scope**: this is a behavioral change, not a bug fix. Worth landing on 
its own with explicit reviewer sign-off, separate from #22027.
   
   ## Alternatives
   
   - Make every `Interval` method infallible by always returning an unbounded 
interval on internal failure — pushes the policy down into the data structure 
but loses signal for legitimately fallible cases like cast errors during 
coercion.
   - Status quo + fix bugs as they're reported. This is what we've been doing; 
the cost is that each new operator/type combination that wasn't anticipated 
turns into a customer-visible regression (cf. #22027 / pydantic/platform#21153).
   
   ## Reproducer (current behavior, prior to #22027)
   
   ```sql
   CREATE TABLE t(c1 BIGINT) AS VALUES (1::bigint), (2::bigint), (10::bigint);
   
   SELECT
     c1,
     CASE WHEN c1 = 0 THEN 100.0
          ELSE ROUND((1.0 - (c1::numeric / c1)) * 100, 2)
     END AS rate
   FROM t
   WHERE (1.0 - (c1::numeric / c1)) * 100 < 95.0;
   ```
   fails with:
   ```
   Internal error: Assertion failed: ... Intervals must have the same data type 
for division
   ```
   
   After #22027 this specific path returns rows. But any future operator/type 
pair that isn't fully coerced will reintroduce the same shape of failure — 
that's what this issue is about preventing structurally.
   
   🤖 Generated with [Claude Code](https://claude.com/claude-code)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[I] Stats/interval propagation should degrade to unbounded intervals on internal errors, not fail the query [datafusion]

Reply via email to