adriangb opened a new issue, #22028:
URL: https://github.com/apache/datafusion/issues/22028
## Background
Interval and statistics propagation in DataFusion is a best-effort
optimization: it refines bounds on expressions to enable better pruning, filter
pushdown, and join planning, but it is never *required* for query correctness.
If propagation can't compute precise bounds for an expression, the optimizer
should fall back to an unbounded interval (or unknown distribution) and
continue planning.
Today, several propagation paths return `Result<Interval>` and `?`-propagate
errors all the way out to the user as `Internal error: ...`. A single
arithmetic combination that hasn't been wired through the type-coercion
plumbing — or any other unexpected failure inside `Interval` arithmetic — turns
into a hard query failure rather than degraded statistics.
A concrete recent example: #22027 fixed five `assert_eq_or_internal_err!`
calls in `Interval::mul`/`div`/`intersect`/`union`/`contains` that were firing
for `Decimal128(38, 10) / Decimal128(20, 0)` and similar mismatched-type cases.
The underlying bug was real (missing type coercion), but the way it surfaced —
as an internal error from a customer's `numeric / count(*)` alert query — was
disproportionate to the actual problem (we couldn't refine bounds for one
expression).
## Proposal
Treat interval/stats propagation as *advisory*: if any error escapes
`Interval` arithmetic during propagation, log a debug warning and substitute an
unbounded interval (or `Distribution::new_generic` with an unknown range), then
continue optimizing.
### Natural seams
The two main entry points where errors should be caught and converted into
"no information":
1. **`BinaryExpr::evaluate_bounds`** in
`datafusion/physical-expr/src/expressions/binary.rs` (around line 411).
Currently:
```rust
fn evaluate_bounds(&self, children: &[&Interval]) -> Result<Interval> {
let left_interval = children[0];
let right_interval = children[1];
apply_operator(&self.op, left_interval, right_interval)
}
```
Wrap the `apply_operator` call so that on `Err`, we log and return
`Interval::make_unbounded(&self.data_type(...))`.
2. **`apply_operator` callers in `statistics.rs`**
(`datafusion/expr-common/src/statistics.rs` lines 738 and 763). Both compute a
`range_evaluation` from operand ranges; on error they should degrade to a
`Bernoulli(unknown)` (line 738) or unbounded `Distribution::new_generic` (line
763) rather than failing the entire stats computation.
`propagate_constraints` in `BinaryExpr` and the CP solver paths in
`cp_solver.rs` (`update_ranges`, `propagate_arithmetic`, etc.) likely deserve
the same treatment — failure to propagate should yield \"no refinement\"
(return `Ok(Some(vec![]))` or skip the refinement), not bubble out.
### Considerations
- **Visibility**: silently swallowing internal errors masks real bugs in
interval arithmetic. Mitigations:
- Always log at `warn!` or `debug!` with the operator and operand types.
- Gate the fallback on `DataFusionError::Internal` specifically — let
`External`, `IoError`, `ResourcesExhausted`, etc. continue to propagate.
- Add a session config like `optimizer.fail_on_interval_error` (default
`false`) so tests and CI can opt into the strict behavior and catch regressions.
- **Test coverage**: existing tests rely on errors propagating in some cases
(e.g. CP-solver edge cases). Each call site needs a careful audit to make sure
\"no information\" is the right fallback for that specific function's contract.
- **Scope**: this is a behavioral change, not a bug fix. Worth landing on
its own with explicit reviewer sign-off, separate from #22027.
## Alternatives
- Make every `Interval` method infallible by always returning an unbounded
interval on internal failure — pushes the policy down into the data structure
but loses signal for legitimately fallible cases like cast errors during
coercion.
- Status quo + fix bugs as they're reported. This is what we've been doing;
the cost is that each new operator/type combination that wasn't anticipated
turns into a customer-visible regression (cf. #22027 / pydantic/platform#21153).
## Reproducer (current behavior, prior to #22027)
```sql
CREATE TABLE t(c1 BIGINT) AS VALUES (1::bigint), (2::bigint), (10::bigint);
SELECT
c1,
CASE WHEN c1 = 0 THEN 100.0
ELSE ROUND((1.0 - (c1::numeric / c1)) * 100, 2)
END AS rate
FROM t
WHERE (1.0 - (c1::numeric / c1)) * 100 < 95.0;
```
fails with:
```
Internal error: Assertion failed: ... Intervals must have the same data type
for division
```
After #22027 this specific path returns rows. But any future operator/type
pair that isn't fully coerced will reintroduce the same shape of failure —
that's what this issue is about preventing structurally.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]