KARTIK64-rgb opened a new pull request, #20799:
URL: https://github.com/apache/datafusion/pull/20799
## Which issue does this PR close?
Closes #20779.
## Rationale for this change
In `max_distinct_count` (inside
`datafusion/physical-plan/src/joins/utils.rs`), the
`Precision::Exact` branch computes the number of non-null rows by doing:
```rust
let count = count - stats.null_count.get_value().unwrap_or(&0);
```
Before #20228 this subtraction was always safe because `num_rows` was never
smaller
than `null_count`. But #20228 added `fetch` (limit push-down) support to
`HashJoinExec`, and when a limit is applied, `partition_statistics()` caps
`num_rows` to `Exact(fetch_value)` without also capping the per-column
`null_count`. This means `null_count` can legally exceed `num_rows`, causing
a
panic with *"attempt to subtract with overflow"*.
## What changes are included in this PR?
- **Bug fix** in `max_distinct_count` (`utils.rs` ~line 725): replaced the
bare
subtraction with a saturating subtraction so that when `null_count` exceeds
`num_rows` the result is clamped to `0` instead of panicking.
```rust
// Before
let count = count - stats.null_count.get_value().unwrap_or(&0);
// After
let count =
count.saturating_sub(*stats.null_count.get_value().unwrap_or(&0));
```
- **Regression test** added at the bottom of the `mod tests` block in the
same
file. The test deliberately constructs a scenario where `null_count (5) >
num_rows (2)` and asserts that `max_distinct_count` returns `Exact(0)`
without
panicking.
## Are these changes tested?
Yes. A new unit test
`test_max_distinct_count_no_overflow_when_null_count_exceeds_num_rows` is
added
directly in `datafusion/physical-plan/src/joins/utils.rs`. It covers the
exact
edge-case from the bug report (null_count > num_rows after a fetch/limit
push-down) and would have caught the panic before the fix.
## Are there any user-facing changes?
No user-facing or API changes. This is a purely internal arithmetic fix in
the
statistics estimation logic. Queries that previously panicked when a limit
was
pushed down into a `HashJoinExec` will now complete successfully.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]