KARTIK64-rgb opened a new pull request, #20799:
URL: https://github.com/apache/datafusion/pull/20799

   ## Which issue does this PR close?
   
   Closes #20779.
   
   ## Rationale for this change
   
   In `max_distinct_count` (inside 
`datafusion/physical-plan/src/joins/utils.rs`), the
   `Precision::Exact` branch computes the number of non-null rows by doing:
   
   ```rust
   let count = count - stats.null_count.get_value().unwrap_or(&0);
   ```
   
   Before #20228 this subtraction was always safe because `num_rows` was never 
smaller
   than `null_count`. But #20228 added `fetch` (limit push-down) support to
   `HashJoinExec`, and when a limit is applied, `partition_statistics()` caps
   `num_rows` to `Exact(fetch_value)` without also capping the per-column
   `null_count`. This means `null_count` can legally exceed `num_rows`, causing 
a
   panic with *"attempt to subtract with overflow"*.
   
   ## What changes are included in this PR?
   
   - **Bug fix** in `max_distinct_count` (`utils.rs` ~line 725): replaced the 
bare
     subtraction with a saturating subtraction so that when `null_count` exceeds
     `num_rows` the result is clamped to `0` instead of panicking.
   
     ```rust
     // Before
     let count = count - stats.null_count.get_value().unwrap_or(&0);
   
     // After
     let count = 
count.saturating_sub(*stats.null_count.get_value().unwrap_or(&0));
     ```
   
   - **Regression test** added at the bottom of the `mod tests` block in the 
same
     file. The test deliberately constructs a scenario where `null_count (5) >
     num_rows (2)` and asserts that `max_distinct_count` returns `Exact(0)` 
without
     panicking.
   
   ## Are these changes tested?
   
   Yes. A new unit test
   `test_max_distinct_count_no_overflow_when_null_count_exceeds_num_rows` is 
added
   directly in `datafusion/physical-plan/src/joins/utils.rs`. It covers the 
exact
   edge-case from the bug report (null_count > num_rows after a fetch/limit
   push-down) and would have caught the panic before the fix.
   
   ## Are there any user-facing changes?
   
   No user-facing or API changes. This is a purely internal arithmetic fix in 
the
   statistics estimation logic. Queries that previously panicked when a limit 
was
   pushed down into a `HashJoinExec` will now complete successfully.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to