neilconway commented on code in PR #20391:
URL: https://github.com/apache/datafusion/pull/20391#discussion_r2814581399
##########
datafusion/physical-plan/src/filter.rs:
##########
@@ -771,10 +771,21 @@ fn collect_new_statistics(
};
};
let (lower, upper) = interval.into_bounds();
- let (min_value, max_value) = if lower.eq(&upper) {
- (Precision::Exact(lower), Precision::Exact(upper))
+ let bounds_equal =
+ !lower.is_null() && !upper.is_null() && lower.eq(&upper);
+ let min_value = if lower.is_null() {
+ Precision::Absent
+ } else if bounds_equal {
+ Precision::Exact(lower)
} else {
- (Precision::Inexact(lower), Precision::Inexact(upper))
+ Precision::Inexact(lower)
+ };
+ let max_value = if upper.is_null() {
+ Precision::Absent
+ } else if bounds_equal {
+ Precision::Exact(upper)
+ } else {
+ Precision::Inexact(upper)
Review Comment:
I found the revised logic a little hard to read. What about something like:
```suggestion
let is_exact =
!lower.is_null() && !upper.is_null() && lower == upper;
let min_value = interval_bound_to_precision(lower,
is_exact);
let max_value = interval_bound_to_precision(upper,
is_exact);
```
Where we define a helper func:
```rust
/// Converts an interval bound to a [`Precision`] value. NULL bounds (which
/// represent "unbounded" in the [`Interval`] type) map to
[`Precision::Absent`].
fn interval_bound_to_precision(
bound: ScalarValue,
is_exact: bool,
) -> Precision<ScalarValue> {
if bound.is_null() {
Precision::Absent
} else if is_exact {
Precision::Exact(bound)
} else {
Precision::Inexact(bound)
}
}
```
##########
datafusion/physical-plan/src/filter.rs:
##########
@@ -2053,4 +2064,43 @@ mod tests {
Ok(())
}
+
+ /// Regression test: columns with Absent min/max statistics must remain
+ /// Absent after FilterExec, not be converted to Exact(NULL). The latter
+ /// causes `estimate_disjoint_inputs` to incorrectly conclude join inputs
+ /// are disjoint (ScalarValue's PartialOrd sorts NULLs last), producing
+ /// zero cardinality and forcing Partitioned join mode.
Review Comment:
Some parts of this comment overfit to the exact scenario you ran into and
other implementation details. What about something simpler like:
```suggestion
/// Columns with Absent min/max statistics should remain Absent after
/// FilterExec.
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]