isidentical commented on code in PR #3912:
URL: https://github.com/apache/arrow-datafusion/pull/3912#discussion_r1002132576
##########
datafusion/physical-expr/src/expressions/binary.rs:
##########
@@ -771,32 +745,47 @@ fn compare_left_boundaries(
// it is actually smaller than what we have right now) and it is a
valid
// value (e.g. [0, 100] < -100 would update the boundaries to [0,
-100] if
// there weren't the selectivity check).
- if rhs_scalar_value < variadic_max && selectivity > 0.0 {
- (variadic_min, rhs_scalar_value)
+ if right < left_max && selectivity > 0.0 {
+ (left_min, right)
} else {
- (variadic_min, variadic_max)
+ (left_min, left_max)
}
}
Operator::Gt | Operator::GtEq => {
// Same as above, but this time we want to limit the lower bound.
- if rhs_scalar_value > variadic_min && selectivity > 0.0 {
- (rhs_scalar_value, variadic_max)
+ if right > left_min && selectivity > 0.0 {
+ (right, left_max)
} else {
- (variadic_min, variadic_max)
+ (left_min, left_max)
}
}
// For equality, we don't have the range problem so even if the
selectivity
// is 0.0, we can still update the boundaries.
- Operator::Eq => (rhs_scalar_value.clone(), rhs_scalar_value),
+ Operator::Eq => (right.clone(), right),
_ => unreachable!(),
};
- Some(ExprBoundaries {
- min_value: new_min,
- max_value: new_max,
- distinct_count,
- selectivity: Some(selectivity),
- })
+ // The context represents all the knowledge we have gathered during the
+ // analysis process, which we can now add more since the expression's upper
+ // and lower boundaries might have changed.
+ let left_bounds = ExprBoundaries::new(left_min, left_max,
left_bounds.distinct_count);
+ left.apply(context, &left_bounds);
Review Comment:
> 🤔 don't the rules about combining information about the left and right
side depend on the operators?
Exactly! And that is certainly possible with the current implementation (all
the control is delegated to expression's own `analyze()` implementation). This
part is about a single comparison, but the same system (albeit a bit
differently) can also be used with the conjunctions (like the following
example, originally from #3898),
> Each expression can either decide to share all the context with its
children (e.g. in the case of AND, they'll all share a single context so all
the following expressions would know that a for example can be only greater
than 6 at this point etc.) or fork the context and then merge it together by
using its logic (OR is the most basic example where you have to take the union
of the ranges since it could be either of them).
If we were to be implement `AND` and `OR`, then `AND` would pretty much pass
the same context to throughout the conjunction. But for `OR`, when diverging to
left and right, we would pass separate contexts (clone of the current version)
and when processing for both of them is completed we can merge the boundaries
of the columns in those contexts and update the existing context (basically a
union of left and right, `(a < 5) OR (a > 10)` would say `bounds(a) = [min(a),
5] U [10, max(a)]`). This allows us to robustly handle chains of ANDs/ORs
without worrying about it from the call sites (all the responsibility is
delegated to the parent expression, which knows how to handle its children
better than anyone 🙂).
If it sounds interesting, I can try to give it a shot at implementing AND/OR
s in this PR (to have an actual use case). WDYT?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]