adriangb commented on code in PR #18393:
URL: https://github.com/apache/datafusion/pull/18393#discussion_r2546317584
##########
datafusion/physical-plan/src/joins/hash_join/shared_bounds.rs:
##########
@@ -333,81 +402,154 @@ impl SharedBuildAccumulator {
// CollectLeft: Simple conjunction of bounds and membership
check
AccumulatedBuildData::CollectLeft { data } => {
if let Some(partition_data) = data {
+ // Create membership predicate (InList for small build
sides, hash lookup otherwise)
+ let membership_expr = create_membership_predicate(
+ &self.on_right,
+ partition_data.pushdown.clone(),
+ &HASH_JOIN_SEED,
+ self.probe_schema.as_ref(),
+ )?;
+
// Create bounds check expression (if bounds available)
- let Some(filter_expr) = create_bounds_predicate(
+ let bounds_expr = create_bounds_predicate(
&self.on_right,
&partition_data.bounds,
- ) else {
- // No bounds available, nothing to update
- return Ok(());
+ );
+
+ // Combine membership and bounds expressions
+ let filter_expr = match (membership_expr, bounds_expr)
{
Review Comment:
I think so because:
1. You have to calculate the bound anyway in case you need to fall back to
that.
2. Downstream operators may be able to do things with bounds that they can't
with InListExpr (e.g. stats pruning).
3. The bound are going to be cheaper to evaluate and thus may short circuit
the InListExpr if they are false.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]