2010YOUY01 commented on PR #18392: URL: https://github.com/apache/datafusion/pull/18392#issuecomment-4277614451
I think this PR is a legit improvement. If someone has the bandwidth to review and maintain the implementation, it would be great to move it forward. Personally, I feel this solution isn’t general enough, so I’d prefer to explore the alternative approaches outlined below. ## Issues This solution only supports semi join with one in-equality join condition, if we add additional join conditions, this fast-path is not applicable, we still have to implement semi/anti join in the classic join path in https://github.com/apache/datafusion/blob/main/datafusion/physical-plan/src/joins/piecewise_merge_join/classic_join.rs Example: ``` -- This works, in the left-side build we can gather min/max, and use that to directly filter the probe side select * from t1 right semi join t2 on t1.v1 < t2.v1 -- This does not work, we still have to enumerate the joined pair first, to evaluate the residual expr select * from t1 right semi join t2 on (t1.v1 < t2.v1) AND (t1.v2 < t2.v2) ``` ## Alternatives Here are two more general implementation ideas: 1. Optimizer rewriting to a aggregate/filter pattern Example: `SELECT t1.* FROM t1 LEFT SEMI JOIN t2 ON t1.value < t2.value` becomes `SELECT t1.* FROM t1 WHERE t1.value < (SELECT max(t2.value) FROM t2)` This implementation is the most efficient way, and it's also extendible to other join conditions that can be rewritten to this aggregate + filter pattern, we can build a general interface to support a new family of optimizations. For example ```sql SELECT t1.* FROM t1 LEFT SEMI JOIN t2 ON t1.v1 > t2.v1 AND t1.v2 > t2.v2 ``` can be similarly rewritten to ```sql SELECT t1.* FROM t1 WHERE t1.v1 > ( SELECT min(t2.v1) FROM t2 ) AND t1.v2 > ( SELECT min(t2.v2) FROM t2 ) ``` 2. Implement semi/anti join in the existing piecewisemerge join path in https://github.com/apache/datafusion/blob/main/datafusion/physical-plan/src/joins/piecewise_merge_join/classic_join.rs For the above example workload this is not optimal, but this is still needed for more general use cases, so I think we can first maintain this version for simplicity. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
