2010YOUY01 commented on PR #18392:
URL: https://github.com/apache/datafusion/pull/18392#issuecomment-4277614451

   I think this PR is a legit improvement. If someone has the bandwidth to 
review and maintain the implementation, it would be great to move it forward.
   Personally, I feel this solution isn’t general enough, so I’d prefer to 
explore the alternative approaches outlined below.
   
   ## Issues
   This solution only supports semi join with one in-equality join condition, 
if we add additional join conditions, this fast-path is not applicable, we 
still have to implement semi/anti join in the classic join path in 
https://github.com/apache/datafusion/blob/main/datafusion/physical-plan/src/joins/piecewise_merge_join/classic_join.rs
   
   Example:
   ```
   -- This works, in the left-side build we can gather min/max, and use that to 
directly filter the probe side
   select *
   from t1
   right semi join t2
   on t1.v1 < t2.v1
   
   -- This does not work, we still have to enumerate the joined pair first, to 
evaluate the residual expr
   select *
   from t1
   right semi join t2
   on (t1.v1 < t2.v1) AND (t1.v2 < t2.v2)
   ```
   
   ## Alternatives
   Here are two more general implementation ideas:
   1. Optimizer rewriting to a aggregate/filter pattern
   Example:
   `SELECT t1.* FROM t1 LEFT SEMI JOIN t2 ON t1.value < t2.value`
   becomes
   `SELECT t1.* FROM t1 WHERE t1.value < (SELECT max(t2.value) FROM t2)`
   
   This implementation is the most efficient way, and it's also extendible to 
other join conditions that can be rewritten to this aggregate + filter pattern, 
we can build a general interface to support a new family of optimizations.
   For example
   ```sql
   SELECT t1.*
   FROM t1
   LEFT SEMI JOIN t2
   ON t1.v1 > t2.v1
   AND t1.v2 > t2.v2
   ```
   can be similarly rewritten to
   ```sql
   SELECT t1.*
   FROM t1
   WHERE t1.v1 > (
     SELECT min(t2.v1)
     FROM t2
   )
   AND t1.v2 > (
     SELECT min(t2.v2)
     FROM t2
   )
   ```
   
   2. Implement semi/anti join in the existing piecewisemerge join path in 
https://github.com/apache/datafusion/blob/main/datafusion/physical-plan/src/joins/piecewise_merge_join/classic_join.rs
   For the above example workload this is not optimal, but this is still needed 
for more general use cases, so I think we can first maintain this version for 
simplicity.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to