alamb commented on PR #103:
URL: https://github.com/apache/datafusion-site/pull/103#issuecomment-3266091514

   > The different partitions must not have scanned data which included both 
extremes, resulting in an efficient dynamic filter.
   > 
   > Would it be feasible to have 
[`ColumnBounds`](https://github.com/apache/datafusion/blob/baf6f602879030dea741322d6f219d401983bb78/datafusion/physical-plan/src/joins/hash_join/shared_bounds.rs#L39)
 include multiple ranges (which would then be combined with `OR`) instead of a 
single min/max? I think this could solve the problem in these type of queries. 
The potential issue might be having queries whose build side would return many 
rows, causing the dynamic filter to be very large, but in that case we could 
merge the ranges to not exceed some N.
   
   Another possibility is to use something like a Bloom Filter, which I think s 
what spark does. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to