alamb commented on PR #103: URL: https://github.com/apache/datafusion-site/pull/103#issuecomment-3266091514
> The different partitions must not have scanned data which included both extremes, resulting in an efficient dynamic filter. > > Would it be feasible to have [`ColumnBounds`](https://github.com/apache/datafusion/blob/baf6f602879030dea741322d6f219d401983bb78/datafusion/physical-plan/src/joins/hash_join/shared_bounds.rs#L39) include multiple ranges (which would then be combined with `OR`) instead of a single min/max? I think this could solve the problem in these type of queries. The potential issue might be having queries whose build side would return many rows, causing the dynamic filter to be very large, but in that case we could merge the ranges to not exceed some N. Another possibility is to use something like a Bloom Filter, which I think s what spark does. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
