adriangb commented on PR #103: URL: https://github.com/apache/datafusion-site/pull/103#issuecomment-3263847495
> Would it be feasible to have [`ColumnBounds`](https://github.com/apache/datafusion/blob/baf6f602879030dea741322d6f219d401983bb78/datafusion/physical-plan/src/joins/hash_join/shared_bounds.rs#L39) include multiple ranges (which would then be combined with `OR`) instead of a single min/max? How would we represent those ranges? Would you just create a fixed bucket of ranges (e.g. 32 ranges)? If you have a concrete idea please feel free to make a PR! FWIW in my mind the min/max stats approach is mostly helpful when you filters on your build side mean that the range describes only a small part of the probe side. For example with temporal data this is going to be super common. For cases where this doesn't hold we should probably just push down the entire hash table as a filter, either as an actual reference to the hash table, as an `IN <list>` expression or as a bloom filter (possibly multiple of the above or choosing based on the size of the build side, they each have pros/cons/properties). This is tracked in #17171 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
