adriangb commented on PR #103:
URL: https://github.com/apache/datafusion-site/pull/103#issuecomment-3263847495

   > Would it be feasible to have 
[`ColumnBounds`](https://github.com/apache/datafusion/blob/baf6f602879030dea741322d6f219d401983bb78/datafusion/physical-plan/src/joins/hash_join/shared_bounds.rs#L39)
 include multiple ranges (which would then be combined with `OR`) instead of a 
single min/max?
   
   How would we represent those ranges? Would you just create a fixed bucket of 
ranges (e.g. 32 ranges)? If you have a concrete idea please feel free to make a 
PR!
   
   FWIW in my mind the min/max stats approach is mostly helpful when you 
filters on your build side mean that the range describes only a small part of 
the probe side. For example with temporal data this is going to be super 
common. For cases where this doesn't hold we should probably just push down the 
entire hash table as a filter, either as an actual reference to the hash table, 
as an `IN <list>` expression or as a bloom filter (possibly multiple of the 
above or choosing based on the size of the build side, they each have 
pros/cons/properties). This is tracked in #17171


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to