Re: [I] Push down entire hash table from HashJoinExec into scans [datafusion]

via GitHub Sun, 07 Sep 2025 09:28:35 -0700


rkrishn7 commented on issue #17171:
URL: https://github.com/apache/datafusion/issues/17171#issuecomment-3257197579


   Hello! This is quite interesting.
   
   > Should we always push down both the hash table and bounds?
   
   > Should we push down the entire hash table as is or should we build bloom 
filters, an IN LIST expression or other?
   
   My initial thought is that the number of distinct values (NDV) on the build 
side could serve as a useful signal for which filter to push down. For example:
   
   - If the join column has high cardinality, the values are likely fairly 
random, in which case pushing down an `IN LIST` may be more effective. I 
_think_ existing pruning machinery already handles `IN LIST` quite well — e.g. 
enabling Parquet row group pruning via Bloom filters.
   
   - On the other hand, if the column shows low variance, then min/max 
statistics may provide better pruning.
   
   Curious to hear your thoughts on this approach!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [I] Push down entire hash table from HashJoinExec into scans [datafusion]

Reply via email to