rkrishn7 commented on issue #17171:
URL: https://github.com/apache/datafusion/issues/17171#issuecomment-3264482482

   I've been playing around with pushing down literal guarantees from the 
values on the left hand side. Some initial thoughts/observations:
   
   As @adriangb pointed out the question is really when to push down the left 
column values, since there are likely diminishing returns to doing so. For 
example, we probably wouldn't want to push down a list of 1000 values, since:
   
   - The effectiveness of pruning groups of data decreases as the number of 
distinct values increases, I imagine.
   - There is likely a point where the cost of evaluating the bloom filter 
outweighs the cost of unnecessary data scanning.
   
   I think the above two are directly correlated. That is, they both are 
influenced by the size of the literal guarantees we build.
   
   However, with point lookup queries, for example, the effectiveness over 
min/max stats pruning can clearly be observed. Take the following query for 
example:
   
   ```
   SELECT *
   FROM customer
   JOIN orders on c_custkey = o_custkey
   WHERE c_name = 'Customer#000077000';
   ```
   
   I observed around a 3x speedup on TPCH SF20 (down from ~3s to ~1s) when 
building the dynamic filter as an `IN LIST` vs. min/max bounds. Note that this 
required me to build the TPCH data with bloom filters included.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to