rkrishn7 commented on issue #17171: URL: https://github.com/apache/datafusion/issues/17171#issuecomment-3264482482
I've been playing around with pushing down literal guarantees from the values on the left hand side. Some initial thoughts/observations: As @adriangb pointed out the question is really when to push down the left column values, since there are likely diminishing returns to doing so. For example, we probably wouldn't want to push down a list of 1000 values, since: - The effectiveness of pruning groups of data decreases as the number of distinct values increases, I imagine. - There is likely a point where the cost of evaluating the bloom filter outweighs the cost of unnecessary data scanning. I think the above two are directly correlated. That is, they both are influenced by the size of the literal guarantees we build. However, with point lookup queries, for example, the effectiveness over min/max stats pruning can clearly be observed. Take the following query for example: ``` SELECT * FROM customer JOIN orders on c_custkey = o_custkey WHERE c_name = 'Customer#000077000'; ``` I observed around a 3x speedup on TPCH SF20 (down from ~3s to ~1s) when building the dynamic filter as an `IN LIST` vs. min/max bounds. Note that this required me to build the TPCH data with bloom filters included. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
