Re: [PR] Dynamic filters blog post (rev 2) [datafusion-site]

via GitHub Wed, 10 Sep 2025 05:50:26 -0700


alamb commented on PR #103:
URL: https://github.com/apache/datafusion-site/pull/103#issuecomment-3274751989


   > However to get the filter value (it doesn't have to be super accurate, 
just close to reduce the reading scope) it is possible to scan `select min(ts) 
from t1` first, and this refers to a single column which might be cheap, and 
even cheaper if min/max can be derived from the footer, and then apply the 
value for TopK filter.
   
   I think one major idea is to *reuse state / information that is already 
present in the operators* -- so for example the TopK operator already has a 
topK heap, and the dynamic filter concept allows this information to be passed 
down to the scan. 
   
   > How it makes sure we dont need to scan 100M rows as before, is it for any 
scenario, or when underlying files data are sorted?
   
   I don't think the dynamic filter has any guarantees that it will filter rows 
-- for example, in the pathalogical case where the data is scanned in reverse 
order, it will not filter any
   
   However, the idea is that updaing the dynamic filter is cheap and it does 
help in many real world settings, so it is overall a good optimization to do


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Dynamic filters blog post (rev 2) [datafusion-site]

Reply via email to