Re: [I] [EPIC] Fix performance regressions when enabling parquet filter pushdown (late materialization) [datafusion]

via GitHub Thu, 19 Feb 2026 08:37:13 -0800


adriangb commented on issue #20324:
URL: https://github.com/apache/datafusion/issues/20324#issuecomment-3928403492


   @Dandandan after much experimenting I've come to the conclusion that:
   - We do need to be able to tweak the strategy within a single file
   - The statistics should be shared across files (i.e. we make a decision 
globally but each file is able to change it's strategy within a scan). This 
involves assigning stable identifiers to each filter which can survive 
simplifiers / adapting to the physical file schema / replacing partition values.
   - The right metric for effectiveness is bytes filtered / seconds of compute
   - We should implement this by starting all filters as row filters (similar 
to current main) but "demoting" ineffective ones to be run as a conjunction 
after all of the others. E.g. we start with `row_filters = [id = 123, url ilike 
'%google%', url ilike '%.com'%]` and then convert to `row_filters = [id], 
scan_filters = [url ilike '%google%' and url ilike '`%.com%']`
   
   I think one easy way to get this would be to add an `disabled() -> bool` API 
to `ArrowPredicate`. Then the DataFusion implementation can dynamically disable 
ineffective row filters and move them to post-scan.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [I] [EPIC] Fix performance regressions when enabling parquet filter pushdown (late materialization) [datafusion]

Reply via email to