adriangb commented on issue #20324: URL: https://github.com/apache/datafusion/issues/20324#issuecomment-3928403492
@Dandandan after much experimenting I've come to the conclusion that: - We do need to be able to tweak the strategy within a single file - The statistics should be shared across files (i.e. we make a decision globally but each file is able to change it's strategy within a scan). This involves assigning stable identifiers to each filter which can survive simplifiers / adapting to the physical file schema / replacing partition values. - The right metric for effectiveness is bytes filtered / seconds of compute - We should implement this by starting all filters as row filters (similar to current main) but "demoting" ineffective ones to be run as a conjunction after all of the others. E.g. we start with `row_filters = [id = 123, url ilike '%google%', url ilike '%.com'%]` and then convert to `row_filters = [id], scan_filters = [url ilike '%google%' and url ilike '`%.com%']` I think one easy way to get this would be to add an `disabled() -> bool` API to `ArrowPredicate`. Then the DataFusion implementation can dynamically disable ineffective row filters and move them to post-scan. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
