sdf-jkl commented on PR #19639: URL: https://github.com/apache/datafusion/pull/19639#issuecomment-3716517667
> The ClickHouse resources seem to be more in line with parquet row group pruning using statistics, which happens before this process. What we are talking about here is more so how to process the filtering during the scan, which would be after the `PREWHERE` / row group stats. But their approach of using statistics to plan the order of applying filters _is_ relevant to this work. @adriangb, upon further reading, PREWHERE is not performing row group pruning; it evaluates predicate expressions right after that at the row filter stage. However, I don't think it decides which columns to push to scan and which to demote to post-scan, so it's not that relevant here. Originally, I hadn't read your comments and PR carefully enough, so I was under the impression that you were trying to use row group statistics to improve filter ordering (what PREWHERE in clickhouse is now doing). I now understand that your approach is more like dynamic programming, where we estimate selectivity at runtime per predicate per file, which is indeed very different from the clickhouse implementation. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
