Re: [PR] feat: adaptive filter selectivity tracking for Parquet row filters [datafusion]

via GitHub Tue, 06 Jan 2026 13:59:41 -0800


sdf-jkl commented on PR #19639:
URL: https://github.com/apache/datafusion/pull/19639#issuecomment-3716517667


   > The ClickHouse resources seem to be more in line with parquet row group 
pruning using statistics, which happens before this process. What we are 
talking about here is more so how to process the filtering during the scan, 
which would be after the `PREWHERE` / row group stats. But their approach of 
using statistics to plan the order of applying filters _is_ relevant to this 
work.
   
   @adriangb, upon further reading, PREWHERE is not performing row group 
pruning; it evaluates predicate expressions right after that at the row filter 
stage. 
   
   However, I don't think it decides which columns to push to scan and which to 
demote to post-scan, so it's not that relevant here.
   
   Originally, I hadn't read your comments and PR carefully enough, so I was 
under the impression that you were trying to use row group statistics to 
improve filter ordering (what PREWHERE in clickhouse is now doing). 
   
   I now understand that your approach is more like dynamic programming, where 
we estimate selectivity at runtime per predicate per file, which is indeed very 
different from the clickhouse implementation.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] feat: adaptive filter selectivity tracking for Parquet row filters [datafusion]

Reply via email to