Re: [PR] Optimize parquet row filter auto strategy with adaptive fallback [arrow-rs]

via GitHub Mon, 22 Jun 2026 07:56:40 -0700


hhhizzz commented on PR #9956:
URL: https://github.com/apache/arrow-rs/pull/9956#issuecomment-4769646397


   I think the best model is **multi-level adaptivity**.
   
   DataFusion has more **high-level context**, such as query semantics, file / 
row-group statistics, projected columns across the whole scan, and cross-file 
predicate selectivity. That makes it a good place to decide whether a 
`RowFilter` should be used at all, or to override the default behavior when it 
has stronger information.
   
   However, the part this PR is trying to handle is **lower-level**. The actual 
shape of the final `RowSelection` is only known after the Parquet reader has 
decoded the predicate columns for a specific row group and evaluated the 
`RowFilter`. For example, whether the result is clustered, fragmented, mostly 
selected, or sparse is not available to DataFusion at logical / physical 
planning time without duplicating Parquet reader internals.
   
   So I see the responsibilities as:
   
   - **DataFusion** decides the higher-level scan / filtering strategy.
   - **arrow-rs** decides how to execute a row filter once it sees the actual 
row-group-level selection shape and loaded page ranges.
   
   The explicit policy APIs are important because they still allow DataFusion 
to override the automatic choice. But I think the default `Auto` behavior 
belongs in arrow-rs, because the information it uses is produced inside the 
Parquet reader during execution.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] Optimize parquet row filter auto strategy with adaptive fallback [arrow-rs]

Reply via email to