[PR] Pipeline Parquet row group reads [datafusion]

via GitHub Sun, 10 May 2026 11:20:46 -0700


Dandandan opened a new pull request, #22099:
URL: https://github.com/apache/datafusion/pull/22099


   ## Which issue does this PR close?
   
   N/A
   
   ## Rationale for this change
   
   Parquet scans currently fetch and decode row groups serially after planning. 
Arrow-rs exposes `ParquetRecordBatchStream::next_row_group`, which can fetch 
the next row group while the current row group is being decoded.
   
   ## What changes are included in this PR?
   
   This changes the Parquet opener to build a file-level morsel that uses 
`next_row_group()` with a single-row-group lookahead. The stream starts 
fetching row group N+1 while decoding row group N, while preserving the 
existing projection, row filtering, row selection, limit, reverse row-group 
ordering, dynamic early-stop pruning, and metrics behavior.
   
   The PR also adds a regression test that verifies multiple row groups are 
read from a single file-level morsel.
   
   ## Are these changes tested?
   
   - `cargo fmt --all`
   - `cargo check -p datafusion-datasource-parquet`
   - `cargo test -p datafusion-datasource-parquet`
   - `cargo clippy -p datafusion-datasource-parquet --all-targets -- -D 
warnings`
   
   Also ran `cargo clippy --all-targets --all-features -- -D warnings`; it 
currently fails outside this PR in 
`datafusion/physical-plan/benches/aggregate_vectorized.rs`, where the bench 
still passes `Vec<bool>` to `vectorized_equal_to` after the API changed to 
`BooleanBufferBuilder` in fa03a4c8718 / #21886.
   
   ## Are there any user-facing changes?
   
   No API or documented behavior changes. This is an execution-path performance 
improvement for Parquet scans.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] Pipeline Parquet row group reads [datafusion]

Reply via email to