Dandandan opened a new pull request, #22099: URL: https://github.com/apache/datafusion/pull/22099
## Which issue does this PR close? N/A ## Rationale for this change Parquet scans currently fetch and decode row groups serially after planning. Arrow-rs exposes `ParquetRecordBatchStream::next_row_group`, which can fetch the next row group while the current row group is being decoded. ## What changes are included in this PR? This changes the Parquet opener to build a file-level morsel that uses `next_row_group()` with a single-row-group lookahead. The stream starts fetching row group N+1 while decoding row group N, while preserving the existing projection, row filtering, row selection, limit, reverse row-group ordering, dynamic early-stop pruning, and metrics behavior. The PR also adds a regression test that verifies multiple row groups are read from a single file-level morsel. ## Are these changes tested? - `cargo fmt --all` - `cargo check -p datafusion-datasource-parquet` - `cargo test -p datafusion-datasource-parquet` - `cargo clippy -p datafusion-datasource-parquet --all-targets -- -D warnings` Also ran `cargo clippy --all-targets --all-features -- -D warnings`; it currently fails outside this PR in `datafusion/physical-plan/benches/aggregate_vectorized.rs`, where the bench still passes `Vec<bool>` to `vectorized_equal_to` after the API changed to `BooleanBufferBuilder` in fa03a4c8718 / #21886. ## Are there any user-facing changes? No API or documented behavior changes. This is an execution-path performance improvement for Parquet scans. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
