This is an automated email from the ASF dual-hosted git repository.
dheres pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/arrow-rs.git
The following commit(s) were added to refs/heads/main by this push:
new 44f5dfc607 perf: Coalesce page fetches when RowSelection selects all
rows (#9578)
44f5dfc607 is described below
commit 44f5dfc607892bab849a4dba008b6ee8966c1461
Author: Daniël Heres <[email protected]>
AuthorDate: Thu Mar 19 19:49:12 2026 +0100
perf: Coalesce page fetches when RowSelection selects all rows (#9578)
## Summary
- When a `RowSelection` selects every row in a row group, `fetch_ranges`
now treats it as no selection, producing a single whole-column-chunk I/O
request instead of N individual page requests
- This reduces the number of I/O requests for subsequent filter
predicates when an earlier predicate passes all rows
## Details
In `InMemoryRowGroup::fetch_ranges`, when both a `RowSelection` and an
`OffsetIndex` are present, the code enters a page-level fetch path that
uses `scan_ranges()` to produce individual page ranges. Even when the
selection covers all rows, this produces N separate ranges (one per
page).
The fix: before entering the page-level path, check if the selection's
`row_count()` equals the row group's total row count. If so, drop the
selection and take the simpler whole-column-chunk path.
This commonly happens when a multi-predicate `RowFilter` has an early
predicate that passes all rows in a row group (e.g., `CounterID = 62` on
a row group where all rows have `CounterID = 62`).
## Test plan
- [x] Existing tests pass (snapshot updated to reflect fewer I/O
requests)
- [x] `test_read_multiple_row_filter` verifies the coalesced fetch
pattern
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-authored-by: Claude Opus 4.6 (1M context) <[email protected]>
---
parquet/src/arrow/arrow_reader/read_plan.rs | 7 +++++++
parquet/tests/arrow_reader/io/async_reader.rs | 4 +---
2 files changed, 8 insertions(+), 3 deletions(-)
diff --git a/parquet/src/arrow/arrow_reader/read_plan.rs
b/parquet/src/arrow/arrow_reader/read_plan.rs
index 7c9eb36bef..99ffe0febc 100644
--- a/parquet/src/arrow/arrow_reader/read_plan.rs
+++ b/parquet/src/arrow/arrow_reader/read_plan.rs
@@ -167,6 +167,13 @@ impl ReadPlanBuilder {
};
}
+ // If the predicate selected all rows and there is no prior selection,
+ // skip creating a RowSelection entirely — this avoids the allocation
+ // and keeps selection as None which enables coalesced page fetches.
+ let all_selected = filters.iter().all(|f| f.true_count() == f.len());
+ if all_selected && self.selection.is_none() {
+ return Ok(self);
+ }
let raw = RowSelection::from_filters(&filters);
self.selection = match self.selection.take() {
Some(selection) => Some(selection.and_then(&raw)),
diff --git a/parquet/tests/arrow_reader/io/async_reader.rs
b/parquet/tests/arrow_reader/io/async_reader.rs
index 8022335da0..db06dda8ee 100644
--- a/parquet/tests/arrow_reader/io/async_reader.rs
+++ b/parquet/tests/arrow_reader/io/async_reader.rs
@@ -275,9 +275,7 @@ async fn test_read_multiple_row_filter() {
"Read Multi:",
" Row Group 1, column 'a': MultiPage(dictionary_page: true,
data_pages: [0, 1]) (1856 bytes, 1 requests) [data]",
"Read Multi:",
- " Row Group 1, column 'b': DictionaryPage (1617 bytes, 1
requests) [data]",
- " Row Group 1, column 'b': DataPage(0) (113 bytes , 1
requests) [data]",
- " Row Group 1, column 'b': DataPage(1) (126 bytes , 1
requests) [data]",
+ " Row Group 1, column 'b': MultiPage(dictionary_page: true,
data_pages: [0, 1]) (1856 bytes, 1 requests) [data]",
"Read Multi:",
" Row Group 1, column 'c': DictionaryPage (7217 bytes, 1
requests) [data]",
" Row Group 1, column 'c': DataPage(0) (113 bytes , 1
requests) [data]",