This is an automated email from the ASF dual-hosted git repository.

dheres pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/arrow-rs.git


The following commit(s) were added to refs/heads/main by this push:
     new 44f5dfc607 perf: Coalesce page fetches when RowSelection selects all 
rows (#9578)
44f5dfc607 is described below

commit 44f5dfc607892bab849a4dba008b6ee8966c1461
Author: Daniël Heres <[email protected]>
AuthorDate: Thu Mar 19 19:49:12 2026 +0100

    perf: Coalesce page fetches when RowSelection selects all rows (#9578)
    
    ## Summary
    
    - When a `RowSelection` selects every row in a row group, `fetch_ranges`
    now treats it as no selection, producing a single whole-column-chunk I/O
    request instead of N individual page requests
    - This reduces the number of I/O requests for subsequent filter
    predicates when an earlier predicate passes all rows
    
    ## Details
    
    In `InMemoryRowGroup::fetch_ranges`, when both a `RowSelection` and an
    `OffsetIndex` are present, the code enters a page-level fetch path that
    uses `scan_ranges()` to produce individual page ranges. Even when the
    selection covers all rows, this produces N separate ranges (one per
    page).
    
    The fix: before entering the page-level path, check if the selection's
    `row_count()` equals the row group's total row count. If so, drop the
    selection and take the simpler whole-column-chunk path.
    
    This commonly happens when a multi-predicate `RowFilter` has an early
    predicate that passes all rows in a row group (e.g., `CounterID = 62` on
    a row group where all rows have `CounterID = 62`).
    
    ## Test plan
    
    - [x] Existing tests pass (snapshot updated to reflect fewer I/O
    requests)
    - [x] `test_read_multiple_row_filter` verifies the coalesced fetch
    pattern
    
    🤖 Generated with [Claude Code](https://claude.com/claude-code)
    
    Co-authored-by: Claude Opus 4.6 (1M context) <[email protected]>
---
 parquet/src/arrow/arrow_reader/read_plan.rs   | 7 +++++++
 parquet/tests/arrow_reader/io/async_reader.rs | 4 +---
 2 files changed, 8 insertions(+), 3 deletions(-)

diff --git a/parquet/src/arrow/arrow_reader/read_plan.rs 
b/parquet/src/arrow/arrow_reader/read_plan.rs
index 7c9eb36bef..99ffe0febc 100644
--- a/parquet/src/arrow/arrow_reader/read_plan.rs
+++ b/parquet/src/arrow/arrow_reader/read_plan.rs
@@ -167,6 +167,13 @@ impl ReadPlanBuilder {
             };
         }
 
+        // If the predicate selected all rows and there is no prior selection,
+        // skip creating a RowSelection entirely — this avoids the allocation
+        // and keeps selection as None which enables coalesced page fetches.
+        let all_selected = filters.iter().all(|f| f.true_count() == f.len());
+        if all_selected && self.selection.is_none() {
+            return Ok(self);
+        }
         let raw = RowSelection::from_filters(&filters);
         self.selection = match self.selection.take() {
             Some(selection) => Some(selection.and_then(&raw)),
diff --git a/parquet/tests/arrow_reader/io/async_reader.rs 
b/parquet/tests/arrow_reader/io/async_reader.rs
index 8022335da0..db06dda8ee 100644
--- a/parquet/tests/arrow_reader/io/async_reader.rs
+++ b/parquet/tests/arrow_reader/io/async_reader.rs
@@ -275,9 +275,7 @@ async fn test_read_multiple_row_filter() {
             "Read Multi:",
             "  Row Group 1, column 'a': MultiPage(dictionary_page: true, 
data_pages: [0, 1])  (1856 bytes, 1 requests) [data]",
             "Read Multi:",
-            "  Row Group 1, column 'b': DictionaryPage   (1617 bytes, 1 
requests) [data]",
-            "  Row Group 1, column 'b': DataPage(0)      (113 bytes , 1 
requests) [data]",
-            "  Row Group 1, column 'b': DataPage(1)      (126 bytes , 1 
requests) [data]",
+            "  Row Group 1, column 'b': MultiPage(dictionary_page: true, 
data_pages: [0, 1])  (1856 bytes, 1 requests) [data]",
             "Read Multi:",
             "  Row Group 1, column 'c': DictionaryPage   (7217 bytes, 1 
requests) [data]",
             "  Row Group 1, column 'c': DataPage(0)      (113 bytes , 1 
requests) [data]",

Reply via email to