Re: [PR] [Parquet] Support skipping pages with mask based evaluation [arrow-rs]

via GitHub Sat, 10 Jan 2026 11:29:01 -0800


alamb commented on code in PR #9118:
URL: https://github.com/apache/arrow-rs/pull/9118#discussion_r2678607392



##########
parquet/src/arrow/push_decoder/reader_builder/mod.rs:
##########
@@ -654,57 +660,6 @@ impl RowGroupReaderBuilder {
     }
 }
 
-/// Override the selection strategy if needed.
-///
-/// Some pages can be skipped during row-group construction if they are not 
read
-/// by the selections. This means that the data pages for those rows are never
-/// loaded and definition/repetition levels are never read. When using
-/// `RowSelections` selection works because `skip_records()` handles this
-/// case and skips the page accordingly.
-///
-/// However, with the current mask design, all values must be read and decoded
-/// and then a mask filter is applied. Thus if any pages are skipped during
-/// row-group construction, the data pages are missing and cannot be decoded.
-///
-/// A simple example:
-/// * the page size is 2, the mask is 100001, row selection should be read(1) 
skip(4) read(1)
-/// * the `ColumnChunkData` would be page1(10), page2(skipped), page3(01)
-///
-/// Using the row selection to skip(4), page2 won't be read at all, so in this
-/// case we can't decode all the rows and apply a mask. To correctly apply the
-/// bit mask, we need all 6 values be read, but page2 is not in memory.
-fn override_selector_strategy_if_needed(

Review Comment:
   nice -- the idea is to avoid this function 👍 



##########
parquet/src/arrow/arrow_reader/selection.rs:
##########
@@ -800,14 +806,26 @@ impl MaskCursor {
             let mut chunk_rows = 0;
             let mut selected_rows = 0;
 
-            // Advance until enough rows have been selected to satisfy the 
batch size,
-            // or until the mask is exhausted. This mirrors the behaviour of 
the legacy
-            // `RowSelector` queue-based iteration.
-            while cursor < mask.len() && selected_rows < batch_size {
+            let mut page_end = mask.len();
+            if let Some(pages) = page_locations {
+                for loc in pages {

Review Comment:
   I am also a little worried that this loop will take too long (it is O(N^2) 
in the number of pages as each time it looks through all pages
   
   Maybe we could potentially add a PageLocationIterator to the cursor itself 
(so we know where to pick up)



##########
parquet/src/arrow/arrow_reader/selection.rs:
##########
@@ -800,14 +806,26 @@ impl MaskCursor {
             let mut chunk_rows = 0;
             let mut selected_rows = 0;
 
-            // Advance until enough rows have been selected to satisfy the 
batch size,
-            // or until the mask is exhausted. This mirrors the behaviour of 
the legacy
-            // `RowSelector` queue-based iteration.
-            while cursor < mask.len() && selected_rows < batch_size {
+            let mut page_end = mask.len();
+            if let Some(pages) = page_locations {
+                for loc in pages {
+                    let page_start = 
loc.first_row_index.try_into().unwrap_or(usize::MAX);

Review Comment:
   I think this could just be an unwrap -- if the page index location is less 
than 0 or doesn't fit in a usize it is invalid and we should have stopped 
reading the file earlier I think



##########
parquet/src/arrow/arrow_reader/mod.rs:
##########
@@ -1407,6 +1420,7 @@ impl ParquetRecordBatchReader {
             array_reader,
             schema: Arc::new(schema),
             read_plan,
+            page_offsets: page_offsets.map(|slice| Arc::new(slice.to_vec())),

Review Comment:
   So I think this will effectively will copy the entire 
OffsetIndexMetadataData structure (which I worry could be quite large)
   
   I wonder if we need to find a way to avoid this (e.g. making the entire 
thing Arc'd in 
https://github.com/apache/arrow-rs/blob/67e04e758f1e62ec3d78d2f678daf433a4c54e30/parquet/src/file/metadata/mod.rs#L197-L196
 somehow 🤔 )



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] [Parquet] Support skipping pages with mask based evaluation [arrow-rs]

Reply via email to