[PR] Page index partial missing behavior [arrow-rs]

via GitHub Thu, 20 Nov 2025 11:02:22 -0800


joe-ucp opened a new pull request, #8892:
URL: https://github.com/apache/arrow-rs/pull/8892


   # Page index behavior: handle partial / missing indexes without panics
   
   ---
   
   ## Goal
   
   Ensure Parquet page indexes behave correctly when some or all columns lack 
indexes by:
   - Handling `None` at the leaf of `ParquetColumnIndex` and 
`ParquetOffsetIndex` as “no index for this column chunk”
   - Avoiding panics in normal usage when indexes are legitimately missing
   - Adding tests that lock in the expected behavior for partial and fully 
missing page indexes
   
   ---
   
   ## Which issue does this PR close?
   
   Closes [#8818](https://github.com/apache/arrow-rs/issues/8818).
   
   ---
   
   ## Rationale for this change
   
   With the earlier PRs:
   - Page index types now model per-column optionality as 
`Vec<Vec<Option<...>>>`
   - Column index semantics are normalized so missing indexes use `None`
   
   However, several call sites and higher-level APIs still effectively assumed 
that page indexes were present for all requested columns once page index 
reading was enabled. This could lead to:
   - `unwrap()` / `as_ref().unwrap()` panics when a specific column chunk had 
no index
   - Confusing behavior when some columns had page indexes and others did not
   - No explicit tests covering these partial- and no-index cases
   
   This PR focuses on making the runtime behavior robust for these scenarios 
and documenting it via tests.
   
   ---
   
   ## What changes are included in this PR?
   
   ### Safer handling of per-column `Option` page indexes
   
   - Update page-iterator, selection, and in-memory row-group code paths to:
     - Treat `None` in `ParquetOffsetIndex[row_group][column]` or 
`ParquetColumnIndex[row_group][column]` as “no page index for this column chunk”
     - Fall back to non-indexed behavior (e.g., scanning full columns or 
skipping page-level pruning) instead of panicking
   
   - Where `expect(...)` is still used, it is reserved for true internal 
invariants such as:
     - “page index should be present when page index reading is explicitly 
enabled and this code path is only constructed under that assumption”
     - These invariants are now spelled out in error messages to make 
assumptions explicit.
   
   ### Row selection and read-plan behavior
   
   - `RowSelection` and related selection helpers are updated to:
     - Take `&[Option<OffsetIndexMetaData>]` where appropriate
     - Skip page-level pruning for columns whose offset index is `None`
     - Respect the current `RowSelectionPolicy` while avoiding panics for 
missing indexes
   
   ### In-memory row group and async/arrow readers
   
   - `InMemoryRowGroup` now:
     - Uses `Option<&[Option<OffsetIndexMetaData>]>` for offset indexes
     - Uses page indexes where present to compute sparse read ranges
     - Falls back to reading the entire column when no offset index is present, 
instead of unconditionally indexing into `offset_index[idx]`
   
   - Async and synchronous readers are updated to:
     - Assume offset/column indexes may be missing on a per-column basis
     - Still assert the invariants they depend on when page index usage is 
explicitly requested (e.g., in tests that are specifically about index behavior)
   
   ### `PageIndexPolicy` behavior
   
   - With per-column `Option` modeling, `parse_offset_index` now:
     - Builds a `Vec<Vec<Option<OffsetIndexMetaData>>>` so missing offset 
indexes are represented as `None`
     - Avoids invalidating all page indexes when some columns are missing 
indexes under non-`Required` policies
   
   - The practical effect is:
     - For optional page index use, partial presence no longer forces 
“invalidate everything”
     - Call sites must and do handle per-column `None` in a defined, 
non-panicking way
   
   ### Tests for partial and missing page indexes
   
   - Add focused tests (e.g., in `parquet/tests/arrow_reader/statistics.rs`) 
that cover:
     1. All columns have page indexes (baseline behavior)
     2. Some columns have column indexes while others do not:
        - Verify that metadata reading and page-index-based logic do not panic
        - Verify that the “missing” columns are represented as `None` at the 
page index leaf
     3. No page indexes at all (page indexes disabled):
        - Verify that metadata reports `None` for top-level page indexes
        - Verify that reading still works without panics and falls back to 
non-indexed behavior
   - These tests document expected behavior and guard against future 
regressions.
   
   ---
   
   ## Are these changes tested?
   
   **Yes.**
   
   - All existing parquet tests continue to pass.
   - New and updated tests are included to explicitly cover:
     - Partial column index presence
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[PR] Page index partial missing behavior [arrow-rs]

Reply via email to