joe-ucp opened a new pull request, #8892: URL: https://github.com/apache/arrow-rs/pull/8892
# Page index behavior: handle partial / missing indexes without panics --- ## Goal Ensure Parquet page indexes behave correctly when some or all columns lack indexes by: - Handling `None` at the leaf of `ParquetColumnIndex` and `ParquetOffsetIndex` as “no index for this column chunk” - Avoiding panics in normal usage when indexes are legitimately missing - Adding tests that lock in the expected behavior for partial and fully missing page indexes --- ## Which issue does this PR close? Closes [#8818](https://github.com/apache/arrow-rs/issues/8818). --- ## Rationale for this change With the earlier PRs: - Page index types now model per-column optionality as `Vec<Vec<Option<...>>>` - Column index semantics are normalized so missing indexes use `None` However, several call sites and higher-level APIs still effectively assumed that page indexes were present for all requested columns once page index reading was enabled. This could lead to: - `unwrap()` / `as_ref().unwrap()` panics when a specific column chunk had no index - Confusing behavior when some columns had page indexes and others did not - No explicit tests covering these partial- and no-index cases This PR focuses on making the runtime behavior robust for these scenarios and documenting it via tests. --- ## What changes are included in this PR? ### Safer handling of per-column `Option` page indexes - Update page-iterator, selection, and in-memory row-group code paths to: - Treat `None` in `ParquetOffsetIndex[row_group][column]` or `ParquetColumnIndex[row_group][column]` as “no page index for this column chunk” - Fall back to non-indexed behavior (e.g., scanning full columns or skipping page-level pruning) instead of panicking - Where `expect(...)` is still used, it is reserved for true internal invariants such as: - “page index should be present when page index reading is explicitly enabled and this code path is only constructed under that assumption” - These invariants are now spelled out in error messages to make assumptions explicit. ### Row selection and read-plan behavior - `RowSelection` and related selection helpers are updated to: - Take `&[Option<OffsetIndexMetaData>]` where appropriate - Skip page-level pruning for columns whose offset index is `None` - Respect the current `RowSelectionPolicy` while avoiding panics for missing indexes ### In-memory row group and async/arrow readers - `InMemoryRowGroup` now: - Uses `Option<&[Option<OffsetIndexMetaData>]>` for offset indexes - Uses page indexes where present to compute sparse read ranges - Falls back to reading the entire column when no offset index is present, instead of unconditionally indexing into `offset_index[idx]` - Async and synchronous readers are updated to: - Assume offset/column indexes may be missing on a per-column basis - Still assert the invariants they depend on when page index usage is explicitly requested (e.g., in tests that are specifically about index behavior) ### `PageIndexPolicy` behavior - With per-column `Option` modeling, `parse_offset_index` now: - Builds a `Vec<Vec<Option<OffsetIndexMetaData>>>` so missing offset indexes are represented as `None` - Avoids invalidating all page indexes when some columns are missing indexes under non-`Required` policies - The practical effect is: - For optional page index use, partial presence no longer forces “invalidate everything” - Call sites must and do handle per-column `None` in a defined, non-panicking way ### Tests for partial and missing page indexes - Add focused tests (e.g., in `parquet/tests/arrow_reader/statistics.rs`) that cover: 1. All columns have page indexes (baseline behavior) 2. Some columns have column indexes while others do not: - Verify that metadata reading and page-index-based logic do not panic - Verify that the “missing” columns are represented as `None` at the page index leaf 3. No page indexes at all (page indexes disabled): - Verify that metadata reports `None` for top-level page indexes - Verify that reading still works without panics and falls back to non-indexed behavior - These tests document expected behavior and guard against future regressions. --- ## Are these changes tested? **Yes.** - All existing parquet tests continue to pass. - New and updated tests are included to explicitly cover: - Partial column index presence -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
