joe-ucp opened a new pull request, #8893: URL: https://github.com/apache/arrow-rs/pull/8893
# Page index column semantics: use `None` instead of `ColumnIndexMetaData::NONE` --- ## Goal Normalize column page index semantics so that missing column indexes are represented as `None` at the leaf of `ParquetColumnIndex`, treating `ColumnIndexMetaData::NONE` as a legacy/special marker instead of the general “no index” representation. --- ## Which issue does this PR close? Part of [#8818](https://github.com/apache/arrow-rs/issues/8818). --- ## Rationale for this change After introducing: ```rust pub type ParquetColumnIndex = Vec<Vec<Option<ColumnIndexMetaData>>>; ``` there were still code paths that used `Some(ColumnIndexMetaData::NONE)` to represent *“no column index for this column chunk”*. That left us with two parallel representations for “missing”: - `None` - `Some(ColumnIndexMetaData::NONE)` This is confusing for callers and undermines the advantages of modeling optionality with `Option`. The more idiomatic and type-safe approach is: - Use `None` to express *“no index for this (row_group, column)”* - Use `Some(ColumnIndexMetaData::...)` only when an actual index is present - Treat `ColumnIndexMetaData::NONE` as a legacy/special-case sentinel, not the general “missing index” marker This PR updates parsing, writing, and consumers to align with that model. --- ## What changes are included in this PR? - ### Parser semantics for column indexes - `parse_column_index` now sets: - `ParquetColumnIndex[row_group][column] = None` when there is no column index range for that column chunk - `Some(ColumnIndexMetaData::...)` when column index data is present - This replaces the previous behavior that could return `Some(ColumnIndexMetaData::NONE)` in these cases. - ### Writer / metadata plumbing - Ensure writer paths and internal builders only produce `Some(ColumnIndexMetaData::...)` when a column index actually exists. - When no column index is produced, the leaf is set to `None` instead of `Some(NONE)`. - ### `ColumnIndexMetaData::NONE` documentation and usage - Document `ColumnIndexMetaData::NONE` as a legacy marker and clarify it must NOT be used to represent missing column indexes in `ParquetColumnIndex`; `None` on the outer `Option` should be used instead. - Update consumers (e.g., in `arrow_reader/statistics.rs`) to: - Treat `None` as “no column index for this column chunk” - Treat `Some(ci)` as “column index present” - Only special-case `ColumnIndexMetaData::NONE` where it has specific legacy meaning for page-level stats, not as the general “no index” signal. - ### Consumers and statistics logic - Update iterator macros and helpers to work with `Iterator<Item = (usize, &'a Option<ColumnIndexMetaData>)>` instead of the non-optional type. - For statistics: - `None` → fill with `len` `None` entries (no stats for that chunk) - `Some(ColumnIndexMetaData::...)` → derive min/max/null_count as before - This ensures: - Column-index-based pruning is skipped when `None` is encountered - The code no longer relies on `ColumnIndexMetaData::NONE` as the primary “missing” signal - ### Tests - Update tests that previously expected `Some(ColumnIndexMetaData::NONE)` to now expect `None` for missing column indexes. - Preserve existing expectations where `ColumnIndexMetaData::NONE` has specific meaning, but not as the default representation of “no index”. --- ## Are these changes tested? **Yes.** - Existing parquet tests have been updated where necessary and are passing. - Page-index-related tests (including statistics and reader tests) are updated to assert the new semantics and are passing. Example commands: cargo test --package parquet --lib cargo test --package parquet --test arrow_reader -- test_page_index* # or equivalent page-index tests --- ## Are there any user-facing changes? **Yes, in terms of how missing column indexes are surfaced:** - Callers that inspect `ParquetColumnIndex` should now treat: - `None` at the leaf as “no column index for this column chunk” - `Some(ColumnIndexMetaData::...)` as “column index present” - Any downstream code relying on `Some(ColumnIndexMetaData::NONE)` as a missing-index marker will need to be updated to check for `None` instead. The on-disk Parquet encoding remains unchanged. Most observable differences should be: - More consistent representation of missing column indexes - Fewer surprises around mixed `None` / `Some(NONE)` representations - No new panics introduced by this normalization -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
