joe-ucp opened a new pull request, #8860:
URL: https://github.com/apache/arrow-rs/pull/8860

   - Decode page encoding stats to a compact `EncodingMask` by default
   - Add `PageEncodingStatsMode { Mask, Full, Skip }` and 
`ParquetPageEncodingStats`
   - Keep on-disk Parquet metadata format unchanged
   - Update tests, benchmarks and docs for the new default
   
   # Which issue does this PR close?
   
   - Closes #8859.
   
   # Rationale for this change
   
   Decoding page encoding statistics into a full `Vec<PageEncodingStats>` for 
every column can allocate a lot of memory and is rarely needed by callers. A 
compact mask representation keeps enough information for most use cases (which 
encodings appear) while reducing allocations and metadata parsing cost.
   
   At the same time, we still want to support users who rely on the full stats, 
and also allow completely skipping stats when they are not needed.
   
   # What changes are included in this PR?
   
   - Introduce `PageEncodingStatsMode` with three modes:
     - `Mask` (default): decode stats to a compact `EncodingMask`
     - `Full`: decode full `Vec<PageEncodingStats>` as before
     - `Skip`: do not decode page encoding stats
   - Introduce `ParquetPageEncodingStats` to store stats in memory as either:
     - `Mask(EncodingMask)` or
     - `Full(Vec<PageEncodingStats>)`
   - Extend `ParquetMetaDataOptions` with `page_encoding_stats_mode`, 
defaulting to `Mask`
   - Wire `PageEncodingStatsMode` into the Thrift metadata decoder to:
     - derive a mask from the decoded stats,
     - preserve the full vector, or
     - skip decoding entirely, depending on the mode
   - Keep serialization/on-disk format unchanged (only full stats are written 
when present)
   - Update `ColumnChunkMetaData` to use `ParquetPageEncodingStats` and add 
helpers:
     - `page_encoding_stats()`
     - `page_encoding_stats_mask()`
     - `page_encoding_stats_full()`
   - Update and extend tests and benchmarks for all three modes
   - Update documentation and `CHANGELOG.md` to describe the new default and 
how to opt into full stats
   
   # Are these changes tested?
   
   Yes:
   
   - `cargo fmt -p parquet`
   - `cargo clippy -p parquet -- -D warnings`
   - `cargo test -p parquet`
     - including new tests for Mask / Full / Skip modes and edge cases
   - `cargo bench -p parquet -- benches::metadata`
     - Mask mode performs at least as well as the previous default, with Full 
and Skip behaving as expected
   
   # Are there any user-facing changes?
   
   Yes:
   
   - The default behavior when decoding metadata now uses `Mask` mode instead 
of always decoding full `Vec<PageEncodingStats>`.
   - Users who need the full stats can opt in via:
     - 
`ParquetMetaDataOptions::with_page_encoding_stats_mode(PageEncodingStatsMode::Full)`
   - Users who do not need page encoding stats can select 
`PageEncodingStatsMode::Skip`.
   
   This is an in-memory behavior change intended for the next **major 
release**; the Parquet on-disk format is unchanged.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to