Re: [I] Optimized decoding of Parquet Statistics, `null_pages` and `null_counts` [arrow-rs]

via GitHub Mon, 02 Feb 2026 07:27:52 -0800


alamb commented on issue #9296:
URL: https://github.com/apache/arrow-rs/issues/9296#issuecomment-3835863883


   My personal suggestion is:
   1. Defer decoding statistics entirely when parsing metadata (just skip the 
statistics)
   2. Decode the statistics directly into arrow arrays (aka the correct 
columnar format) when requested
   
   This would solve several sources of inefficiency today:
   1. Many small allocations in ParquetMetadata (one allocation for each page 
and one for each column for each row group)
   2. Inefficient conversion having to walk down all those little allocations 
and copy them into an Array
   3. Decoding (w/ allocations) the column statistics for columns that are 
never read in the queries
   
   I think the API design is probably the trickiest part of this project


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] Optimized decoding of Parquet Statistics, `null_pages` and `null_counts` [arrow-rs]

Reply via email to