alamb commented on issue #9296: URL: https://github.com/apache/arrow-rs/issues/9296#issuecomment-3835863883
My personal suggestion is: 1. Defer decoding statistics entirely when parsing metadata (just skip the statistics) 2. Decode the statistics directly into arrow arrays (aka the correct columnar format) when requested This would solve several sources of inefficiency today: 1. Many small allocations in ParquetMetadata (one allocation for each page and one for each column for each row group) 2. Inefficient conversion having to walk down all those little allocations and copy them into an Array 3. Decoding (w/ allocations) the column statistics for columns that are never read in the queries I think the API design is probably the trickiest part of this project -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
