etseidl commented on PR #8160: URL: https://github.com/apache/arrow-rs/pull/8160#issuecomment-3192922581
I will say that the page indexes are pretty darn expensive to parse, and the file used for the benchmark (`parquet-testing/data/all_types_tiny_pages.parquet`) is pretty pathological. Looking into where the time goes, the offset index is hobbled by the fact that it's defined as an array of structs, which adds considerable overhead to the parsing. The column index is a struct of arrays that parses very quickly, but then must be transformed into an array of structs after decoding, so that takes a good bit of time. Copying of the min/max statistics for byte arrays takes the majority of that time (note that the test file does not contain the level histograms...those would be very costly as well if present). We could look into rethinking how we represent the column index. Perhaps saving the bytes read and presenting slices rather than copies will work (at least as far as the histograms in the column index...we may be stuck with min/max value copying). @alamb, not sure how radical you want to go here 😅 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org