etseidl commented on PR #8160:
URL: https://github.com/apache/arrow-rs/pull/8160#issuecomment-3192922581

   I will say that the page indexes are pretty darn expensive to parse, and the 
file used for the benchmark 
(`parquet-testing/data/all_types_tiny_pages.parquet`) is pretty pathological. 
Looking into where the time goes, the offset index is hobbled by the fact that 
it's defined as an array of structs, which adds considerable overhead to the 
parsing. The column index is a struct of arrays that parses very quickly, but 
then must be transformed into an array of structs after decoding, so that takes 
a good bit of time. Copying of the min/max statistics for byte arrays takes the 
majority of that time (note that the test file does not contain the level 
histograms...those would be very costly as well if present). We could look into 
rethinking how we represent the column index. Perhaps saving the bytes read and 
presenting slices rather than copies will work (at least as far as the 
histograms in the column index...we may be stuck with min/max value copying).
   
   @alamb, not sure how radical you want to go here 😅 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to