[I] Reading arrow schemas from parquet files is expensive [datafusion]

via GitHub Fri, 15 May 2026 02:47:00 -0700


fpetkovski opened a new issue, #22200:
URL: https://github.com/apache/datafusion/issues/22200


   We have a specific use case in one of our deployments where a smaller subset 
of files ends up serving heavy reads, many of which are point lookups. I am 
noticing in profiles that most of the CPU time is spent on inferring the arrow 
schema from the `ARROW:schema` Parquet metadata. The other expensive part is 
rebuilding the bloom filter on the predicate column over and over again. 
   
   In our case we know the arrow schema for each file and are okay with 
providing it ourselves. Perhaps one option to do it is to add it as an optional 
field to `PartitionedFile` and the opener can prioritize it if set, before 
trying to infer it from the parquet footer. I don't yet have a good solution 
for reusing bloom filters but I am open to ideas of what can be done to inject 
more information in the Parquet opener ahead of time. I am happy to also open a 
separate issue for them.
   
   The flamegraph bellow is taken from one of our production deployments and I 
have focused it only on the stack frames doing parquet file reads.
   
   <img width="1352" height="549" alt="Image" 
src="https://github.com/user-attachments/assets/26bef2d4-e8b4-4c44-9931-dc1573b9b358";
 />


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[I] Reading arrow schemas from parquet files is expensive [datafusion]

Reply via email to