fpetkovski opened a new issue, #22200: URL: https://github.com/apache/datafusion/issues/22200
We have a specific use case in one of our deployments where a smaller subset of files ends up serving heavy reads, many of which are point lookups. I am noticing in profiles that most of the CPU time is spent on inferring the arrow schema from the `ARROW:schema` Parquet metadata. The other expensive part is rebuilding the bloom filter on the predicate column over and over again. In our case we know the arrow schema for each file and are okay with providing it ourselves. Perhaps one option to do it is to add it as an optional field to `PartitionedFile` and the opener can prioritize it if set, before trying to infer it from the parquet footer. I don't yet have a good solution for reusing bloom filters but I am open to ideas of what can be done to inject more information in the Parquet opener ahead of time. I am happy to also open a separate issue for them. The flamegraph bellow is taken from one of our production deployments and I have focused it only on the stack frames doing parquet file reads. <img width="1352" height="549" alt="Image" src="https://github.com/user-attachments/assets/26bef2d4-e8b4-4c44-9931-dc1573b9b358" /> -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
