mbutrovich opened a new pull request, #2100: URL: https://github.com/apache/iceberg-rust/pull/2100
## Which issue does this PR close? <!-- We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. For example `Closes #123` indicates that this PR will close issue #123. --> While running DataFusion Comet with an Iceberg workload that generates ~80,000 `FileScanTask` objects passed in the `ArrowReader`, we see the majority of CPU time spent in `get_metadata` calls via `ArrowReader::create_parquet_record_batch_stream_builder`. This is a screenshot from the CPU time flame graph from one of the executors in this Spark job: <img width="1427" height="450" alt="Screenshot 2026-02-02 at 6 40 19 AM" src="https://github.com/user-attachments/assets/95e9e884-6e14-4cfb-8e81-afbafa5e8fcb" /> I suspect the `ArrowReader` is processing `FileScanTask`s for the same Parquet data files and fetching the same metadata, burning CPU cycles to parse and adding extra object store calls. ## What changes are included in this PR? <!-- Provide a summary of the modifications in this PR. List the main changes such as new features, bug fixes, refactoring, or any other updates. --> - `ParquetMetadataCache` modeled after delete_filter.rs's behavior. I made the key a composite of the location and whether the page index was requested to be read, since a subsequent `true` when cached with `false` will yield improper results. - `ArrowReader` has a metadata cache. - `BasicDeleteFileLoader` has a metadata cache. ## Are these changes tested? <!-- Specify what test covers (unit test, integration test, etc.). If tests are not included in your PR, please explain why (for example, are they covered by existing tests)? --> - New test in reader.rs, as well as all existing tests pass. - We will also run this in Comet CI, as well as try the pipeline described above with the experimental Comet branch pointing to this branch. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
