mbutrovich opened a new pull request, #2100:
URL: https://github.com/apache/iceberg-rust/pull/2100

   ## Which issue does this PR close?
   
   <!--
   We generally require a GitHub issue to be filed for all bug fixes and 
enhancements and this helps us generate change logs for our releases. You can 
link an issue to this PR using the GitHub syntax. For example `Closes #123` 
indicates that this PR will close issue #123.
   -->
   
   While running DataFusion Comet with an Iceberg workload that generates 
~80,000 `FileScanTask` objects passed in the `ArrowReader`, we see the majority 
of CPU time spent in `get_metadata` calls via 
`ArrowReader::create_parquet_record_batch_stream_builder`.
   
   This is a screenshot from the CPU time flame graph from one of the executors 
in this Spark job:
   <img width="1427" height="450" alt="Screenshot 2026-02-02 at 6 40 19 AM" 
src="https://github.com/user-attachments/assets/95e9e884-6e14-4cfb-8e81-afbafa5e8fcb";
 />
   
   I suspect the `ArrowReader` is processing `FileScanTask`s for the same 
Parquet data files and fetching the same metadata, burning CPU cycles to parse 
and adding extra object store calls.
   
   ## What changes are included in this PR?
   
   <!--
   Provide a summary of the modifications in this PR. List the main changes 
such as new features, bug fixes, refactoring, or any other updates.
   -->
   
   - `ParquetMetadataCache` modeled after delete_filter.rs's behavior. I made 
the key a composite of the location and whether the page index was requested to 
be read, since a subsequent `true` when cached with `false` will yield improper 
results. 
   - `ArrowReader` has a metadata cache.
   - `BasicDeleteFileLoader` has a metadata cache.
   
   ## Are these changes tested?
   
   <!--
   Specify what test covers (unit test, integration test, etc.).
   
   If tests are not included in your PR, please explain why (for example, are 
they covered by existing tests)?
   -->
   
   - New test in reader.rs, as well as all existing tests pass.
   - We will also run this in Comet CI, as well as try the pipeline described 
above with the experimental Comet branch pointing to this branch.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to