jpugliesi commented on issue #1229: URL: https://github.com/apache/iceberg-python/issues/1229#issuecomment-2557778425
Just to contribute some findings: We also encountered this case where `pyiceberg`'s scanning`plan_files` was surprisingly slow reading manifest files from GCS. Switching the `py-io-impl` from `pyiceberg.io.pyarrow.PyArrowFileIO` to `pyiceberg.io.fsspec.FsspecFileIO` improved the performance significantly. Attached are some screenshots of Traces (run on my laptop), showing the performance difference we've consistently observed using the different `py-io-impl`s: `pyiceberg.io.pyarrow.PyArrowFileIO`: <img width="1709" alt="image" src="https://github.com/user-attachments/assets/a7b094fa-4dbd-4b6c-95c8-b6644e099c1d" /> `pyiceberg.io.fsspec.FsspecFileIO` <img width="1705" alt="image" src="https://github.com/user-attachments/assets/a4df475d-f4dc-4c84-b55e-2b7932ead6c2" /> With `PyArrowFileIO`, it looks like there is some resource contention. We tried tuning various things, such as [`ARROW_IO_THREADS`](https://arrow.apache.org/docs/cpp/threading.html#cpu-vs-i-o), but ultimately never identified the root issue -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
