[GitHub] [arrow] ericl opened a new issue #12501: [python] Parallel parquet metadata resolution?

GitBox Wed, 23 Feb 2022 17:08:58 -0800


ericl opened a new issue #12501:
URL: https://github.com/apache/arrow/issues/12501



   Hi all,
   
   We're trying to improve the metadata resolution performance of pyarrow's 
ParquetDataset in Ray (https://github.com/ray-project/ray/issues/21274). 
Metadata resolution is a bottleneck when directories have 1000s of files in 
cloud storage.
   
   Trying multiple threads or processes, both approaches seem to have issues:
   - With threads: there is a mutex preventing concurrent metadata resolution 
of arrow::dataset::ParquetFileFragments.
   - With processes: Serializing a ParquetFileFragment to send to another 
process triggers metadata resolution (again hitting the lock issue).
   
   Is there a recommended way of achieving parallelism here?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] ericl opened a new issue #12501: [python] Parallel parquet metadata resolution?

Reply via email to