jorisvandenbossche commented on issue #12501: URL: https://github.com/apache/arrow/issues/12501#issuecomment-1050954366
It might not be the exact thing you need for Ray, but a related issue is that the actual "dataset discovery" (listing all files, etc) is currently single threaded, and that's something that might be possible to parallelize on Arrow's side: https://issues.apache.org/jira/browse/ARROW-8137 If we have the option to force to load the metadata already during dataset discovery (instead of later when accessed), that could also speed-up the serialization of the fragments (since all metadata will already be read at that point). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
