westonpace commented on pull request #12323: URL: https://github.com/apache/arrow/pull/12323#issuecomment-1054726309
TL;DR: We can solve this, we probably want to solve this, but it will involve some C++ effort. Sorry for the delay in looking at this. We certainly have some options here. This is a great chance to start putting some scaffolding we've laid down to good use. The fact that the parquet reader works here is actually a fluke that we will someday fix (:laughing:) with [ARROW-14974](https://issues.apache.org/jira/browse/ARROW-14974). There are two thread pools in Arrow. The CPU thread pool and the I/O thread pool. The CPU thread pool has one thread per core and these threads are expected to do lots of heavy CPU work. The I/O thread pool may have few threads (e.g. if using a hard disk) or it may have many threads (e.g. if using S3) and these threads are expected to spend most of their time in the waiting state. CPU threads should generally not block for long periods of time. So when they have to do something slow (like read from disk) they put the task on the I/O thread pool and add a callback on the CPU thread pool to deal with the result. When use_threads is false we typically interpret that as "don't use up a bunch of CPU for this task" and we limit the CPU thread pool. Ideally we limit it to the calling thread. In some cases (e.g. execution engine) we limit it to one CPU thread and the calling thread (though I'm working on that as we speak). What we don't usually do is limit the I/O thread pool in any way. We have the tooling to do this (basically the queue that you mentioned) but will need to do some work to wire everything up. We can probably come up with a "limit all CPU and I/O tasks to the R thread" solution more easily than a "use the CPU thread pool for CPU tasks but limit all I/O tasks to the R thread" but the latter should probably be possible. It will also be easier to support the whole-table readers & writers initially and then later add support for streaming APIs. Also, this will have some performance impact when reading multiple files. For example, if you were to read a multi-file dataset from curl you would generally want to issue parallel HTTP reads but if we're only allowed to use a single thread for the read then that will not work. Although, we could probably address that particular performance impact if the underlying technology has support for an asynchronous API (as it seems that R's curl package does) so we can have three thread pools! (:dizzy:) What's the timing on this? I'm a little busy at the moment but I should be able to find some time this week to sketch a solution for the read_feather call (which could be adapted for read_csv or I could sketch the solution for read_csv first). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
