westonpace commented on pull request #12323:
URL: https://github.com/apache/arrow/pull/12323#issuecomment-1054726309


   TL;DR: We can solve this, we probably want to solve this, but it will 
involve some C++ effort.
   
   Sorry for the delay in looking at this.  We certainly have some options 
here.  This is a great chance to start putting some scaffolding we've laid down 
to good use.  The fact that the parquet reader works here is actually a fluke 
that we will someday fix (:laughing:) with 
[ARROW-14974](https://issues.apache.org/jira/browse/ARROW-14974).
   
   There are two thread pools in Arrow.  The CPU thread pool and the I/O thread 
pool.  The CPU thread pool has one thread per core and these threads are 
expected to do lots of heavy CPU work.  The I/O thread pool may have few 
threads (e.g. if using a hard disk) or it may have many threads (e.g. if using 
S3) and these threads are expected to spend most of their time in the waiting 
state.
   
   CPU threads should generally not block for long periods of time.  So when 
they have to do something slow (like read from disk) they put the task on the 
I/O thread pool and add a callback on the CPU thread pool to deal with the 
result.
   
   When use_threads is false we typically interpret that as "don't use up a 
bunch of CPU for this task" and we limit the CPU thread pool.  Ideally we limit 
it to the calling thread.  In some cases (e.g. execution engine) we limit it to 
one CPU thread and the calling thread (though I'm working on that as we speak).
   
   What we don't usually do is limit the I/O thread pool in any way.  We have 
the tooling to do this (basically the queue that you mentioned) but will need 
to do some work to wire everything up.  We can probably come up with a "limit 
all CPU and I/O tasks to the R thread" solution more easily than a "use the CPU 
thread pool for CPU tasks but limit all I/O tasks to the R thread" but the 
latter should probably be possible.  It will also be easier to support the 
whole-table readers & writers initially and then later add support for  
streaming APIs.
   
   Also, this will have some performance impact when reading multiple files.  
For example, if you were to read a multi-file dataset from curl you would 
generally want to issue parallel HTTP reads but if we're only allowed to use a 
single thread for the read then that will not work.  Although, we could 
probably address that particular performance impact if the underlying 
technology has support for an asynchronous API (as it seems that R's curl 
package does) so we can have three thread pools! (:dizzy:)
   
   What's the timing on this?  I'm a little busy at the moment but I should be 
able to find some time this week to sketch a solution for the read_feather call 
(which could be adapted for read_csv or I could sketch the solution for 
read_csv first).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to