paleolimbot commented on pull request #12323:
URL: https://github.com/apache/arrow/pull/12323#issuecomment-1055461092


   Thanks!
   
   > What's the timing on this?
   
   There is no particular rush on `read_feather()` and `read_csv_arrow()` 
working with R connections. It doesn't have to be solved for this PR, either, 
although if this PR is merged it would be best to fix before the next CRAN 
release.
   
   > We can probably come up with a "limit all CPU and I/O tasks to the R 
thread" solution more easily than a "use the CPU thread pool for CPU tasks but 
limit all I/O tasks to the R thread" but the latter should probably be possible.
   
   I'm still wrapping my head around the specifics here, but because they might 
be related I'll list the "calling the R thread" possibilities I've run into 
recently in case any of them makes one of those options more obvious to pursue.
   
   - This PR, when a user wants to use some Arrow machinery but needs to 
implement the `InputStream` or `OutputStream` as R functions because for 
whatever reason the filesystem/input stream type isn't implemented in Arrow C++ 
or the R bindings
   - A user has a `RecordBatchReader` where calling the get next batch method 
is an R function. I haven't had time to look into it properly but this crashes 
every time I've tried to put it into the query engine (works for read_table(), 
though). Possibly related is a `RecordBatchReader.from_batches()` that was 
imported from Python via the C interface, which also crashes when put into the 
query engine (but not read_table()).
   - An extension type implemented in R that has a custom `ExtensionEquals()` 
method (just starting this in #12467).
   - A compute function that wraps an R function (e.g., for things like 
geospatial operators whose external dependencies are impractical or impossible 
to include in the arrow R package)
   
   From the R end, I know there is a way to request the evaluation of something 
on the main thread from elsewhere; however, there needs to be an event loop on 
the main thread checking for tasks for that to work. I don't know much about it 
but I do know it has been used elsewhere for packages like Shiny and plumber 
that accept HTTP requests and funnel them to R functions.
   
   > Although, we could probably address that particular performance impact if 
the underlying technology has support for an asynchronous API (as it seems that 
R's curl package does)
   
   In my mind, supporting R connections is more about providing a (possibly 
slow) workaround for things that Arrow C++ or the R bindings can't do yet 
(e.g., URLs). I do know that the async API for curl from the R end is along the 
lines of `open_async(url, function(chunk, is_last_chunk))`. R connections are a 
pain and if there are more use-cases along these lines it might be worth 
investing in some C struct definitions where its clear that callable members 
must be thread safe.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to