paleolimbot opened a new pull request #12323:
URL: https://github.com/apache/arrow/pull/12323


   This is a PR to support arbitrary R "connection" objects as Input and Output 
streams. In particular, this adds support for sockets (), URLs, and some other 
IO operations that are implemented as R connections (e.g., in the 
[archive](https://github.com/r-lib/archive#archive) package). The gist of it is 
that you should be able to do this:
   
   ``` r
   # remotes::install_github("paleolimbot/arrow/r@r-connections")
   library(arrow, warn.conflicts = FALSE)
   
   addr <- "https://github.com/apache/arrow/raw/master/r/inst/v0.7.1.parquet";
   
   stream <- arrow:::make_readable_file(addr)
   rawToChar(as.raw(stream$Read(4)))
   #> [1] "PAR1"
   stream$close()
   
   stream <- arrow:::make_readable_file(url(addr, open = "rb"))
   rawToChar(as.raw(stream$Read(4)))
   #> [1] "PAR1"
   stream$close()
   ```
   
   There are two serious issues that prevent this PR from being useful. First, 
it uses functions that R considers "non-API" functions from the C API.
   
       > checking compiled code ... NOTE
         File ‘arrow/libs/arrow.so’:
           Found non-API calls to R: ‘R_GetConnection’, ‘R_ReadConnection’,
             ‘R_WriteConnection’
         
         Compiled code should not call non-API entry points in R.
   
   We can get around this by calling back into R (in the same way this PR 
implements `Tell()` and `Close()`). We could also go all out and implement the 
other half (exposing `InputStream`/`OutputStream`s as R connections) and ask 
for an exemption (at least one R package, curl, does this). The archive package 
seems to expose connections without a NOTE on the CRAN check page, so perhaps 
there is also a workaround.
   
   Second, we get a crash when passing the input stream to most functions. I 
think this is because the `Read()` method is getting called from another thread 
but it also could be an error in my implementation. If the issue is threading, 
we would have to arrange a way to queue jobs for the R main thread (e.g., how 
the [later](https://github.com/r-lib/later#background-tasks) package does it) 
and a way to ping it occasionally to fetch the results. This is complicated but 
might be useful for other reasons (supporting evaluation of R functions in more 
places). It also might be more work than it's worth.
   
   ``` r
   # remotes::install_github("paleolimbot/arrow/r@r-connections")
   library(arrow, warn.conflicts = FALSE)
   
   addr <- "https://github.com/apache/arrow/raw/master/r/inst/v0.7.1.parquet";
   read_parquet(addr)
   ```
   
   ```
   *** caught segfault ***
   address 0x28, cause 'invalid permissions'
   
   Traceback:
    1: parquet___arrow___FileReader__OpenFile(file, props)
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to