paleolimbot opened a new pull request #12323: URL: https://github.com/apache/arrow/pull/12323
This is a PR to support arbitrary R "connection" objects as Input and Output streams. In particular, this adds support for sockets (), URLs, and some other IO operations that are implemented as R connections (e.g., in the [archive](https://github.com/r-lib/archive#archive) package). The gist of it is that you should be able to do this: ``` r # remotes::install_github("paleolimbot/arrow/r@r-connections") library(arrow, warn.conflicts = FALSE) addr <- "https://github.com/apache/arrow/raw/master/r/inst/v0.7.1.parquet" stream <- arrow:::make_readable_file(addr) rawToChar(as.raw(stream$Read(4))) #> [1] "PAR1" stream$close() stream <- arrow:::make_readable_file(url(addr, open = "rb")) rawToChar(as.raw(stream$Read(4))) #> [1] "PAR1" stream$close() ``` There are two serious issues that prevent this PR from being useful. First, it uses functions that R considers "non-API" functions from the C API. > checking compiled code ... NOTE File ‘arrow/libs/arrow.so’: Found non-API calls to R: ‘R_GetConnection’, ‘R_ReadConnection’, ‘R_WriteConnection’ Compiled code should not call non-API entry points in R. We can get around this by calling back into R (in the same way this PR implements `Tell()` and `Close()`). We could also go all out and implement the other half (exposing `InputStream`/`OutputStream`s as R connections) and ask for an exemption (at least one R package, curl, does this). The archive package seems to expose connections without a NOTE on the CRAN check page, so perhaps there is also a workaround. Second, we get a crash when passing the input stream to most functions. I think this is because the `Read()` method is getting called from another thread but it also could be an error in my implementation. If the issue is threading, we would have to arrange a way to queue jobs for the R main thread (e.g., how the [later](https://github.com/r-lib/later#background-tasks) package does it) and a way to ping it occasionally to fetch the results. This is complicated but might be useful for other reasons (supporting evaluation of R functions in more places). It also might be more work than it's worth. ``` r # remotes::install_github("paleolimbot/arrow/r@r-connections") library(arrow, warn.conflicts = FALSE) addr <- "https://github.com/apache/arrow/raw/master/r/inst/v0.7.1.parquet" read_parquet(addr) ``` ``` *** caught segfault *** address 0x28, cause 'invalid permissions' Traceback: 1: parquet___arrow___FileReader__OpenFile(file, props) ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
