amoeba opened a new issue, #40938:
URL: https://github.com/apache/arrow/issues/40938
### Describe the enhancement requested
For many common use cases, the bindings the R package registers (`?acero)`
allow users to write dplyr pipelines without pulling the data into R. When they
run into cases where an operation isn't supported, they get an error that at
least notifies them they've hit the edge:
> Error: Expression $foo not supported in Arrow
> Call collect() first to pull data into R.
However, there are some cases where this edge is confusing and it makes it
can make it hard for users to understand how to navigate it.
<details>
<summary>Setup</summary>
```r
dat <- data.frame(a = rep("A", 10))
write.csv(dat, "data.csv", row.names = FALSE)
library(dplyr)
library(arrow)
```
</details>
For example, if I try to use a function I wrote in a pipeline, I get our
usual error:
```r
udf <- function(x) gsub("A", "B", x)
open_dataset("data.csv", format = "csv") |>
mutate(b = udf(a)) |>
collect()
```
> Error: Expression udf(a) not supported in Arrow
> Call collect() first to pull data into R.
If I happen to know about `register_scalar_function`, I can modify my UDF
slightly and make it work:
```r
udf <- function(context, x) gsub("A", "B", x)
register_scalar_function("udf", udf, string(), string(), auto_convert = TRUE)
open_dataset("data.csv", format = "csv") |>
mutate(b = udf(a)) |>
collect()
```
This succeeds and would lead the user to think it's not pulling the data
into R at any point.
What's confusing is that while the above snippet with the unregistered UDF
fails, this isn't true for all UDFs. This works:
```r
add_one <- function(x) { x + 1 }
open_dataset("mtcars.csv", format = "csv") |>
mutate(gear_one = add_one(gear)) |>
collect()
```
So the core question here why does the UDF that calls `+` work while one
that calls `gsub` doesn't? They both have bindings. I think the answer(s) might
be an opportunity to improve both the function of this part of the package and
its documentation.
From a discussion on rOpenSci Slack, we may have some related issues worth
linking here:
- https://github.com/apache/arrow/issues/29667
- https://github.com/apache/arrow/issues/20372
### Component(s)
R
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]