amoeba opened a new issue, #40938:
URL: https://github.com/apache/arrow/issues/40938

   ### Describe the enhancement requested
   
   For many common use cases, the bindings the R package registers (`?acero)` 
allow users to write dplyr pipelines without pulling the data into R. When they 
run into cases where an operation isn't supported, they get an error that at 
least notifies them they've hit the edge:
   
   > Error: Expression $foo not supported in Arrow
   > Call collect() first to pull data into R.
   
   However, there are some cases where this edge is confusing and it makes it 
can make it hard for users to understand how to navigate it.
   
   <details>
   <summary>Setup</summary>
   
   ```r
   dat <- data.frame(a = rep("A", 10))
   write.csv(dat, "data.csv", row.names = FALSE)
   
   library(dplyr)
   library(arrow)
   ```
   </details>
   
   For example, if I try to use a function I wrote in a pipeline, I get our 
usual error:
   
   ```r
   udf <- function(x) gsub("A", "B", x)
   open_dataset("data.csv", format = "csv") |>
     mutate(b = udf(a)) |>
     collect()
   ```
   
   > Error: Expression udf(a) not supported in Arrow
   > Call collect() first to pull data into R.
   
   If I happen to know about `register_scalar_function`, I can modify my UDF 
slightly and make it work:
   ```r
   udf <- function(context, x) gsub("A", "B", x)
   register_scalar_function("udf", udf, string(), string(), auto_convert = TRUE)
   open_dataset("data.csv", format = "csv") |>
     mutate(b = udf(a)) |>
     collect()
   ```
   
   This succeeds and would lead the user to think it's not pulling the data 
into R at any point.
   
   What's confusing is that while the above snippet with the unregistered UDF 
fails, this isn't true for all UDFs. This works:
   
   ```r
   add_one <- function(x) { x + 1 }
   open_dataset("mtcars.csv", format = "csv") |>
     mutate(gear_one = add_one(gear)) |>
     collect()
   ```
   
   So the core question here why does the UDF that calls `+` work while one 
that calls `gsub` doesn't? They both have bindings. I think the answer(s) might 
be an opportunity to improve both the function of this part of the package and 
its documentation.
   
   From a discussion on rOpenSci Slack, we may have some related issues worth 
linking here:
   
   - https://github.com/apache/arrow/issues/29667
   - https://github.com/apache/arrow/issues/20372
   
   ### Component(s)
   
   R


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to