Sam Albers created ARROW-11925: ---------------------------------- Summary: Add `between` method for arrow_dplyr_query Key: ARROW-11925 URL: https://issues.apache.org/jira/browse/ARROW-11925 Project: Apache Arrow Issue Type: New Feature Components: R Reporter: Sam Albers
Would you consider a PR to add a between method for `arrow_dplyr_query` objects? Even something implemented directly in R harnesses the arrow speed. Here is what I am thinking: Typical usage of `between`: {code:java} library(dplyr) library(arrow) iris %>% filter(between(Petal.Length, 1, 1.1)){code} Here is a mocked up version of the method: {code:java} between_mock <- function(x, left, right) { if (length(left) != 1) { rlang::abort("`left` must be length 1") } if (length(right) != 1) { rlang::abort("`right` must be length 1") }x >= left & x <= right }{code} I think because `dplyr` uses C++ to efficiently do this, `between` doesn't work out of the box: {code:java} open_dataset("nyc-taxi", partitioning = "year") %>% filter(year == 2014) %>% select(year, fare_amount) %>% filter(between(fare_amount, 10, 11)) %>% collect() Error: Filter expression not supported for Arrow Datasets: between(fare_amount, 10, 11) Call collect() first to pull data into R. In addition: Warning message: between() called on numeric vector with S3 class Backtrace: x 1. +-[ `%>%`(...) ] 2. +-[ dplyr::collect(...) ] 3. +-[ dplyr::filter(...) ] 4. \-arrow:::filter.arrow_dplyr_query(...){code} But even my simple implementation works fine: {code:java} open_dataset("nyc-taxi", partitioning = "year") %>% filter(year == 2014) %>% select(year, fare_amount) %>% filter(between_mock(fare_amount, 10, 11)) %>% collect() {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)