jonkeane commented on a change in pull request #11894:
URL: https://github.com/apache/arrow/pull/11894#discussion_r779925693
##########
File path: r/R/dataset-scan.R
##########
@@ -185,17 +185,36 @@ ScanTask <- R6Class("ScanTask",
#' `data.frame`? Default `TRUE`
#' @export
map_batches <- function(X, FUN, ..., .data.frame = TRUE) {
- if (.data.frame) {
- lapply <- map_dfr
- }
- scanner <- Scanner$create(ensure_group_vars(X))
+ # TODO(ARROW-15271): possibly refactor do_exec_plan to return a
RecordBatchReader
Review comment:
```suggestion
# TODO: ARROW-15271 possibly refactor do_exec_plan to return a
RecordBatchReader
```
Super minor, but most of our TODOs have the semicolon and then the issue
number (though there are others that do it with a space too)
##########
File path: r/tests/testthat/test-dataset.R
##########
@@ -453,15 +453,38 @@ test_that("Creating UnionDataset", {
})
test_that("map_batches", {
- skip("map_batches() is broken (ARROW-14029)")
ds <- open_dataset(dataset_dir, partitioning = "part")
+
+ # summarize returns arrow_dplyr_query, which gets collected into a tibble
expect_equal(
ds %>%
filter(int > 5) %>%
select(int, lgl) %>%
- map_batches(~ summarize(., min_int = min(int))),
+ map_batches(~ summarize(., min_int = min(int))) %>%
+ arrange(min_int),
tibble(min_int = c(6L, 101L))
)
+
+ # $num_rows returns integer vector
+ expect_equal(
+ ds %>%
+ filter(int > 5) %>%
+ select(int, lgl) %>%
+ map_batches(~ .$num_rows, .data.frame = FALSE) %>%
+ unlist() %>% # Returns list because .data.frame is FALSE
Review comment:
Thanks for this fantastic _extra_ clarification!
##########
File path: r/vignettes/dataset.Rmd
##########
@@ -290,6 +290,64 @@ rows match the filter. Relatedly, since Parquet files
contain row groups with
statistics on the data within, there may be entire chunks of data you can
avoid scanning because they have no rows where `total_amount > 100`.
+### Processing data in batches
Review comment:
I like it a lot. And I think it totally belongs here in a vignette
(especially in the tone you have here). But it wouldn't be bad to make an issue
to add to the cookbook as well (though don't feel obligated to do that right
now if you don't want to!).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]