jonkeane commented on a change in pull request #11894:
URL: https://github.com/apache/arrow/pull/11894#discussion_r779659047
##########
File path: r/tests/testthat/test-dataset.R
##########
@@ -453,15 +453,38 @@ test_that("Creating UnionDataset", {
})
test_that("map_batches", {
- skip("map_batches() is broken (ARROW-14029)")
ds <- open_dataset(dataset_dir, partitioning = "part")
+
+ # summarize returns arrow_dplyr_query
expect_equal(
ds %>%
filter(int > 5) %>%
select(int, lgl) %>%
- map_batches(~ summarize(., min_int = min(int))),
+ map_batches(~ summarize(., min_int = min(int))) %>%
+ arrange(min_int),
tibble(min_int = c(6L, 101L))
)
+
+ # $num_rows returns integer vector
+ expect_equal(
+ ds %>%
+ filter(int > 5) %>%
+ select(int, lgl) %>%
+ map_batches(~ .$num_rows, .data.frame = FALSE) %>%
+ as.numeric() %>%
+ sort(),
+ c(5, 10)
Review comment:
```suggestion
map_batches(~ .$num_rows, .data.frame = FALSE) %>%
sort(),
c(5L, 10L)
```
Does this work? `as.numeric()` in there is a little suspicious — what issues
do you have if you take it out?
##########
File path: r/tests/testthat/test-dataset.R
##########
@@ -453,15 +453,38 @@ test_that("Creating UnionDataset", {
})
test_that("map_batches", {
- skip("map_batches() is broken (ARROW-14029)")
ds <- open_dataset(dataset_dir, partitioning = "part")
+
+ # summarize returns arrow_dplyr_query
expect_equal(
ds %>%
filter(int > 5) %>%
select(int, lgl) %>%
- map_batches(~ summarize(., min_int = min(int))),
+ map_batches(~ summarize(., min_int = min(int))) %>%
+ arrange(min_int),
tibble(min_int = c(6L, 101L))
)
+
+ # $num_rows returns integer vector
+ expect_equal(
+ ds %>%
+ filter(int > 5) %>%
+ select(int, lgl) %>%
+ map_batches(~ .$num_rows, .data.frame = FALSE) %>%
+ as.numeric() %>%
+ sort(),
+ c(5, 10)
+ )
+
+ # $Take returns RecordBatch
Review comment:
Is this comment accurate here? It looks like it's returning a tibble? Or
is that a side effect of `arrange()`?
##########
File path: r/R/dataset-scan.R
##########
@@ -174,8 +174,6 @@ ScanTask <- R6Class("ScanTask",
#' a `data.frame` for further aggregation, even if you couldn't fit the whole
#' `Dataset` result in memory.
#'
-#' This is experimental and not recommended for production use.
Review comment:
We might want to still keep the experimental label here — it's working
now, but as you mention, we might refactor it / have it possibly have different
behavior in the future.
##########
File path: r/R/dataset-scan.R
##########
@@ -185,17 +183,36 @@ ScanTask <- R6Class("ScanTask",
#' `data.frame`? Default `TRUE`
#' @export
map_batches <- function(X, FUN, ..., .data.frame = TRUE) {
- if (.data.frame) {
- lapply <- map_dfr
- }
- scanner <- Scanner$create(ensure_group_vars(X))
+ # TODO: possibly refactor do_exec_plan to return a RecordBatchReader
Review comment:
Would you mind making the Jira for this + put the number here? I don't
know of one off the top of my head but we should get one if we think we'll
(possibly) want to move in that direction
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]