[GitHub] [arrow] jonkeane commented on a change in pull request #11894: ARROW-14029: [R] Repair map_batches()

GitBox Thu, 06 Jan 2022 08:17:16 -0800


jonkeane commented on a change in pull request #11894:
URL: https://github.com/apache/arrow/pull/11894#discussion_r779659047




##########
File path: r/tests/testthat/test-dataset.R
##########
@@ -453,15 +453,38 @@ test_that("Creating UnionDataset", {
 })
 
 test_that("map_batches", {
-  skip("map_batches() is broken (ARROW-14029)")
   ds <- open_dataset(dataset_dir, partitioning = "part")
+
+  # summarize returns arrow_dplyr_query
   expect_equal(
     ds %>%
       filter(int > 5) %>%
       select(int, lgl) %>%
-      map_batches(~ summarize(., min_int = min(int))),
+      map_batches(~ summarize(., min_int = min(int))) %>%
+      arrange(min_int),
     tibble(min_int = c(6L, 101L))
   )
+
+  # $num_rows returns integer vector
+  expect_equal(
+    ds %>%
+      filter(int > 5) %>%
+      select(int, lgl) %>%
+      map_batches(~ .$num_rows, .data.frame = FALSE) %>%
+      as.numeric() %>%
+      sort(),
+    c(5, 10)

Review comment:
       ```suggestion
         map_batches(~ .$num_rows, .data.frame = FALSE) %>%
         sort(),
       c(5L, 10L)
   ```
   
   Does this work? `as.numeric()` in there is a little suspicious — what issues 
do you have if you take it out?

##########
File path: r/tests/testthat/test-dataset.R
##########
@@ -453,15 +453,38 @@ test_that("Creating UnionDataset", {
 })
 
 test_that("map_batches", {
-  skip("map_batches() is broken (ARROW-14029)")
   ds <- open_dataset(dataset_dir, partitioning = "part")
+
+  # summarize returns arrow_dplyr_query
   expect_equal(
     ds %>%
       filter(int > 5) %>%
       select(int, lgl) %>%
-      map_batches(~ summarize(., min_int = min(int))),
+      map_batches(~ summarize(., min_int = min(int))) %>%
+      arrange(min_int),
     tibble(min_int = c(6L, 101L))
   )
+
+  # $num_rows returns integer vector
+  expect_equal(
+    ds %>%
+      filter(int > 5) %>%
+      select(int, lgl) %>%
+      map_batches(~ .$num_rows, .data.frame = FALSE) %>%
+      as.numeric() %>%
+      sort(),
+    c(5, 10)
+  )
+
+  # $Take returns RecordBatch

Review comment:
       Is this comment accurate here? It looks like it's returning a tibble? Or 
is that a side effect of `arrange()`?

##########
File path: r/R/dataset-scan.R
##########
@@ -174,8 +174,6 @@ ScanTask <- R6Class("ScanTask",
 #' a `data.frame` for further aggregation, even if you couldn't fit the whole
 #' `Dataset` result in memory.
 #'
-#' This is experimental and not recommended for production use.

Review comment:
       We might want to still keep the experimental label here — it's working 
now, but as you mention, we might refactor it / have it possibly have different 
behavior in the future.

##########
File path: r/R/dataset-scan.R
##########
@@ -185,17 +183,36 @@ ScanTask <- R6Class("ScanTask",
 #' `data.frame`? Default `TRUE`
 #' @export
 map_batches <- function(X, FUN, ..., .data.frame = TRUE) {
-  if (.data.frame) {
-    lapply <- map_dfr
-  }
-  scanner <- Scanner$create(ensure_group_vars(X))
+  # TODO: possibly refactor do_exec_plan to return a RecordBatchReader

Review comment:
       Would you mind making the Jira for this + put the number here? I don't 
know of one off the top of my head but we should get one if we think we'll 
(possibly) want to move in that direction




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] jonkeane commented on a change in pull request #11894: ARROW-14029: [R] Repair map_batches()

Reply via email to