[GitHub] [arrow] nealrichardson commented on a change in pull request #10269: ARROW-11705: [R] Support scalar value recycling in RecordBatch/Table$create()

GitBox Fri, 14 May 2021 07:43:08 -0700


nealrichardson commented on a change in pull request #10269:
URL: https://github.com/apache/arrow/pull/10269#discussion_r632571798




##########
File path: r/R/record-batch.R
##########
@@ -161,7 +161,18 @@ RecordBatch$create <- function(..., schema = NULL) {
     out <- RecordBatch__from_arrays(schema, arrays)
     return(dplyr::group_by(out, !!!dplyr::groups(arrays[[1]])))
   }
-  
+
+  # If any arrays are length 1, recycle them  
+  arr_lens <- map_int(arrays, length)

Review comment:
       FWIW there is a base R function `lengths()` that does this (though I 
don't recall what version it was added in)

##########
File path: r/R/record-batch.R
##########
@@ -161,7 +161,18 @@ RecordBatch$create <- function(..., schema = NULL) {
     out <- RecordBatch__from_arrays(schema, arrays)
     return(dplyr::group_by(out, !!!dplyr::groups(arrays[[1]])))
   }
-  
+
+  # If any arrays are length 1, recycle them  
+  arr_lens <- map_int(arrays, length)
+  if (length(arrays) > 1 && any(arr_lens == 1) && !all(arr_lens==1)) {
+    max_array_len <- max(arr_lens)
+    arrays <- modify2(
+      arrays,
+      arr_lens == 1,
+      ~if (.y) repeat_value_as_array(.x, max_array_len) else .x
+    )

Review comment:
       Call me 👴 but I personally find 
   
   ```
   arrays[arr_lens == 1] <- lapply(arrays[arr_lens == 1], 
repeat_value_as_array, max_array_len)
   ```
   
   easier to read than `modify2(...)`.

##########
File path: r/R/scalar.R
##########
@@ -33,7 +33,7 @@ Scalar <- R6Class("Scalar",
   public = list(
     ToString = function() Scalar__ToString(self),
     as_vector = function() Scalar__as_vector(self),
-    as_array = function() MakeArrayFromScalar(self),
+    as_array = function() MakeArrayFromScalar(self, 1L),

Review comment:
       If you pass this through, maybe you don't need `repeat_value_as_array`
   
   ```suggestion
       as_array = function(length = 1L) MakeArrayFromScalar(self, 
as.integer(length)),
   ```

##########
File path: r/tests/testthat/test-RecordBatch.R
##########
@@ -415,14 +414,42 @@ test_that("record_batch() handles null type 
(ARROW-7064)", {
   expect_equivalent(batch$schema,  schema(a = int32(), n = null()))
 })
 
-test_that("record_batch() scalar recycling", {
-  skip("Not implemented (ARROW-11705)")
+test_that("record_batch() scalar recycling with vectors", {
   expect_data_frame(
     record_batch(a = 1:10, b = 5),
     tibble::tibble(a = 1:10, b = 5)
   )
 })
 
+test_that("record_batch() scalar recycling with Scalars, Arrays, and 
ChunkedArrays", {
+  
+  expect_data_frame(
+    record_batch(a = Array$create(1:10), b = Scalar$create(5)),
+    tibble::tibble(a = 1:10, b = 5)
+  )
+  
+  expect_data_frame(
+    record_batch(a = Array$create(1:10), b = Array$create(5)),
+    tibble::tibble(a = 1:10, b = 5)
+  )
+  
+  expect_data_frame(
+    record_batch(a = Array$create(1:10), b = ChunkedArray$create(5)),
+    tibble::tibble(a = 1:10, b = 5)
+  )
+  
+})
+
+test_that("record_batch() no recycling with tibbles", {

Review comment:
       Why does this error, and should it?

##########
File path: r/R/util.R
##########
@@ -110,3 +110,21 @@ handle_embedded_nul_error <- function(e) {
   }
   stop(e)
 }
+
+#' Take an object of length 1 and repeat it.
+#' 
+#' @param object Object to be repeated - vector, `Scalar`, `Array`, or 
`ChunkedArray`
+#' @param n Number of repetitions
+#' 
+#' @return `Array` of length `n`
+#' 
+#' @keywords internal
+repeat_value_as_array <- function(object, n) {
+  if (length(object) != 1) {

Review comment:
       If you're checking length again here (unnecessary IMO since you're only 
calling this in the case where you've validated length == 1 already), you could 
simplify your `modify2()` wrapper and just `map` over all `arrays`, and in here 
only do the recycling if length == 1. 

##########
File path: r/R/record-batch.R
##########
@@ -161,7 +161,18 @@ RecordBatch$create <- function(..., schema = NULL) {
     out <- RecordBatch__from_arrays(schema, arrays)
     return(dplyr::group_by(out, !!!dplyr::groups(arrays[[1]])))
   }
-  
+
+  # If any arrays are length 1, recycle them  

Review comment:
       This code block seems to be repeated in both the Table and RecordBatch 
code, so it might be better factored out as a helper function.
   
   I also wonder whether the logic would actually be simpler in C++ because you 
could do it at a later point where you know exactly that you have a vector of 
Arrays and don't have to worry about whether it is an R vector, a Scalar, an 
Array, etc. See the `check_consistent_array_size` function for example in 
`r/src`--you could drop in around there and instead of erroring if you don't 
have consistent lengths, handle the recycling case. (Also ok to (1) ignore this 
suggestion or (2) defer to a followup)




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] nealrichardson commented on a change in pull request #10269: ARROW-11705: [R] Support scalar value recycling in RecordBatch/Table$create()

Reply via email to