[GitHub] [arrow] jonkeane commented on a change in pull request #11315: ARROW-13860: [R] arrow 5.0.0 write_parquet throws error writing grouped data.frame

GitBox Wed, 13 Oct 2021 12:18:16 -0700


jonkeane commented on a change in pull request #11315:
URL: https://github.com/apache/arrow/pull/11315#discussion_r728368089




##########
File path: r/R/schema.R
##########
@@ -133,6 +133,23 @@ Schema <- R6Class("Schema",
         self$set_pointer(out$pointer())
         self
       }
+    },
+    r_metadata = function(new) {
+      # Helper for the R metadata that handles the serialization
+      # See also method on ArrowTabular
+      if (missing(new)) {
+        out <- self$metadata$r
+        if (!is.null(out)) {
+          # Can't unserialize NULL
+          out <- .unserialize_arrow_r_metadata(out)
+        }
+        # Returns either NULL or a named list
+        out
+      } else {
+        # Set the R metadata
+        self$metadata$r <- .serialize_arrow_r_metadata(new)
+        self

Review comment:
       Could? Should? we merge this and the tabular method? They look the same 
to me (except the comment)

##########
File path: r/R/table.R
##########
@@ -141,20 +144,6 @@ Table <- R6Class("Table",
     num_columns = function() Table__num_columns(self),
     num_rows = function() Table__num_rows(self),
     schema = function() Table__schema(self),
-    metadata = function(new) {
-      if (missing(new)) {
-        # Get the metadata (from the schema)
-        self$schema$metadata
-      } else {
-        # Set the metadata
-        new <- prepare_key_value_metadata(new)
-        out <- Table__ReplaceSchemaMetadata(self, new)
-        # ReplaceSchemaMetadata returns a new object but we're modifying in 
place,
-        # so swap in that new C++ object pointer into our R6 object
-        self$set_pointer(out$pointer())
-        self
-      }
-    },

Review comment:
       `Table$metadata()` would now dispatch through the 
`ArrowTabular$metadata()` method/binding now, correct?

##########
File path: r/tests/testthat/test-metadata.R
##########
@@ -240,7 +241,7 @@ test_that("Row-level metadata (does not) roundtrip in 
datasets", {
   skip_if_not_available("parquet")
 
   local_edition(3)

Review comment:
       ```suggestion
   ```
   
   This isn't related to the PR, but we no longer need this, right?

##########
File path: r/R/metadata.R
##########
@@ -129,6 +133,19 @@ remove_attributes <- function(x) {
 }
 
 arrow_attributes <- function(x, only_top_level = FALSE) {
+  if (inherits(x, "grouped_df")) {
+    # Keep only the group var names, not the rest of the cached data that dplyr
+    # uses, which may be large

Review comment:
       How large? We can do this as a follow on if we want, but I would be 
curious if removing it and then having dplyr regenerate it is worse or better 
(performance wise) compared to serializing the large cache.

##########
File path: r/tests/testthat/test-parquet.R
##########
@@ -104,6 +104,12 @@ test_that("write_parquet() accepts RecordBatch too", {
   expect_equal(tab, Table$create(batch))
 })
 
+test_that("write_parquet() handles grouped_df", {
+  library(dplyr, warn.conflicts = FALSE)
+  df <- tibble::tibble(a = 1:4, b = 5) %>% group_by(b)
+  expect_parquet_roundtrip(df, as_data_frame = TRUE)

Review comment:
       Does this test confirm that dplyr deals with recreating the cache as 
well?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] jonkeane commented on a change in pull request #11315: ARROW-13860: [R] arrow 5.0.0 write_parquet throws error writing grouped data.frame

Reply via email to