[GitHub] [arrow] nealrichardson commented on a change in pull request #11315: ARROW-13860: [R] arrow 5.0.0 write_parquet throws error writing grouped data.frame

GitBox Wed, 13 Oct 2021 13:00:43 -0700


nealrichardson commented on a change in pull request #11315:
URL: https://github.com/apache/arrow/pull/11315#discussion_r728403810




##########
File path: r/R/metadata.R
##########
@@ -129,6 +133,19 @@ remove_attributes <- function(x) {
 }
 
 arrow_attributes <- function(x, only_top_level = FALSE) {
+  if (inherits(x, "grouped_df")) {
+    # Keep only the group var names, not the rest of the cached data that dplyr
+    # uses, which may be large

Review comment:
       You basically get a data.frame that is `distinct(group_vars())` plus a 
list column of integer vectors of the row indices that match each condition:
   
   ```
   > mtcars %>% group_by(cyl, hp) %>% attr("groups")
   # A tibble: 23 × 3
        cyl    hp       .rows
      <dbl> <dbl> <list<int>>
    1     4    52         [1]
    2     4    62         [1]
    3     4    65         [1]
    4     4    66         [2]
    5     4    91         [1]
    6     4    93         [1]
    7     4    95         [1]
    8     4    97         [1]
    9     4   109         [1]
   10     4   113         [1]
   # … with 13 more rows
   ```
   
   So clearly that gets big both when you have lots of rows and when you have 
high cardinality in your groups. I don't think it makes sense for us to save it 
to feather/parquet, and we don't need to because we can recreate it from just 
`group_vars()` on the round trip.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] nealrichardson commented on a change in pull request #11315: ARROW-13860: [R] arrow 5.0.0 write_parquet throws error writing grouped data.frame

Reply via email to