nealrichardson commented on a change in pull request #11315: URL: https://github.com/apache/arrow/pull/11315#discussion_r728403810
########## File path: r/R/metadata.R ########## @@ -129,6 +133,19 @@ remove_attributes <- function(x) { } arrow_attributes <- function(x, only_top_level = FALSE) { + if (inherits(x, "grouped_df")) { + # Keep only the group var names, not the rest of the cached data that dplyr + # uses, which may be large Review comment: You basically get a data.frame that is `distinct(group_vars())` plus a list column of integer vectors of the row indices that match each condition: ``` > mtcars %>% group_by(cyl, hp) %>% attr("groups") # A tibble: 23 × 3 cyl hp .rows <dbl> <dbl> <list<int>> 1 4 52 [1] 2 4 62 [1] 3 4 65 [1] 4 4 66 [2] 5 4 91 [1] 6 4 93 [1] 7 4 95 [1] 8 4 97 [1] 9 4 109 [1] 10 4 113 [1] # … with 13 more rows ``` So clearly that gets big both when you have lots of rows and when you have high cardinality in your groups. I don't think it makes sense for us to save it to feather/parquet, and we don't need to because we can recreate it from just `group_vars()` on the round trip. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org