Sam Albers created ARROW-15679: ---------------------------------- Summary: count should return an ungrouped dataframe Key: ARROW-15679 URL: https://issues.apache.org/jira/browse/ARROW-15679 Project: Apache Arrow Issue Type: Bug Components: R Affects Versions: 7.0.0 Reporter: Sam Albers
Unless grouped before `dplyr::count` returns a ungrouped data.frame. The arrow implement preserves the grouping variables: {code:java} library(arrow, warn.conflicts = FALSE) library(dplyr, warn.conflicts = FALSE) tf1 <- tempfile() dir.create(tf1) starwars |> write_dataset(tf1) # no group ---------------------------------------------------------------- ## dplyr behaviour count_dplyr_no_group <- starwars %>% count(gender, homeworld, species) group_vars(count_dplyr_no_group) #> character(0) ## arrow behaviour count_arrow_no_group <- open_dataset(tf1) %>% count(gender, homeworld, species) %>% collect() group_vars(count_arrow_no_group) #> [1] "gender" "homeworld" {code} If I am correct that this is a undesired behaviour I think it can be fixed [here|https://github.com/apache/arrow/blob/5ad5ddcafee8fada9cebb341df638b750c98efb7/r/R/dplyr-count.R#L20-L35] using this patch: {code:java} count.arrow_dplyr_query <- function(x, ..., wt = NULL, sort = FALSE, name = NULL) { if (!missing(...)) { out <- dplyr::group_by(x, ..., .add = TRUE) } else { out <- x } out <- dplyr::tally(out, wt = {{ wt }}, sort = sort, name = name) gv <- dplyr::group_vars(x) if (rlang::is_empty(gv)) { out <- dplyr::ungroup(out) } else { # Restore original group vars out$group_by_vars <- gv } out } {code} I can submit a PR with some tests if that would be helpful. -- This message was sent by Atlassian Jira (v8.20.1#820001)