DavZim opened a new issue, #14872: URL: https://github.com/apache/arrow/issues/14872
### Describe the bug, including details regarding any error messages, version, and platform. When collecting a query with multiple group_by + summarise statements, one variable gets wrongly assigned values from another variable. When an ungroup is inserted, everything works fine again. To reproduce, consider the following: In the examples below, the variable `gender` should be `F`, or `M` and not `Group X`. When the `ungroup()` is inserted (second part), gender is again F/M and not Group X. ``` r library(dplyr) library(arrow) # Create sample dataset N <- 1000 set.seed(123) orig_data <- tibble( code_group = sample(paste("Group", 1:2), N, replace = TRUE), year = sample(2015:2016, N, replace = TRUE), gender = sample(c("F", "M"), N, replace = TRUE), value = runif(N, 0, 10) ) write_dataset(orig_data, "example") # Query and replicate the error (ds <- open_dataset("example/")) #> FileSystemDataset with 1 Parquet file #> code_group: string #> year: int32 #> gender: string #> value: double ds |> group_by(year, code_group, gender) |> summarise(value = sum(value)) |> group_by(code_group, gender) |> summarise(value = max(value), NN = n()) |> collect() #> # A tibble: 2 × 4 #> # Groups: code_group [2] #> code_group gender value NN #> <chr> <chr> <dbl> <int> #> 1 Group 1 Group 1 724. 4 #> 2 Group 2 Group 2 661. 4 ``` **ERROR** the gender variable is replaced by the values of the group variable ``` r ds |> group_by(year, code_group, gender) |> summarise(value = sum(value)) |> ungroup() |> #< Added this line... group_by(code_group, gender) |> summarise(value = max(value), NN = n()) |> collect() #> # A tibble: 4 × 4 #> # Groups: code_group [2] #> code_group gender value NN #> <chr> <chr> <dbl> <int> #> 1 Group 1 F 724. 2 #> 2 Group 2 M 627. 2 #> 3 Group 1 M 658. 2 #> 4 Group 2 F 661. 2 ``` **Note** now after inserting the `ungroup()` between the group-by - summarise calls, gender is not replaced Quick look at the query (note Node 4 where `"gender": code_group`) ``` r ds |> group_by(year, code_group, gender) |> summarise(value = sum(value)) |> group_by(code_group, gender) |> summarise(value = max(value), NN = n()) |> show_query() #> ExecPlan with 8 nodes: #> 7:SinkNode{} #> 6:ProjectNode{projection=[code_group, gender, value, NN]} #> 5:GroupByNode{keys=["code_group", "gender"], aggregates=[ #> hash_max(value, {skip_nulls=false, min_count=0}), #> hash_sum(NN, {skip_nulls=true, min_count=1}), #> ]} #> 4:ProjectNode{projection=[value, "NN": 1, code_group, "gender": code_group]} #< gender is wrongfully mapped to code_group! #> 3:ProjectNode{projection=[year, code_group, gender, value]} #> 2:GroupByNode{keys=["year", "code_group", "gender"], aggregates=[ #> hash_sum(value, {skip_nulls=false, min_count=0}), #> ]} #> 1:ProjectNode{projection=[value, year, code_group, gender]} #> 0:SourceNode{} ``` Note that this was also asked [here on SO](https://stackoverflow.com/q/74710844/3048453) ### Component(s) R -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org