Jonathan Keane created ARROW-16700: -------------------------------------- Summary: [C++] [R] [Datasets] aggregates on partitioning columns Key: ARROW-16700 URL: https://issues.apache.org/jira/browse/ARROW-16700 Project: Apache Arrow Issue Type: Bug Components: C++, R Reporter: Jonathan Keane
When summarizing a whole dataset (without group_by) with an aggregate, and summarizing a partitioned column, arrow returns wrong data: {code:r} library(arrow, warn.conflicts = FALSE) library(dplyr, warn.conflicts = FALSE) df <- expand.grid( some_nulls = c(0L, 1L, 2L), year = 2010:2023, month = 1:12, day = 1:30 ) path <- tempfile() dir.create(path) write_dataset(df, path, partitioning = c("year", "month")) ds <- open_dataset(path) # with arrow the mins/maxes are off for partitioning columns ds %>% summarise(n = n(), min_year = min(year), min_month = min(month), min_day = min(day), max_year = max(year), max_month = max(month), max_day = max(day)) %>% collect() #> # A tibble: 1 × 7 #> n min_year min_month min_day max_year max_month max_day #> <int> <int> <int> <int> <int> <int> <int> #> 1 15120 2023 1 1 2023 12 30 # comapred to what we get with dplyr df %>% summarise(n = n(), min_year = min(year), min_month = min(month), min_day = min(day), max_year = max(year), max_month = max(month), max_day = max(day)) %>% collect() #> n min_year min_month min_day max_year max_month max_day #> 1 15120 2010 1 1 2023 12 30 # even min alone is off: ds %>% summarise(min_year = min(year)) %>% collect() #> # A tibble: 1 × 1 #> min_year #> <int> #> 1 2016 # but non-partitioning columns are fine: ds %>% summarise(min_day = min(day)) %>% collect() #> # A tibble: 1 × 1 #> min_day #> <int> #> 1 1 # But with a group_by, this seems ok ds %>% group_by(some_nulls) %>% summarise(min_year = min(year)) %>% collect() #> # A tibble: 3 × 2 #> some_nulls min_year #> <int> <int> #> 1 0 2010 #> 2 1 2010 #> 3 2 2010 {code} -- This message was sent by Atlassian Jira (v8.20.7#820007)