[ https://issues.apache.org/jira/browse/ARROW-16700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17560667#comment-17560667 ]
Jonathan Keane commented on ARROW-16700: ---------------------------------------- [~westonpace] not sure if this is related to ARROW-16904 or ARROW-16807 but another wrong-data ticket we should take a look at > [C++] [R] [Datasets] aggregates on partitioning columns > ------------------------------------------------------- > > Key: ARROW-16700 > URL: https://issues.apache.org/jira/browse/ARROW-16700 > Project: Apache Arrow > Issue Type: Bug > Components: C++, R > Reporter: Jonathan Keane > Priority: Blocker > Fix For: 9.0.0, 8.0.1 > > > When summarizing a whole dataset (without group_by) with an aggregate, and > summarizing a partitioned column, arrow returns wrong data: > {code:r} > library(arrow, warn.conflicts = FALSE) > library(dplyr, warn.conflicts = FALSE) > df <- expand.grid( > some_nulls = c(0L, 1L, 2L), > year = 2010:2023, > month = 1:12, > day = 1:30 > ) > path <- tempfile() > dir.create(path) > write_dataset(df, path, partitioning = c("year", "month")) > ds <- open_dataset(path) > # with arrow the mins/maxes are off for partitioning columns > ds %>% > summarise(n = n(), min_year = min(year), min_month = min(month), min_day = > min(day), max_year = max(year), max_month = max(month), max_day = max(day)) > %>% > collect() > #> # A tibble: 1 × 7 > #> n min_year min_month min_day max_year max_month max_day > #> <int> <int> <int> <int> <int> <int> <int> > #> 1 15120 2023 1 1 2023 12 30 > # comapred to what we get with dplyr > df %>% > summarise(n = n(), min_year = min(year), min_month = min(month), min_day = > min(day), max_year = max(year), max_month = max(month), max_day = max(day)) > %>% > collect() > #> n min_year min_month min_day max_year max_month max_day > #> 1 15120 2010 1 1 2023 12 30 > # even min alone is off: > ds %>% > summarise(min_year = min(year)) %>% > collect() > #> # A tibble: 1 × 1 > #> min_year > #> <int> > #> 1 2016 > > # but non-partitioning columns are fine: > ds %>% > summarise(min_day = min(day)) %>% > collect() > #> # A tibble: 1 × 1 > #> min_day > #> <int> > #> 1 1 > > > # But with a group_by, this seems ok > ds %>% > group_by(some_nulls) %>% > summarise(min_year = min(year)) %>% > collect() > #> # A tibble: 3 × 2 > #> some_nulls min_year > #> <int> <int> > #> 1 0 2010 > #> 2 1 2010 > #> 3 2 2010 > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)