[ https://issues.apache.org/jira/browse/ARROW-16700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17562721#comment-17562721 ]
Weston Pace commented on ARROW-16700: ------------------------------------- {quote}tl;dr it looks like min/max itself is broken and only aggregates only the last partition it sees (in each thread); every Consume call casually ignores and overrides this->state. Extraordinary claims demand extraordinary evidence, especially since I only started looking at Acero in-depth and installed R about five hours ago, so here's my complete analysis for posterity...{quote} Yes, min/max is broken in the way you describe. This is captured in ARROW-16904. [~octalene] is working on updating the unit tests so that we can reproduce this issue correctly. Sorry, I hadn't realized this JIRA also involved a min/max and I hadn't realized the project workaround you mentioned. However, that is excellent analysis. You are correct that a project node would fix this issue. However, a project node isn't normally inserted immediately after a scan node. In fact, what is happening here, is that R always inserts a project node immediately _before_ an aggregate node. Either way, that is a pretty significant workaround, as the project & filter nodes are already satisfied by the guarantee. Still, I'd like to leave this issue in place for the moment, though maybe it doesn't need to be a blocker. Future Substrait queries would, in theory, be able to create plans without the preceding project node. At some point the scan node will not emit these columns unless they are asked for so I don't think a project to satisfy the column-dropping emit will always be necessary either. > [C++] [R] [Datasets] aggregates on partitioning columns > ------------------------------------------------------- > > Key: ARROW-16700 > URL: https://issues.apache.org/jira/browse/ARROW-16700 > Project: Apache Arrow > Issue Type: Bug > Components: C++, R > Reporter: Jonathan Keane > Assignee: Jeroen van Straten > Priority: Blocker > Labels: pull-request-available > Fix For: 9.0.0, 8.0.1 > > Time Spent: 0.5h > Remaining Estimate: 0h > > When summarizing a whole dataset (without group_by) with an aggregate, and > summarizing a partitioned column, arrow returns wrong data: > {code:r} > library(arrow, warn.conflicts = FALSE) > library(dplyr, warn.conflicts = FALSE) > df <- expand.grid( > some_nulls = c(0L, 1L, 2L), > year = 2010:2023, > month = 1:12, > day = 1:30 > ) > path <- tempfile() > dir.create(path) > write_dataset(df, path, partitioning = c("year", "month")) > ds <- open_dataset(path) > # with arrow the mins/maxes are off for partitioning columns > ds %>% > summarise(n = n(), min_year = min(year), min_month = min(month), min_day = > min(day), max_year = max(year), max_month = max(month), max_day = max(day)) > %>% > collect() > #> # A tibble: 1 × 7 > #> n min_year min_month min_day max_year max_month max_day > #> <int> <int> <int> <int> <int> <int> <int> > #> 1 15120 2023 1 1 2023 12 30 > # comapred to what we get with dplyr > df %>% > summarise(n = n(), min_year = min(year), min_month = min(month), min_day = > min(day), max_year = max(year), max_month = max(month), max_day = max(day)) > %>% > collect() > #> n min_year min_month min_day max_year max_month max_day > #> 1 15120 2010 1 1 2023 12 30 > # even min alone is off: > ds %>% > summarise(min_year = min(year)) %>% > collect() > #> # A tibble: 1 × 1 > #> min_year > #> <int> > #> 1 2016 > > # but non-partitioning columns are fine: > ds %>% > summarise(min_day = min(day)) %>% > collect() > #> # A tibble: 1 × 1 > #> min_day > #> <int> > #> 1 1 > > > # But with a group_by, this seems ok > ds %>% > group_by(some_nulls) %>% > summarise(min_year = min(year)) %>% > collect() > #> # A tibble: 3 × 2 > #> some_nulls min_year > #> <int> <int> > #> 1 0 2010 > #> 2 1 2010 > #> 3 2 2010 > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)