[ https://issues.apache.org/jira/browse/ARROW-16700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17562732#comment-17562732 ]
Jeroen van Straten commented on ARROW-16700: -------------------------------------------- So I guess that basically means that my PR is linked to the wrong issue? I'm not sure how to neatly resolve that, at least not without closing my PR and waiting for [~octalene] to fix min/max (and test it appropriately) instead. For the guarantee issue that remains, I'm inclined to place the blame on Scanner rather than all the nodes other than filter and project, and to just insert the code for a trivial projection into Scanner to leverage the existing SimplifyWithGuarantee implementation. I'm assuming that an expression that only selects an existing field will just result in a pointer copy, and that evaluating a literal expression to a scalar is also cheap (at least if the literal isn't massive). What do you think? > Future Substrait queries would, in theory, be able to create plans without > the preceding project node. I'm not 100% sure on that one because of the tag fields that Scanner normally adds, which Substrait wouldn't know about. It feels fragile to leave them in because I imagine they would affect column indices after a join if not treated carefully. But I get your point. > [C++] [R] [Datasets] aggregates on partitioning columns > ------------------------------------------------------- > > Key: ARROW-16700 > URL: https://issues.apache.org/jira/browse/ARROW-16700 > Project: Apache Arrow > Issue Type: Bug > Components: C++, R > Reporter: Jonathan Keane > Assignee: Jeroen van Straten > Priority: Blocker > Labels: pull-request-available > Fix For: 9.0.0, 8.0.1 > > Time Spent: 1h > Remaining Estimate: 0h > > When summarizing a whole dataset (without group_by) with an aggregate, and > summarizing a partitioned column, arrow returns wrong data: > {code:r} > library(arrow, warn.conflicts = FALSE) > library(dplyr, warn.conflicts = FALSE) > df <- expand.grid( > some_nulls = c(0L, 1L, 2L), > year = 2010:2023, > month = 1:12, > day = 1:30 > ) > path <- tempfile() > dir.create(path) > write_dataset(df, path, partitioning = c("year", "month")) > ds <- open_dataset(path) > # with arrow the mins/maxes are off for partitioning columns > ds %>% > summarise(n = n(), min_year = min(year), min_month = min(month), min_day = > min(day), max_year = max(year), max_month = max(month), max_day = max(day)) > %>% > collect() > #> # A tibble: 1 × 7 > #> n min_year min_month min_day max_year max_month max_day > #> <int> <int> <int> <int> <int> <int> <int> > #> 1 15120 2023 1 1 2023 12 30 > # comapred to what we get with dplyr > df %>% > summarise(n = n(), min_year = min(year), min_month = min(month), min_day = > min(day), max_year = max(year), max_month = max(month), max_day = max(day)) > %>% > collect() > #> n min_year min_month min_day max_year max_month max_day > #> 1 15120 2010 1 1 2023 12 30 > # even min alone is off: > ds %>% > summarise(min_year = min(year)) %>% > collect() > #> # A tibble: 1 × 1 > #> min_year > #> <int> > #> 1 2016 > > # but non-partitioning columns are fine: > ds %>% > summarise(min_day = min(day)) %>% > collect() > #> # A tibble: 1 × 1 > #> min_day > #> <int> > #> 1 1 > > > # But with a group_by, this seems ok > ds %>% > group_by(some_nulls) %>% > summarise(min_year = min(year)) %>% > collect() > #> # A tibble: 3 × 2 > #> some_nulls min_year > #> <int> <int> > #> 1 0 2010 > #> 2 1 2010 > #> 3 2 2010 > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)