[jira] [Commented] (ARROW-16700) [C++] [R] [Datasets] aggregates on partitioning columns

Jeroen van Straten (Jira) Tue, 05 Jul 2022 10:35:05 -0700


    [ 
https://issues.apache.org/jira/browse/ARROW-16700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17562771#comment-17562771
 ]


Jeroen van Straten commented on ARROW-16700:
--------------------------------------------

> I think you just need to update the title of the PR. I'm not sure if that 
> removes the linkage to this issue but that's probably ok. It should add 
> linkage to the correct PR.

The (more) correct issue being 
[ARROW-16904|https://issues.apache.org/jira/projects/ARROW/issues/ARROW-16904], 
right?

> I'm not entirely sure I understand your point on the join. The partition 
> columns I think can just be normal columns as far as Substrait is considered. 
> Are you maybe thinking of the __fragment_index, __filename, etc. columns?

The latter, I think 
([these|https://github.com/apache/arrow/blob/c1a1f47b8a2772fc270832902e7d788ee467ea08/cpp/src/arrow/dataset/scanner.cc#L917-L920]
 to be specific). I don't know if it would actually be an issue at all, but it 
doesn't feel right that Scanner makes more columns than ReadRel if we're going 
to be treating them as if they map one-to-one to each other.

As for a resolution to this issue, I've now locally rewritten MakeExecBatch to 
take the guarantee as its argument, use ExtractKnownFieldValues to turn it into 
the map of constant columns (helper of SimplifyWithGuarantee; turned out to be 
way easier to just use that, and it's also more fit for this purpose), and 
prefer the constant scalar from that over the actual incoming data if known. 
It's currently giving me failures, but my code is also still a mess from all 
the debug printing so it's not that surprising.

> [C++] [R] [Datasets] aggregates on partitioning columns
> -------------------------------------------------------
>
>                 Key: ARROW-16700
>                 URL: https://issues.apache.org/jira/browse/ARROW-16700
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: C++, R
>            Reporter: Jonathan Keane
>            Assignee: Jeroen van Straten
>            Priority: Blocker
>              Labels: pull-request-available
>             Fix For: 9.0.0, 8.0.1
>
>          Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> When summarizing a whole dataset (without group_by) with an aggregate, and 
> summarizing a partitioned column, arrow returns wrong data:
> {code:r}
> library(arrow, warn.conflicts = FALSE)
> library(dplyr, warn.conflicts = FALSE)
> df <- expand.grid(
>   some_nulls = c(0L, 1L, 2L),
>   year = 2010:2023,
>   month = 1:12,
>   day = 1:30
> )
> path <- tempfile()
> dir.create(path)
> write_dataset(df, path, partitioning = c("year", "month"))
> ds <- open_dataset(path)
> # with arrow the mins/maxes are off for partitioning columns
> ds %>%
>   summarise(n = n(), min_year = min(year), min_month = min(month), min_day = 
> min(day), max_year = max(year), max_month = max(month), max_day = max(day)) 
> %>% 
>   collect()
> #> # A tibble: 1 × 7
> #>       n min_year min_month min_day max_year max_month max_day
> #>   <int>    <int>     <int>   <int>    <int>     <int>   <int>
> #> 1 15120     2023         1       1     2023        12      30
> # comapred to what we get with dplyr
> df %>%
>   summarise(n = n(), min_year = min(year), min_month = min(month), min_day = 
> min(day), max_year = max(year), max_month = max(month), max_day = max(day)) 
> %>% 
>   collect()
> #>       n min_year min_month min_day max_year max_month max_day
> #> 1 15120     2010         1       1     2023        12      30
> # even min alone is off:
> ds %>%
>   summarise(min_year = min(year)) %>% 
>   collect()
> #> # A tibble: 1 × 1
> #>   min_year
> #>      <int>
> #> 1     2016
>   
> # but non-partitioning columns are fine:
> ds %>%
>   summarise(min_day = min(day)) %>% 
>   collect()
> #> # A tibble: 1 × 1
> #>   min_day
> #>     <int>
> #> 1       1
>   
>   
> # But with a group_by, this seems ok
> ds %>%
>   group_by(some_nulls) %>%
>   summarise(min_year = min(year)) %>% 
>   collect()
> #> # A tibble: 3 × 2
> #>   some_nulls min_year
> #>        <int>    <int>
> #> 1          0     2010
> #> 2          1     2010
> #> 3          2     2010
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (ARROW-16700) [C++] [R] [Datasets] aggregates on partitioning columns

Reply via email to