[ 
https://issues.apache.org/jira/browse/ARROW-16904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17562627#comment-17562627
 ] 

Jeroen van Straten commented on ARROW-16904:
--------------------------------------------

I probably fixed this as part of 
[https://issues.apache.org/jira/projects/ARROW/issues/ARROW-16700] / 
[https://github.com/apache/arrow/pull/13509]. min/max wasn't working correctly 
when multiple Consume calls would be chained for the same ScalarAggregator 
instance; only the last call would affect the state. I'm not deep enough into 
Acero to understand under what circumstances it follows this pattern (which was 
broken and isn't tested) and under what circumstances it will only call Consume 
once per instance and then Merge the instances (which works correctly and is 
tested), though.

> [C++] min/max not deterministic if Parquet files have multiple row groups
> -------------------------------------------------------------------------
>
>                 Key: ARROW-16904
>                 URL: https://issues.apache.org/jira/browse/ARROW-16904
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: R
>    Affects Versions: 8.0.0
>         Environment: $ lsb_release -a
> No LSB modules are available.
> Distributor ID: Ubuntu
> Description:    Ubuntu 20.04.4 LTS
> Release:        20.04
> Codename:       focal
>            Reporter: Robert On
>            Assignee: Aldrin M
>            Priority: Blocker
>             Fix For: 9.0.0
>
>
> The following code produces non-deterministic result for getting the minimum 
> value of a sequence of 1e5 and 1e6 integers.
> {code:r}
> sapply(1:100, function(x) {
>   # create parquet file with a val column with numbers 1 to 100,000
>   arrow::write_parquet(
>     data.frame(val = 1:1e5), "test.parquet")
>   # find minimum value
>   arrow::open_dataset("test.parquet") %>%
>     dplyr::summarise(min_val = min(val)) %>%
>     dplyr::collect() %>% dplyr::pull(min_val)
> }) %>% table()
> sapply(1:100, function(x) {
>   # create parquet file with a val column with numbers 1 to 1,000,000
>   arrow::write_parquet(
>     data.frame(val = 1:1e6), "test.parquet")
>   # find minimum value
>   arrow::open_dataset("test.parquet") %>%
>     dplyr::summarise(min_val = min(val)) %>%
>     dplyr::collect() %>% dplyr::pull(min_val)
> }) %>% table()
> {code}
> The first 100 simulations using numbers 1 to 1e5 is able to find the minimum 
> number (1) all 100 times.
> The second 100 simulations using numbers 1 to 1e6 only finds the minimum 
> number (1) 65 out of 100 times. It finds near multiples of 131073, 25, 8, and 
> 2 times respectively.
> {code:r}
> . 1
> 100 
> . 1 131073 262145 393217 
>  65     25      8      2 {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to