Robert On created ARROW-16904: --------------------------------- Summary: dplyr summarise using min/max aggregate function non-deterministic for large number of elements Key: ARROW-16904 URL: https://issues.apache.org/jira/browse/ARROW-16904 Project: Apache Arrow Issue Type: Bug Components: R Affects Versions: 8.0.0 Environment: $ lsb_release -a No LSB modules are available. Distributor ID: Ubuntu Description: Ubuntu 20.04.4 LTS Release: 20.04 Codename: focal Reporter: Robert On
The following code produces non-deterministic result for getting the minimum value of a sequence of 1e6 integers. {code:java} sapply(1:100, function(x) { # create parquet file with a val column with numbers 1 to 100,000 arrow::write_parquet( data.frame(val = 1:1e5), "test.parquet") arrow::open_dataset("test.parquet") %>% dplyr::summarise(min_val = min(val)) %>% dplyr::collect() %>% dplyr::pull(min_val) }) %>% table() sapply(1:100, function(x) { # create parquet file with a val column with numbers 1 to 1,000,000 arrow::write_parquet( data.frame(val = 1:1e6), "test.parquet") arrow::open_dataset("test.parquet") %>% dplyr::summarise(min_val = min(val)) %>% dplyr::collect() %>% dplyr::pull(min_val) }) %>% table() {code} The first 100 simulations using numbers 1 to 1e5 is able to find the minimum number (1) all 100 times. The second 100 simulations using numbers 1 to 1e6 only finds the minimum number (1) 65 out of 100 times. It finds near multiples of 131073, 25, 8, and 2 times respectively. {code:java} . 1 100 . 1 131073 262145 393217 65 25 8 2 {code} -- This message was sent by Atlassian Jira (v8.20.7#820007)