Edward Visel created ARROW-16808: ------------------------------------ Summary: [C++] count_distinct aggregates incorrectly across row groups Key: ARROW-16808 URL: https://issues.apache.org/jira/browse/ARROW-16808 Project: Apache Arrow Issue Type: Bug Environment: > arrow::arrow_info() Arrow package version: 8.0.0.9000
Capabilities: dataset TRUE substrait FALSE parquet TRUE json TRUE s3 TRUE utf8proc TRUE re2 TRUE snappy TRUE gzip TRUE brotli TRUE zstd TRUE lz4 TRUE lz4_frame TRUE lzo FALSE bz2 TRUE jemalloc TRUE mimalloc FALSE Memory: Allocator jemalloc Current 37.25 Kb Max 925.42 Kb Runtime: SIMD Level none Detected SIMD Level none Build: C++ Library Version 9.0.0-SNAPSHOT C++ Compiler AppleClang C++ Compiler Version 13.1.6.13160021 Git ID d9d78946607f36e25e9d812a5cc956bd00ab2bc9 Reporter: Edward Visel Fix For: 9.0.0, 8.0.1 When reading from parquet files with multiple row groups, {{count_distinct}} (wrapped by `n_distinct` in R) returns inaccurate and inconsistent results: {code:r} library(dplyr, warn.conflicts = FALSE) path <- tempfile(fileext = '.parquet') arrow::write_parquet(dplyr::starwars, path, chunk_size = 20L) ds <- arrow::open_dataset(path) ds %>% count(sex) %>% collect() #> # A tibble: 5 × 2 #> sex n #> <chr> <int> #> 1 male 60 #> 2 none 6 #> 3 female 16 #> 4 hermaphroditic 1 #> 5 <NA> 4 ds %>% summarise(n = n_distinct(sex)) %>% collect() #> # A tibble: 1 × 1 #> n #> <int> #> 1 19 ds %>% summarise(n = n_distinct(sex)) %>% collect() #> # A tibble: 1 × 1 #> n #> <int> #> 1 17 ds %>% summarise(n = n_distinct(sex)) %>% collect() #> # A tibble: 1 × 1 #> n #> <int> #> 1 17 ds %>% summarise(n = n_distinct(sex)) %>% collect() #> # A tibble: 1 × 1 #> n #> <int> #> 1 16 ds %>% summarise(n = n_distinct(sex)) %>% collect() #> # A tibble: 1 × 1 #> n #> <int> #> 1 16 ds %>% summarise(n = n_distinct(sex)) %>% collect() #> # A tibble: 1 × 1 #> n #> <int> #> 1 17 ds %>% summarise(n = n_distinct(sex)) %>% collect() #> # A tibble: 1 × 1 #> n #> <int> #> 1 17 # correct ds %>% collect() %>% summarise(n = n_distinct(sex)) #> # A tibble: 1 × 1 #> n #> <int> #> 1 5 {code} If the file is stored as a single row group, results are correct. When grouped, results are correct. I can reproduce this in Python as well using the same file and {{pyarrow.compute.count_distinct}}: {code:python} import pyarrow as pa import pyarrow.parquet as pq pa.__version__ #> 8.0.0 starwars = pq.read_table('/var/folders/0j/zz6p_mjx2_b727p6xdpm5chc0000gn/T//Rtmp2wnWl5/file1744f6cc6cea8.parquet') print(pa.compute.count_distinct(starwars.column('sex')).as_py()) #> 15 print(pa.compute.unique(starwars.column('sex'))) #> [ #> "male", #> "none", #> "female", #> "hermaphroditic", #> null #> ] {code} This seems likely to be the same problem in this StackOverflow question: https://stackoverflow.com/questions/72561901/how-do-i-compute-the-number-of-unique-values-in-a-pyarrow-array which is working from orc files. -- This message was sent by Atlassian Jira (v8.20.7#820007)