[ https://issues.apache.org/jira/browse/ARROW-16808?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Edward Visel closed ARROW-16808. -------------------------------- Resolution: Duplicate Duplicate of [ARROW-16807] > [C++] count_distinct aggregates incorrectly across row groups > ------------------------------------------------------------- > > Key: ARROW-16808 > URL: https://issues.apache.org/jira/browse/ARROW-16808 > Project: Apache Arrow > Issue Type: Bug > Environment: > arrow::arrow_info() > Arrow package version: 8.0.0.9000 > Capabilities: > > dataset TRUE > substrait FALSE > parquet TRUE > json TRUE > s3 TRUE > utf8proc TRUE > re2 TRUE > snappy TRUE > gzip TRUE > brotli TRUE > zstd TRUE > lz4 TRUE > lz4_frame TRUE > lzo FALSE > bz2 TRUE > jemalloc TRUE > mimalloc FALSE > Memory: > > Allocator jemalloc > Current 37.25 Kb > Max 925.42 Kb > Runtime: > > SIMD Level none > Detected SIMD Level none > Build: > > C++ Library Version 9.0.0-SNAPSHOT > C++ Compiler AppleClang > C++ Compiler Version 13.1.6.13160021 > Git ID d9d78946607f36e25e9d812a5cc956bd00ab2bc9 > Reporter: Edward Visel > Priority: Blocker > Fix For: 9.0.0, 8.0.1 > > > When reading from parquet files with multiple row groups, {{count_distinct}} > (wrapped by `n_distinct` in R) returns inaccurate and inconsistent results: > {code:r} > library(dplyr, warn.conflicts = FALSE) > path <- tempfile(fileext = '.parquet') > arrow::write_parquet(dplyr::starwars, path, chunk_size = 20L) > ds <- arrow::open_dataset(path) > ds %>% count(sex) %>% collect() > #> # A tibble: 5 × 2 > #> sex n > #> <chr> <int> > #> 1 male 60 > #> 2 none 6 > #> 3 female 16 > #> 4 hermaphroditic 1 > #> 5 <NA> 4 > ds %>% summarise(n = n_distinct(sex)) %>% collect() > #> # A tibble: 1 × 1 > #> n > #> <int> > #> 1 19 > ds %>% summarise(n = n_distinct(sex)) %>% collect() > #> # A tibble: 1 × 1 > #> n > #> <int> > #> 1 17 > ds %>% summarise(n = n_distinct(sex)) %>% collect() > #> # A tibble: 1 × 1 > #> n > #> <int> > #> 1 17 > ds %>% summarise(n = n_distinct(sex)) %>% collect() > #> # A tibble: 1 × 1 > #> n > #> <int> > #> 1 16 > ds %>% summarise(n = n_distinct(sex)) %>% collect() > #> # A tibble: 1 × 1 > #> n > #> <int> > #> 1 16 > ds %>% summarise(n = n_distinct(sex)) %>% collect() > #> # A tibble: 1 × 1 > #> n > #> <int> > #> 1 17 > ds %>% summarise(n = n_distinct(sex)) %>% collect() > #> # A tibble: 1 × 1 > #> n > #> <int> > #> 1 17 > # correct > ds %>% collect() %>% summarise(n = n_distinct(sex)) > #> # A tibble: 1 × 1 > #> n > #> <int> > #> 1 5 > {code} > If the file is stored as a single row group, results are correct. When > grouped, results are correct. > I can reproduce this in Python as well using the same file and > {{pyarrow.compute.count_distinct}}: > {code:python} > import pyarrow as pa > import pyarrow.parquet as pq > pa.__version__ > #> 8.0.0 > starwars = > pq.read_table('/var/folders/0j/zz6p_mjx2_b727p6xdpm5chc0000gn/T//Rtmp2wnWl5/file1744f6cc6cea8.parquet') > print(pa.compute.count_distinct(starwars.column('sex')).as_py()) > #> 15 > print(pa.compute.unique(starwars.column('sex'))) > #> [ > #> "male", > #> "none", > #> "female", > #> "hermaphroditic", > #> null > #> ] > {code} > This seems likely to be the same problem in this StackOverflow question: > https://stackoverflow.com/questions/72561901/how-do-i-compute-the-number-of-unique-values-in-a-pyarrow-array > which is working from orc files. -- This message was sent by Atlassian Jira (v8.20.7#820007)