[jira] [Updated] (ARROW-16807) [C++] count_distinct aggregates incorrectly across row groups

Edward Visel (Jira) Fri, 10 Jun 2022 10:09:06 -0700


     [ 
https://issues.apache.org/jira/browse/ARROW-16807?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Edward Visel updated ARROW-16807:
---------------------------------
    Description: 
When reading from parquet files with multiple row groups, {{count_distinct}} 
(wrapped by {{n_distinct}} in R) returns inaccurate and inconsistent results:
{code:r}
library(dplyr, warn.conflicts = FALSE)

path <- tempfile(fileext = '.parquet')
arrow::write_parquet(dplyr::starwars, path, chunk_size = 20L)

ds <- arrow::open_dataset(path)

ds %>% count(sex) %>% collect()
#> # A tibble: 5 × 2
#>   sex                n
#>   <chr>          <int>
#> 1 male              60
#> 2 none               6
#> 3 female            16
#> 4 hermaphroditic     1
#> 5 <NA>               4

ds %>% summarise(n = n_distinct(sex)) %>% collect()
#> # A tibble: 1 × 1
#>       n
#>   <int>
#> 1    19
ds %>% summarise(n = n_distinct(sex)) %>% collect()
#> # A tibble: 1 × 1
#>       n
#>   <int>
#> 1    17
ds %>% summarise(n = n_distinct(sex)) %>% collect()
#> # A tibble: 1 × 1
#>       n
#>   <int>
#> 1    17
ds %>% summarise(n = n_distinct(sex)) %>% collect()
#> # A tibble: 1 × 1
#>       n
#>   <int>
#> 1    16
ds %>% summarise(n = n_distinct(sex)) %>% collect()
#> # A tibble: 1 × 1
#>       n
#>   <int>
#> 1    16
ds %>% summarise(n = n_distinct(sex)) %>% collect()
#> # A tibble: 1 × 1
#>       n
#>   <int>
#> 1    17
ds %>% summarise(n = n_distinct(sex)) %>% collect()
#> # A tibble: 1 × 1
#>       n
#>   <int>
#> 1    17

# correct
ds %>% collect() %>% summarise(n = n_distinct(sex))
#> # A tibble: 1 × 1
#>       n
#>   <int>
#> 1     5
{code}

If the file is stored as a single row group, results are correct. When grouped, 
results are correct.

I can reproduce this in Python as well using the same file and 
{{pyarrow.compute.count_distinct}}:

{code:python}
import pyarrow as pa
import pyarrow.parquet as pq

pa.__version__
#> 8.0.0

starwars = 
pq.read_table('/var/folders/0j/zz6p_mjx2_b727p6xdpm5chc0000gn/T//Rtmp2wnWl5/file1744f6cc6cea8.parquet')

print(pa.compute.count_distinct(starwars.column('sex')).as_py())
#> 15
print(pa.compute.unique(starwars.column('sex')))
#> [
#>   "male",
#>   "none",
#>   "female",
#>   "hermaphroditic",
#>    null
#> ]
{code}

This seems likely to be the same problem in this StackOverflow question: 
https://stackoverflow.com/questions/72561901/how-do-i-compute-the-number-of-unique-values-in-a-pyarrow-array
 which is working from orc files.

  was:
When reading from parquet files with multiple row groups, {{count_distinct}} 
(wrapped by `n_distinct` in R) returns inaccurate and inconsistent results:
{code:r}
library(dplyr, warn.conflicts = FALSE)

path <- tempfile(fileext = '.parquet')
arrow::write_parquet(dplyr::starwars, path, chunk_size = 20L)

ds <- arrow::open_dataset(path)

ds %>% count(sex) %>% collect()
#> # A tibble: 5 × 2
#>   sex                n
#>   <chr>          <int>
#> 1 male              60
#> 2 none               6
#> 3 female            16
#> 4 hermaphroditic     1
#> 5 <NA>               4

ds %>% summarise(n = n_distinct(sex)) %>% collect()
#> # A tibble: 1 × 1
#>       n
#>   <int>
#> 1    19
ds %>% summarise(n = n_distinct(sex)) %>% collect()
#> # A tibble: 1 × 1
#>       n
#>   <int>
#> 1    17
ds %>% summarise(n = n_distinct(sex)) %>% collect()
#> # A tibble: 1 × 1
#>       n
#>   <int>
#> 1    17
ds %>% summarise(n = n_distinct(sex)) %>% collect()
#> # A tibble: 1 × 1
#>       n
#>   <int>
#> 1    16
ds %>% summarise(n = n_distinct(sex)) %>% collect()
#> # A tibble: 1 × 1
#>       n
#>   <int>
#> 1    16
ds %>% summarise(n = n_distinct(sex)) %>% collect()
#> # A tibble: 1 × 1
#>       n
#>   <int>
#> 1    17
ds %>% summarise(n = n_distinct(sex)) %>% collect()
#> # A tibble: 1 × 1
#>       n
#>   <int>
#> 1    17

# correct
ds %>% collect() %>% summarise(n = n_distinct(sex))
#> # A tibble: 1 × 1
#>       n
#>   <int>
#> 1     5
{code}

If the file is stored as a single row group, results are correct. When grouped, 
results are correct.

I can reproduce this in Python as well using the same file and 
{{pyarrow.compute.count_distinct}}:

{code:python}
import pyarrow as pa
import pyarrow.parquet as pq

pa.__version__
#> 8.0.0

starwars = 
pq.read_table('/var/folders/0j/zz6p_mjx2_b727p6xdpm5chc0000gn/T//Rtmp2wnWl5/file1744f6cc6cea8.parquet')

print(pa.compute.count_distinct(starwars.column('sex')).as_py())
#> 15
print(pa.compute.unique(starwars.column('sex')))
#> [
#>   "male",
#>   "none",
#>   "female",
#>   "hermaphroditic",
#>    null
#> ]
{code}

This seems likely to be the same problem in this StackOverflow question: 
https://stackoverflow.com/questions/72561901/how-do-i-compute-the-number-of-unique-values-in-a-pyarrow-array
 which is working from orc files.


> [C++] count_distinct aggregates incorrectly across row groups
> -------------------------------------------------------------
>
>                 Key: ARROW-16807
>                 URL: https://issues.apache.org/jira/browse/ARROW-16807
>             Project: Apache Arrow
>          Issue Type: Bug
>         Environment: > arrow::arrow_info()
> Arrow package version: 8.0.0.9000
> Capabilities:
>                
> dataset    TRUE
> substrait FALSE
> parquet    TRUE
> json       TRUE
> s3         TRUE
> utf8proc   TRUE
> re2        TRUE
> snappy     TRUE
> gzip       TRUE
> brotli     TRUE
> zstd       TRUE
> lz4        TRUE
> lz4_frame  TRUE
> lzo       FALSE
> bz2        TRUE
> jemalloc   TRUE
> mimalloc  FALSE
> Memory:
>                    
> Allocator  jemalloc
> Current    37.25 Kb
> Max       925.42 Kb
> Runtime:
>                         
> SIMD Level          none
> Detected SIMD Level none
> Build:
>                                                              
> C++ Library Version                            9.0.0-SNAPSHOT
> C++ Compiler                                       AppleClang
> C++ Compiler Version                          13.1.6.13160021
> Git ID               d9d78946607f36e25e9d812a5cc956bd00ab2bc9
>            Reporter: Edward Visel
>            Priority: Blocker
>             Fix For: 9.0.0, 8.0.1
>
>
> When reading from parquet files with multiple row groups, {{count_distinct}} 
> (wrapped by {{n_distinct}} in R) returns inaccurate and inconsistent results:
> {code:r}
> library(dplyr, warn.conflicts = FALSE)
> path <- tempfile(fileext = '.parquet')
> arrow::write_parquet(dplyr::starwars, path, chunk_size = 20L)
> ds <- arrow::open_dataset(path)
> ds %>% count(sex) %>% collect()
> #> # A tibble: 5 × 2
> #>   sex                n
> #>   <chr>          <int>
> #> 1 male              60
> #> 2 none               6
> #> 3 female            16
> #> 4 hermaphroditic     1
> #> 5 <NA>               4
> ds %>% summarise(n = n_distinct(sex)) %>% collect()
> #> # A tibble: 1 × 1
> #>       n
> #>   <int>
> #> 1    19
> ds %>% summarise(n = n_distinct(sex)) %>% collect()
> #> # A tibble: 1 × 1
> #>       n
> #>   <int>
> #> 1    17
> ds %>% summarise(n = n_distinct(sex)) %>% collect()
> #> # A tibble: 1 × 1
> #>       n
> #>   <int>
> #> 1    17
> ds %>% summarise(n = n_distinct(sex)) %>% collect()
> #> # A tibble: 1 × 1
> #>       n
> #>   <int>
> #> 1    16
> ds %>% summarise(n = n_distinct(sex)) %>% collect()
> #> # A tibble: 1 × 1
> #>       n
> #>   <int>
> #> 1    16
> ds %>% summarise(n = n_distinct(sex)) %>% collect()
> #> # A tibble: 1 × 1
> #>       n
> #>   <int>
> #> 1    17
> ds %>% summarise(n = n_distinct(sex)) %>% collect()
> #> # A tibble: 1 × 1
> #>       n
> #>   <int>
> #> 1    17
> # correct
> ds %>% collect() %>% summarise(n = n_distinct(sex))
> #> # A tibble: 1 × 1
> #>       n
> #>   <int>
> #> 1     5
> {code}
> If the file is stored as a single row group, results are correct. When 
> grouped, results are correct.
> I can reproduce this in Python as well using the same file and 
> {{pyarrow.compute.count_distinct}}:
> {code:python}
> import pyarrow as pa
> import pyarrow.parquet as pq
> pa.__version__
> #> 8.0.0
> starwars = 
> pq.read_table('/var/folders/0j/zz6p_mjx2_b727p6xdpm5chc0000gn/T//Rtmp2wnWl5/file1744f6cc6cea8.parquet')
> print(pa.compute.count_distinct(starwars.column('sex')).as_py())
> #> 15
> print(pa.compute.unique(starwars.column('sex')))
> #> [
> #>   "male",
> #>   "none",
> #>   "female",
> #>   "hermaphroditic",
> #>    null
> #> ]
> {code}
> This seems likely to be the same problem in this StackOverflow question: 
> https://stackoverflow.com/questions/72561901/how-do-i-compute-the-number-of-unique-values-in-a-pyarrow-array
>  which is working from orc files.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[jira] [Updated] (ARROW-16807) [C++] count_distinct aggregates incorrectly across row groups

Reply via email to