[jira] [Commented] (ARROW-15075) [C++][Dataset] Implement Dataset for JSON format

2022-08-29 Thread Edward Visel (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-15075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17597245#comment-17597245
 ] 

Edward Visel commented on ARROW-15075:
--

After this is implemented, [the Google Quickdraw 
dataset](https://github.com/googlecreativelab/quickdraw-dataset) is a nice 
freely-available dataset in ndjson to use for benchmarking and demos and such

> [C++][Dataset] Implement Dataset for JSON format
> 
>
> Key: ARROW-15075
> URL: https://issues.apache.org/jira/browse/ARROW-15075
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Will Jones
>Priority: Major
>  Labels: dataset
>
> We already have support for reading individual files, but not yet for reading 
> datasets. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (ARROW-15075) [C++][Dataset] Implement Dataset for JSON format

2022-08-29 Thread Edward Visel (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-15075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17597245#comment-17597245
 ] 

Edward Visel edited comment on ARROW-15075 at 8/29/22 3:04 PM:
---

After this is implemented, [the Google Quickdraw 
dataset|https://github.com/googlecreativelab/quickdraw-dataset] is a nice 
freely-available dataset in ndjson to use for benchmarking and demos and such


was (Author: alistaire):
After this is implemented, [the Google Quickdraw 
dataset](https://github.com/googlecreativelab/quickdraw-dataset) is a nice 
freely-available dataset in ndjson to use for benchmarking and demos and such

> [C++][Dataset] Implement Dataset for JSON format
> 
>
> Key: ARROW-15075
> URL: https://issues.apache.org/jira/browse/ARROW-15075
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Will Jones
>Priority: Major
>  Labels: dataset
>
> We already have support for reading individual files, but not yet for reading 
> datasets. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-16807) [C++] count_distinct aggregates incorrectly across row groups

2022-06-10 Thread Edward Visel (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16807?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Edward Visel updated ARROW-16807:
-
Description: 
When reading from parquet files with multiple row groups, {{count_distinct}} 
(wrapped by {{n_distinct}} in R) returns inaccurate and inconsistent results:
{code:r}
library(dplyr, warn.conflicts = FALSE)

path <- tempfile(fileext = '.parquet')
arrow::write_parquet(dplyr::starwars, path, chunk_size = 20L)

ds <- arrow::open_dataset(path)

ds %>% count(sex) %>% collect()
#> # A tibble: 5 × 2
#>   sex                n
#>             
#> 1 male              60
#> 2 none               6
#> 3 female            16
#> 4 hermaphroditic     1
#> 5                4

ds %>% summarise(n = n_distinct(sex)) %>% collect()
#> # A tibble: 1 × 1
#>       n
#>   
#> 1    19
ds %>% summarise(n = n_distinct(sex)) %>% collect()
#> # A tibble: 1 × 1
#>       n
#>   
#> 1    17
ds %>% summarise(n = n_distinct(sex)) %>% collect()
#> # A tibble: 1 × 1
#>       n
#>   
#> 1    17
ds %>% summarise(n = n_distinct(sex)) %>% collect()
#> # A tibble: 1 × 1
#>       n
#>   
#> 1    16
ds %>% summarise(n = n_distinct(sex)) %>% collect()
#> # A tibble: 1 × 1
#>       n
#>   
#> 1    16
ds %>% summarise(n = n_distinct(sex)) %>% collect()
#> # A tibble: 1 × 1
#>       n
#>   
#> 1    17
ds %>% summarise(n = n_distinct(sex)) %>% collect()
#> # A tibble: 1 × 1
#>       n
#>   
#> 1    17

# correct
ds %>% collect() %>% summarise(n = n_distinct(sex))
#> # A tibble: 1 × 1
#>       n
#>   
#> 1     5
{code}

If the file is stored as a single row group, results are correct. When grouped, 
results are correct.

I can reproduce this in Python as well using the same file and 
{{pyarrow.compute.count_distinct}}:

{code:python}
import pyarrow as pa
import pyarrow.parquet as pq

pa.__version__
#> 8.0.0

starwars = 
pq.read_table('/var/folders/0j/zz6p_mjx2_b727p6xdpm5chcgn/T//Rtmp2wnWl5/file1744f6cc6cea8.parquet')

pa.compute.count_distinct(starwars.column('sex')).as_py()
#> 15
pa.compute.unique(starwars.column('sex'))
#> [
#>   "male",
#>   "none",
#>   "female",
#>   "hermaphroditic",
#>null
#> ]
{code}

This seems likely to be the same problem in this StackOverflow question: 
https://stackoverflow.com/questions/72561901/how-do-i-compute-the-number-of-unique-values-in-a-pyarrow-array
 which is working from orc files.

  was:
When reading from parquet files with multiple row groups, {{count_distinct}} 
(wrapped by {{n_distinct}} in R) returns inaccurate and inconsistent results:
{code:r}
library(dplyr, warn.conflicts = FALSE)

path <- tempfile(fileext = '.parquet')
arrow::write_parquet(dplyr::starwars, path, chunk_size = 20L)

ds <- arrow::open_dataset(path)

ds %>% count(sex) %>% collect()
#> # A tibble: 5 × 2
#>   sex                n
#>             
#> 1 male              60
#> 2 none               6
#> 3 female            16
#> 4 hermaphroditic     1
#> 5                4

ds %>% summarise(n = n_distinct(sex)) %>% collect()
#> # A tibble: 1 × 1
#>       n
#>   
#> 1    19
ds %>% summarise(n = n_distinct(sex)) %>% collect()
#> # A tibble: 1 × 1
#>       n
#>   
#> 1    17
ds %>% summarise(n = n_distinct(sex)) %>% collect()
#> # A tibble: 1 × 1
#>       n
#>   
#> 1    17
ds %>% summarise(n = n_distinct(sex)) %>% collect()
#> # A tibble: 1 × 1
#>       n
#>   
#> 1    16
ds %>% summarise(n = n_distinct(sex)) %>% collect()
#> # A tibble: 1 × 1
#>       n
#>   
#> 1    16
ds %>% summarise(n = n_distinct(sex)) %>% collect()
#> # A tibble: 1 × 1
#>       n
#>   
#> 1    17
ds %>% summarise(n = n_distinct(sex)) %>% collect()
#> # A tibble: 1 × 1
#>       n
#>   
#> 1    17

# correct
ds %>% collect() %>% summarise(n = n_distinct(sex))
#> # A tibble: 1 × 1
#>       n
#>   
#> 1     5
{code}

If the file is stored as a single row group, results are correct. When grouped, 
results are correct.

I can reproduce this in Python as well using the same file and 
{{pyarrow.compute.count_distinct}}:

{code:python}
import pyarrow as pa
import pyarrow.parquet as pq

pa.__version__
#> 8.0.0

starwars = 
pq.read_table('/var/folders/0j/zz6p_mjx2_b727p6xdpm5chcgn/T//Rtmp2wnWl5/file1744f6cc6cea8.parquet')

print(pa.compute.count_distinct(starwars.column('sex')).as_py())
#> 15
print(pa.compute.unique(starwars.column('sex')))
#> [
#>   "male",
#>   "none",
#>   "female",
#>   "hermaphroditic",
#>null
#> ]
{code}

This seems likely to be the same problem in this StackOverflow question: 
https://stackoverflow.com/questions/72561901/how-do-i-compute-the-number-of-unique-values-in-a-pyarrow-array
 which is working from orc files.


> [C++] count_distinct aggregates incorrectly across row groups
> -
>
> Key: ARROW-16807
> URL: https://issues.apache.org/jira/browse/ARROW-16807
> Project: Apache Arrow
>  Issue Type: Bug
> Environment: > 

[jira] [Updated] (ARROW-16807) [C++] count_distinct aggregates incorrectly across row groups

2022-06-10 Thread Edward Visel (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16807?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Edward Visel updated ARROW-16807:
-
Description: 
When reading from parquet files with multiple row groups, {{count_distinct}} 
(wrapped by {{n_distinct}} in R) returns inaccurate and inconsistent results:
{code:r}
library(dplyr, warn.conflicts = FALSE)

path <- tempfile(fileext = '.parquet')
arrow::write_parquet(dplyr::starwars, path, chunk_size = 20L)

ds <- arrow::open_dataset(path)

ds %>% count(sex) %>% collect()
#> # A tibble: 5 × 2
#>   sex                n
#>             
#> 1 male              60
#> 2 none               6
#> 3 female            16
#> 4 hermaphroditic     1
#> 5                4

ds %>% summarise(n = n_distinct(sex)) %>% collect()
#> # A tibble: 1 × 1
#>       n
#>   
#> 1    19
ds %>% summarise(n = n_distinct(sex)) %>% collect()
#> # A tibble: 1 × 1
#>       n
#>   
#> 1    17
ds %>% summarise(n = n_distinct(sex)) %>% collect()
#> # A tibble: 1 × 1
#>       n
#>   
#> 1    17
ds %>% summarise(n = n_distinct(sex)) %>% collect()
#> # A tibble: 1 × 1
#>       n
#>   
#> 1    16
ds %>% summarise(n = n_distinct(sex)) %>% collect()
#> # A tibble: 1 × 1
#>       n
#>   
#> 1    16
ds %>% summarise(n = n_distinct(sex)) %>% collect()
#> # A tibble: 1 × 1
#>       n
#>   
#> 1    17
ds %>% summarise(n = n_distinct(sex)) %>% collect()
#> # A tibble: 1 × 1
#>       n
#>   
#> 1    17

# correct
ds %>% collect() %>% summarise(n = n_distinct(sex))
#> # A tibble: 1 × 1
#>       n
#>   
#> 1     5
{code}

If the file is stored as a single row group, results are correct. When grouped, 
results are correct.

I can reproduce this in Python as well using the same file and 
{{pyarrow.compute.count_distinct}}:

{code:python}
import pyarrow as pa
import pyarrow.parquet as pq

pa.__version__
#> 8.0.0

starwars = 
pq.read_table('/var/folders/0j/zz6p_mjx2_b727p6xdpm5chcgn/T//Rtmp2wnWl5/file1744f6cc6cea8.parquet')

print(pa.compute.count_distinct(starwars.column('sex')).as_py())
#> 15
print(pa.compute.unique(starwars.column('sex')))
#> [
#>   "male",
#>   "none",
#>   "female",
#>   "hermaphroditic",
#>null
#> ]
{code}

This seems likely to be the same problem in this StackOverflow question: 
https://stackoverflow.com/questions/72561901/how-do-i-compute-the-number-of-unique-values-in-a-pyarrow-array
 which is working from orc files.

  was:
When reading from parquet files with multiple row groups, {{count_distinct}} 
(wrapped by `n_distinct` in R) returns inaccurate and inconsistent results:
{code:r}
library(dplyr, warn.conflicts = FALSE)

path <- tempfile(fileext = '.parquet')
arrow::write_parquet(dplyr::starwars, path, chunk_size = 20L)

ds <- arrow::open_dataset(path)

ds %>% count(sex) %>% collect()
#> # A tibble: 5 × 2
#>   sex                n
#>             
#> 1 male              60
#> 2 none               6
#> 3 female            16
#> 4 hermaphroditic     1
#> 5                4

ds %>% summarise(n = n_distinct(sex)) %>% collect()
#> # A tibble: 1 × 1
#>       n
#>   
#> 1    19
ds %>% summarise(n = n_distinct(sex)) %>% collect()
#> # A tibble: 1 × 1
#>       n
#>   
#> 1    17
ds %>% summarise(n = n_distinct(sex)) %>% collect()
#> # A tibble: 1 × 1
#>       n
#>   
#> 1    17
ds %>% summarise(n = n_distinct(sex)) %>% collect()
#> # A tibble: 1 × 1
#>       n
#>   
#> 1    16
ds %>% summarise(n = n_distinct(sex)) %>% collect()
#> # A tibble: 1 × 1
#>       n
#>   
#> 1    16
ds %>% summarise(n = n_distinct(sex)) %>% collect()
#> # A tibble: 1 × 1
#>       n
#>   
#> 1    17
ds %>% summarise(n = n_distinct(sex)) %>% collect()
#> # A tibble: 1 × 1
#>       n
#>   
#> 1    17

# correct
ds %>% collect() %>% summarise(n = n_distinct(sex))
#> # A tibble: 1 × 1
#>       n
#>   
#> 1     5
{code}

If the file is stored as a single row group, results are correct. When grouped, 
results are correct.

I can reproduce this in Python as well using the same file and 
{{pyarrow.compute.count_distinct}}:

{code:python}
import pyarrow as pa
import pyarrow.parquet as pq

pa.__version__
#> 8.0.0

starwars = 
pq.read_table('/var/folders/0j/zz6p_mjx2_b727p6xdpm5chcgn/T//Rtmp2wnWl5/file1744f6cc6cea8.parquet')

print(pa.compute.count_distinct(starwars.column('sex')).as_py())
#> 15
print(pa.compute.unique(starwars.column('sex')))
#> [
#>   "male",
#>   "none",
#>   "female",
#>   "hermaphroditic",
#>null
#> ]
{code}

This seems likely to be the same problem in this StackOverflow question: 
https://stackoverflow.com/questions/72561901/how-do-i-compute-the-number-of-unique-values-in-a-pyarrow-array
 which is working from orc files.


> [C++] count_distinct aggregates incorrectly across row groups
> -
>
> Key: ARROW-16807
> URL: https://issues.apache.org/jira/browse/ARROW-16807
> Project: Apache Arrow
>  Issue Type: Bug
> 

[jira] [Created] (ARROW-16808) [C++] count_distinct aggregates incorrectly across row groups

2022-06-10 Thread Edward Visel (Jira)
Edward Visel created ARROW-16808:


 Summary: [C++] count_distinct aggregates incorrectly across row 
groups
 Key: ARROW-16808
 URL: https://issues.apache.org/jira/browse/ARROW-16808
 Project: Apache Arrow
  Issue Type: Bug
 Environment: > arrow::arrow_info()
Arrow package version: 8.0.0.9000

Capabilities:
   
datasetTRUE
substrait FALSE
parquetTRUE
json   TRUE
s3 TRUE
utf8proc   TRUE
re2TRUE
snappy TRUE
gzip   TRUE
brotli TRUE
zstd   TRUE
lz4TRUE
lz4_frame  TRUE
lzo   FALSE
bz2TRUE
jemalloc   TRUE
mimalloc  FALSE

Memory:
   
Allocator  jemalloc
Current37.25 Kb
Max   925.42 Kb

Runtime:

SIMD Level  none
Detected SIMD Level none

Build:
 
C++ Library Version9.0.0-SNAPSHOT
C++ Compiler   AppleClang
C++ Compiler Version  13.1.6.13160021
Git ID   d9d78946607f36e25e9d812a5cc956bd00ab2bc9
Reporter: Edward Visel
 Fix For: 9.0.0, 8.0.1


When reading from parquet files with multiple row groups, {{count_distinct}} 
(wrapped by `n_distinct` in R) returns inaccurate and inconsistent results:
{code:r}
library(dplyr, warn.conflicts = FALSE)

path <- tempfile(fileext = '.parquet')
arrow::write_parquet(dplyr::starwars, path, chunk_size = 20L)

ds <- arrow::open_dataset(path)

ds %>% count(sex) %>% collect()
#> # A tibble: 5 × 2
#>   sex                n
#>             
#> 1 male              60
#> 2 none               6
#> 3 female            16
#> 4 hermaphroditic     1
#> 5                4

ds %>% summarise(n = n_distinct(sex)) %>% collect()
#> # A tibble: 1 × 1
#>       n
#>   
#> 1    19
ds %>% summarise(n = n_distinct(sex)) %>% collect()
#> # A tibble: 1 × 1
#>       n
#>   
#> 1    17
ds %>% summarise(n = n_distinct(sex)) %>% collect()
#> # A tibble: 1 × 1
#>       n
#>   
#> 1    17
ds %>% summarise(n = n_distinct(sex)) %>% collect()
#> # A tibble: 1 × 1
#>       n
#>   
#> 1    16
ds %>% summarise(n = n_distinct(sex)) %>% collect()
#> # A tibble: 1 × 1
#>       n
#>   
#> 1    16
ds %>% summarise(n = n_distinct(sex)) %>% collect()
#> # A tibble: 1 × 1
#>       n
#>   
#> 1    17
ds %>% summarise(n = n_distinct(sex)) %>% collect()
#> # A tibble: 1 × 1
#>       n
#>   
#> 1    17

# correct
ds %>% collect() %>% summarise(n = n_distinct(sex))
#> # A tibble: 1 × 1
#>       n
#>   
#> 1     5
{code}

If the file is stored as a single row group, results are correct. When grouped, 
results are correct.

I can reproduce this in Python as well using the same file and 
{{pyarrow.compute.count_distinct}}:

{code:python}
import pyarrow as pa
import pyarrow.parquet as pq

pa.__version__
#> 8.0.0

starwars = 
pq.read_table('/var/folders/0j/zz6p_mjx2_b727p6xdpm5chcgn/T//Rtmp2wnWl5/file1744f6cc6cea8.parquet')

print(pa.compute.count_distinct(starwars.column('sex')).as_py())
#> 15
print(pa.compute.unique(starwars.column('sex')))
#> [
#>   "male",
#>   "none",
#>   "female",
#>   "hermaphroditic",
#>null
#> ]
{code}

This seems likely to be the same problem in this StackOverflow question: 
https://stackoverflow.com/questions/72561901/how-do-i-compute-the-number-of-unique-values-in-a-pyarrow-array
 which is working from orc files.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Closed] (ARROW-16808) [C++] count_distinct aggregates incorrectly across row groups

2022-06-10 Thread Edward Visel (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16808?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Edward Visel closed ARROW-16808.

Resolution: Duplicate

Duplicate of [ARROW-16807]

> [C++] count_distinct aggregates incorrectly across row groups
> -
>
> Key: ARROW-16808
> URL: https://issues.apache.org/jira/browse/ARROW-16808
> Project: Apache Arrow
>  Issue Type: Bug
> Environment: > arrow::arrow_info()
> Arrow package version: 8.0.0.9000
> Capabilities:
>
> datasetTRUE
> substrait FALSE
> parquetTRUE
> json   TRUE
> s3 TRUE
> utf8proc   TRUE
> re2TRUE
> snappy TRUE
> gzip   TRUE
> brotli TRUE
> zstd   TRUE
> lz4TRUE
> lz4_frame  TRUE
> lzo   FALSE
> bz2TRUE
> jemalloc   TRUE
> mimalloc  FALSE
> Memory:
>
> Allocator  jemalloc
> Current37.25 Kb
> Max   925.42 Kb
> Runtime:
> 
> SIMD Level  none
> Detected SIMD Level none
> Build:
>  
> C++ Library Version9.0.0-SNAPSHOT
> C++ Compiler   AppleClang
> C++ Compiler Version  13.1.6.13160021
> Git ID   d9d78946607f36e25e9d812a5cc956bd00ab2bc9
>Reporter: Edward Visel
>Priority: Blocker
> Fix For: 9.0.0, 8.0.1
>
>
> When reading from parquet files with multiple row groups, {{count_distinct}} 
> (wrapped by `n_distinct` in R) returns inaccurate and inconsistent results:
> {code:r}
> library(dplyr, warn.conflicts = FALSE)
> path <- tempfile(fileext = '.parquet')
> arrow::write_parquet(dplyr::starwars, path, chunk_size = 20L)
> ds <- arrow::open_dataset(path)
> ds %>% count(sex) %>% collect()
> #> # A tibble: 5 × 2
> #>   sex                n
> #>             
> #> 1 male              60
> #> 2 none               6
> #> 3 female            16
> #> 4 hermaphroditic     1
> #> 5                4
> ds %>% summarise(n = n_distinct(sex)) %>% collect()
> #> # A tibble: 1 × 1
> #>       n
> #>   
> #> 1    19
> ds %>% summarise(n = n_distinct(sex)) %>% collect()
> #> # A tibble: 1 × 1
> #>       n
> #>   
> #> 1    17
> ds %>% summarise(n = n_distinct(sex)) %>% collect()
> #> # A tibble: 1 × 1
> #>       n
> #>   
> #> 1    17
> ds %>% summarise(n = n_distinct(sex)) %>% collect()
> #> # A tibble: 1 × 1
> #>       n
> #>   
> #> 1    16
> ds %>% summarise(n = n_distinct(sex)) %>% collect()
> #> # A tibble: 1 × 1
> #>       n
> #>   
> #> 1    16
> ds %>% summarise(n = n_distinct(sex)) %>% collect()
> #> # A tibble: 1 × 1
> #>       n
> #>   
> #> 1    17
> ds %>% summarise(n = n_distinct(sex)) %>% collect()
> #> # A tibble: 1 × 1
> #>       n
> #>   
> #> 1    17
> # correct
> ds %>% collect() %>% summarise(n = n_distinct(sex))
> #> # A tibble: 1 × 1
> #>       n
> #>   
> #> 1     5
> {code}
> If the file is stored as a single row group, results are correct. When 
> grouped, results are correct.
> I can reproduce this in Python as well using the same file and 
> {{pyarrow.compute.count_distinct}}:
> {code:python}
> import pyarrow as pa
> import pyarrow.parquet as pq
> pa.__version__
> #> 8.0.0
> starwars = 
> pq.read_table('/var/folders/0j/zz6p_mjx2_b727p6xdpm5chcgn/T//Rtmp2wnWl5/file1744f6cc6cea8.parquet')
> print(pa.compute.count_distinct(starwars.column('sex')).as_py())
> #> 15
> print(pa.compute.unique(starwars.column('sex')))
> #> [
> #>   "male",
> #>   "none",
> #>   "female",
> #>   "hermaphroditic",
> #>null
> #> ]
> {code}
> This seems likely to be the same problem in this StackOverflow question: 
> https://stackoverflow.com/questions/72561901/how-do-i-compute-the-number-of-unique-values-in-a-pyarrow-array
>  which is working from orc files.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (ARROW-16807) [C++] count_distinct aggregates incorrectly across row groups

2022-06-10 Thread Edward Visel (Jira)
Edward Visel created ARROW-16807:


 Summary: [C++] count_distinct aggregates incorrectly across row 
groups
 Key: ARROW-16807
 URL: https://issues.apache.org/jira/browse/ARROW-16807
 Project: Apache Arrow
  Issue Type: Bug
 Environment: > arrow::arrow_info()
Arrow package version: 8.0.0.9000

Capabilities:
   
datasetTRUE
substrait FALSE
parquetTRUE
json   TRUE
s3 TRUE
utf8proc   TRUE
re2TRUE
snappy TRUE
gzip   TRUE
brotli TRUE
zstd   TRUE
lz4TRUE
lz4_frame  TRUE
lzo   FALSE
bz2TRUE
jemalloc   TRUE
mimalloc  FALSE

Memory:
   
Allocator  jemalloc
Current37.25 Kb
Max   925.42 Kb

Runtime:

SIMD Level  none
Detected SIMD Level none

Build:
 
C++ Library Version9.0.0-SNAPSHOT
C++ Compiler   AppleClang
C++ Compiler Version  13.1.6.13160021
Git ID   d9d78946607f36e25e9d812a5cc956bd00ab2bc9
Reporter: Edward Visel
 Fix For: 9.0.0, 8.0.1


When reading from parquet files with multiple row groups, {{count_distinct}} 
(wrapped by `n_distinct` in R) returns inaccurate and inconsistent results:
{code:r}
library(dplyr, warn.conflicts = FALSE)

path <- tempfile(fileext = '.parquet')
arrow::write_parquet(dplyr::starwars, path, chunk_size = 20L)

ds <- arrow::open_dataset(path)

ds %>% count(sex) %>% collect()
#> # A tibble: 5 × 2
#>   sex                n
#>             
#> 1 male              60
#> 2 none               6
#> 3 female            16
#> 4 hermaphroditic     1
#> 5                4

ds %>% summarise(n = n_distinct(sex)) %>% collect()
#> # A tibble: 1 × 1
#>       n
#>   
#> 1    19
ds %>% summarise(n = n_distinct(sex)) %>% collect()
#> # A tibble: 1 × 1
#>       n
#>   
#> 1    17
ds %>% summarise(n = n_distinct(sex)) %>% collect()
#> # A tibble: 1 × 1
#>       n
#>   
#> 1    17
ds %>% summarise(n = n_distinct(sex)) %>% collect()
#> # A tibble: 1 × 1
#>       n
#>   
#> 1    16
ds %>% summarise(n = n_distinct(sex)) %>% collect()
#> # A tibble: 1 × 1
#>       n
#>   
#> 1    16
ds %>% summarise(n = n_distinct(sex)) %>% collect()
#> # A tibble: 1 × 1
#>       n
#>   
#> 1    17
ds %>% summarise(n = n_distinct(sex)) %>% collect()
#> # A tibble: 1 × 1
#>       n
#>   
#> 1    17

# correct
ds %>% collect() %>% summarise(n = n_distinct(sex))
#> # A tibble: 1 × 1
#>       n
#>   
#> 1     5
{code}

If the file is stored as a single row group, results are correct. When grouped, 
results are correct.

I can reproduce this in Python as well using the same file and 
{{pyarrow.compute.count_distinct}}:

{code:python}
import pyarrow as pa
import pyarrow.parquet as pq

pa.__version__
#> 8.0.0

starwars = 
pq.read_table('/var/folders/0j/zz6p_mjx2_b727p6xdpm5chcgn/T//Rtmp2wnWl5/file1744f6cc6cea8.parquet')

print(pa.compute.count_distinct(starwars.column('sex')).as_py())
#> 15
print(pa.compute.unique(starwars.column('sex')))
#> [
#>   "male",
#>   "none",
#>   "female",
#>   "hermaphroditic",
#>null
#> ]
{code}

This seems likely to be the same problem in this StackOverflow question: 
https://stackoverflow.com/questions/72561901/how-do-i-compute-the-number-of-unique-values-in-a-pyarrow-array
 which is working from orc files.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (ARROW-16188) [R] Fix excess "Handling string data with embedded nuls" warning in tests

2022-04-13 Thread Edward Visel (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Edward Visel updated ARROW-16188:
-
Summary: [R] Fix excess "Handling string data with embedded nuls" warning 
in tests  (was: Fix excess "Handling string data with embedded nuls" warning in 
tests)

> [R] Fix excess "Handling string data with embedded nuls" warning in tests
> -
>
> Key: ARROW-16188
> URL: https://issues.apache.org/jira/browse/ARROW-16188
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Reporter: Edward Visel
>Priority: Major
>
> The R tests raise a warning
> {code:java}
> ══ Warnings 
> 
> 1. Handling string data with embedded nuls (test-RecordBatch.R:547:3) - 
> Stripping '\0' (nul) from character vector{code}
> That test is already has an expectation for the warning, so it is likely 
> being raised repeatedly and the excess are not suppressed.
> AC: Warning not arising in tests because it is handled in an appropriate 
> manner



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-16188) Fix excess "Handling string data with embedded nuls" warning in tests

2022-04-13 Thread Edward Visel (Jira)
Edward Visel created ARROW-16188:


 Summary: Fix excess "Handling string data with embedded nuls" 
warning in tests
 Key: ARROW-16188
 URL: https://issues.apache.org/jira/browse/ARROW-16188
 Project: Apache Arrow
  Issue Type: Bug
  Components: R
Reporter: Edward Visel


The R tests raise a warning
{code:java}
══ Warnings 

1. Handling string data with embedded nuls (test-RecordBatch.R:547:3) - 
Stripping '\0' (nul) from character vector{code}
That test is already has an expectation for the warning, so it is likely being 
raised repeatedly and the excess are not suppressed.

AC: Warning not arising in tests because it is handled in an appropriate manner



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Assigned] (ARROW-14168) [R] Warn only once about arrow function differences

2022-04-11 Thread Edward Visel (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-14168?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Edward Visel reassigned ARROW-14168:


Assignee: Edward Visel

> [R] Warn only once about arrow function differences
> ---
>
> Key: ARROW-14168
> URL: https://issues.apache.org/jira/browse/ARROW-14168
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Neal Richardson
>Assignee: Edward Visel
>Priority: Major
>  Labels: good-first-issue
>
> When someone calls median or quantile, we warn them that it is approximate. 
> When a session is interactive, this happens only the first time in 
> interactive sessions for {{median}} and {{quantile}} 
> https://github.com/apache/arrow/blob/d197ad31c3d7c16ecee74cb76a71ce397e905b3b/r/R/dplyr-summarize.R#L107-L111
> But when we test, the session is not interactive so we get a number of 
> spurious extra warnings. Because of this we set a local testthat edition 
> https://github.com/apache/arrow/blob/d197ad31c3d7c16ecee74cb76a71ce397e905b3b/r/tests/testthat/test-dplyr-summarize.R#L299-L301
>  which swallows any extra warnings.
> We can either only warn once regardless of interactivity (which is probably 
> totally fine for these warnings), or we could use something like 
> {{local_interactive}} in these tests so that we only get the first warning 
> https://rlang.r-lib.org/reference/is_interactive.html
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)