[jira] [Commented] (ARROW-15075) [C++][Dataset] Implement Dataset for JSON format
[ https://issues.apache.org/jira/browse/ARROW-15075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17597245#comment-17597245 ] Edward Visel commented on ARROW-15075: -- After this is implemented, [the Google Quickdraw dataset](https://github.com/googlecreativelab/quickdraw-dataset) is a nice freely-available dataset in ndjson to use for benchmarking and demos and such > [C++][Dataset] Implement Dataset for JSON format > > > Key: ARROW-15075 > URL: https://issues.apache.org/jira/browse/ARROW-15075 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ >Reporter: Will Jones >Priority: Major > Labels: dataset > > We already have support for reading individual files, but not yet for reading > datasets. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (ARROW-15075) [C++][Dataset] Implement Dataset for JSON format
[ https://issues.apache.org/jira/browse/ARROW-15075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17597245#comment-17597245 ] Edward Visel edited comment on ARROW-15075 at 8/29/22 3:04 PM: --- After this is implemented, [the Google Quickdraw dataset|https://github.com/googlecreativelab/quickdraw-dataset] is a nice freely-available dataset in ndjson to use for benchmarking and demos and such was (Author: alistaire): After this is implemented, [the Google Quickdraw dataset](https://github.com/googlecreativelab/quickdraw-dataset) is a nice freely-available dataset in ndjson to use for benchmarking and demos and such > [C++][Dataset] Implement Dataset for JSON format > > > Key: ARROW-15075 > URL: https://issues.apache.org/jira/browse/ARROW-15075 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ >Reporter: Will Jones >Priority: Major > Labels: dataset > > We already have support for reading individual files, but not yet for reading > datasets. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-16807) [C++] count_distinct aggregates incorrectly across row groups
[ https://issues.apache.org/jira/browse/ARROW-16807?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Edward Visel updated ARROW-16807: - Description: When reading from parquet files with multiple row groups, {{count_distinct}} (wrapped by {{n_distinct}} in R) returns inaccurate and inconsistent results: {code:r} library(dplyr, warn.conflicts = FALSE) path <- tempfile(fileext = '.parquet') arrow::write_parquet(dplyr::starwars, path, chunk_size = 20L) ds <- arrow::open_dataset(path) ds %>% count(sex) %>% collect() #> # A tibble: 5 × 2 #> sex n #> #> 1 male 60 #> 2 none 6 #> 3 female 16 #> 4 hermaphroditic 1 #> 5 4 ds %>% summarise(n = n_distinct(sex)) %>% collect() #> # A tibble: 1 × 1 #> n #> #> 1 19 ds %>% summarise(n = n_distinct(sex)) %>% collect() #> # A tibble: 1 × 1 #> n #> #> 1 17 ds %>% summarise(n = n_distinct(sex)) %>% collect() #> # A tibble: 1 × 1 #> n #> #> 1 17 ds %>% summarise(n = n_distinct(sex)) %>% collect() #> # A tibble: 1 × 1 #> n #> #> 1 16 ds %>% summarise(n = n_distinct(sex)) %>% collect() #> # A tibble: 1 × 1 #> n #> #> 1 16 ds %>% summarise(n = n_distinct(sex)) %>% collect() #> # A tibble: 1 × 1 #> n #> #> 1 17 ds %>% summarise(n = n_distinct(sex)) %>% collect() #> # A tibble: 1 × 1 #> n #> #> 1 17 # correct ds %>% collect() %>% summarise(n = n_distinct(sex)) #> # A tibble: 1 × 1 #> n #> #> 1 5 {code} If the file is stored as a single row group, results are correct. When grouped, results are correct. I can reproduce this in Python as well using the same file and {{pyarrow.compute.count_distinct}}: {code:python} import pyarrow as pa import pyarrow.parquet as pq pa.__version__ #> 8.0.0 starwars = pq.read_table('/var/folders/0j/zz6p_mjx2_b727p6xdpm5chcgn/T//Rtmp2wnWl5/file1744f6cc6cea8.parquet') pa.compute.count_distinct(starwars.column('sex')).as_py() #> 15 pa.compute.unique(starwars.column('sex')) #> [ #> "male", #> "none", #> "female", #> "hermaphroditic", #>null #> ] {code} This seems likely to be the same problem in this StackOverflow question: https://stackoverflow.com/questions/72561901/how-do-i-compute-the-number-of-unique-values-in-a-pyarrow-array which is working from orc files. was: When reading from parquet files with multiple row groups, {{count_distinct}} (wrapped by {{n_distinct}} in R) returns inaccurate and inconsistent results: {code:r} library(dplyr, warn.conflicts = FALSE) path <- tempfile(fileext = '.parquet') arrow::write_parquet(dplyr::starwars, path, chunk_size = 20L) ds <- arrow::open_dataset(path) ds %>% count(sex) %>% collect() #> # A tibble: 5 × 2 #> sex n #> #> 1 male 60 #> 2 none 6 #> 3 female 16 #> 4 hermaphroditic 1 #> 5 4 ds %>% summarise(n = n_distinct(sex)) %>% collect() #> # A tibble: 1 × 1 #> n #> #> 1 19 ds %>% summarise(n = n_distinct(sex)) %>% collect() #> # A tibble: 1 × 1 #> n #> #> 1 17 ds %>% summarise(n = n_distinct(sex)) %>% collect() #> # A tibble: 1 × 1 #> n #> #> 1 17 ds %>% summarise(n = n_distinct(sex)) %>% collect() #> # A tibble: 1 × 1 #> n #> #> 1 16 ds %>% summarise(n = n_distinct(sex)) %>% collect() #> # A tibble: 1 × 1 #> n #> #> 1 16 ds %>% summarise(n = n_distinct(sex)) %>% collect() #> # A tibble: 1 × 1 #> n #> #> 1 17 ds %>% summarise(n = n_distinct(sex)) %>% collect() #> # A tibble: 1 × 1 #> n #> #> 1 17 # correct ds %>% collect() %>% summarise(n = n_distinct(sex)) #> # A tibble: 1 × 1 #> n #> #> 1 5 {code} If the file is stored as a single row group, results are correct. When grouped, results are correct. I can reproduce this in Python as well using the same file and {{pyarrow.compute.count_distinct}}: {code:python} import pyarrow as pa import pyarrow.parquet as pq pa.__version__ #> 8.0.0 starwars = pq.read_table('/var/folders/0j/zz6p_mjx2_b727p6xdpm5chcgn/T//Rtmp2wnWl5/file1744f6cc6cea8.parquet') print(pa.compute.count_distinct(starwars.column('sex')).as_py()) #> 15 print(pa.compute.unique(starwars.column('sex'))) #> [ #> "male", #> "none", #> "female", #> "hermaphroditic", #>null #> ] {code} This seems likely to be the same problem in this StackOverflow question: https://stackoverflow.com/questions/72561901/how-do-i-compute-the-number-of-unique-values-in-a-pyarrow-array which is working from orc files. > [C++] count_distinct aggregates incorrectly across row groups > - > > Key: ARROW-16807 > URL: https://issues.apache.org/jira/browse/ARROW-16807 > Project: Apache Arrow > Issue Type: Bug > Environment: >
[jira] [Updated] (ARROW-16807) [C++] count_distinct aggregates incorrectly across row groups
[ https://issues.apache.org/jira/browse/ARROW-16807?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Edward Visel updated ARROW-16807: - Description: When reading from parquet files with multiple row groups, {{count_distinct}} (wrapped by {{n_distinct}} in R) returns inaccurate and inconsistent results: {code:r} library(dplyr, warn.conflicts = FALSE) path <- tempfile(fileext = '.parquet') arrow::write_parquet(dplyr::starwars, path, chunk_size = 20L) ds <- arrow::open_dataset(path) ds %>% count(sex) %>% collect() #> # A tibble: 5 × 2 #> sex n #> #> 1 male 60 #> 2 none 6 #> 3 female 16 #> 4 hermaphroditic 1 #> 5 4 ds %>% summarise(n = n_distinct(sex)) %>% collect() #> # A tibble: 1 × 1 #> n #> #> 1 19 ds %>% summarise(n = n_distinct(sex)) %>% collect() #> # A tibble: 1 × 1 #> n #> #> 1 17 ds %>% summarise(n = n_distinct(sex)) %>% collect() #> # A tibble: 1 × 1 #> n #> #> 1 17 ds %>% summarise(n = n_distinct(sex)) %>% collect() #> # A tibble: 1 × 1 #> n #> #> 1 16 ds %>% summarise(n = n_distinct(sex)) %>% collect() #> # A tibble: 1 × 1 #> n #> #> 1 16 ds %>% summarise(n = n_distinct(sex)) %>% collect() #> # A tibble: 1 × 1 #> n #> #> 1 17 ds %>% summarise(n = n_distinct(sex)) %>% collect() #> # A tibble: 1 × 1 #> n #> #> 1 17 # correct ds %>% collect() %>% summarise(n = n_distinct(sex)) #> # A tibble: 1 × 1 #> n #> #> 1 5 {code} If the file is stored as a single row group, results are correct. When grouped, results are correct. I can reproduce this in Python as well using the same file and {{pyarrow.compute.count_distinct}}: {code:python} import pyarrow as pa import pyarrow.parquet as pq pa.__version__ #> 8.0.0 starwars = pq.read_table('/var/folders/0j/zz6p_mjx2_b727p6xdpm5chcgn/T//Rtmp2wnWl5/file1744f6cc6cea8.parquet') print(pa.compute.count_distinct(starwars.column('sex')).as_py()) #> 15 print(pa.compute.unique(starwars.column('sex'))) #> [ #> "male", #> "none", #> "female", #> "hermaphroditic", #>null #> ] {code} This seems likely to be the same problem in this StackOverflow question: https://stackoverflow.com/questions/72561901/how-do-i-compute-the-number-of-unique-values-in-a-pyarrow-array which is working from orc files. was: When reading from parquet files with multiple row groups, {{count_distinct}} (wrapped by `n_distinct` in R) returns inaccurate and inconsistent results: {code:r} library(dplyr, warn.conflicts = FALSE) path <- tempfile(fileext = '.parquet') arrow::write_parquet(dplyr::starwars, path, chunk_size = 20L) ds <- arrow::open_dataset(path) ds %>% count(sex) %>% collect() #> # A tibble: 5 × 2 #> sex n #> #> 1 male 60 #> 2 none 6 #> 3 female 16 #> 4 hermaphroditic 1 #> 5 4 ds %>% summarise(n = n_distinct(sex)) %>% collect() #> # A tibble: 1 × 1 #> n #> #> 1 19 ds %>% summarise(n = n_distinct(sex)) %>% collect() #> # A tibble: 1 × 1 #> n #> #> 1 17 ds %>% summarise(n = n_distinct(sex)) %>% collect() #> # A tibble: 1 × 1 #> n #> #> 1 17 ds %>% summarise(n = n_distinct(sex)) %>% collect() #> # A tibble: 1 × 1 #> n #> #> 1 16 ds %>% summarise(n = n_distinct(sex)) %>% collect() #> # A tibble: 1 × 1 #> n #> #> 1 16 ds %>% summarise(n = n_distinct(sex)) %>% collect() #> # A tibble: 1 × 1 #> n #> #> 1 17 ds %>% summarise(n = n_distinct(sex)) %>% collect() #> # A tibble: 1 × 1 #> n #> #> 1 17 # correct ds %>% collect() %>% summarise(n = n_distinct(sex)) #> # A tibble: 1 × 1 #> n #> #> 1 5 {code} If the file is stored as a single row group, results are correct. When grouped, results are correct. I can reproduce this in Python as well using the same file and {{pyarrow.compute.count_distinct}}: {code:python} import pyarrow as pa import pyarrow.parquet as pq pa.__version__ #> 8.0.0 starwars = pq.read_table('/var/folders/0j/zz6p_mjx2_b727p6xdpm5chcgn/T//Rtmp2wnWl5/file1744f6cc6cea8.parquet') print(pa.compute.count_distinct(starwars.column('sex')).as_py()) #> 15 print(pa.compute.unique(starwars.column('sex'))) #> [ #> "male", #> "none", #> "female", #> "hermaphroditic", #>null #> ] {code} This seems likely to be the same problem in this StackOverflow question: https://stackoverflow.com/questions/72561901/how-do-i-compute-the-number-of-unique-values-in-a-pyarrow-array which is working from orc files. > [C++] count_distinct aggregates incorrectly across row groups > - > > Key: ARROW-16807 > URL: https://issues.apache.org/jira/browse/ARROW-16807 > Project: Apache Arrow > Issue Type: Bug >
[jira] [Created] (ARROW-16808) [C++] count_distinct aggregates incorrectly across row groups
Edward Visel created ARROW-16808: Summary: [C++] count_distinct aggregates incorrectly across row groups Key: ARROW-16808 URL: https://issues.apache.org/jira/browse/ARROW-16808 Project: Apache Arrow Issue Type: Bug Environment: > arrow::arrow_info() Arrow package version: 8.0.0.9000 Capabilities: datasetTRUE substrait FALSE parquetTRUE json TRUE s3 TRUE utf8proc TRUE re2TRUE snappy TRUE gzip TRUE brotli TRUE zstd TRUE lz4TRUE lz4_frame TRUE lzo FALSE bz2TRUE jemalloc TRUE mimalloc FALSE Memory: Allocator jemalloc Current37.25 Kb Max 925.42 Kb Runtime: SIMD Level none Detected SIMD Level none Build: C++ Library Version9.0.0-SNAPSHOT C++ Compiler AppleClang C++ Compiler Version 13.1.6.13160021 Git ID d9d78946607f36e25e9d812a5cc956bd00ab2bc9 Reporter: Edward Visel Fix For: 9.0.0, 8.0.1 When reading from parquet files with multiple row groups, {{count_distinct}} (wrapped by `n_distinct` in R) returns inaccurate and inconsistent results: {code:r} library(dplyr, warn.conflicts = FALSE) path <- tempfile(fileext = '.parquet') arrow::write_parquet(dplyr::starwars, path, chunk_size = 20L) ds <- arrow::open_dataset(path) ds %>% count(sex) %>% collect() #> # A tibble: 5 × 2 #> sex n #> #> 1 male 60 #> 2 none 6 #> 3 female 16 #> 4 hermaphroditic 1 #> 5 4 ds %>% summarise(n = n_distinct(sex)) %>% collect() #> # A tibble: 1 × 1 #> n #> #> 1 19 ds %>% summarise(n = n_distinct(sex)) %>% collect() #> # A tibble: 1 × 1 #> n #> #> 1 17 ds %>% summarise(n = n_distinct(sex)) %>% collect() #> # A tibble: 1 × 1 #> n #> #> 1 17 ds %>% summarise(n = n_distinct(sex)) %>% collect() #> # A tibble: 1 × 1 #> n #> #> 1 16 ds %>% summarise(n = n_distinct(sex)) %>% collect() #> # A tibble: 1 × 1 #> n #> #> 1 16 ds %>% summarise(n = n_distinct(sex)) %>% collect() #> # A tibble: 1 × 1 #> n #> #> 1 17 ds %>% summarise(n = n_distinct(sex)) %>% collect() #> # A tibble: 1 × 1 #> n #> #> 1 17 # correct ds %>% collect() %>% summarise(n = n_distinct(sex)) #> # A tibble: 1 × 1 #> n #> #> 1 5 {code} If the file is stored as a single row group, results are correct. When grouped, results are correct. I can reproduce this in Python as well using the same file and {{pyarrow.compute.count_distinct}}: {code:python} import pyarrow as pa import pyarrow.parquet as pq pa.__version__ #> 8.0.0 starwars = pq.read_table('/var/folders/0j/zz6p_mjx2_b727p6xdpm5chcgn/T//Rtmp2wnWl5/file1744f6cc6cea8.parquet') print(pa.compute.count_distinct(starwars.column('sex')).as_py()) #> 15 print(pa.compute.unique(starwars.column('sex'))) #> [ #> "male", #> "none", #> "female", #> "hermaphroditic", #>null #> ] {code} This seems likely to be the same problem in this StackOverflow question: https://stackoverflow.com/questions/72561901/how-do-i-compute-the-number-of-unique-values-in-a-pyarrow-array which is working from orc files. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Closed] (ARROW-16808) [C++] count_distinct aggregates incorrectly across row groups
[ https://issues.apache.org/jira/browse/ARROW-16808?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Edward Visel closed ARROW-16808. Resolution: Duplicate Duplicate of [ARROW-16807] > [C++] count_distinct aggregates incorrectly across row groups > - > > Key: ARROW-16808 > URL: https://issues.apache.org/jira/browse/ARROW-16808 > Project: Apache Arrow > Issue Type: Bug > Environment: > arrow::arrow_info() > Arrow package version: 8.0.0.9000 > Capabilities: > > datasetTRUE > substrait FALSE > parquetTRUE > json TRUE > s3 TRUE > utf8proc TRUE > re2TRUE > snappy TRUE > gzip TRUE > brotli TRUE > zstd TRUE > lz4TRUE > lz4_frame TRUE > lzo FALSE > bz2TRUE > jemalloc TRUE > mimalloc FALSE > Memory: > > Allocator jemalloc > Current37.25 Kb > Max 925.42 Kb > Runtime: > > SIMD Level none > Detected SIMD Level none > Build: > > C++ Library Version9.0.0-SNAPSHOT > C++ Compiler AppleClang > C++ Compiler Version 13.1.6.13160021 > Git ID d9d78946607f36e25e9d812a5cc956bd00ab2bc9 >Reporter: Edward Visel >Priority: Blocker > Fix For: 9.0.0, 8.0.1 > > > When reading from parquet files with multiple row groups, {{count_distinct}} > (wrapped by `n_distinct` in R) returns inaccurate and inconsistent results: > {code:r} > library(dplyr, warn.conflicts = FALSE) > path <- tempfile(fileext = '.parquet') > arrow::write_parquet(dplyr::starwars, path, chunk_size = 20L) > ds <- arrow::open_dataset(path) > ds %>% count(sex) %>% collect() > #> # A tibble: 5 × 2 > #> sex n > #> > #> 1 male 60 > #> 2 none 6 > #> 3 female 16 > #> 4 hermaphroditic 1 > #> 5 4 > ds %>% summarise(n = n_distinct(sex)) %>% collect() > #> # A tibble: 1 × 1 > #> n > #> > #> 1 19 > ds %>% summarise(n = n_distinct(sex)) %>% collect() > #> # A tibble: 1 × 1 > #> n > #> > #> 1 17 > ds %>% summarise(n = n_distinct(sex)) %>% collect() > #> # A tibble: 1 × 1 > #> n > #> > #> 1 17 > ds %>% summarise(n = n_distinct(sex)) %>% collect() > #> # A tibble: 1 × 1 > #> n > #> > #> 1 16 > ds %>% summarise(n = n_distinct(sex)) %>% collect() > #> # A tibble: 1 × 1 > #> n > #> > #> 1 16 > ds %>% summarise(n = n_distinct(sex)) %>% collect() > #> # A tibble: 1 × 1 > #> n > #> > #> 1 17 > ds %>% summarise(n = n_distinct(sex)) %>% collect() > #> # A tibble: 1 × 1 > #> n > #> > #> 1 17 > # correct > ds %>% collect() %>% summarise(n = n_distinct(sex)) > #> # A tibble: 1 × 1 > #> n > #> > #> 1 5 > {code} > If the file is stored as a single row group, results are correct. When > grouped, results are correct. > I can reproduce this in Python as well using the same file and > {{pyarrow.compute.count_distinct}}: > {code:python} > import pyarrow as pa > import pyarrow.parquet as pq > pa.__version__ > #> 8.0.0 > starwars = > pq.read_table('/var/folders/0j/zz6p_mjx2_b727p6xdpm5chcgn/T//Rtmp2wnWl5/file1744f6cc6cea8.parquet') > print(pa.compute.count_distinct(starwars.column('sex')).as_py()) > #> 15 > print(pa.compute.unique(starwars.column('sex'))) > #> [ > #> "male", > #> "none", > #> "female", > #> "hermaphroditic", > #>null > #> ] > {code} > This seems likely to be the same problem in this StackOverflow question: > https://stackoverflow.com/questions/72561901/how-do-i-compute-the-number-of-unique-values-in-a-pyarrow-array > which is working from orc files. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (ARROW-16807) [C++] count_distinct aggregates incorrectly across row groups
Edward Visel created ARROW-16807: Summary: [C++] count_distinct aggregates incorrectly across row groups Key: ARROW-16807 URL: https://issues.apache.org/jira/browse/ARROW-16807 Project: Apache Arrow Issue Type: Bug Environment: > arrow::arrow_info() Arrow package version: 8.0.0.9000 Capabilities: datasetTRUE substrait FALSE parquetTRUE json TRUE s3 TRUE utf8proc TRUE re2TRUE snappy TRUE gzip TRUE brotli TRUE zstd TRUE lz4TRUE lz4_frame TRUE lzo FALSE bz2TRUE jemalloc TRUE mimalloc FALSE Memory: Allocator jemalloc Current37.25 Kb Max 925.42 Kb Runtime: SIMD Level none Detected SIMD Level none Build: C++ Library Version9.0.0-SNAPSHOT C++ Compiler AppleClang C++ Compiler Version 13.1.6.13160021 Git ID d9d78946607f36e25e9d812a5cc956bd00ab2bc9 Reporter: Edward Visel Fix For: 9.0.0, 8.0.1 When reading from parquet files with multiple row groups, {{count_distinct}} (wrapped by `n_distinct` in R) returns inaccurate and inconsistent results: {code:r} library(dplyr, warn.conflicts = FALSE) path <- tempfile(fileext = '.parquet') arrow::write_parquet(dplyr::starwars, path, chunk_size = 20L) ds <- arrow::open_dataset(path) ds %>% count(sex) %>% collect() #> # A tibble: 5 × 2 #> sex n #> #> 1 male 60 #> 2 none 6 #> 3 female 16 #> 4 hermaphroditic 1 #> 5 4 ds %>% summarise(n = n_distinct(sex)) %>% collect() #> # A tibble: 1 × 1 #> n #> #> 1 19 ds %>% summarise(n = n_distinct(sex)) %>% collect() #> # A tibble: 1 × 1 #> n #> #> 1 17 ds %>% summarise(n = n_distinct(sex)) %>% collect() #> # A tibble: 1 × 1 #> n #> #> 1 17 ds %>% summarise(n = n_distinct(sex)) %>% collect() #> # A tibble: 1 × 1 #> n #> #> 1 16 ds %>% summarise(n = n_distinct(sex)) %>% collect() #> # A tibble: 1 × 1 #> n #> #> 1 16 ds %>% summarise(n = n_distinct(sex)) %>% collect() #> # A tibble: 1 × 1 #> n #> #> 1 17 ds %>% summarise(n = n_distinct(sex)) %>% collect() #> # A tibble: 1 × 1 #> n #> #> 1 17 # correct ds %>% collect() %>% summarise(n = n_distinct(sex)) #> # A tibble: 1 × 1 #> n #> #> 1 5 {code} If the file is stored as a single row group, results are correct. When grouped, results are correct. I can reproduce this in Python as well using the same file and {{pyarrow.compute.count_distinct}}: {code:python} import pyarrow as pa import pyarrow.parquet as pq pa.__version__ #> 8.0.0 starwars = pq.read_table('/var/folders/0j/zz6p_mjx2_b727p6xdpm5chcgn/T//Rtmp2wnWl5/file1744f6cc6cea8.parquet') print(pa.compute.count_distinct(starwars.column('sex')).as_py()) #> 15 print(pa.compute.unique(starwars.column('sex'))) #> [ #> "male", #> "none", #> "female", #> "hermaphroditic", #>null #> ] {code} This seems likely to be the same problem in this StackOverflow question: https://stackoverflow.com/questions/72561901/how-do-i-compute-the-number-of-unique-values-in-a-pyarrow-array which is working from orc files. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Updated] (ARROW-16188) [R] Fix excess "Handling string data with embedded nuls" warning in tests
[ https://issues.apache.org/jira/browse/ARROW-16188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Edward Visel updated ARROW-16188: - Summary: [R] Fix excess "Handling string data with embedded nuls" warning in tests (was: Fix excess "Handling string data with embedded nuls" warning in tests) > [R] Fix excess "Handling string data with embedded nuls" warning in tests > - > > Key: ARROW-16188 > URL: https://issues.apache.org/jira/browse/ARROW-16188 > Project: Apache Arrow > Issue Type: Bug > Components: R >Reporter: Edward Visel >Priority: Major > > The R tests raise a warning > {code:java} > ══ Warnings > > 1. Handling string data with embedded nuls (test-RecordBatch.R:547:3) - > Stripping '\0' (nul) from character vector{code} > That test is already has an expectation for the warning, so it is likely > being raised repeatedly and the excess are not suppressed. > AC: Warning not arising in tests because it is handled in an appropriate > manner -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-16188) Fix excess "Handling string data with embedded nuls" warning in tests
Edward Visel created ARROW-16188: Summary: Fix excess "Handling string data with embedded nuls" warning in tests Key: ARROW-16188 URL: https://issues.apache.org/jira/browse/ARROW-16188 Project: Apache Arrow Issue Type: Bug Components: R Reporter: Edward Visel The R tests raise a warning {code:java} ══ Warnings 1. Handling string data with embedded nuls (test-RecordBatch.R:547:3) - Stripping '\0' (nul) from character vector{code} That test is already has an expectation for the warning, so it is likely being raised repeatedly and the excess are not suppressed. AC: Warning not arising in tests because it is handled in an appropriate manner -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Assigned] (ARROW-14168) [R] Warn only once about arrow function differences
[ https://issues.apache.org/jira/browse/ARROW-14168?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Edward Visel reassigned ARROW-14168: Assignee: Edward Visel > [R] Warn only once about arrow function differences > --- > > Key: ARROW-14168 > URL: https://issues.apache.org/jira/browse/ARROW-14168 > Project: Apache Arrow > Issue Type: Improvement > Components: R >Reporter: Neal Richardson >Assignee: Edward Visel >Priority: Major > Labels: good-first-issue > > When someone calls median or quantile, we warn them that it is approximate. > When a session is interactive, this happens only the first time in > interactive sessions for {{median}} and {{quantile}} > https://github.com/apache/arrow/blob/d197ad31c3d7c16ecee74cb76a71ce397e905b3b/r/R/dplyr-summarize.R#L107-L111 > But when we test, the session is not interactive so we get a number of > spurious extra warnings. Because of this we set a local testthat edition > https://github.com/apache/arrow/blob/d197ad31c3d7c16ecee74cb76a71ce397e905b3b/r/tests/testthat/test-dplyr-summarize.R#L299-L301 > which swallows any extra warnings. > We can either only warn once regardless of interactivity (which is probably > totally fine for these warnings), or we could use something like > {{local_interactive}} in these tests so that we only get the first warning > https://rlang.r-lib.org/reference/is_interactive.html > -- This message was sent by Atlassian Jira (v8.20.1#820001)