[jira] [Updated] (ARROW-13865) Writing moderate-size parquet files of nested dataframes from R slows down/process hangs
[ https://issues.apache.org/jira/browse/ARROW-13865?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] John Sheffield updated ARROW-13865: --- Description: I observed a significant slowdown in parquet writes (and ultimately the process just hangs for minutes without completion) while writing moderate-size nested dataframes from R. I have replicated the issue on MacOS and Ubuntu so far. An example: ``` testdf <- dplyr::tibble( id = uuid::UUIDgenerate(n = 5000), l1 = as.list(lapply(1:5000, (function(x) runif(1000, l2 = as.list(lapply(1:5000, (function(x) rnorm(1000 ) testdf_long <- tidyr::unnest(testdf, cols = c(l1, l2)) # This works arrow::write_parquet(testdf_long, "testdf_long.parquet") # This write does not complete within a few minutes on my testing but throws no errors arrow::write_parquet(testdf, "testdf.parquet") ``` I can't guess at why this is true, but the slowdown is closely tied to row counts: ``` # screenshot attached; 12ms, 56ms, and 680ms respectively. microbenchmark::microbenchmark( arrow::write_parquet(testdf[1, ], "testdf.parquet"), arrow::write_parquet(testdf[1:10, ], "testdf.parquet"), arrow::write_parquet(testdf[1:100, ], "testdf.parquet"), times = 5 ) ``` I'm using the CRAN version 5.0.0 in both cases. The sessionInfo() for Ubuntu is R version 4.0.5 (2021-03-31) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 20.04.3 LTS Matrix products: default BLAS/LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.8.so locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8 LC_MESSAGES=C [7] LC_PAPER=en_US.UTF-8 LC_NAME=C LC_ADDRESS=C LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] arrow_5.0.0 And sessionInfo for MacOS is: R version 4.0.1 (2020-06-06) Platform: x86_64-apple-darwin17.0 (64-bit) Running under: macOS Catalina 10.15.7 Matrix products: default BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] arrow_5.0.0 was: I observed a significant slowdown in parquet writes (and ultimately the process just hangs for minutes without completion) while writing moderate-size nested dataframes from R. I have replicated the issue on MacOS and Ubuntu so far. An example: ``` testdf <- dplyr::tibble( id = uuid::UUIDgenerate(n = 5000), l1 = as.list(lapply(1:5000, (function(x) runif(1000, l2 = as.list(lapply(1:5000, (function(x) rnorm(1000 ) testdf_long <- tidyr::unnest(testdf, cols = c(l1, l2)) # This works arrow::write_parquet(testdf_long, "testdf_long.parquet") # This write does not complete within a few minutes on my testing but throws no errors arrow::write_parquet(testdf, "testdf.parquet") ``` I can't guess at why this is true, but the slowdown is closely tied to row counts: ``` # screenshot attached; 12ms, 56ms, and 680ms respectively. microbenchmark::microbenchmark( arrow::write_parquet(testdf[1, ], "testdf.parquet"), arrow::write_parquet(testdf[1:10, ], "testdf.parquet"), arrow::write_parquet(testdf[1:100, ], "testdf.parquet"), times = 5 ) ``` I'm using the CRAN version 5.0.0 in both cases. The sessionInfo() for Ubuntu is R version 4.0.5 (2021-03-31) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 20.04.3 LTS Matrix products: default BLAS/LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.8.so locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8LC_MESSAGES=C [7] LC_PAPER=en_US.UTF-8 LC_NAME=C LC_ADDRESS=C LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] arrow_5.0.0 And sessionInfo for MacOS is: R version 4.0.1 (2020-06-06) Platform: x86_64-apple-darwin17.0 (64-bit) Running under: macOS Catalina 10.15.7 Matrix products: default BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] arrow_5.0.0 > Writing moderate-size parquet
[jira] [Updated] (ARROW-13865) Writing moderate-size parquet files of nested dataframes from R slows down/process hangs
[ https://issues.apache.org/jira/browse/ARROW-13865?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] John Sheffield updated ARROW-13865: --- Description: I observed a significant slowdown in parquet writes (and ultimately the process just hangs for minutes without completion) while writing moderate-size nested dataframes from R. I have replicated the issue on MacOS and Ubuntu so far. An example: ``` testdf <- dplyr::tibble( id = uuid::UUIDgenerate(n = 5000), l1 = as.list(lapply(1:5000, (function( x ) runif(1000, l2 = as.list(lapply(1:5000, (function( x ) rnorm(1000 ) testdf_long <- tidyr::unnest(testdf, cols = c(l1, l2)) # This works arrow::write_parquet(testdf_long, "testdf_long.parquet") # This write does not complete within a few minutes on my testing but throws no errors arrow::write_parquet(testdf, "testdf.parquet") ``` I can't guess at why this is true, but the slowdown is closely tied to row counts: ``` # screenshot attached; 12ms, 56ms, and 680ms respectively. microbenchmark::microbenchmark( arrow::write_parquet(testdf[1, ], "testdf.parquet"), arrow::write_parquet(testdf[1:10, ], "testdf.parquet"), arrow::write_parquet(testdf[1:100, ], "testdf.parquet"), times = 5 ) ``` I'm using the CRAN version 5.0.0 in both cases. The sessionInfo() for Ubuntu is R version 4.0.5 (2021-03-31) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 20.04.3 LTS Matrix products: default BLAS/LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.8.so locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8 LC_MESSAGES=C [7] LC_PAPER=en_US.UTF-8 LC_NAME=C LC_ADDRESS=C LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] arrow_5.0.0 And sessionInfo for MacOS is: R version 4.0.1 (2020-06-06) Platform: x86_64-apple-darwin17.0 (64-bit) Running under: macOS Catalina 10.15.7 Matrix products: default BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] arrow_5.0.0 was: I observed a significant slowdown in parquet writes (and ultimately the process just hangs for minutes without completion) while writing moderate-size nested dataframes from R. I have replicated the issue on MacOS and Ubuntu so far. An example: ``` testdf <- dplyr::tibble( id = uuid::UUIDgenerate(n = 5000), l1 = as.list(lapply(1:5000, (function(x) runif(1000, l2 = as.list(lapply(1:5000, (function(x) rnorm(1000 ) testdf_long <- tidyr::unnest(testdf, cols = c(l1, l2)) # This works arrow::write_parquet(testdf_long, "testdf_long.parquet") # This write does not complete within a few minutes on my testing but throws no errors arrow::write_parquet(testdf, "testdf.parquet") ``` I can't guess at why this is true, but the slowdown is closely tied to row counts: ``` # screenshot attached; 12ms, 56ms, and 680ms respectively. microbenchmark::microbenchmark( arrow::write_parquet(testdf[1, ], "testdf.parquet"), arrow::write_parquet(testdf[1:10, ], "testdf.parquet"), arrow::write_parquet(testdf[1:100, ], "testdf.parquet"), times = 5 ) ``` I'm using the CRAN version 5.0.0 in both cases. The sessionInfo() for Ubuntu is R version 4.0.5 (2021-03-31) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 20.04.3 LTS Matrix products: default BLAS/LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.8.so locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8 LC_MESSAGES=C [7] LC_PAPER=en_US.UTF-8 LC_NAME=C LC_ADDRESS=C LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] arrow_5.0.0 And sessionInfo for MacOS is: R version 4.0.1 (2020-06-06) Platform: x86_64-apple-darwin17.0 (64-bit) Running under: macOS Catalina 10.15.7 Matrix products: default BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] arrow_5.0.0 > Writing moderate-size parquet files of nested dataframes from R slows > down/process hangs > ---
[jira] [Created] (ARROW-13865) Writing moderate-size parquet files of nested dataframes from R slows down/process hangs
John Sheffield created ARROW-13865: -- Summary: Writing moderate-size parquet files of nested dataframes from R slows down/process hangs Key: ARROW-13865 URL: https://issues.apache.org/jira/browse/ARROW-13865 Project: Apache Arrow Issue Type: Bug Components: R Affects Versions: 5.0.0 Reporter: John Sheffield Attachments: Screen Shot 2021-09-02 at 11.21.37 AM.png I observed a significant slowdown in parquet writes (and ultimately the process just hangs for minutes without completion) while writing moderate-size nested dataframes from R. I have replicated the issue on MacOS and Ubuntu so far. An example: ``` testdf <- dplyr::tibble( id = uuid::UUIDgenerate(n = 5000), l1 = as.list(lapply(1:5000, (function(x) runif(1000, l2 = as.list(lapply(1:5000, (function(x) rnorm(1000 ) testdf_long <- tidyr::unnest(testdf, cols = c(l1, l2)) # This works arrow::write_parquet(testdf_long, "testdf_long.parquet") # This write does not complete within a few minutes on my testing but throws no errors arrow::write_parquet(testdf, "testdf.parquet") ``` I can't guess at why this is true, but the slowdown is closely tied to row counts: ``` # screenshot attached; 12ms, 56ms, and 680ms respectively. microbenchmark::microbenchmark( arrow::write_parquet(testdf[1, ], "testdf.parquet"), arrow::write_parquet(testdf[1:10, ], "testdf.parquet"), arrow::write_parquet(testdf[1:100, ], "testdf.parquet"), times = 5 ) ``` I'm using the CRAN version 5.0.0 in both cases. The sessionInfo() for Ubuntu is R version 4.0.5 (2021-03-31) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 20.04.3 LTS Matrix products: default BLAS/LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.8.so locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8LC_MESSAGES=C [7] LC_PAPER=en_US.UTF-8 LC_NAME=C LC_ADDRESS=C LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] arrow_5.0.0 And sessionInfo for MacOS is: R version 4.0.1 (2020-06-06) Platform: x86_64-apple-darwin17.0 (64-bit) Running under: macOS Catalina 10.15.7 Matrix products: default BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] arrow_5.0.0 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (ARROW-11067) [R] read_csv_arrow silently fails to read some strings and returns nulls
[ https://issues.apache.org/jira/browse/ARROW-11067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17256576#comment-17256576 ] John Sheffield edited comment on ARROW-11067 at 12/30/20, 4:07 PM: --- Hm, the plot thickens. I just replicated Weston's results for the arrow_sample_data.csv script in a few environments that suggest it might be a Mac-running-R4.0 issue. * *Success:* In a container (`rocker/geospatial:4.0.2`, container itself is Ubuntu 20.04LTS running on GCE instance running Debian 10), I also see Weston's result of all successes, but using R 4.0.2 instead of his 3.6.3. {code:java} > sessionInfo()R version 4.0.2 (2020-06-22) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 20.04 LTS Matrix products: default BLAS/LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.8.so locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8 [6] LC_MESSAGES=C LC_PAPER=en_US.UTF-8 LC_NAME=C LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] arrow_2.0.0 forcats_0.5.0 stringr_1.4.0 dplyr_1.0.2 purrr_0.3.4 readr_1.3.1 tidyr_1.1.2 tibble_3.0.4 ggplot2_3.3.2 tidyverse_1.3.0 loaded via a namespace (and not attached): [1] Rcpp_1.0.5 cellranger_1.1.0 pillar_1.4.7 compiler_4.0.2 dbplyr_1.4.4 tools_4.0.2 digest_0.6.27 bit_4.0.4 jsonlite_1.7.1 [10] lubridate_1.7.9 lifecycle_0.2.0 gtable_0.3.0 pkgconfig_2.0.3 rlang_0.4.9 reprex_0.3.0 cli_2.2.0 DBI_1.1.0 rstudioapi_0.13 [19] haven_2.3.1 withr_2.3.0 xml2_1.3.2 httr_1.4.2 fs_1.5.0 generics_0.0.2 vctrs_0.3.5 hms_0.5.3 bit64_4.0.5 [28] grid_4.0.2 tidyselect_1.1.0 glue_1.4.2 R6_2.5.0 fansi_0.4.1 readxl_1.3.1 farver_2.0.3 modelr_0.1.8 blob_1.2.1 [37] magrittr_2.0.1 backports_1.1.10 scales_1.1.1 ellipsis_0.3.1 rvest_0.3.6 assertthat_0.2.1 colorspace_2.0-0 stringi_1.5.3 munsell_0.5.0 [46] broom_0.7.0 crayon_1.3.4 {code} * *Failure:* In a fresh Mac R environment running the latest MacOS (Big Sur 11.1 20C69) and R 4.0.3, the alternating success/failure pattern still shows up: {code:java} > sessionInfo() R version 4.0.3 (2020-10-10) Platform: x86_64-apple-darwin17.0 (64-bit) Running under: macOS Big Sur 10.16 Matrix products: default LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] arrow_2.0.0 forcats_0.5.0 stringr_1.4.0 dplyr_1.0.2 purrr_0.3.4 readr_1.4.0 tidyr_1.1.2 tibble_3.0.4 ggplot2_3.3.2 tidyverse_1.3.0 loaded via a namespace (and not attached): [1] Rcpp_1.0.5 cellranger_1.1.0 pillar_1.4.7 compiler_4.0.3 dbplyr_2.0.0 tools_4.0.3 digest_0.6.27 bit_4.0.4 jsonlite_1.7.2 [10] lubridate_1.7.9.2 lifecycle_0.2.0 gtable_0.3.0 pkgconfig_2.0.3 rlang_0.4.9 reprex_0.3.0 cli_2.2.0 DBI_1.1.0 rstudioapi_0.13 [19] haven_2.3.1 withr_2.3.0 xml2_1.3.2 httr_1.4.2 fs_1.5.0 generics_0.1.0 vctrs_0.3.5 hms_0.5.3 bit64_4.0.5 [28] grid_4.0.3 tidyselect_1.1.0 glue_1.4.2 R6_2.5.0 fansi_0.4.1 readxl_1.3.1 farver_2.0.3 modelr_0.1.8 magrittr_2.0.1 [37] backports_1.2.1 scales_1.1.1 ellipsis_0.3.1 rvest_0.3.6 assertthat_0.2.1 colorspace_2.0-0 stringi_1.5.3 munsell_0.5.0 broom_0.7.2 [46] crayon_1.3.4 {code} was (Author: jms): Hm, the plot thickens. I just replicated Weston's results for the arrow_sample_data.csv script in a few environments that suggest it might be a Mac-running-R4.0 issue. * *Success:* In a container (`rocker/geospatial:4.0.2`, container itself is Ubuntu 20.04LTS running on GCE instance running Debian 10), I also see Weston's result of all successes, but using R 4.0.2 instead of his 3.6.3. > sessionInfo()R version 4.0.2 (2020-06-22) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 20.04 LTS Matrix products: default BLAS/LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.8.so locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8 [6] LC_MESSAGES=C LC_PAPER=en_US.UTF-8 LC_NAME=C LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] arrow_2.0.0 forcats_0.5.0 stringr_1.4.0 dplyr_1.0.2 purrr_0.3.4 readr_1.3.1 tidyr_1.1.2 tibble_3.0.4ggplot2_3.3.2 tidyverse_1.3.0 loaded via a namespace (and not attached): [1] Rcpp_1.0.5 cellranger_1.1.0 pillar_1.4.7 compiler_4.0.2 dbplyr_1.4.4 to
[jira] [Commented] (ARROW-11067) [R] read_csv_arrow silently fails to read some strings and returns nulls
[ https://issues.apache.org/jira/browse/ARROW-11067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17256576#comment-17256576 ] John Sheffield commented on ARROW-11067: Hm, the plot thickens. I just replicated Weston's results for the arrow_sample_data.csv script in a few environments that suggest it might be a Mac-running-R4.0 issue. * *Success:* In a container (`rocker/geospatial:4.0.2`, container itself is Ubuntu 20.04LTS running on GCE instance running Debian 10), I also see Weston's result of all successes, but using R 4.0.2 instead of his 3.6.3. > sessionInfo()R version 4.0.2 (2020-06-22) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 20.04 LTS Matrix products: default BLAS/LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.8.so locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8 [6] LC_MESSAGES=C LC_PAPER=en_US.UTF-8 LC_NAME=C LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] arrow_2.0.0 forcats_0.5.0 stringr_1.4.0 dplyr_1.0.2 purrr_0.3.4 readr_1.3.1 tidyr_1.1.2 tibble_3.0.4ggplot2_3.3.2 tidyverse_1.3.0 loaded via a namespace (and not attached): [1] Rcpp_1.0.5 cellranger_1.1.0 pillar_1.4.7 compiler_4.0.2 dbplyr_1.4.4 tools_4.0.2 digest_0.6.27bit_4.0.4 jsonlite_1.7.1 [10] lubridate_1.7.9 lifecycle_0.2.0 gtable_0.3.0 pkgconfig_2.0.3 rlang_0.4.9 reprex_0.3.0 cli_2.2.0DBI_1.1.0 rstudioapi_0.13 [19] haven_2.3.1 withr_2.3.0 xml2_1.3.2 httr_1.4.2 fs_1.5.0 generics_0.0.2 vctrs_0.3.5 hms_0.5.3bit64_4.0.5 [28] grid_4.0.2 tidyselect_1.1.0 glue_1.4.2 R6_2.5.0 fansi_0.4.1 readxl_1.3.1 farver_2.0.3 modelr_0.1.8 blob_1.2.1 [37] magrittr_2.0.1 backports_1.1.10 scales_1.1.1 ellipsis_0.3.1 rvest_0.3.6 assertthat_0.2.1 colorspace_2.0-0 stringi_1.5.3 munsell_0.5.0 [46] broom_0.7.0 crayon_1.3.4 * *Failure:* In a fresh Mac R environment running the latest MacOS (Big Sur 11.1 20C69) and R 4.0.3, the alternating success/failure pattern still shows up: {code:java} > sessionInfo() R version 4.0.3 (2020-10-10) Platform: x86_64-apple-darwin17.0 (64-bit) Running under: macOS Big Sur 10.16 Matrix products: default LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] arrow_2.0.0 forcats_0.5.0 stringr_1.4.0 dplyr_1.0.2 purrr_0.3.4 readr_1.4.0 tidyr_1.1.2 tibble_3.0.4 ggplot2_3.3.2 tidyverse_1.3.0 loaded via a namespace (and not attached): [1] Rcpp_1.0.5 cellranger_1.1.0 pillar_1.4.7 compiler_4.0.3 dbplyr_2.0.0 tools_4.0.3 digest_0.6.27 bit_4.0.4 jsonlite_1.7.2 [10] lubridate_1.7.9.2 lifecycle_0.2.0 gtable_0.3.0 pkgconfig_2.0.3 rlang_0.4.9 reprex_0.3.0 cli_2.2.0 DBI_1.1.0 rstudioapi_0.13 [19] haven_2.3.1 withr_2.3.0 xml2_1.3.2 httr_1.4.2 fs_1.5.0 generics_0.1.0 vctrs_0.3.5 hms_0.5.3 bit64_4.0.5 [28] grid_4.0.3 tidyselect_1.1.0 glue_1.4.2 R6_2.5.0 fansi_0.4.1 readxl_1.3.1 farver_2.0.3 modelr_0.1.8 magrittr_2.0.1 [37] backports_1.2.1 scales_1.1.1 ellipsis_0.3.1 rvest_0.3.6 assertthat_0.2.1 colorspace_2.0-0 stringi_1.5.3 munsell_0.5.0 broom_0.7.2 [46] crayon_1.3.4 {code} > [R] read_csv_arrow silently fails to read some strings and returns nulls > > > Key: ARROW-11067 > URL: https://issues.apache.org/jira/browse/ARROW-11067 > Project: Apache Arrow > Issue Type: Bug > Components: R >Reporter: John Sheffield >Priority: Major > Fix For: 3.0.0 > > Attachments: arrow_explanation.png, arrow_failure_cases.csv, > arrow_failure_cases.csv, arrowbug1.png, arrowbug1.png, demo_data.csv > > > A sample file is attached, showing 10 rows each of strings with consistent > failures (false_na = TRUE) and consistent successes (false_na = FALSE). The > strings are in the column `json_string` – if relevant, they are geojsons with > min nchar of 33,229 and max nchar of 202,515. > When I read this sample file with other R CSV readers (readr and data.table > shown), the files are imported correctly and there are no NAs in the > json_string column. > When I read with arrow::read_csv_arrow, 50% of the sample json_string column > end up as NAs. as_data_frame TR
[jira] [Comment Edited] (ARROW-11067) [R] read_csv_arrow silently fails to read some strings and returns nulls
[ https://issues.apache.org/jira/browse/ARROW-11067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17256218#comment-17256218 ] John Sheffield edited comment on ARROW-11067 at 12/30/20, 1:35 AM: --- (Sorry for the fragmented report here, but figured out a way to really isolate the issue.) The string read failures are deterministic and predictable, and the content of the strings doesn't seem to matter – only length. There's a switch between success/failure at every integer multiple of *N * (32 * 1024) characters*. * For N in [0,1), string length between 0 and 32767 characters, all reads succeed. * For N in [1,2), string length 32768 and 65535, all of the reads fail. * The same pattern repeats until we hit LongString limits: if floor(nchar/(32 * 1024) is 0 or even, the read succeeds. If floor(nchar/(32 * 1024) is odd, it fails. Code: {code:java} library(tidyverse) library(arrow) generate_string <- function(n){ paste0(sample(c(LETTERS, letters), size = n, replace = TRUE), collapse = "") } sample_breaks <- (1:60L * 16L * 1024L) sample_lengths <- sample_breaks - 1 set.seed(1234) test_strings <- purrr::map_chr(sample_lengths, generate_string) readr::write_csv(data.frame(str = test_strings, strlen = sample_lengths), "arrow_sample_data.csv") arrow::read_csv_arrow("arrow_sample_data.csv") %>% dplyr::mutate(failed_case = ifelse(is.na(str), "failed", "succeeded")) %>% dplyr::select(-str) %>% ggplot(data = ., aes(x = (strlen / (32 * 1024)), y = failed_case)) + geom_point(aes(color = ifelse(floor(strlen / (32 * 1024)) %% 2 == 0, "even", "odd")), size = 3) + scale_x_continuous(breaks = seq(0, 30)) + labs(x = "string length / (32 * 1024) : integer multiple of 32kb", y = "string read success/failure", color = "even/odd multiple of 32kb") {code} !arrow_explanation.png! was (Author: jms): (Sorry for the fragmented report here, but figured out a way to really isolate the issue.) The string read failures are deterministic and predictable, and the content of the strings doesn't seem to matter – only length. There's a switch between success/failure at every integer multiple of *N * (32 * 1024) characters*. * For N in [0,1), string length between 0 and 32767 characters, all reads succeed. * For N in [1,2], string length 32768 and 65535, all of the reads fail. * The same pattern repeats until we hit LongString limits: if floor(nchar/(32 * 1024) is 0 or even, the read succeeds. If floor(nchar/(32 * 1024) is odd, it fails. Code: {code:java} library(tidyverse) library(arrow) generate_string <- function(n){ paste0(sample(c(LETTERS, letters), size = n, replace = TRUE), collapse = "") } sample_breaks <- (1:60L * 16L * 1024L) sample_lengths <- sample_breaks - 1 set.seed(1234) test_strings <- purrr::map_chr(sample_lengths, generate_string) readr::write_csv(data.frame(str = test_strings, strlen = sample_lengths), "arrow_sample_data.csv") arrow::read_csv_arrow("arrow_sample_data.csv") %>% dplyr::mutate(failed_case = ifelse(is.na(str), "failed", "succeeded")) %>% dplyr::select(-str) %>% ggplot(data = ., aes(x = (strlen / (32 * 1024)), y = failed_case)) + geom_point(aes(color = ifelse(floor(strlen / (32 * 1024)) %% 2 == 0, "even", "odd")), size = 3) + scale_x_continuous(breaks = seq(0, 30)) + labs(x = "string length / (32 * 1024) : integer multiple of 32kb", y = "string read success/failure", color = "even/odd multiple of 32kb") {code} !arrow_explanation.png! > [R] read_csv_arrow silently fails to read some strings and returns nulls > > > Key: ARROW-11067 > URL: https://issues.apache.org/jira/browse/ARROW-11067 > Project: Apache Arrow > Issue Type: Bug > Components: R >Reporter: John Sheffield >Priority: Major > Fix For: 3.0.0 > > Attachments: arrow_explanation.png, arrow_failure_cases.csv, > arrow_failure_cases.csv, arrowbug1.png, arrowbug1.png, demo_data.csv > > > A sample file is attached, showing 10 rows each of strings with consistent > failures (false_na = TRUE) and consistent successes (false_na = FALSE). The > strings are in the column `json_string` – if relevant, they are geojsons with > min nchar of 33,229 and max nchar of 202,515. > When I read this sample file with other R CSV readers (readr and data.table > shown), the files are imported correctly and there are no NAs in the > json_string column. > When I read with arrow::read_csv_arrow, 50% of the sample json_string column > end up as NAs. as_data_frame TRUE or FALSE does not change the behavior, so > this might not be limited to the R interface, but I can't help debug much > further upstream. > > > {code:java}
[jira] [Commented] (ARROW-11067) [R] read_csv_arrow silently fails to read some strings and returns nulls
[ https://issues.apache.org/jira/browse/ARROW-11067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17256218#comment-17256218 ] John Sheffield commented on ARROW-11067: (Sorry for the fragmented report here, but figured out a way to really isolate the issue.) The string read failures are deterministic and predictable, and the content of the strings doesn't seem to matter – only length. There's a switch between success/failure at every integer multiple of *N * (32 * 1024) characters*. * For N in [0,1), string length between 0 and 32767 characters, all reads succeed. * For N in [1,2], string length 32768 and 65535, all of the reads fail. * The same pattern repeats until we hit LongString limits: if floor(nchar/(32 * 1024) is 0 or even, the read succeeds. If floor(nchar/(32 * 1024) is odd, it fails. Code: {code:java} library(tidyverse) library(arrow) generate_string <- function(n){ paste0(sample(c(LETTERS, letters), size = n, replace = TRUE), collapse = "") } sample_breaks <- (1:60L * 16L * 1024L) sample_lengths <- sample_breaks - 1 set.seed(1234) test_strings <- purrr::map_chr(sample_lengths, generate_string) readr::write_csv(data.frame(str = test_strings, strlen = sample_lengths), "arrow_sample_data.csv") arrow::read_csv_arrow("arrow_sample_data.csv") %>% dplyr::mutate(failed_case = ifelse(is.na(str), "failed", "succeeded")) %>% dplyr::select(-str) %>% ggplot(data = ., aes(x = (strlen / (32 * 1024)), y = failed_case)) + geom_point(aes(color = ifelse(floor(strlen / (32 * 1024)) %% 2 == 0, "even", "odd")), size = 3) + scale_x_continuous(breaks = seq(0, 30)) + labs(x = "string length / (32 * 1024) : integer multiple of 32kb", y = "string read success/failure", color = "even/odd multiple of 32kb") {code} !arrow_explanation.png! > [R] read_csv_arrow silently fails to read some strings and returns nulls > > > Key: ARROW-11067 > URL: https://issues.apache.org/jira/browse/ARROW-11067 > Project: Apache Arrow > Issue Type: Bug > Components: R >Reporter: John Sheffield >Priority: Major > Fix For: 3.0.0 > > Attachments: arrow_explanation.png, arrow_failure_cases.csv, > arrow_failure_cases.csv, arrowbug1.png, arrowbug1.png, demo_data.csv > > > A sample file is attached, showing 10 rows each of strings with consistent > failures (false_na = TRUE) and consistent successes (false_na = FALSE). The > strings are in the column `json_string` – if relevant, they are geojsons with > min nchar of 33,229 and max nchar of 202,515. > When I read this sample file with other R CSV readers (readr and data.table > shown), the files are imported correctly and there are no NAs in the > json_string column. > When I read with arrow::read_csv_arrow, 50% of the sample json_string column > end up as NAs. as_data_frame TRUE or FALSE does not change the behavior, so > this might not be limited to the R interface, but I can't help debug much > further upstream. > > > {code:java} > aaa1 <- arrow::read_csv_arrow("demo_data.csv", as_data_frame = TRUE) > aaa2 <- arrow::read_csv_arrow("demo_data.csv", as_data_frame = FALSE) > bbb <- data.table::fread("demo_data.csv") > ccc <- readr::read_csv("demo_data.csv") > mean(is.na(aaa1$json_string)) # 0.5 > mean(is.na(aaa2$column(1))) # Scalar 0.5 > mean(is.na(bbb$json_string)) # 0 > mean(is.na(ccc$json_string)) # 0{code} > > > * arrow 2.0 (latest CRAN) > * readr 1.4.0 > * data.table 1.13.2 > * R version 4.0.1 (2020-06-06) > * MacOS Catalina 10.15.7 / x86_64-apple-darwin17.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-11067) [R] read_csv_arrow silently fails to read some strings and returns nulls
[ https://issues.apache.org/jira/browse/ARROW-11067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] John Sheffield updated ARROW-11067: --- Attachment: arrow_explanation.png > [R] read_csv_arrow silently fails to read some strings and returns nulls > > > Key: ARROW-11067 > URL: https://issues.apache.org/jira/browse/ARROW-11067 > Project: Apache Arrow > Issue Type: Bug > Components: R >Reporter: John Sheffield >Priority: Major > Fix For: 3.0.0 > > Attachments: arrow_explanation.png, arrow_failure_cases.csv, > arrow_failure_cases.csv, arrowbug1.png, arrowbug1.png, demo_data.csv > > > A sample file is attached, showing 10 rows each of strings with consistent > failures (false_na = TRUE) and consistent successes (false_na = FALSE). The > strings are in the column `json_string` – if relevant, they are geojsons with > min nchar of 33,229 and max nchar of 202,515. > When I read this sample file with other R CSV readers (readr and data.table > shown), the files are imported correctly and there are no NAs in the > json_string column. > When I read with arrow::read_csv_arrow, 50% of the sample json_string column > end up as NAs. as_data_frame TRUE or FALSE does not change the behavior, so > this might not be limited to the R interface, but I can't help debug much > further upstream. > > > {code:java} > aaa1 <- arrow::read_csv_arrow("demo_data.csv", as_data_frame = TRUE) > aaa2 <- arrow::read_csv_arrow("demo_data.csv", as_data_frame = FALSE) > bbb <- data.table::fread("demo_data.csv") > ccc <- readr::read_csv("demo_data.csv") > mean(is.na(aaa1$json_string)) # 0.5 > mean(is.na(aaa2$column(1))) # Scalar 0.5 > mean(is.na(bbb$json_string)) # 0 > mean(is.na(ccc$json_string)) # 0{code} > > > * arrow 2.0 (latest CRAN) > * readr 1.4.0 > * data.table 1.13.2 > * R version 4.0.1 (2020-06-06) > * MacOS Catalina 10.15.7 / x86_64-apple-darwin17.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-11067) [R] read_csv_arrow silently fails to read some strings and returns nulls
[ https://issues.apache.org/jira/browse/ARROW-11067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] John Sheffield updated ARROW-11067: --- Attachment: arrowbug1.png > [R] read_csv_arrow silently fails to read some strings and returns nulls > > > Key: ARROW-11067 > URL: https://issues.apache.org/jira/browse/ARROW-11067 > Project: Apache Arrow > Issue Type: Bug > Components: R >Reporter: John Sheffield >Priority: Major > Fix For: 3.0.0 > > Attachments: arrow_failure_cases.csv, arrow_failure_cases.csv, > arrowbug1.png, arrowbug1.png, demo_data.csv > > > A sample file is attached, showing 10 rows each of strings with consistent > failures (false_na = TRUE) and consistent successes (false_na = FALSE). The > strings are in the column `json_string` – if relevant, they are geojsons with > min nchar of 33,229 and max nchar of 202,515. > When I read this sample file with other R CSV readers (readr and data.table > shown), the files are imported correctly and there are no NAs in the > json_string column. > When I read with arrow::read_csv_arrow, 50% of the sample json_string column > end up as NAs. as_data_frame TRUE or FALSE does not change the behavior, so > this might not be limited to the R interface, but I can't help debug much > further upstream. > > > {code:java} > aaa1 <- arrow::read_csv_arrow("demo_data.csv", as_data_frame = TRUE) > aaa2 <- arrow::read_csv_arrow("demo_data.csv", as_data_frame = FALSE) > bbb <- data.table::fread("demo_data.csv") > ccc <- readr::read_csv("demo_data.csv") > mean(is.na(aaa1$json_string)) # 0.5 > mean(is.na(aaa2$column(1))) # Scalar 0.5 > mean(is.na(bbb$json_string)) # 0 > mean(is.na(ccc$json_string)) # 0{code} > > > * arrow 2.0 (latest CRAN) > * readr 1.4.0 > * data.table 1.13.2 > * R version 4.0.1 (2020-06-06) > * MacOS Catalina 10.15.7 / x86_64-apple-darwin17.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-11067) [R] read_csv_arrow silently fails to read some strings and returns nulls
[ https://issues.apache.org/jira/browse/ARROW-11067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17256207#comment-17256207 ] John Sheffield commented on ARROW-11067: I pulled a few strings over a much larger dataset and came to something useful. There is an extremely definite 'striping' of success/failure patterns beginning at nchar of 32,767 (where failures start); then the failures stop and all cases succeed between 65,685 and 98,832 chars; and then we switch back to failures. The graph below captures it all. (Unfortunately, can't share the full dataset this came from for confidentiality reasons, but I'm betting that I can recreate the effect on something simulated. I also attached the distribution of character counts by success/failure – this is the CSV behind the plot, dropping cases below 30k characters which 100% succeeded.) [^arrow_failure_cases.csv] !arrowbug1.png! > [R] read_csv_arrow silently fails to read some strings and returns nulls > > > Key: ARROW-11067 > URL: https://issues.apache.org/jira/browse/ARROW-11067 > Project: Apache Arrow > Issue Type: Bug > Components: R >Reporter: John Sheffield >Priority: Major > Fix For: 3.0.0 > > Attachments: arrow_failure_cases.csv, arrow_failure_cases.csv, > arrowbug1.png, arrowbug1.png, demo_data.csv > > > A sample file is attached, showing 10 rows each of strings with consistent > failures (false_na = TRUE) and consistent successes (false_na = FALSE). The > strings are in the column `json_string` – if relevant, they are geojsons with > min nchar of 33,229 and max nchar of 202,515. > When I read this sample file with other R CSV readers (readr and data.table > shown), the files are imported correctly and there are no NAs in the > json_string column. > When I read with arrow::read_csv_arrow, 50% of the sample json_string column > end up as NAs. as_data_frame TRUE or FALSE does not change the behavior, so > this might not be limited to the R interface, but I can't help debug much > further upstream. > > > {code:java} > aaa1 <- arrow::read_csv_arrow("demo_data.csv", as_data_frame = TRUE) > aaa2 <- arrow::read_csv_arrow("demo_data.csv", as_data_frame = FALSE) > bbb <- data.table::fread("demo_data.csv") > ccc <- readr::read_csv("demo_data.csv") > mean(is.na(aaa1$json_string)) # 0.5 > mean(is.na(aaa2$column(1))) # Scalar 0.5 > mean(is.na(bbb$json_string)) # 0 > mean(is.na(ccc$json_string)) # 0{code} > > > * arrow 2.0 (latest CRAN) > * readr 1.4.0 > * data.table 1.13.2 > * R version 4.0.1 (2020-06-06) > * MacOS Catalina 10.15.7 / x86_64-apple-darwin17.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-11067) [R] read_csv_arrow silently fails to read some strings and returns nulls
[ https://issues.apache.org/jira/browse/ARROW-11067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] John Sheffield updated ARROW-11067: --- Attachment: arrow_failure_cases.csv > [R] read_csv_arrow silently fails to read some strings and returns nulls > > > Key: ARROW-11067 > URL: https://issues.apache.org/jira/browse/ARROW-11067 > Project: Apache Arrow > Issue Type: Bug > Components: R >Reporter: John Sheffield >Priority: Major > Fix For: 3.0.0 > > Attachments: arrow_failure_cases.csv, arrow_failure_cases.csv, > arrowbug1.png, arrowbug1.png, demo_data.csv > > > A sample file is attached, showing 10 rows each of strings with consistent > failures (false_na = TRUE) and consistent successes (false_na = FALSE). The > strings are in the column `json_string` – if relevant, they are geojsons with > min nchar of 33,229 and max nchar of 202,515. > When I read this sample file with other R CSV readers (readr and data.table > shown), the files are imported correctly and there are no NAs in the > json_string column. > When I read with arrow::read_csv_arrow, 50% of the sample json_string column > end up as NAs. as_data_frame TRUE or FALSE does not change the behavior, so > this might not be limited to the R interface, but I can't help debug much > further upstream. > > > {code:java} > aaa1 <- arrow::read_csv_arrow("demo_data.csv", as_data_frame = TRUE) > aaa2 <- arrow::read_csv_arrow("demo_data.csv", as_data_frame = FALSE) > bbb <- data.table::fread("demo_data.csv") > ccc <- readr::read_csv("demo_data.csv") > mean(is.na(aaa1$json_string)) # 0.5 > mean(is.na(aaa2$column(1))) # Scalar 0.5 > mean(is.na(bbb$json_string)) # 0 > mean(is.na(ccc$json_string)) # 0{code} > > > * arrow 2.0 (latest CRAN) > * readr 1.4.0 > * data.table 1.13.2 > * R version 4.0.1 (2020-06-06) > * MacOS Catalina 10.15.7 / x86_64-apple-darwin17.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Issue Comment Deleted] (ARROW-11067) [R] read_csv_arrow silently fails to read some strings and returns nulls
[ https://issues.apache.org/jira/browse/ARROW-11067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] John Sheffield updated ARROW-11067: --- Comment: was deleted (was: I pulled a few strings over a much larger dataset and came to something useful. There is an extremely definite 'striping' of success/failure patterns beginning at nchar of 32,767 (where failures start); then the failures stop and all cases succeed between 65,685 and 98,832 chars; and then we switch back to failures. The graph below captures it all. (Unfortunately, can't share the full dataset this came from for confidentiality reasons, but I'm betting that I can recreate the effect on something simulated. I also attached the distribution of character counts by success/failure – this is the CSV behind the plot, dropping cases below 30k characters which 100% succeeded.) [^arrow_failure_cases.csv] !arrowbug1.png!) > [R] read_csv_arrow silently fails to read some strings and returns nulls > > > Key: ARROW-11067 > URL: https://issues.apache.org/jira/browse/ARROW-11067 > Project: Apache Arrow > Issue Type: Bug > Components: R >Reporter: John Sheffield >Priority: Major > Fix For: 3.0.0 > > Attachments: arrow_failure_cases.csv, arrowbug1.png, demo_data.csv > > > A sample file is attached, showing 10 rows each of strings with consistent > failures (false_na = TRUE) and consistent successes (false_na = FALSE). The > strings are in the column `json_string` – if relevant, they are geojsons with > min nchar of 33,229 and max nchar of 202,515. > When I read this sample file with other R CSV readers (readr and data.table > shown), the files are imported correctly and there are no NAs in the > json_string column. > When I read with arrow::read_csv_arrow, 50% of the sample json_string column > end up as NAs. as_data_frame TRUE or FALSE does not change the behavior, so > this might not be limited to the R interface, but I can't help debug much > further upstream. > > > {code:java} > aaa1 <- arrow::read_csv_arrow("demo_data.csv", as_data_frame = TRUE) > aaa2 <- arrow::read_csv_arrow("demo_data.csv", as_data_frame = FALSE) > bbb <- data.table::fread("demo_data.csv") > ccc <- readr::read_csv("demo_data.csv") > mean(is.na(aaa1$json_string)) # 0.5 > mean(is.na(aaa2$column(1))) # Scalar 0.5 > mean(is.na(bbb$json_string)) # 0 > mean(is.na(ccc$json_string)) # 0{code} > > > * arrow 2.0 (latest CRAN) > * readr 1.4.0 > * data.table 1.13.2 > * R version 4.0.1 (2020-06-06) > * MacOS Catalina 10.15.7 / x86_64-apple-darwin17.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (ARROW-11067) [R] read_csv_arrow silently fails to read some strings and returns nulls
[ https://issues.apache.org/jira/browse/ARROW-11067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17256206#comment-17256206 ] John Sheffield edited comment on ARROW-11067 at 12/29/20, 11:38 PM: I pulled a few strings over a much larger dataset and came to something useful. There is an extremely definite 'striping' of success/failure patterns beginning at nchar of 32,767 (where failures start); then the failures stop and all cases succeed between 65,685 and 98,832 chars; and then we switch back to failures. The graph below captures it all. (Unfortunately, can't share the full dataset this came from for confidentiality reasons, but I'm betting that I can recreate the effect on something simulated. I also attached the distribution of character counts by success/failure – this is the CSV behind the plot, dropping cases below 30k characters which 100% succeeded.) [^arrow_failure_cases.csv] !arrowbug1.png! was (Author: jms): I pulled a few strings over a much larger dataset and came to something useful. There is an extremely definite 'striping' of success/failure patterns beginning at nchar of 32,767 (where failures start); then the failures stop and all cases succeed between 65,685 and 98,832 chars; and then we switch back to failures. The graph below captures it all. (Unfortunately, can't share the full dataset this came from for confidentiality reasons, but I'm betting that I can recreate the effect on something simulated. I also attached the distribution of character counts by success/failure – this is the CSV behind the plot, dropping cases below 30k characters which 100%[^arrow_failure_cases.csv] succeeded.) !arrowbug1.png! > [R] read_csv_arrow silently fails to read some strings and returns nulls > > > Key: ARROW-11067 > URL: https://issues.apache.org/jira/browse/ARROW-11067 > Project: Apache Arrow > Issue Type: Bug > Components: R >Reporter: John Sheffield >Priority: Major > Fix For: 3.0.0 > > Attachments: arrow_failure_cases.csv, arrowbug1.png, demo_data.csv > > > A sample file is attached, showing 10 rows each of strings with consistent > failures (false_na = TRUE) and consistent successes (false_na = FALSE). The > strings are in the column `json_string` – if relevant, they are geojsons with > min nchar of 33,229 and max nchar of 202,515. > When I read this sample file with other R CSV readers (readr and data.table > shown), the files are imported correctly and there are no NAs in the > json_string column. > When I read with arrow::read_csv_arrow, 50% of the sample json_string column > end up as NAs. as_data_frame TRUE or FALSE does not change the behavior, so > this might not be limited to the R interface, but I can't help debug much > further upstream. > > > {code:java} > aaa1 <- arrow::read_csv_arrow("demo_data.csv", as_data_frame = TRUE) > aaa2 <- arrow::read_csv_arrow("demo_data.csv", as_data_frame = FALSE) > bbb <- data.table::fread("demo_data.csv") > ccc <- readr::read_csv("demo_data.csv") > mean(is.na(aaa1$json_string)) # 0.5 > mean(is.na(aaa2$column(1))) # Scalar 0.5 > mean(is.na(bbb$json_string)) # 0 > mean(is.na(ccc$json_string)) # 0{code} > > > * arrow 2.0 (latest CRAN) > * readr 1.4.0 > * data.table 1.13.2 > * R version 4.0.1 (2020-06-06) > * MacOS Catalina 10.15.7 / x86_64-apple-darwin17.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-11067) [R] read_csv_arrow silently fails to read some strings and returns nulls
[ https://issues.apache.org/jira/browse/ARROW-11067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17256206#comment-17256206 ] John Sheffield commented on ARROW-11067: I pulled a few strings over a much larger dataset and came to something useful. There is an extremely definite 'striping' of success/failure patterns beginning at nchar of 32,767 (where failures start); then the failures stop and all cases succeed between 65,685 and 98,832 chars; and then we switch back to failures. The graph below captures it all. (Unfortunately, can't share the full dataset this came from for confidentiality reasons, but I'm betting that I can recreate the effect on something simulated. I also attached the distribution of character counts by success/failure – this is the CSV behind the plot, dropping cases below 30k characters which 100%[^arrow_failure_cases.csv] succeeded.) !arrowbug1.png! > [R] read_csv_arrow silently fails to read some strings and returns nulls > > > Key: ARROW-11067 > URL: https://issues.apache.org/jira/browse/ARROW-11067 > Project: Apache Arrow > Issue Type: Bug > Components: R >Reporter: John Sheffield >Priority: Major > Fix For: 3.0.0 > > Attachments: arrow_failure_cases.csv, arrowbug1.png, demo_data.csv > > > A sample file is attached, showing 10 rows each of strings with consistent > failures (false_na = TRUE) and consistent successes (false_na = FALSE). The > strings are in the column `json_string` – if relevant, they are geojsons with > min nchar of 33,229 and max nchar of 202,515. > When I read this sample file with other R CSV readers (readr and data.table > shown), the files are imported correctly and there are no NAs in the > json_string column. > When I read with arrow::read_csv_arrow, 50% of the sample json_string column > end up as NAs. as_data_frame TRUE or FALSE does not change the behavior, so > this might not be limited to the R interface, but I can't help debug much > further upstream. > > > {code:java} > aaa1 <- arrow::read_csv_arrow("demo_data.csv", as_data_frame = TRUE) > aaa2 <- arrow::read_csv_arrow("demo_data.csv", as_data_frame = FALSE) > bbb <- data.table::fread("demo_data.csv") > ccc <- readr::read_csv("demo_data.csv") > mean(is.na(aaa1$json_string)) # 0.5 > mean(is.na(aaa2$column(1))) # Scalar 0.5 > mean(is.na(bbb$json_string)) # 0 > mean(is.na(ccc$json_string)) # 0{code} > > > * arrow 2.0 (latest CRAN) > * readr 1.4.0 > * data.table 1.13.2 > * R version 4.0.1 (2020-06-06) > * MacOS Catalina 10.15.7 / x86_64-apple-darwin17.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-11067) [R] read_csv_arrow silently fails to read some strings and returns nulls
[ https://issues.apache.org/jira/browse/ARROW-11067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] John Sheffield updated ARROW-11067: --- Attachment: arrow_failure_cases.csv > [R] read_csv_arrow silently fails to read some strings and returns nulls > > > Key: ARROW-11067 > URL: https://issues.apache.org/jira/browse/ARROW-11067 > Project: Apache Arrow > Issue Type: Bug > Components: R >Reporter: John Sheffield >Priority: Major > Fix For: 3.0.0 > > Attachments: arrow_failure_cases.csv, arrowbug1.png, demo_data.csv > > > A sample file is attached, showing 10 rows each of strings with consistent > failures (false_na = TRUE) and consistent successes (false_na = FALSE). The > strings are in the column `json_string` – if relevant, they are geojsons with > min nchar of 33,229 and max nchar of 202,515. > When I read this sample file with other R CSV readers (readr and data.table > shown), the files are imported correctly and there are no NAs in the > json_string column. > When I read with arrow::read_csv_arrow, 50% of the sample json_string column > end up as NAs. as_data_frame TRUE or FALSE does not change the behavior, so > this might not be limited to the R interface, but I can't help debug much > further upstream. > > > {code:java} > aaa1 <- arrow::read_csv_arrow("demo_data.csv", as_data_frame = TRUE) > aaa2 <- arrow::read_csv_arrow("demo_data.csv", as_data_frame = FALSE) > bbb <- data.table::fread("demo_data.csv") > ccc <- readr::read_csv("demo_data.csv") > mean(is.na(aaa1$json_string)) # 0.5 > mean(is.na(aaa2$column(1))) # Scalar 0.5 > mean(is.na(bbb$json_string)) # 0 > mean(is.na(ccc$json_string)) # 0{code} > > > * arrow 2.0 (latest CRAN) > * readr 1.4.0 > * data.table 1.13.2 > * R version 4.0.1 (2020-06-06) > * MacOS Catalina 10.15.7 / x86_64-apple-darwin17.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-11067) [R] read_csv_arrow silently fails to read some strings and returns nulls
[ https://issues.apache.org/jira/browse/ARROW-11067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] John Sheffield updated ARROW-11067: --- Attachment: arrowbug1.png > [R] read_csv_arrow silently fails to read some strings and returns nulls > > > Key: ARROW-11067 > URL: https://issues.apache.org/jira/browse/ARROW-11067 > Project: Apache Arrow > Issue Type: Bug > Components: R >Reporter: John Sheffield >Priority: Major > Fix For: 3.0.0 > > Attachments: arrowbug1.png, demo_data.csv > > > A sample file is attached, showing 10 rows each of strings with consistent > failures (false_na = TRUE) and consistent successes (false_na = FALSE). The > strings are in the column `json_string` – if relevant, they are geojsons with > min nchar of 33,229 and max nchar of 202,515. > When I read this sample file with other R CSV readers (readr and data.table > shown), the files are imported correctly and there are no NAs in the > json_string column. > When I read with arrow::read_csv_arrow, 50% of the sample json_string column > end up as NAs. as_data_frame TRUE or FALSE does not change the behavior, so > this might not be limited to the R interface, but I can't help debug much > further upstream. > > > {code:java} > aaa1 <- arrow::read_csv_arrow("demo_data.csv", as_data_frame = TRUE) > aaa2 <- arrow::read_csv_arrow("demo_data.csv", as_data_frame = FALSE) > bbb <- data.table::fread("demo_data.csv") > ccc <- readr::read_csv("demo_data.csv") > mean(is.na(aaa1$json_string)) # 0.5 > mean(is.na(aaa2$column(1))) # Scalar 0.5 > mean(is.na(bbb$json_string)) # 0 > mean(is.na(ccc$json_string)) # 0{code} > > > * arrow 2.0 (latest CRAN) > * readr 1.4.0 > * data.table 1.13.2 > * R version 4.0.1 (2020-06-06) > * MacOS Catalina 10.15.7 / x86_64-apple-darwin17.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11067) read_csv_arrow silently fails to read some strings and returns nulls
John Sheffield created ARROW-11067: -- Summary: read_csv_arrow silently fails to read some strings and returns nulls Key: ARROW-11067 URL: https://issues.apache.org/jira/browse/ARROW-11067 Project: Apache Arrow Issue Type: Bug Components: R Reporter: John Sheffield Attachments: demo_data.csv A sample file is attached, showing 10 rows each of strings with consistent failures (false_na = TRUE) and consistent successes (false_na = FALSE). The strings are in the column `json_string` – if relevant, they are geojsons with min nchar of 33,229 and max nchar of 202,515. When I read this sample file with other R CSV readers (readr and data.table shown), the files are imported correctly and there are no NAs in the json_string column. When I read with arrow::read_csv_arrow, 50% of the sample json_string column end up as NAs. as_data_frame TRUE or FALSE does not change the behavior, so this might not be limited to the R interface, but I can't help debug much further upstream. {code:java} aaa1 <- arrow::read_csv_arrow("demo_data.csv", as_data_frame = TRUE) aaa2 <- arrow::read_csv_arrow("demo_data.csv", as_data_frame = FALSE) bbb <- data.table::fread("demo_data.csv") ccc <- readr::read_csv("demo_data.csv") mean(is.na(aaa1$json_string)) # 0.5 mean(is.na(aaa2$column(1))) # Scalar 0.5 mean(is.na(bbb$json_string)) # 0 mean(is.na(ccc$json_string)) # 0{code} * arrow 2.0 (latest CRAN) * readr 1.4.0 * data.table 1.13.2 * R version 4.0.1 (2020-06-06) * MacOS Catalina 10.15.7 / x86_64-apple-darwin17.0 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-10485) open_dataset(): specifying partition when hive_style =TRUE fails silently
[ https://issues.apache.org/jira/browse/ARROW-10485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] John Sheffield updated ARROW-10485: --- Description: When writing a dataset with hive_style = TRUE, now the default, that dataset has to be opened without an explicit definition of the partitions to work as expected. Even if the correct partition is specified, any query to the dataset on the partition field returns 0 rows. >From my eyes as a user, I'd want this to error out specifically (not just >warn), probably when first calling open_dataset(). ``` data("mtcars") arrow::write_dataset(dataset = mtcars, path = "mtcarstest", partitioning = "cyl", format = "parquet", hive_style = TRUE) mtc1 <- arrow::open_dataset("mtcarstest", partitioning = "cyl") mtc2 <- arrow::open_dataset("mtcarstest") mtc1 %>% dplyr::filter(cyl == 4) %>% collect() mtc2 %>% dplyr::filter(cyl == 4) %>% collect() ``` was: When writing a dataset with hive_style = TRUE, now the default, that dataset has to be opened without an explicit definition of the partitions to work as expected. Even if the correct partition is specified, any query to the dataset on the partition field returns 0 rows. >From my eyes as a user, I'd want this to error out specifically (not just >warn), probably when first calling open_dataset(). ``` data("mtcars") arrow::write_dataset(dataset = mtcars, path = "mtcarstest", partitioning = "cyl", format = "parquet", hive_style = TRUE) mtc1 <- arrow::open_dataset("mtcarstest", partitioning = "cyl") mtc2 <- arrow::open_dataset("mtcarstest") mtc1 %>% dplyr::filter(cyl == 4) %>% collect() mtc2 %>% dplyr::filter(cyl == 4) %>% collect() ``` > open_dataset(): specifying partition when hive_style =TRUE fails silently > - > > Key: ARROW-10485 > URL: https://issues.apache.org/jira/browse/ARROW-10485 > Project: Apache Arrow > Issue Type: Bug > Components: R >Affects Versions: 2.0.0 > Environment: MacOS Catalina 10.15.7 (19H2), R 4.01, arrow R package > v2.0.0 >Reporter: John Sheffield >Priority: Minor > > When writing a dataset with hive_style = TRUE, now the default, that dataset > has to be opened without an explicit definition of the partitions to work as > expected. Even if the correct partition is specified, any query to the > dataset on the partition field returns 0 rows. > > From my eyes as a user, I'd want this to error out specifically (not just > warn), probably when first calling open_dataset(). > ``` > data("mtcars") > arrow::write_dataset(dataset = mtcars, path = "mtcarstest", partitioning = > "cyl", format = "parquet", hive_style = TRUE) > mtc1 <- arrow::open_dataset("mtcarstest", partitioning = "cyl") > mtc2 <- arrow::open_dataset("mtcarstest") > mtc1 %>% > dplyr::filter(cyl == 4) %>% > collect() > mtc2 %>% > dplyr::filter(cyl == 4) %>% > collect() > ``` -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10485) open_dataset(): specifying partition when hive_style =TRUE fails silently
John Sheffield created ARROW-10485: -- Summary: open_dataset(): specifying partition when hive_style =TRUE fails silently Key: ARROW-10485 URL: https://issues.apache.org/jira/browse/ARROW-10485 Project: Apache Arrow Issue Type: Bug Components: R Affects Versions: 2.0.0 Environment: MacOS Catalina 10.15.7 (19H2), R 4.01, arrow R package v2.0.0 Reporter: John Sheffield When writing a dataset with hive_style = TRUE, now the default, that dataset has to be opened without an explicit definition of the partitions to work as expected. Even if the correct partition is specified, any query to the dataset on the partition field returns 0 rows. >From my eyes as a user, I'd want this to error out specifically (not just >warn), probably when first calling open_dataset(). ``` data("mtcars") arrow::write_dataset(dataset = mtcars, path = "mtcarstest", partitioning = "cyl", format = "parquet", hive_style = TRUE) mtc1 <- arrow::open_dataset("mtcarstest", partitioning = "cyl") mtc2 <- arrow::open_dataset("mtcarstest") mtc1 %>% dplyr::filter(cyl == 4) %>% collect() mtc2 %>% dplyr::filter(cyl == 4) %>% collect() ``` -- This message was sent by Atlassian Jira (v8.3.4#803005)