[jira] [Updated] (ARROW-13865) Writing moderate-size parquet files of nested dataframes from R slows down/process hangs

2021-09-02 Thread John Sheffield (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13865?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

John Sheffield updated ARROW-13865:
---
Description: 
I observed a significant slowdown in parquet writes (and ultimately the process 
just hangs for minutes without completion) while writing moderate-size nested 
dataframes from R. I have replicated the issue on MacOS and Ubuntu so far.

 

An example:

```

testdf <- dplyr::tibble(
 id = uuid::UUIDgenerate(n = 5000),
 l1 = as.list(lapply(1:5000, (function(x) runif(1000,
 l2 = as.list(lapply(1:5000, (function(x) rnorm(1000
 )

testdf_long <- tidyr::unnest(testdf, cols = c(l1, l2))

 
 # This works

arrow::write_parquet(testdf_long, "testdf_long.parquet")
 # This write does not complete within a few minutes on my testing but throws 
no errors
 arrow::write_parquet(testdf, "testdf.parquet")

```

I can't guess at why this is true, but the slowdown is closely tied to row 
counts:

```
 # screenshot attached; 12ms, 56ms, and 680ms respectively.

microbenchmark::microbenchmark(
 arrow::write_parquet(testdf[1, ], "testdf.parquet"),
 arrow::write_parquet(testdf[1:10, ], "testdf.parquet"),
 arrow::write_parquet(testdf[1:100, ], "testdf.parquet"),
 times = 5
 )

```

I'm using the CRAN version 5.0.0 in both cases. The sessionInfo() for Ubuntu is
 R version 4.0.5 (2021-03-31)
 Platform: x86_64-pc-linux-gnu (64-bit)
 Running under: Ubuntu 20.04.3 LTS

Matrix products: default
 BLAS/LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.8.so

locale:
 [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 
LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8 LC_MESSAGES=C 
 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C LC_ADDRESS=C LC_TELEPHONE=C 
LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
 [1] stats graphics grDevices utils datasets methods base

other attached packages:
 [1] arrow_5.0.0

And sessionInfo for MacOS is:
 R version 4.0.1 (2020-06-06) Platform: x86_64-apple-darwin17.0 (64-bit) 
Running under: macOS Catalina 10.15.7 Matrix products: default BLAS: 
/System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
 LAPACK: 
/Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib 
locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 
attached base packages: [1] stats graphics grDevices utils datasets methods 
base other attached packages: [1] arrow_5.0.0

  was:
I observed a significant slowdown in parquet writes (and ultimately the process 
just hangs for minutes without completion) while writing moderate-size nested 
dataframes from R. I have replicated the issue on MacOS and Ubuntu so far.

 

An example:

```

testdf <- dplyr::tibble(
 id = uuid::UUIDgenerate(n = 5000),
 l1 = as.list(lapply(1:5000, (function(x) runif(1000,
 l2 = as.list(lapply(1:5000, (function(x) rnorm(1000
)

testdf_long <- tidyr::unnest(testdf, cols = c(l1, l2))

 

# This works

arrow::write_parquet(testdf_long, "testdf_long.parquet")

# This write does not complete within a few minutes on my testing but throws no 
errors
arrow::write_parquet(testdf, "testdf.parquet")

```

I can't guess at why this is true, but the slowdown is closely tied to row 
counts:

```

# screenshot attached; 12ms, 56ms, and 680ms respectively.

microbenchmark::microbenchmark(
 arrow::write_parquet(testdf[1, ], "testdf.parquet"),
 arrow::write_parquet(testdf[1:10, ], "testdf.parquet"),
 arrow::write_parquet(testdf[1:100, ], "testdf.parquet"),
 times = 5
)

```

I'm using the CRAN version 5.0.0 in both cases. The sessionInfo() for Ubuntu is
R version 4.0.5 (2021-03-31)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.3 LTS

Matrix products: default
BLAS/LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.8.so

locale:
 [1] LC_CTYPE=en_US.UTF-8   LC_NUMERIC=C   LC_TIME=en_US.UTF-8  
  LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8LC_MESSAGES=C   
  
 [7] LC_PAPER=en_US.UTF-8   LC_NAME=C  LC_ADDRESS=C 
  LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C 
  

attached base packages:
[1] stats graphics  grDevices utils datasets  methods   base 

other attached packages:
[1] arrow_5.0.0

And sessionInfo for MacOS is:
R version 4.0.1 (2020-06-06) Platform: x86_64-apple-darwin17.0 (64-bit) Running 
under: macOS Catalina 10.15.7 Matrix products: default BLAS: 
/System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
 LAPACK: 
/Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib 
locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 
attached base packages: [1] stats graphics grDevices utils datasets methods 
base other attached packages: [1] arrow_5.0.0


> Writing moderate-size parquet 

[jira] [Updated] (ARROW-13865) Writing moderate-size parquet files of nested dataframes from R slows down/process hangs

2021-09-02 Thread John Sheffield (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13865?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

John Sheffield updated ARROW-13865:
---
Description: 
I observed a significant slowdown in parquet writes (and ultimately the process 
just hangs for minutes without completion) while writing moderate-size nested 
dataframes from R. I have replicated the issue on MacOS and Ubuntu so far.

 

An example:

```

testdf <- dplyr::tibble(
 id = uuid::UUIDgenerate(n = 5000),
 l1 = as.list(lapply(1:5000, (function( x ) runif(1000,
 l2 = as.list(lapply(1:5000, (function( x ) rnorm(1000
 )

testdf_long <- tidyr::unnest(testdf, cols = c(l1, l2))

 
 # This works

arrow::write_parquet(testdf_long, "testdf_long.parquet")
 # This write does not complete within a few minutes on my testing but throws 
no errors
 arrow::write_parquet(testdf, "testdf.parquet")

```

I can't guess at why this is true, but the slowdown is closely tied to row 
counts:

```
 # screenshot attached; 12ms, 56ms, and 680ms respectively.

microbenchmark::microbenchmark(
 arrow::write_parquet(testdf[1, ], "testdf.parquet"),
 arrow::write_parquet(testdf[1:10, ], "testdf.parquet"),
 arrow::write_parquet(testdf[1:100, ], "testdf.parquet"),
 times = 5
 )

```

I'm using the CRAN version 5.0.0 in both cases. The sessionInfo() for Ubuntu is
 R version 4.0.5 (2021-03-31)
 Platform: x86_64-pc-linux-gnu (64-bit)
 Running under: Ubuntu 20.04.3 LTS

Matrix products: default
 BLAS/LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.8.so

locale:
 [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 
LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8 LC_MESSAGES=C 
 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C LC_ADDRESS=C LC_TELEPHONE=C 
LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
 [1] stats graphics grDevices utils datasets methods base

other attached packages:
 [1] arrow_5.0.0

And sessionInfo for MacOS is:
 R version 4.0.1 (2020-06-06) Platform: x86_64-apple-darwin17.0 (64-bit) 
Running under: macOS Catalina 10.15.7 Matrix products: default BLAS: 
/System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
 LAPACK: 
/Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib 
locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 
attached base packages: [1] stats graphics grDevices utils datasets methods 
base other attached packages: [1] arrow_5.0.0

  was:
I observed a significant slowdown in parquet writes (and ultimately the process 
just hangs for minutes without completion) while writing moderate-size nested 
dataframes from R. I have replicated the issue on MacOS and Ubuntu so far.

 

An example:

```

testdf <- dplyr::tibble(
 id = uuid::UUIDgenerate(n = 5000),
 l1 = as.list(lapply(1:5000, (function(x) runif(1000,
 l2 = as.list(lapply(1:5000, (function(x) rnorm(1000
 )

testdf_long <- tidyr::unnest(testdf, cols = c(l1, l2))

 
 # This works

arrow::write_parquet(testdf_long, "testdf_long.parquet")
 # This write does not complete within a few minutes on my testing but throws 
no errors
 arrow::write_parquet(testdf, "testdf.parquet")

```

I can't guess at why this is true, but the slowdown is closely tied to row 
counts:

```
 # screenshot attached; 12ms, 56ms, and 680ms respectively.

microbenchmark::microbenchmark(
 arrow::write_parquet(testdf[1, ], "testdf.parquet"),
 arrow::write_parquet(testdf[1:10, ], "testdf.parquet"),
 arrow::write_parquet(testdf[1:100, ], "testdf.parquet"),
 times = 5
 )

```

I'm using the CRAN version 5.0.0 in both cases. The sessionInfo() for Ubuntu is
 R version 4.0.5 (2021-03-31)
 Platform: x86_64-pc-linux-gnu (64-bit)
 Running under: Ubuntu 20.04.3 LTS

Matrix products: default
 BLAS/LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.8.so

locale:
 [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 
LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8 LC_MESSAGES=C 
 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C LC_ADDRESS=C LC_TELEPHONE=C 
LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
 [1] stats graphics grDevices utils datasets methods base

other attached packages:
 [1] arrow_5.0.0

And sessionInfo for MacOS is:
 R version 4.0.1 (2020-06-06) Platform: x86_64-apple-darwin17.0 (64-bit) 
Running under: macOS Catalina 10.15.7 Matrix products: default BLAS: 
/System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
 LAPACK: 
/Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib 
locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 
attached base packages: [1] stats graphics grDevices utils datasets methods 
base other attached packages: [1] arrow_5.0.0


> Writing moderate-size parquet files of nested dataframes from R slows 
> down/process hangs
> 

[jira] [Created] (ARROW-13865) Writing moderate-size parquet files of nested dataframes from R slows down/process hangs

2021-09-02 Thread John Sheffield (Jira)
John Sheffield created ARROW-13865:
--

 Summary: Writing moderate-size parquet files of nested dataframes 
from R slows down/process hangs
 Key: ARROW-13865
 URL: https://issues.apache.org/jira/browse/ARROW-13865
 Project: Apache Arrow
  Issue Type: Bug
  Components: R
Affects Versions: 5.0.0
Reporter: John Sheffield
 Attachments: Screen Shot 2021-09-02 at 11.21.37 AM.png

I observed a significant slowdown in parquet writes (and ultimately the process 
just hangs for minutes without completion) while writing moderate-size nested 
dataframes from R. I have replicated the issue on MacOS and Ubuntu so far.

 

An example:

```

testdf <- dplyr::tibble(
 id = uuid::UUIDgenerate(n = 5000),
 l1 = as.list(lapply(1:5000, (function(x) runif(1000,
 l2 = as.list(lapply(1:5000, (function(x) rnorm(1000
)

testdf_long <- tidyr::unnest(testdf, cols = c(l1, l2))

 

# This works

arrow::write_parquet(testdf_long, "testdf_long.parquet")

# This write does not complete within a few minutes on my testing but throws no 
errors
arrow::write_parquet(testdf, "testdf.parquet")

```

I can't guess at why this is true, but the slowdown is closely tied to row 
counts:

```

# screenshot attached; 12ms, 56ms, and 680ms respectively.

microbenchmark::microbenchmark(
 arrow::write_parquet(testdf[1, ], "testdf.parquet"),
 arrow::write_parquet(testdf[1:10, ], "testdf.parquet"),
 arrow::write_parquet(testdf[1:100, ], "testdf.parquet"),
 times = 5
)

```

I'm using the CRAN version 5.0.0 in both cases. The sessionInfo() for Ubuntu is
R version 4.0.5 (2021-03-31)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.3 LTS

Matrix products: default
BLAS/LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.8.so

locale:
 [1] LC_CTYPE=en_US.UTF-8   LC_NUMERIC=C   LC_TIME=en_US.UTF-8  
  LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8LC_MESSAGES=C   
  
 [7] LC_PAPER=en_US.UTF-8   LC_NAME=C  LC_ADDRESS=C 
  LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C 
  

attached base packages:
[1] stats graphics  grDevices utils datasets  methods   base 

other attached packages:
[1] arrow_5.0.0

And sessionInfo for MacOS is:
R version 4.0.1 (2020-06-06) Platform: x86_64-apple-darwin17.0 (64-bit) Running 
under: macOS Catalina 10.15.7 Matrix products: default BLAS: 
/System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
 LAPACK: 
/Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib 
locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 
attached base packages: [1] stats graphics grDevices utils datasets methods 
base other attached packages: [1] arrow_5.0.0



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (ARROW-11067) [R] read_csv_arrow silently fails to read some strings and returns nulls

2020-12-30 Thread John Sheffield (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-11067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17256576#comment-17256576
 ] 

John Sheffield edited comment on ARROW-11067 at 12/30/20, 4:07 PM:
---

Hm, the plot thickens. I just replicated Weston's results for the 
arrow_sample_data.csv script in a few environments that suggest it might be a 
Mac-running-R4.0 issue. 
 * *Success:* In a container (`rocker/geospatial:4.0.2`, container itself is 
Ubuntu 20.04LTS running on GCE instance running Debian 10), I also see Weston's 
result of all successes, but using R 4.0.2 instead of his 3.6.3.

 
{code:java}
> sessionInfo()R version 4.0.2 (2020-06-22)
 Platform: x86_64-pc-linux-gnu (64-bit)
 Running under: Ubuntu 20.04 LTS
Matrix products: default
 BLAS/LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.8.so
locale:
 [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 
LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8 
 [6] LC_MESSAGES=C LC_PAPER=en_US.UTF-8 LC_NAME=C LC_ADDRESS=C LC_TELEPHONE=C 
 [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
 [1] stats graphics grDevices utils datasets methods base
other attached packages:
 [1] arrow_2.0.0 forcats_0.5.0 stringr_1.4.0 dplyr_1.0.2 purrr_0.3.4 
readr_1.3.1 tidyr_1.1.2 tibble_3.0.4 ggplot2_3.3.2 tidyverse_1.3.0
loaded via a namespace (and not attached):
 [1] Rcpp_1.0.5 cellranger_1.1.0 pillar_1.4.7 compiler_4.0.2 dbplyr_1.4.4 
tools_4.0.2 digest_0.6.27 bit_4.0.4 jsonlite_1.7.1 
 [10] lubridate_1.7.9 lifecycle_0.2.0 gtable_0.3.0 pkgconfig_2.0.3 rlang_0.4.9 
reprex_0.3.0 cli_2.2.0 DBI_1.1.0 rstudioapi_0.13 
 [19] haven_2.3.1 withr_2.3.0 xml2_1.3.2 httr_1.4.2 fs_1.5.0 generics_0.0.2 
vctrs_0.3.5 hms_0.5.3 bit64_4.0.5 
 [28] grid_4.0.2 tidyselect_1.1.0 glue_1.4.2 R6_2.5.0 fansi_0.4.1 readxl_1.3.1 
farver_2.0.3 modelr_0.1.8 blob_1.2.1 
 [37] magrittr_2.0.1 backports_1.1.10 scales_1.1.1 ellipsis_0.3.1 rvest_0.3.6 
assertthat_0.2.1 colorspace_2.0-0 stringi_1.5.3 munsell_0.5.0 
 [46] broom_0.7.0 crayon_1.3.4
{code}
 
 * *Failure:* In a fresh Mac R environment running the latest MacOS (Big Sur 
11.1 20C69) and R 4.0.3, the alternating success/failure pattern still shows up:

 
{code:java}
> sessionInfo() 
R version 4.0.3 (2020-10-10) 
Platform: x86_64-apple-darwin17.0 (64-bit) 
Running under: macOS Big Sur 10.16 
Matrix products: default 
LAPACK: 
/Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib 
locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 
attached base packages: [1] stats graphics grDevices utils datasets methods 
base 
other attached packages: [1] arrow_2.0.0 forcats_0.5.0 stringr_1.4.0 
dplyr_1.0.2 purrr_0.3.4 readr_1.4.0 tidyr_1.1.2 tibble_3.0.4 ggplot2_3.3.2 
tidyverse_1.3.0 
loaded via a namespace (and not attached): [1] Rcpp_1.0.5 cellranger_1.1.0 
pillar_1.4.7 compiler_4.0.3 dbplyr_2.0.0 tools_4.0.3 digest_0.6.27 bit_4.0.4 
jsonlite_1.7.2 [10] lubridate_1.7.9.2 lifecycle_0.2.0 gtable_0.3.0 
pkgconfig_2.0.3 rlang_0.4.9 reprex_0.3.0 cli_2.2.0 DBI_1.1.0 rstudioapi_0.13 
[19] haven_2.3.1 withr_2.3.0 xml2_1.3.2 httr_1.4.2 fs_1.5.0 generics_0.1.0 
vctrs_0.3.5 hms_0.5.3 bit64_4.0.5 [28] grid_4.0.3 tidyselect_1.1.0 glue_1.4.2 
R6_2.5.0 fansi_0.4.1 readxl_1.3.1 farver_2.0.3 modelr_0.1.8 magrittr_2.0.1 [37] 
backports_1.2.1 scales_1.1.1 ellipsis_0.3.1 rvest_0.3.6 assertthat_0.2.1 
colorspace_2.0-0 stringi_1.5.3 munsell_0.5.0 broom_0.7.2 [46] crayon_1.3.4
{code}
 

 

 


was (Author: jms):
Hm, the plot thickens. I just replicated Weston's results for the 
arrow_sample_data.csv script in a few environments that suggest it might be a 
Mac-running-R4.0 issue. 
 * *Success:* In a container (`rocker/geospatial:4.0.2`, container itself is 
Ubuntu 20.04LTS running on GCE instance running Debian 10), I also see Weston's 
result of all successes, but using R 4.0.2 instead of his 3.6.3.

> sessionInfo()R version 4.0.2 (2020-06-22)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04 LTS

Matrix products: default
BLAS/LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.8.so

locale:
 [1] LC_CTYPE=en_US.UTF-8   LC_NUMERIC=C   LC_TIME=en_US.UTF-8  
  LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8   
 [6] LC_MESSAGES=C  LC_PAPER=en_US.UTF-8   LC_NAME=C
  LC_ADDRESS=C   LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C   

attached base packages:
[1] stats graphics  grDevices utils datasets  methods   base 

other attached packages:
 [1] arrow_2.0.0 forcats_0.5.0   stringr_1.4.0   dplyr_1.0.2 
purrr_0.3.4 readr_1.3.1 tidyr_1.1.2 tibble_3.0.4ggplot2_3.3.2   
tidyverse_1.3.0

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.5   cellranger_1.1.0 pillar_1.4.7 compiler_4.0.2   
dbplyr_1.4.4 tools_4.0.2  

[jira] [Commented] (ARROW-11067) [R] read_csv_arrow silently fails to read some strings and returns nulls

2020-12-30 Thread John Sheffield (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-11067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17256576#comment-17256576
 ] 

John Sheffield commented on ARROW-11067:


Hm, the plot thickens. I just replicated Weston's results for the 
arrow_sample_data.csv script in a few environments that suggest it might be a 
Mac-running-R4.0 issue. 
 * *Success:* In a container (`rocker/geospatial:4.0.2`, container itself is 
Ubuntu 20.04LTS running on GCE instance running Debian 10), I also see Weston's 
result of all successes, but using R 4.0.2 instead of his 3.6.3.

> sessionInfo()R version 4.0.2 (2020-06-22)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04 LTS

Matrix products: default
BLAS/LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.8.so

locale:
 [1] LC_CTYPE=en_US.UTF-8   LC_NUMERIC=C   LC_TIME=en_US.UTF-8  
  LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8   
 [6] LC_MESSAGES=C  LC_PAPER=en_US.UTF-8   LC_NAME=C
  LC_ADDRESS=C   LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C   

attached base packages:
[1] stats graphics  grDevices utils datasets  methods   base 

other attached packages:
 [1] arrow_2.0.0 forcats_0.5.0   stringr_1.4.0   dplyr_1.0.2 
purrr_0.3.4 readr_1.3.1 tidyr_1.1.2 tibble_3.0.4ggplot2_3.3.2   
tidyverse_1.3.0

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.5   cellranger_1.1.0 pillar_1.4.7 compiler_4.0.2   
dbplyr_1.4.4 tools_4.0.2  digest_0.6.27bit_4.0.4
jsonlite_1.7.1  
[10] lubridate_1.7.9  lifecycle_0.2.0  gtable_0.3.0 pkgconfig_2.0.3  
rlang_0.4.9  reprex_0.3.0 cli_2.2.0DBI_1.1.0
rstudioapi_0.13 
[19] haven_2.3.1  withr_2.3.0  xml2_1.3.2   httr_1.4.2   
fs_1.5.0 generics_0.0.2   vctrs_0.3.5  hms_0.5.3bit64_4.0.5 

[28] grid_4.0.2   tidyselect_1.1.0 glue_1.4.2   R6_2.5.0 
fansi_0.4.1  readxl_1.3.1 farver_2.0.3 modelr_0.1.8 blob_1.2.1  

[37] magrittr_2.0.1   backports_1.1.10 scales_1.1.1 ellipsis_0.3.1   
rvest_0.3.6  assertthat_0.2.1 colorspace_2.0-0 stringi_1.5.3
munsell_0.5.0   
[46] broom_0.7.0  crayon_1.3.4  
 * *Failure:* In a fresh Mac R environment running the latest MacOS (Big Sur 
11.1 20C69) and R 4.0.3, the alternating success/failure pattern still shows up:

 
{code:java}
> sessionInfo() 
R version 4.0.3 (2020-10-10) 
Platform: x86_64-apple-darwin17.0 (64-bit) 
Running under: macOS Big Sur 10.16 
Matrix products: default 
LAPACK: 
/Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib 
locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 
attached base packages: [1] stats graphics grDevices utils datasets methods 
base 
other attached packages: [1] arrow_2.0.0 forcats_0.5.0 stringr_1.4.0 
dplyr_1.0.2 purrr_0.3.4 readr_1.4.0 tidyr_1.1.2 tibble_3.0.4 ggplot2_3.3.2 
tidyverse_1.3.0 
loaded via a namespace (and not attached): [1] Rcpp_1.0.5 cellranger_1.1.0 
pillar_1.4.7 compiler_4.0.3 dbplyr_2.0.0 tools_4.0.3 digest_0.6.27 bit_4.0.4 
jsonlite_1.7.2 [10] lubridate_1.7.9.2 lifecycle_0.2.0 gtable_0.3.0 
pkgconfig_2.0.3 rlang_0.4.9 reprex_0.3.0 cli_2.2.0 DBI_1.1.0 rstudioapi_0.13 
[19] haven_2.3.1 withr_2.3.0 xml2_1.3.2 httr_1.4.2 fs_1.5.0 generics_0.1.0 
vctrs_0.3.5 hms_0.5.3 bit64_4.0.5 [28] grid_4.0.3 tidyselect_1.1.0 glue_1.4.2 
R6_2.5.0 fansi_0.4.1 readxl_1.3.1 farver_2.0.3 modelr_0.1.8 magrittr_2.0.1 [37] 
backports_1.2.1 scales_1.1.1 ellipsis_0.3.1 rvest_0.3.6 assertthat_0.2.1 
colorspace_2.0-0 stringi_1.5.3 munsell_0.5.0 broom_0.7.2 [46] crayon_1.3.4
{code}
 

 

 

> [R] read_csv_arrow silently fails to read some strings and returns nulls
> 
>
> Key: ARROW-11067
> URL: https://issues.apache.org/jira/browse/ARROW-11067
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Reporter: John Sheffield
>Priority: Major
> Fix For: 3.0.0
>
> Attachments: arrow_explanation.png, arrow_failure_cases.csv, 
> arrow_failure_cases.csv, arrowbug1.png, arrowbug1.png, demo_data.csv
>
>
> A sample file is attached, showing 10 rows each of strings with consistent 
> failures (false_na = TRUE) and consistent successes (false_na = FALSE). The 
> strings are in the column `json_string` – if relevant, they are geojsons with 
> min nchar of 33,229 and max nchar of 202,515.
> When I read this sample file with other R CSV readers (readr and data.table 
> shown), the files are imported correctly and there are no NAs in the 
> json_string column.
> When I read with arrow::read_csv_arrow, 50% of the sample json_string column 
> end up as NAs. as_data_frame TRUE or FALSE does 

[jira] [Comment Edited] (ARROW-11067) [R] read_csv_arrow silently fails to read some strings and returns nulls

2020-12-29 Thread John Sheffield (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-11067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17256218#comment-17256218
 ] 

John Sheffield edited comment on ARROW-11067 at 12/30/20, 1:35 AM:
---

(Sorry for the fragmented report here, but figured out a way to really isolate 
the issue.)

 

The string read failures are deterministic and predictable, and the content of 
the strings doesn't seem to matter – only length. There's a switch between 
success/failure at every integer multiple of *N * (32 * 1024) characters*.
 * For N in [0,1), string length between 0 and 32767 characters, all reads 
succeed.
 * For N in [1,2), string length 32768 and 65535, all of the reads fail.
 * The same pattern repeats until we hit LongString limits: if floor(nchar/(32 
* 1024) is 0 or even, the read succeeds. If floor(nchar/(32 * 1024) is odd, it 
fails.

Code:

 

 
{code:java}
library(tidyverse)
library(arrow)

generate_string <- function(n){
  paste0(sample(c(LETTERS, letters), size = n, replace = TRUE), collapse = "")
}

sample_breaks <- (1:60L * 16L * 1024L)
sample_lengths <- sample_breaks - 1
set.seed(1234)

test_strings <- purrr::map_chr(sample_lengths, generate_string)

readr::write_csv(data.frame(str = test_strings, strlen = sample_lengths),
 "arrow_sample_data.csv")

arrow::read_csv_arrow("arrow_sample_data.csv") %>%
  dplyr::mutate(failed_case = ifelse(is.na(str), "failed", "succeeded")) %>%
  dplyr::select(-str) %>%
  ggplot(data = ., aes(x = (strlen / (32 * 1024)), y = failed_case)) +
  geom_point(aes(color = ifelse(floor(strlen / (32 * 1024)) %% 2 == 0, "even", 
"odd")), size = 3) +
  scale_x_continuous(breaks = seq(0, 30)) +
  labs(x = "string length / (32 * 1024) : integer multiple of 32kb",
   y = "string read success/failure",
   color = "even/odd multiple of 32kb")
{code}
 

!arrow_explanation.png!

 

 

 


was (Author: jms):
(Sorry for the fragmented report here, but figured out a way to really isolate 
the issue.)

 

The string read failures are deterministic and predictable, and the content of 
the strings doesn't seem to matter – only length. There's a switch between 
success/failure at every integer multiple of *N * (32 * 1024) characters*.
 * For N in [0,1), string length between 0 and 32767 characters, all reads 
succeed.
 * For N in [1,2], string length 32768 and 65535, all of the reads fail.
 * The same pattern repeats until we hit LongString limits: if floor(nchar/(32 
* 1024) is 0 or even, the read succeeds. If floor(nchar/(32 * 1024) is odd, it 
fails.

Code:

 

 
{code:java}
library(tidyverse)
library(arrow)

generate_string <- function(n){
  paste0(sample(c(LETTERS, letters), size = n, replace = TRUE), collapse = "")
}

sample_breaks <- (1:60L * 16L * 1024L)
sample_lengths <- sample_breaks - 1
set.seed(1234)

test_strings <- purrr::map_chr(sample_lengths, generate_string)

readr::write_csv(data.frame(str = test_strings, strlen = sample_lengths),
 "arrow_sample_data.csv")

arrow::read_csv_arrow("arrow_sample_data.csv") %>%
  dplyr::mutate(failed_case = ifelse(is.na(str), "failed", "succeeded")) %>%
  dplyr::select(-str) %>%
  ggplot(data = ., aes(x = (strlen / (32 * 1024)), y = failed_case)) +
  geom_point(aes(color = ifelse(floor(strlen / (32 * 1024)) %% 2 == 0, "even", 
"odd")), size = 3) +
  scale_x_continuous(breaks = seq(0, 30)) +
  labs(x = "string length / (32 * 1024) : integer multiple of 32kb",
   y = "string read success/failure",
   color = "even/odd multiple of 32kb")
{code}
 

!arrow_explanation.png!

 

 

 

> [R] read_csv_arrow silently fails to read some strings and returns nulls
> 
>
> Key: ARROW-11067
> URL: https://issues.apache.org/jira/browse/ARROW-11067
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Reporter: John Sheffield
>Priority: Major
> Fix For: 3.0.0
>
> Attachments: arrow_explanation.png, arrow_failure_cases.csv, 
> arrow_failure_cases.csv, arrowbug1.png, arrowbug1.png, demo_data.csv
>
>
> A sample file is attached, showing 10 rows each of strings with consistent 
> failures (false_na = TRUE) and consistent successes (false_na = FALSE). The 
> strings are in the column `json_string` – if relevant, they are geojsons with 
> min nchar of 33,229 and max nchar of 202,515.
> When I read this sample file with other R CSV readers (readr and data.table 
> shown), the files are imported correctly and there are no NAs in the 
> json_string column.
> When I read with arrow::read_csv_arrow, 50% of the sample json_string column 
> end up as NAs. as_data_frame TRUE or FALSE does not change the behavior, so 
> this might not be limited to the R interface, but I can't help debug much 
> further upstream.
>  
>  
> {code:java}
> aaa1 <- 

[jira] [Commented] (ARROW-11067) [R] read_csv_arrow silently fails to read some strings and returns nulls

2020-12-29 Thread John Sheffield (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-11067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17256218#comment-17256218
 ] 

John Sheffield commented on ARROW-11067:


(Sorry for the fragmented report here, but figured out a way to really isolate 
the issue.)

 

The string read failures are deterministic and predictable, and the content of 
the strings doesn't seem to matter – only length. There's a switch between 
success/failure at every integer multiple of *N * (32 * 1024) characters*.
 * For N in [0,1), string length between 0 and 32767 characters, all reads 
succeed.
 * For N in [1,2], string length 32768 and 65535, all of the reads fail.
 * The same pattern repeats until we hit LongString limits: if floor(nchar/(32 
* 1024) is 0 or even, the read succeeds. If floor(nchar/(32 * 1024) is odd, it 
fails.

Code:

 

 
{code:java}
library(tidyverse)
library(arrow)

generate_string <- function(n){
  paste0(sample(c(LETTERS, letters), size = n, replace = TRUE), collapse = "")
}

sample_breaks <- (1:60L * 16L * 1024L)
sample_lengths <- sample_breaks - 1
set.seed(1234)

test_strings <- purrr::map_chr(sample_lengths, generate_string)

readr::write_csv(data.frame(str = test_strings, strlen = sample_lengths),
 "arrow_sample_data.csv")

arrow::read_csv_arrow("arrow_sample_data.csv") %>%
  dplyr::mutate(failed_case = ifelse(is.na(str), "failed", "succeeded")) %>%
  dplyr::select(-str) %>%
  ggplot(data = ., aes(x = (strlen / (32 * 1024)), y = failed_case)) +
  geom_point(aes(color = ifelse(floor(strlen / (32 * 1024)) %% 2 == 0, "even", 
"odd")), size = 3) +
  scale_x_continuous(breaks = seq(0, 30)) +
  labs(x = "string length / (32 * 1024) : integer multiple of 32kb",
   y = "string read success/failure",
   color = "even/odd multiple of 32kb")
{code}
 

!arrow_explanation.png!

 

 

 

> [R] read_csv_arrow silently fails to read some strings and returns nulls
> 
>
> Key: ARROW-11067
> URL: https://issues.apache.org/jira/browse/ARROW-11067
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Reporter: John Sheffield
>Priority: Major
> Fix For: 3.0.0
>
> Attachments: arrow_explanation.png, arrow_failure_cases.csv, 
> arrow_failure_cases.csv, arrowbug1.png, arrowbug1.png, demo_data.csv
>
>
> A sample file is attached, showing 10 rows each of strings with consistent 
> failures (false_na = TRUE) and consistent successes (false_na = FALSE). The 
> strings are in the column `json_string` – if relevant, they are geojsons with 
> min nchar of 33,229 and max nchar of 202,515.
> When I read this sample file with other R CSV readers (readr and data.table 
> shown), the files are imported correctly and there are no NAs in the 
> json_string column.
> When I read with arrow::read_csv_arrow, 50% of the sample json_string column 
> end up as NAs. as_data_frame TRUE or FALSE does not change the behavior, so 
> this might not be limited to the R interface, but I can't help debug much 
> further upstream.
>  
>  
> {code:java}
> aaa1 <- arrow::read_csv_arrow("demo_data.csv", as_data_frame = TRUE)
> aaa2 <- arrow::read_csv_arrow("demo_data.csv", as_data_frame = FALSE)
> bbb <- data.table::fread("demo_data.csv")
> ccc <- readr::read_csv("demo_data.csv")
> mean(is.na(aaa1$json_string)) # 0.5
> mean(is.na(aaa2$column(1))) # Scalar 0.5
> mean(is.na(bbb$json_string)) # 0
> mean(is.na(ccc$json_string)) # 0{code}
>  
>  
>  * arrow 2.0 (latest CRAN)
>  * readr 1.4.0
>  * data.table 1.13.2
>  * R version 4.0.1 (2020-06-06)
>  * MacOS Catalina 10.15.7 / x86_64-apple-darwin17.0
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-11067) [R] read_csv_arrow silently fails to read some strings and returns nulls

2020-12-29 Thread John Sheffield (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-11067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

John Sheffield updated ARROW-11067:
---
Attachment: arrow_explanation.png

> [R] read_csv_arrow silently fails to read some strings and returns nulls
> 
>
> Key: ARROW-11067
> URL: https://issues.apache.org/jira/browse/ARROW-11067
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Reporter: John Sheffield
>Priority: Major
> Fix For: 3.0.0
>
> Attachments: arrow_explanation.png, arrow_failure_cases.csv, 
> arrow_failure_cases.csv, arrowbug1.png, arrowbug1.png, demo_data.csv
>
>
> A sample file is attached, showing 10 rows each of strings with consistent 
> failures (false_na = TRUE) and consistent successes (false_na = FALSE). The 
> strings are in the column `json_string` – if relevant, they are geojsons with 
> min nchar of 33,229 and max nchar of 202,515.
> When I read this sample file with other R CSV readers (readr and data.table 
> shown), the files are imported correctly and there are no NAs in the 
> json_string column.
> When I read with arrow::read_csv_arrow, 50% of the sample json_string column 
> end up as NAs. as_data_frame TRUE or FALSE does not change the behavior, so 
> this might not be limited to the R interface, but I can't help debug much 
> further upstream.
>  
>  
> {code:java}
> aaa1 <- arrow::read_csv_arrow("demo_data.csv", as_data_frame = TRUE)
> aaa2 <- arrow::read_csv_arrow("demo_data.csv", as_data_frame = FALSE)
> bbb <- data.table::fread("demo_data.csv")
> ccc <- readr::read_csv("demo_data.csv")
> mean(is.na(aaa1$json_string)) # 0.5
> mean(is.na(aaa2$column(1))) # Scalar 0.5
> mean(is.na(bbb$json_string)) # 0
> mean(is.na(ccc$json_string)) # 0{code}
>  
>  
>  * arrow 2.0 (latest CRAN)
>  * readr 1.4.0
>  * data.table 1.13.2
>  * R version 4.0.1 (2020-06-06)
>  * MacOS Catalina 10.15.7 / x86_64-apple-darwin17.0
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-11067) [R] read_csv_arrow silently fails to read some strings and returns nulls

2020-12-29 Thread John Sheffield (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-11067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

John Sheffield updated ARROW-11067:
---
Attachment: arrowbug1.png

> [R] read_csv_arrow silently fails to read some strings and returns nulls
> 
>
> Key: ARROW-11067
> URL: https://issues.apache.org/jira/browse/ARROW-11067
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Reporter: John Sheffield
>Priority: Major
> Fix For: 3.0.0
>
> Attachments: arrow_failure_cases.csv, arrow_failure_cases.csv, 
> arrowbug1.png, arrowbug1.png, demo_data.csv
>
>
> A sample file is attached, showing 10 rows each of strings with consistent 
> failures (false_na = TRUE) and consistent successes (false_na = FALSE). The 
> strings are in the column `json_string` – if relevant, they are geojsons with 
> min nchar of 33,229 and max nchar of 202,515.
> When I read this sample file with other R CSV readers (readr and data.table 
> shown), the files are imported correctly and there are no NAs in the 
> json_string column.
> When I read with arrow::read_csv_arrow, 50% of the sample json_string column 
> end up as NAs. as_data_frame TRUE or FALSE does not change the behavior, so 
> this might not be limited to the R interface, but I can't help debug much 
> further upstream.
>  
>  
> {code:java}
> aaa1 <- arrow::read_csv_arrow("demo_data.csv", as_data_frame = TRUE)
> aaa2 <- arrow::read_csv_arrow("demo_data.csv", as_data_frame = FALSE)
> bbb <- data.table::fread("demo_data.csv")
> ccc <- readr::read_csv("demo_data.csv")
> mean(is.na(aaa1$json_string)) # 0.5
> mean(is.na(aaa2$column(1))) # Scalar 0.5
> mean(is.na(bbb$json_string)) # 0
> mean(is.na(ccc$json_string)) # 0{code}
>  
>  
>  * arrow 2.0 (latest CRAN)
>  * readr 1.4.0
>  * data.table 1.13.2
>  * R version 4.0.1 (2020-06-06)
>  * MacOS Catalina 10.15.7 / x86_64-apple-darwin17.0
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-11067) [R] read_csv_arrow silently fails to read some strings and returns nulls

2020-12-29 Thread John Sheffield (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-11067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17256207#comment-17256207
 ] 

John Sheffield commented on ARROW-11067:


I pulled a few strings over a much larger dataset and came to something useful. 
There is an extremely definite 'striping' of success/failure patterns beginning 
at nchar of 32,767 (where failures start); then the failures stop and all cases 
succeed between 65,685 and 98,832 chars; and then we switch back to failures. 
The graph below captures it all.   

(Unfortunately, can't share the full dataset this came from for confidentiality 
reasons, but I'm betting that I can recreate the effect on something simulated. 
I also attached the distribution of character counts by success/failure – this 
is the CSV behind the plot, dropping cases below 30k characters which 100% 
succeeded.)

[^arrow_failure_cases.csv]

 

!arrowbug1.png!

> [R] read_csv_arrow silently fails to read some strings and returns nulls
> 
>
> Key: ARROW-11067
> URL: https://issues.apache.org/jira/browse/ARROW-11067
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Reporter: John Sheffield
>Priority: Major
> Fix For: 3.0.0
>
> Attachments: arrow_failure_cases.csv, arrow_failure_cases.csv, 
> arrowbug1.png, arrowbug1.png, demo_data.csv
>
>
> A sample file is attached, showing 10 rows each of strings with consistent 
> failures (false_na = TRUE) and consistent successes (false_na = FALSE). The 
> strings are in the column `json_string` – if relevant, they are geojsons with 
> min nchar of 33,229 and max nchar of 202,515.
> When I read this sample file with other R CSV readers (readr and data.table 
> shown), the files are imported correctly and there are no NAs in the 
> json_string column.
> When I read with arrow::read_csv_arrow, 50% of the sample json_string column 
> end up as NAs. as_data_frame TRUE or FALSE does not change the behavior, so 
> this might not be limited to the R interface, but I can't help debug much 
> further upstream.
>  
>  
> {code:java}
> aaa1 <- arrow::read_csv_arrow("demo_data.csv", as_data_frame = TRUE)
> aaa2 <- arrow::read_csv_arrow("demo_data.csv", as_data_frame = FALSE)
> bbb <- data.table::fread("demo_data.csv")
> ccc <- readr::read_csv("demo_data.csv")
> mean(is.na(aaa1$json_string)) # 0.5
> mean(is.na(aaa2$column(1))) # Scalar 0.5
> mean(is.na(bbb$json_string)) # 0
> mean(is.na(ccc$json_string)) # 0{code}
>  
>  
>  * arrow 2.0 (latest CRAN)
>  * readr 1.4.0
>  * data.table 1.13.2
>  * R version 4.0.1 (2020-06-06)
>  * MacOS Catalina 10.15.7 / x86_64-apple-darwin17.0
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-11067) [R] read_csv_arrow silently fails to read some strings and returns nulls

2020-12-29 Thread John Sheffield (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-11067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

John Sheffield updated ARROW-11067:
---
Attachment: arrow_failure_cases.csv

> [R] read_csv_arrow silently fails to read some strings and returns nulls
> 
>
> Key: ARROW-11067
> URL: https://issues.apache.org/jira/browse/ARROW-11067
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Reporter: John Sheffield
>Priority: Major
> Fix For: 3.0.0
>
> Attachments: arrow_failure_cases.csv, arrow_failure_cases.csv, 
> arrowbug1.png, arrowbug1.png, demo_data.csv
>
>
> A sample file is attached, showing 10 rows each of strings with consistent 
> failures (false_na = TRUE) and consistent successes (false_na = FALSE). The 
> strings are in the column `json_string` – if relevant, they are geojsons with 
> min nchar of 33,229 and max nchar of 202,515.
> When I read this sample file with other R CSV readers (readr and data.table 
> shown), the files are imported correctly and there are no NAs in the 
> json_string column.
> When I read with arrow::read_csv_arrow, 50% of the sample json_string column 
> end up as NAs. as_data_frame TRUE or FALSE does not change the behavior, so 
> this might not be limited to the R interface, but I can't help debug much 
> further upstream.
>  
>  
> {code:java}
> aaa1 <- arrow::read_csv_arrow("demo_data.csv", as_data_frame = TRUE)
> aaa2 <- arrow::read_csv_arrow("demo_data.csv", as_data_frame = FALSE)
> bbb <- data.table::fread("demo_data.csv")
> ccc <- readr::read_csv("demo_data.csv")
> mean(is.na(aaa1$json_string)) # 0.5
> mean(is.na(aaa2$column(1))) # Scalar 0.5
> mean(is.na(bbb$json_string)) # 0
> mean(is.na(ccc$json_string)) # 0{code}
>  
>  
>  * arrow 2.0 (latest CRAN)
>  * readr 1.4.0
>  * data.table 1.13.2
>  * R version 4.0.1 (2020-06-06)
>  * MacOS Catalina 10.15.7 / x86_64-apple-darwin17.0
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Issue Comment Deleted] (ARROW-11067) [R] read_csv_arrow silently fails to read some strings and returns nulls

2020-12-29 Thread John Sheffield (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-11067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

John Sheffield updated ARROW-11067:
---
Comment: was deleted

(was: I pulled a few strings over a much larger dataset and came to something 
useful. There is an extremely definite 'striping' of success/failure patterns 
beginning at nchar of 32,767 (where failures start); then the failures stop and 
all cases succeed between 65,685 and 98,832 chars; and then we switch back to 
failures. The graph below captures it all.   

(Unfortunately, can't share the full dataset this came from for confidentiality 
reasons, but I'm betting that I can recreate the effect on something simulated. 
I also attached the distribution of character counts by success/failure – this 
is the CSV behind the plot, dropping cases below 30k characters which 100% 
succeeded.)

[^arrow_failure_cases.csv]

 

!arrowbug1.png!)

> [R] read_csv_arrow silently fails to read some strings and returns nulls
> 
>
> Key: ARROW-11067
> URL: https://issues.apache.org/jira/browse/ARROW-11067
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Reporter: John Sheffield
>Priority: Major
> Fix For: 3.0.0
>
> Attachments: arrow_failure_cases.csv, arrowbug1.png, demo_data.csv
>
>
> A sample file is attached, showing 10 rows each of strings with consistent 
> failures (false_na = TRUE) and consistent successes (false_na = FALSE). The 
> strings are in the column `json_string` – if relevant, they are geojsons with 
> min nchar of 33,229 and max nchar of 202,515.
> When I read this sample file with other R CSV readers (readr and data.table 
> shown), the files are imported correctly and there are no NAs in the 
> json_string column.
> When I read with arrow::read_csv_arrow, 50% of the sample json_string column 
> end up as NAs. as_data_frame TRUE or FALSE does not change the behavior, so 
> this might not be limited to the R interface, but I can't help debug much 
> further upstream.
>  
>  
> {code:java}
> aaa1 <- arrow::read_csv_arrow("demo_data.csv", as_data_frame = TRUE)
> aaa2 <- arrow::read_csv_arrow("demo_data.csv", as_data_frame = FALSE)
> bbb <- data.table::fread("demo_data.csv")
> ccc <- readr::read_csv("demo_data.csv")
> mean(is.na(aaa1$json_string)) # 0.5
> mean(is.na(aaa2$column(1))) # Scalar 0.5
> mean(is.na(bbb$json_string)) # 0
> mean(is.na(ccc$json_string)) # 0{code}
>  
>  
>  * arrow 2.0 (latest CRAN)
>  * readr 1.4.0
>  * data.table 1.13.2
>  * R version 4.0.1 (2020-06-06)
>  * MacOS Catalina 10.15.7 / x86_64-apple-darwin17.0
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (ARROW-11067) [R] read_csv_arrow silently fails to read some strings and returns nulls

2020-12-29 Thread John Sheffield (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-11067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17256206#comment-17256206
 ] 

John Sheffield edited comment on ARROW-11067 at 12/29/20, 11:38 PM:


I pulled a few strings over a much larger dataset and came to something useful. 
There is an extremely definite 'striping' of success/failure patterns beginning 
at nchar of 32,767 (where failures start); then the failures stop and all cases 
succeed between 65,685 and 98,832 chars; and then we switch back to failures. 
The graph below captures it all.   

(Unfortunately, can't share the full dataset this came from for confidentiality 
reasons, but I'm betting that I can recreate the effect on something simulated. 
I also attached the distribution of character counts by success/failure – this 
is the CSV behind the plot, dropping cases below 30k characters which 100% 
succeeded.)

[^arrow_failure_cases.csv]

 

!arrowbug1.png!


was (Author: jms):
I pulled a few strings over a much larger dataset and came to something useful. 
There is an extremely definite 'striping' of success/failure patterns beginning 
at nchar of 32,767 (where failures start); then the failures stop and all cases 
succeed between 65,685 and 98,832 chars; and then we switch back to failures. 
The graph below captures it all.   

(Unfortunately, can't share the full dataset this came from for confidentiality 
reasons, but I'm betting that I can recreate the effect on something simulated. 
I also attached the distribution of character counts by success/failure – this 
is the CSV behind the plot, dropping cases below 30k characters which 
100%[^arrow_failure_cases.csv] succeeded.)

!arrowbug1.png!

> [R] read_csv_arrow silently fails to read some strings and returns nulls
> 
>
> Key: ARROW-11067
> URL: https://issues.apache.org/jira/browse/ARROW-11067
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Reporter: John Sheffield
>Priority: Major
> Fix For: 3.0.0
>
> Attachments: arrow_failure_cases.csv, arrowbug1.png, demo_data.csv
>
>
> A sample file is attached, showing 10 rows each of strings with consistent 
> failures (false_na = TRUE) and consistent successes (false_na = FALSE). The 
> strings are in the column `json_string` – if relevant, they are geojsons with 
> min nchar of 33,229 and max nchar of 202,515.
> When I read this sample file with other R CSV readers (readr and data.table 
> shown), the files are imported correctly and there are no NAs in the 
> json_string column.
> When I read with arrow::read_csv_arrow, 50% of the sample json_string column 
> end up as NAs. as_data_frame TRUE or FALSE does not change the behavior, so 
> this might not be limited to the R interface, but I can't help debug much 
> further upstream.
>  
>  
> {code:java}
> aaa1 <- arrow::read_csv_arrow("demo_data.csv", as_data_frame = TRUE)
> aaa2 <- arrow::read_csv_arrow("demo_data.csv", as_data_frame = FALSE)
> bbb <- data.table::fread("demo_data.csv")
> ccc <- readr::read_csv("demo_data.csv")
> mean(is.na(aaa1$json_string)) # 0.5
> mean(is.na(aaa2$column(1))) # Scalar 0.5
> mean(is.na(bbb$json_string)) # 0
> mean(is.na(ccc$json_string)) # 0{code}
>  
>  
>  * arrow 2.0 (latest CRAN)
>  * readr 1.4.0
>  * data.table 1.13.2
>  * R version 4.0.1 (2020-06-06)
>  * MacOS Catalina 10.15.7 / x86_64-apple-darwin17.0
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-11067) [R] read_csv_arrow silently fails to read some strings and returns nulls

2020-12-29 Thread John Sheffield (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-11067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17256206#comment-17256206
 ] 

John Sheffield commented on ARROW-11067:


I pulled a few strings over a much larger dataset and came to something useful. 
There is an extremely definite 'striping' of success/failure patterns beginning 
at nchar of 32,767 (where failures start); then the failures stop and all cases 
succeed between 65,685 and 98,832 chars; and then we switch back to failures. 
The graph below captures it all.   

(Unfortunately, can't share the full dataset this came from for confidentiality 
reasons, but I'm betting that I can recreate the effect on something simulated. 
I also attached the distribution of character counts by success/failure – this 
is the CSV behind the plot, dropping cases below 30k characters which 
100%[^arrow_failure_cases.csv] succeeded.)

!arrowbug1.png!

> [R] read_csv_arrow silently fails to read some strings and returns nulls
> 
>
> Key: ARROW-11067
> URL: https://issues.apache.org/jira/browse/ARROW-11067
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Reporter: John Sheffield
>Priority: Major
> Fix For: 3.0.0
>
> Attachments: arrow_failure_cases.csv, arrowbug1.png, demo_data.csv
>
>
> A sample file is attached, showing 10 rows each of strings with consistent 
> failures (false_na = TRUE) and consistent successes (false_na = FALSE). The 
> strings are in the column `json_string` – if relevant, they are geojsons with 
> min nchar of 33,229 and max nchar of 202,515.
> When I read this sample file with other R CSV readers (readr and data.table 
> shown), the files are imported correctly and there are no NAs in the 
> json_string column.
> When I read with arrow::read_csv_arrow, 50% of the sample json_string column 
> end up as NAs. as_data_frame TRUE or FALSE does not change the behavior, so 
> this might not be limited to the R interface, but I can't help debug much 
> further upstream.
>  
>  
> {code:java}
> aaa1 <- arrow::read_csv_arrow("demo_data.csv", as_data_frame = TRUE)
> aaa2 <- arrow::read_csv_arrow("demo_data.csv", as_data_frame = FALSE)
> bbb <- data.table::fread("demo_data.csv")
> ccc <- readr::read_csv("demo_data.csv")
> mean(is.na(aaa1$json_string)) # 0.5
> mean(is.na(aaa2$column(1))) # Scalar 0.5
> mean(is.na(bbb$json_string)) # 0
> mean(is.na(ccc$json_string)) # 0{code}
>  
>  
>  * arrow 2.0 (latest CRAN)
>  * readr 1.4.0
>  * data.table 1.13.2
>  * R version 4.0.1 (2020-06-06)
>  * MacOS Catalina 10.15.7 / x86_64-apple-darwin17.0
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-11067) [R] read_csv_arrow silently fails to read some strings and returns nulls

2020-12-29 Thread John Sheffield (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-11067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

John Sheffield updated ARROW-11067:
---
Attachment: arrow_failure_cases.csv

> [R] read_csv_arrow silently fails to read some strings and returns nulls
> 
>
> Key: ARROW-11067
> URL: https://issues.apache.org/jira/browse/ARROW-11067
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Reporter: John Sheffield
>Priority: Major
> Fix For: 3.0.0
>
> Attachments: arrow_failure_cases.csv, arrowbug1.png, demo_data.csv
>
>
> A sample file is attached, showing 10 rows each of strings with consistent 
> failures (false_na = TRUE) and consistent successes (false_na = FALSE). The 
> strings are in the column `json_string` – if relevant, they are geojsons with 
> min nchar of 33,229 and max nchar of 202,515.
> When I read this sample file with other R CSV readers (readr and data.table 
> shown), the files are imported correctly and there are no NAs in the 
> json_string column.
> When I read with arrow::read_csv_arrow, 50% of the sample json_string column 
> end up as NAs. as_data_frame TRUE or FALSE does not change the behavior, so 
> this might not be limited to the R interface, but I can't help debug much 
> further upstream.
>  
>  
> {code:java}
> aaa1 <- arrow::read_csv_arrow("demo_data.csv", as_data_frame = TRUE)
> aaa2 <- arrow::read_csv_arrow("demo_data.csv", as_data_frame = FALSE)
> bbb <- data.table::fread("demo_data.csv")
> ccc <- readr::read_csv("demo_data.csv")
> mean(is.na(aaa1$json_string)) # 0.5
> mean(is.na(aaa2$column(1))) # Scalar 0.5
> mean(is.na(bbb$json_string)) # 0
> mean(is.na(ccc$json_string)) # 0{code}
>  
>  
>  * arrow 2.0 (latest CRAN)
>  * readr 1.4.0
>  * data.table 1.13.2
>  * R version 4.0.1 (2020-06-06)
>  * MacOS Catalina 10.15.7 / x86_64-apple-darwin17.0
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-11067) [R] read_csv_arrow silently fails to read some strings and returns nulls

2020-12-29 Thread John Sheffield (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-11067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

John Sheffield updated ARROW-11067:
---
Attachment: arrowbug1.png

> [R] read_csv_arrow silently fails to read some strings and returns nulls
> 
>
> Key: ARROW-11067
> URL: https://issues.apache.org/jira/browse/ARROW-11067
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Reporter: John Sheffield
>Priority: Major
> Fix For: 3.0.0
>
> Attachments: arrowbug1.png, demo_data.csv
>
>
> A sample file is attached, showing 10 rows each of strings with consistent 
> failures (false_na = TRUE) and consistent successes (false_na = FALSE). The 
> strings are in the column `json_string` – if relevant, they are geojsons with 
> min nchar of 33,229 and max nchar of 202,515.
> When I read this sample file with other R CSV readers (readr and data.table 
> shown), the files are imported correctly and there are no NAs in the 
> json_string column.
> When I read with arrow::read_csv_arrow, 50% of the sample json_string column 
> end up as NAs. as_data_frame TRUE or FALSE does not change the behavior, so 
> this might not be limited to the R interface, but I can't help debug much 
> further upstream.
>  
>  
> {code:java}
> aaa1 <- arrow::read_csv_arrow("demo_data.csv", as_data_frame = TRUE)
> aaa2 <- arrow::read_csv_arrow("demo_data.csv", as_data_frame = FALSE)
> bbb <- data.table::fread("demo_data.csv")
> ccc <- readr::read_csv("demo_data.csv")
> mean(is.na(aaa1$json_string)) # 0.5
> mean(is.na(aaa2$column(1))) # Scalar 0.5
> mean(is.na(bbb$json_string)) # 0
> mean(is.na(ccc$json_string)) # 0{code}
>  
>  
>  * arrow 2.0 (latest CRAN)
>  * readr 1.4.0
>  * data.table 1.13.2
>  * R version 4.0.1 (2020-06-06)
>  * MacOS Catalina 10.15.7 / x86_64-apple-darwin17.0
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-11067) read_csv_arrow silently fails to read some strings and returns nulls

2020-12-29 Thread John Sheffield (Jira)
John Sheffield created ARROW-11067:
--

 Summary: read_csv_arrow silently fails to read some strings and 
returns nulls
 Key: ARROW-11067
 URL: https://issues.apache.org/jira/browse/ARROW-11067
 Project: Apache Arrow
  Issue Type: Bug
  Components: R
Reporter: John Sheffield
 Attachments: demo_data.csv

A sample file is attached, showing 10 rows each of strings with consistent 
failures (false_na = TRUE) and consistent successes (false_na = FALSE). The 
strings are in the column `json_string` – if relevant, they are geojsons with 
min nchar of 33,229 and max nchar of 202,515.

When I read this sample file with other R CSV readers (readr and data.table 
shown), the files are imported correctly and there are no NAs in the 
json_string column.

When I read with arrow::read_csv_arrow, 50% of the sample json_string column 
end up as NAs. as_data_frame TRUE or FALSE does not change the behavior, so 
this might not be limited to the R interface, but I can't help debug much 
further upstream.

 

 
{code:java}
aaa1 <- arrow::read_csv_arrow("demo_data.csv", as_data_frame = TRUE)
aaa2 <- arrow::read_csv_arrow("demo_data.csv", as_data_frame = FALSE)
bbb <- data.table::fread("demo_data.csv")
ccc <- readr::read_csv("demo_data.csv")
mean(is.na(aaa1$json_string)) # 0.5
mean(is.na(aaa2$column(1))) # Scalar 0.5
mean(is.na(bbb$json_string)) # 0
mean(is.na(ccc$json_string)) # 0{code}
 

 
 * arrow 2.0 (latest CRAN)
 * readr 1.4.0
 * data.table 1.13.2
 * R version 4.0.1 (2020-06-06)
 * MacOS Catalina 10.15.7 / x86_64-apple-darwin17.0

 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10485) open_dataset(): specifying partition when hive_style =TRUE fails silently

2020-11-03 Thread John Sheffield (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

John Sheffield updated ARROW-10485:
---
Description: 
When writing a dataset with hive_style = TRUE, now the default, that dataset 
has to be opened without an explicit definition of the partitions to work as 
expected. Even if the correct partition is specified, any query to the dataset 
on the partition field returns 0 rows.

 

>From my eyes as a user, I'd want this to error out specifically (not just 
>warn), probably when first calling open_dataset().

```

data("mtcars")
 arrow::write_dataset(dataset = mtcars, path = "mtcarstest", partitioning = 
"cyl", format = "parquet", hive_style = TRUE)

mtc1 <- arrow::open_dataset("mtcarstest", partitioning = "cyl")
 mtc2 <- arrow::open_dataset("mtcarstest")

mtc1 %>%
    dplyr::filter(cyl == 4) %>%
    collect()

mtc2 %>%
    dplyr::filter(cyl == 4) %>%
    collect()

```

  was:
When writing a dataset with hive_style = TRUE, now the default, that dataset 
has to be opened without an explicit definition of the partitions to work as 
expected. Even if the correct partition is specified, any query to the dataset 
on the partition field returns 0 rows.

 

>From my eyes as a user, I'd want this to error out specifically (not just 
>warn), probably when first calling open_dataset().

```

data("mtcars")
arrow::write_dataset(dataset = mtcars, path = "mtcarstest",
 partitioning = "cyl", format = "parquet",
 hive_style = TRUE)

mtc1 <- arrow::open_dataset("mtcarstest", partitioning = "cyl")
mtc2 <- arrow::open_dataset("mtcarstest")

mtc1 %>%
 dplyr::filter(cyl == 4) %>%
 collect()

mtc2 %>%
 dplyr::filter(cyl == 4) %>%
 collect()

```


> open_dataset(): specifying partition when hive_style =TRUE fails silently
> -
>
> Key: ARROW-10485
> URL: https://issues.apache.org/jira/browse/ARROW-10485
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Affects Versions: 2.0.0
> Environment: MacOS Catalina 10.15.7 (19H2), R 4.01, arrow R package 
> v2.0.0
>Reporter: John Sheffield
>Priority: Minor
>
> When writing a dataset with hive_style = TRUE, now the default, that dataset 
> has to be opened without an explicit definition of the partitions to work as 
> expected. Even if the correct partition is specified, any query to the 
> dataset on the partition field returns 0 rows.
>  
> From my eyes as a user, I'd want this to error out specifically (not just 
> warn), probably when first calling open_dataset().
> ```
> data("mtcars")
>  arrow::write_dataset(dataset = mtcars, path = "mtcarstest", partitioning = 
> "cyl", format = "parquet", hive_style = TRUE)
> mtc1 <- arrow::open_dataset("mtcarstest", partitioning = "cyl")
>  mtc2 <- arrow::open_dataset("mtcarstest")
> mtc1 %>%
>     dplyr::filter(cyl == 4) %>%
>     collect()
> mtc2 %>%
>     dplyr::filter(cyl == 4) %>%
>     collect()
> ```



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10485) open_dataset(): specifying partition when hive_style =TRUE fails silently

2020-11-03 Thread John Sheffield (Jira)
John Sheffield created ARROW-10485:
--

 Summary: open_dataset(): specifying partition when hive_style 
=TRUE fails silently
 Key: ARROW-10485
 URL: https://issues.apache.org/jira/browse/ARROW-10485
 Project: Apache Arrow
  Issue Type: Bug
  Components: R
Affects Versions: 2.0.0
 Environment: MacOS Catalina 10.15.7 (19H2), R 4.01, arrow R package 
v2.0.0
Reporter: John Sheffield


When writing a dataset with hive_style = TRUE, now the default, that dataset 
has to be opened without an explicit definition of the partitions to work as 
expected. Even if the correct partition is specified, any query to the dataset 
on the partition field returns 0 rows.

 

>From my eyes as a user, I'd want this to error out specifically (not just 
>warn), probably when first calling open_dataset().

```

data("mtcars")
arrow::write_dataset(dataset = mtcars, path = "mtcarstest",
 partitioning = "cyl", format = "parquet",
 hive_style = TRUE)

mtc1 <- arrow::open_dataset("mtcarstest", partitioning = "cyl")
mtc2 <- arrow::open_dataset("mtcarstest")

mtc1 %>%
 dplyr::filter(cyl == 4) %>%
 collect()

mtc2 %>%
 dplyr::filter(cyl == 4) %>%
 collect()

```



--
This message was sent by Atlassian Jira
(v8.3.4#803005)