[jira] [Updated] (ARROW-13865) [C++][R] Writing moderate-size parquet files of nested dataframes from R slows down/process hangs

2021-10-03 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13865?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-13865:

Fix Version/s: 7.0.0

> [C++][R] Writing moderate-size parquet files of nested dataframes from R 
> slows down/process hangs
> -
>
> Key: ARROW-13865
> URL: https://issues.apache.org/jira/browse/ARROW-13865
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, R
>Affects Versions: 5.0.0
>Reporter: John Sheffield
>Priority: Major
> Fix For: 7.0.0
>
> Attachments: Screen Shot 2021-09-02 at 11.21.37 AM.png
>
>
> I observed a significant slowdown in parquet writes (and ultimately the 
> process just hangs for minutes without completion) while writing 
> moderate-size nested dataframes from R. I have replicated the issue on MacOS 
> and Ubuntu so far.
>  
> An example:
> ```
> testdf <- dplyr::tibble(
>  id = uuid::UUIDgenerate(n = 5000),
>  l1 = as.list(lapply(1:5000, (function( x ) runif(1000,
>  l2 = as.list(lapply(1:5000, (function( x ) rnorm(1000
>  )
> testdf_long <- tidyr::unnest(testdf, cols = c(l1, l2))
>  
>  # This works
> arrow::write_parquet(testdf_long, "testdf_long.parquet")
>  # This write does not complete within a few minutes on my testing but throws 
> no errors
>  arrow::write_parquet(testdf, "testdf.parquet")
> ```
> I can't guess at why this is true, but the slowdown is closely tied to row 
> counts:
> ```
>  # screenshot attached; 12ms, 56ms, and 680ms respectively.
> microbenchmark::microbenchmark(
>  arrow::write_parquet(testdf[1, ], "testdf.parquet"),
>  arrow::write_parquet(testdf[1:10, ], "testdf.parquet"),
>  arrow::write_parquet(testdf[1:100, ], "testdf.parquet"),
>  times = 5
>  )
> ```
> I'm using the CRAN version 5.0.0 in both cases. The sessionInfo() for Ubuntu 
> is
>  R version 4.0.5 (2021-03-31)
>  Platform: x86_64-pc-linux-gnu (64-bit)
>  Running under: Ubuntu 20.04.3 LTS
> Matrix products: default
>  BLAS/LAPACK: 
> /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.8.so
> locale:
>  [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 
> LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8 LC_MESSAGES=C 
>  [7] LC_PAPER=en_US.UTF-8 LC_NAME=C LC_ADDRESS=C LC_TELEPHONE=C 
> LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
> attached base packages:
>  [1] stats graphics grDevices utils datasets methods base
> other attached packages:
>  [1] arrow_5.0.0
> And sessionInfo for MacOS is:
>  R version 4.0.1 (2020-06-06) Platform: x86_64-apple-darwin17.0 (64-bit) 
> Running under: macOS Catalina 10.15.7 Matrix products: default BLAS: 
> /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
>  LAPACK: 
> /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib 
> locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 
> attached base packages: [1] stats graphics grDevices utils datasets methods 
> base other attached packages: [1] arrow_5.0.0



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-13865) [C++][R] Writing moderate-size parquet files of nested dataframes from R slows down/process hangs

2021-09-02 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13865?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-13865:

Summary: [C++][R] Writing moderate-size parquet files of nested dataframes 
from R slows down/process hangs  (was: Writing moderate-size parquet files of 
nested dataframes from R slows down/process hangs)

> [C++][R] Writing moderate-size parquet files of nested dataframes from R 
> slows down/process hangs
> -
>
> Key: ARROW-13865
> URL: https://issues.apache.org/jira/browse/ARROW-13865
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Affects Versions: 5.0.0
>Reporter: John Sheffield
>Priority: Major
> Attachments: Screen Shot 2021-09-02 at 11.21.37 AM.png
>
>
> I observed a significant slowdown in parquet writes (and ultimately the 
> process just hangs for minutes without completion) while writing 
> moderate-size nested dataframes from R. I have replicated the issue on MacOS 
> and Ubuntu so far.
>  
> An example:
> ```
> testdf <- dplyr::tibble(
>  id = uuid::UUIDgenerate(n = 5000),
>  l1 = as.list(lapply(1:5000, (function( x ) runif(1000,
>  l2 = as.list(lapply(1:5000, (function( x ) rnorm(1000
>  )
> testdf_long <- tidyr::unnest(testdf, cols = c(l1, l2))
>  
>  # This works
> arrow::write_parquet(testdf_long, "testdf_long.parquet")
>  # This write does not complete within a few minutes on my testing but throws 
> no errors
>  arrow::write_parquet(testdf, "testdf.parquet")
> ```
> I can't guess at why this is true, but the slowdown is closely tied to row 
> counts:
> ```
>  # screenshot attached; 12ms, 56ms, and 680ms respectively.
> microbenchmark::microbenchmark(
>  arrow::write_parquet(testdf[1, ], "testdf.parquet"),
>  arrow::write_parquet(testdf[1:10, ], "testdf.parquet"),
>  arrow::write_parquet(testdf[1:100, ], "testdf.parquet"),
>  times = 5
>  )
> ```
> I'm using the CRAN version 5.0.0 in both cases. The sessionInfo() for Ubuntu 
> is
>  R version 4.0.5 (2021-03-31)
>  Platform: x86_64-pc-linux-gnu (64-bit)
>  Running under: Ubuntu 20.04.3 LTS
> Matrix products: default
>  BLAS/LAPACK: 
> /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.8.so
> locale:
>  [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 
> LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8 LC_MESSAGES=C 
>  [7] LC_PAPER=en_US.UTF-8 LC_NAME=C LC_ADDRESS=C LC_TELEPHONE=C 
> LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
> attached base packages:
>  [1] stats graphics grDevices utils datasets methods base
> other attached packages:
>  [1] arrow_5.0.0
> And sessionInfo for MacOS is:
>  R version 4.0.1 (2020-06-06) Platform: x86_64-apple-darwin17.0 (64-bit) 
> Running under: macOS Catalina 10.15.7 Matrix products: default BLAS: 
> /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
>  LAPACK: 
> /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib 
> locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 
> attached base packages: [1] stats graphics grDevices utils datasets methods 
> base other attached packages: [1] arrow_5.0.0



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-13865) [C++][R] Writing moderate-size parquet files of nested dataframes from R slows down/process hangs

2021-09-02 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13865?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-13865:

Component/s: C++

> [C++][R] Writing moderate-size parquet files of nested dataframes from R 
> slows down/process hangs
> -
>
> Key: ARROW-13865
> URL: https://issues.apache.org/jira/browse/ARROW-13865
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, R
>Affects Versions: 5.0.0
>Reporter: John Sheffield
>Priority: Major
> Attachments: Screen Shot 2021-09-02 at 11.21.37 AM.png
>
>
> I observed a significant slowdown in parquet writes (and ultimately the 
> process just hangs for minutes without completion) while writing 
> moderate-size nested dataframes from R. I have replicated the issue on MacOS 
> and Ubuntu so far.
>  
> An example:
> ```
> testdf <- dplyr::tibble(
>  id = uuid::UUIDgenerate(n = 5000),
>  l1 = as.list(lapply(1:5000, (function( x ) runif(1000,
>  l2 = as.list(lapply(1:5000, (function( x ) rnorm(1000
>  )
> testdf_long <- tidyr::unnest(testdf, cols = c(l1, l2))
>  
>  # This works
> arrow::write_parquet(testdf_long, "testdf_long.parquet")
>  # This write does not complete within a few minutes on my testing but throws 
> no errors
>  arrow::write_parquet(testdf, "testdf.parquet")
> ```
> I can't guess at why this is true, but the slowdown is closely tied to row 
> counts:
> ```
>  # screenshot attached; 12ms, 56ms, and 680ms respectively.
> microbenchmark::microbenchmark(
>  arrow::write_parquet(testdf[1, ], "testdf.parquet"),
>  arrow::write_parquet(testdf[1:10, ], "testdf.parquet"),
>  arrow::write_parquet(testdf[1:100, ], "testdf.parquet"),
>  times = 5
>  )
> ```
> I'm using the CRAN version 5.0.0 in both cases. The sessionInfo() for Ubuntu 
> is
>  R version 4.0.5 (2021-03-31)
>  Platform: x86_64-pc-linux-gnu (64-bit)
>  Running under: Ubuntu 20.04.3 LTS
> Matrix products: default
>  BLAS/LAPACK: 
> /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.8.so
> locale:
>  [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 
> LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8 LC_MESSAGES=C 
>  [7] LC_PAPER=en_US.UTF-8 LC_NAME=C LC_ADDRESS=C LC_TELEPHONE=C 
> LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
> attached base packages:
>  [1] stats graphics grDevices utils datasets methods base
> other attached packages:
>  [1] arrow_5.0.0
> And sessionInfo for MacOS is:
>  R version 4.0.1 (2020-06-06) Platform: x86_64-apple-darwin17.0 (64-bit) 
> Running under: macOS Catalina 10.15.7 Matrix products: default BLAS: 
> /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
>  LAPACK: 
> /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib 
> locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 
> attached base packages: [1] stats graphics grDevices utils datasets methods 
> base other attached packages: [1] arrow_5.0.0



--
This message was sent by Atlassian Jira
(v8.3.4#803005)