[ https://issues.apache.org/jira/browse/ARROW-13865?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Neal Richardson updated ARROW-13865: ------------------------------------ Fix Version/s: 7.0.0 > [C++][R] Writing moderate-size parquet files of nested dataframes from R > slows down/process hangs > ------------------------------------------------------------------------------------------------- > > Key: ARROW-13865 > URL: https://issues.apache.org/jira/browse/ARROW-13865 > Project: Apache Arrow > Issue Type: Bug > Components: C++, R > Affects Versions: 5.0.0 > Reporter: John Sheffield > Priority: Major > Fix For: 7.0.0 > > Attachments: Screen Shot 2021-09-02 at 11.21.37 AM.png > > > I observed a significant slowdown in parquet writes (and ultimately the > process just hangs for minutes without completion) while writing > moderate-size nested dataframes from R. I have replicated the issue on MacOS > and Ubuntu so far. > > An example: > ``` > testdf <- dplyr::tibble( > id = uuid::UUIDgenerate(n = 5000), > l1 = as.list(lapply(1:5000, (function( x ) runif(1000)))), > l2 = as.list(lapply(1:5000, (function( x ) rnorm(1000)))) > ) > testdf_long <- tidyr::unnest(testdf, cols = c(l1, l2)) > > # This works > arrow::write_parquet(testdf_long, "testdf_long.parquet") > # This write does not complete within a few minutes on my testing but throws > no errors > arrow::write_parquet(testdf, "testdf.parquet") > ``` > I can't guess at why this is true, but the slowdown is closely tied to row > counts: > ``` > # screenshot attached; 12ms, 56ms, and 680ms respectively. > microbenchmark::microbenchmark( > arrow::write_parquet(testdf[1, ], "testdf.parquet"), > arrow::write_parquet(testdf[1:10, ], "testdf.parquet"), > arrow::write_parquet(testdf[1:100, ], "testdf.parquet"), > times = 5 > ) > ``` > I'm using the CRAN version 5.0.0 in both cases. The sessionInfo() for Ubuntu > is > R version 4.0.5 (2021-03-31) > Platform: x86_64-pc-linux-gnu (64-bit) > Running under: Ubuntu 20.04.3 LTS > Matrix products: default > BLAS/LAPACK: > /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.8.so > locale: > [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 > LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8 LC_MESSAGES=C > [7] LC_PAPER=en_US.UTF-8 LC_NAME=C LC_ADDRESS=C LC_TELEPHONE=C > LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C > attached base packages: > [1] stats graphics grDevices utils datasets methods base > other attached packages: > [1] arrow_5.0.0 > And sessionInfo for MacOS is: > R version 4.0.1 (2020-06-06) Platform: x86_64-apple-darwin17.0 (64-bit) > Running under: macOS Catalina 10.15.7 Matrix products: default BLAS: > /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib > LAPACK: > /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib > locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 > attached base packages: [1] stats graphics grDevices utils datasets methods > base other attached packages: [1] arrow_5.0.0 -- This message was sent by Atlassian Jira (v8.3.4#803005)