[ https://issues.apache.org/jira/browse/ARROW-14020?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Jonathan Keane resolved ARROW-14020. ------------------------------------ Resolution: Fixed Issue resolved by pull request 11277 [https://github.com/apache/arrow/pull/11277] > [R] Writing datafames with list columns is slow and scales poorly with > nesting level > ------------------------------------------------------------------------------------ > > Key: ARROW-14020 > URL: https://issues.apache.org/jira/browse/ARROW-14020 > Project: Apache Arrow > Issue Type: Bug > Components: R > Affects Versions: 5.0.0 > Environment: Windows 10 x64 > Reporter: Miles McBain > Assignee: Jonathan Keane > Priority: Major > Labels: pull-request-available > Fix For: 6.0.0 > > Time Spent: 3h 50m > Remaining Estimate: 0h > > Writing data frames that contain list columns seems much slower than expected: > ``` r > library(tidyverse) > #> Warning: package 'tidyverse' was built under R version 4.1.1 > #> Warning: package 'tibble' was built under R version 4.1.1 > #> Warning: package 'readr' was built under R version 4.1.1 > library(arrow) > #> Warning: package 'arrow' was built under R version 4.1.1 > #> > #> Attaching package: 'arrow' > #> The following object is masked from 'package:utils': > #> > #> timestamp > dummy <- tibble( > points = rep(list(seq(6)), 2e6), > index = seq(2e6) > ) > # very slooooooow > system.time(write_parquet(dummy, "dummy.parquet")) > #> user system elapsed > #> 55.64 0.11 55.98 > dummy_txt <- mutate(dummy, points = map_chr(points, deparse)) > # orders of magnitude faster > system.time(write_parquet(dummy_txt, "dummytext.parquet")) > #> user system elapsed > #> 0.24 0.02 0.25 > ``` > <sup>Created on 2021-09-17 by the [reprex > package]([https://reprex.tidyverse.org|https://reprex.tidyverse.org/]) > (v2.0.0)</sup> > <details style="margin-bottom:10px;"> > <summary>Session info</summary> > ``` r > sessioninfo::session_info() > #> - Session info > --------------------------------------------------------------- > #> setting value > #> version R version 4.1.0 (2021-05-18) > #> os Windows 10 x64 > #> system x86_64, mingw32 > #> ui RTerm > #> language (EN) > #> collate English_Australia.1252 > #> ctype English_Australia.1252 > #> tz Australia/Brisbane > #> date 2021-09-17 > #> > #> - Packages > ------------------------------------------------------------------- > #> package * version date lib source > #> arrow * 5.0.0.2 2021-09-05 [1] CRAN (R 4.1.1) > #> assertthat 0.2.1 2019-03-21 [1] CRAN (R 4.1.0) > #> backports 1.2.1 2020-12-09 [1] CRAN (R 4.1.0) > #> bit 4.0.4 2020-08-04 [1] CRAN (R 4.1.0) > #> bit64 4.0.5 2020-08-30 [1] CRAN (R 4.1.0) > #> broom 0.7.7 2021-06-13 [1] CRAN (R 4.1.0) > #> cellranger 1.1.0 2016-07-27 [1] CRAN (R 4.1.0) > #> cli 3.0.1 2021-07-17 [1] CRAN (R 4.1.0) > #> colorspace 2.0-2 2021-06-24 [1] CRAN (R 4.1.0) > #> crayon 1.4.1 2021-02-08 [1] CRAN (R 4.1.0) > #> DBI 1.1.1 2021-01-15 [1] CRAN (R 4.1.0) > #> dbplyr 2.1.1 2021-04-06 [1] CRAN (R 4.1.0) > #> digest 0.6.27 2020-10-24 [1] CRAN (R 4.1.0) > #> dplyr * 1.0.7 2021-06-18 [1] CRAN (R 4.1.0) > #> ellipsis 0.3.2 2021-04-29 [1] CRAN (R 4.1.0) > #> evaluate 0.14 2019-05-28 [1] CRAN (R 4.1.0) > #> fansi 0.5.0 2021-05-25 [1] CRAN (R 4.1.0) > #> forcats * 0.5.1 2021-01-27 [1] CRAN (R 4.1.0) > #> fs 1.5.0 2020-07-31 [1] CRAN (R 4.1.0) > #> generics 0.1.0 2020-10-31 [1] CRAN (R 4.1.0) > #> ggplot2 * 3.3.5 2021-06-25 [1] CRAN (R 4.1.0) > #> glue 1.4.2 2020-08-27 [1] CRAN (R 4.1.0) > #> gtable 0.3.0 2019-03-25 [1] CRAN (R 4.1.0) > #> haven 2.4.1 2021-04-23 [1] CRAN (R 4.1.0) > #> highr 0.9 2021-04-16 [1] CRAN (R 4.1.0) > #> hms 1.1.0 2021-05-17 [1] CRAN (R 4.1.0) > #> htmltools 0.5.1.1 2021-01-22 [1] CRAN (R 4.1.0) > #> httr 1.4.2 2020-07-20 [1] CRAN (R 4.1.0) > #> jsonlite 1.7.2 2020-12-09 [1] CRAN (R 4.1.0) > #> knitr 1.33 2021-04-24 [1] CRAN (R 4.1.0) > #> lifecycle 1.0.0 2021-02-15 [1] CRAN (R 4.1.0) > #> lubridate 1.7.10 2021-02-26 [1] CRAN (R 4.1.0) > #> magrittr 2.0.1 2020-11-17 [1] CRAN (R 4.1.0) > #> modelr 0.1.8 2020-05-19 [1] CRAN (R 4.1.0) > #> munsell 0.5.0 2018-06-12 [1] CRAN (R 4.1.0) > #> pillar 1.6.2 2021-07-29 [1] CRAN (R 4.1.0) > #> pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.1.0) > #> purrr * 0.3.4 2020-04-17 [1] CRAN (R 4.1.0) > #> R6 2.5.1 2021-08-19 [1] CRAN (R 4.1.1) > #> Rcpp 1.0.7 2021-07-07 [1] CRAN (R 4.1.0) > #> readr * 2.0.1 2021-08-10 [1] CRAN (R 4.1.1) > #> readxl 1.3.1 2019-03-13 [1] CRAN (R 4.1.0) > #> reprex 2.0.0 2021-04-02 [1] CRAN (R 4.1.0) > #> rlang 0.4.11 2021-04-30 [1] CRAN (R 4.1.0) > #> rmarkdown 2.9 2021-06-15 [1] CRAN (R 4.1.0) > #> rvest 1.0.1 2021-07-26 [1] CRAN (R 4.1.0) > #> scales 1.1.1 2020-05-11 [1] CRAN (R 4.1.0) > #> sessioninfo 1.1.1 2018-11-05 [1] CRAN (R 4.1.0) > #> stringi 1.7.4 2021-08-25 [1] CRAN (R 4.1.1) > #> stringr * 1.4.0 2019-02-10 [1] CRAN (R 4.1.0) > #> styler 1.4.1 2021-03-30 [1] CRAN (R 4.1.0) > #> tibble * 3.1.4 2021-08-25 [1] CRAN (R 4.1.1) > #> tidyr * 1.1.3 2021-03-03 [1] CRAN (R 4.1.0) > #> tidyselect 1.1.1 2021-04-30 [1] CRAN (R 4.1.0) > #> tidyverse * 1.3.1 2021-04-15 [1] CRAN (R 4.1.1) > #> tzdb 0.1.2 2021-07-20 [1] CRAN (R 4.1.0) > #> utf8 1.2.2 2021-07-24 [1] CRAN (R 4.1.0) > #> vctrs 0.3.8 2021-04-29 [1] CRAN (R 4.1.0) > #> withr 2.4.2 2021-04-18 [1] CRAN (R 4.1.0) > #> xfun 0.24 2021-06-15 [1] CRAN (R 4.1.0) > #> xml2 1.3.2 2020-04-23 [1] CRAN (R 4.1.0) > #> yaml 2.2.1 2020-02-01 [1] CRAN (R 4.1.0) > #> > #> [1] C:/Users/msmcbain/libs/R > #> [2] C:/R/R-4.1.0/library > ``` > </details> > In this case it's actually faster to convert the list columns to text and do > the write, than to write with the list columns. > This issue also affects write_arrow: > ``` r > library(tidyverse) > #> Warning: package 'tidyverse' was built under R version 4.1.1 > #> Warning: package 'tibble' was built under R version 4.1.1 > #> Warning: package 'readr' was built under R version 4.1.1 > library(arrow) > #> Warning: package 'arrow' was built under R version 4.1.1 > #> > #> Attaching package: 'arrow' > #> The following object is masked from 'package:utils': > #> > #> timestamp > dummy <- tibble( > points = rep(list(seq(6)), 2e6), > index = seq(2e6) > ) > # very slooooooow > system.time(write_arrow(dummy, "dummy.parquet")) > #> Warning: Use 'write_ipc_stream' or 'write_feather' instead. > #> user system elapsed > #> 56.95 0.08 57.13 > dummy_txt <- mutate(dummy, points = map_chr(points, deparse)) > # orders of magnitude faster > system.time(write_arrow(dummy_txt, "dummytext.parquet")) > #> Warning: Use 'write_ipc_stream' or 'write_feather' instead. > #> user system elapsed > #> 0.06 0.01 0.10 > ``` > <sup>Created on 2021-09-17 by the [reprex > package]([https://reprex.tidyverse.org|https://reprex.tidyverse.org/]) > (v2.0.0)</sup> > Interestingly the performance seems to degrade exponentially with the nesting > level of the lists: > ```r > # add a level of nesting > dummy2 <- tibble( > points = rep(list(list(seq(6))), 2e6), > index = seq(2e6) > ) > # order of magnitude slower again, lost patience wating for it to return > system.time(write_parquet(dummy2, "dummy2.parquet") > ``` > This has implications for \{sf} dataframes which use list columns to > represent spatial data structures. Arrow/parquet are pretty much not viable > for moderate to large spatial data in R: > ```r > # options(timeout = 1000) > remotes::install_github("wfmackey/absmapsdata") > library(absmapsdata) > # doesn't return in a resonable amount of time > write_arrow(absmapsdata::sa12016, "sa1.parquet") > # can use the same work around as above by converting geomtry to vector of > well knowntext, but it takes time and bloats the files > ``` > Possibly related to https://issues.apache.org/jira/browse/ARROW-12529 ? > -- This message was sent by Atlassian Jira (v8.3.4#803005)