[ https://issues.apache.org/jira/browse/ARROW-14736?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Weston Pace updated ARROW-14736: -------------------------------- Labels: dataset (was: ) > [C++][R]Opening a multi-file dataset and writing a re-partitioned version of > it fails > ------------------------------------------------------------------------------------- > > Key: ARROW-14736 > URL: https://issues.apache.org/jira/browse/ARROW-14736 > Project: Apache Arrow > Issue Type: Bug > Components: C++, R > Affects Versions: 6.0.0 > Environment: M1 Mac, macOS Monterey 12.0.1, 16Gb RAM > R 4.1.1, {arrow} R package 6.0.0.2 (release) & 6.0.0.9000 (dev) > Reporter: Dragoș Moldovan-Grünfeld > Priority: Major > Labels: dataset > Attachments: image-2021-11-17-14-43-37-127.png, > image-2021-11-17-14-54-42-747.png, image-2021-11-17-14-55-08-597.png > > > Attempting to open a multi-file dataset and write a re-partitioned version of > it fails as it seems there is an attempt to collect data into memory first. > This happens both for wide and long data. > Steps to reproduce the issue: > 1. Create a large dataset (100k columns, 300k rows) and write it to disk and > create 20 copies of it. Each file will have a footprint of roughly 7.5GB. > {code:r} > library(arrow) > library(dplyr) > library(fs) > rows <- 300000 > cols <- 100000 > partitions <- 20 > wide_df <- as.data.frame( > matrix( > sample(1:32767, rows * cols / partitions, replace = TRUE), > ncol = cols) > ) > schem <- sapply(colnames(wide_df), function(nm) {int16()}) > schem <- do.call(schema, schem) > wide_tab <- Table$create(wide_df, schema = schem) > write_parquet(wide_tab, "~/Documents/arrow_playground/wide.parquet") > fs::dir_create("~/Documents/arrow_playground/wide_ds") > for (i in seq_len(partitions)) { > file.copy("~/Documents/arrow_playground/wide.parquet", > > glue::glue("~/Documents/arrow_playground/wide_ds/wide-{i-1}.parquet")) > } > ds_wide <- open_dataset("~/Documents/arrow_playground/wide_ds/") > {code} > All the following steps fail: > 2. Creating and writing a partitioned version of {{{}ds_wide{}}}. > {code:r} > ds_wide %>% > mutate(grouper = round(V1 / 1024)) %>% > write_dataset("~/Documents/arrow_playground/partitioned", > partitioning = "grouper", > format = "parquet") > {code} > 3. Writing a non-partitioned dataset: > {code:r} > ds_wide %>% > write_dataset("~/Documents/arrow_playground/partitioned", > format = "parquet") > {code} > 4. Creating the partitioning variable first and then attempting to write: > {code:r} > ds2 <- ds_wide %>% > mutate(grouper = round(V1 / 1024)) > ds2 %>% > write_dataset("~/Documents/arrow_playground/partitioned", > partitioning = "grouper", > format = "parquet") > {code} > 5. Attempting to write to csv: > {code:r} > ds_wide %>% > write_dataset("~/Documents/arrow_playground/csv_writing/test.csv", > format = "csv") > {code} > None of the failures seem to originate in R code and they all result in a > similar behaviour: the R sessions consume increasing amounts of RAM until > they crash. -- This message was sent by Atlassian Jira (v8.20.1#820001)