[ 
https://issues.apache.org/jira/browse/ARROW-14736?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weston Pace updated ARROW-14736:
--------------------------------
    Labels: dataset  (was: )

> [C++][R]Opening a multi-file dataset and writing a re-partitioned version of 
> it fails
> -------------------------------------------------------------------------------------
>
>                 Key: ARROW-14736
>                 URL: https://issues.apache.org/jira/browse/ARROW-14736
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: C++, R
>    Affects Versions: 6.0.0
>         Environment: M1 Mac, macOS Monterey 12.0.1, 16Gb RAM
> R 4.1.1, {arrow} R package 6.0.0.2 (release) & 6.0.0.9000 (dev)
>            Reporter: Dragoș Moldovan-Grünfeld
>            Priority: Major
>              Labels: dataset
>         Attachments: image-2021-11-17-14-43-37-127.png, 
> image-2021-11-17-14-54-42-747.png, image-2021-11-17-14-55-08-597.png
>
>
> Attempting to open a multi-file dataset and write a re-partitioned version of 
> it fails as it seems there is an attempt to collect data into memory first. 
> This happens both for wide and long data.
> Steps to reproduce the issue:
> 1. Create a large dataset (100k columns, 300k rows) and write it to disk and 
> create 20 copies of it. Each file will have a footprint of roughly 7.5GB. 
> {code:r}
> library(arrow)
> library(dplyr)
> library(fs)
> rows <- 300000
> cols <- 100000
> partitions <- 20
> wide_df <- as.data.frame(
>   matrix(
>     sample(1:32767, rows * cols / partitions, replace = TRUE), 
>     ncol = cols)
> )
> schem <- sapply(colnames(wide_df), function(nm) {int16()})
> schem <- do.call(schema, schem)
> wide_tab <- Table$create(wide_df, schema = schem)
> write_parquet(wide_tab, "~/Documents/arrow_playground/wide.parquet")
> fs::dir_create("~/Documents/arrow_playground/wide_ds")
> for (i in seq_len(partitions)) {
>   file.copy("~/Documents/arrow_playground/wide.parquet", 
>             
> glue::glue("~/Documents/arrow_playground/wide_ds/wide-{i-1}.parquet"))
> }
> ds_wide <- open_dataset("~/Documents/arrow_playground/wide_ds/")
> {code}
> All the following steps fail:
> 2. Creating and writing a partitioned version of {{{}ds_wide{}}}.
> {code:r}
>   ds_wide %>%
>     mutate(grouper = round(V1 / 1024)) %>%
>     write_dataset("~/Documents/arrow_playground/partitioned", 
>                    partitioning = "grouper",
>                    format = "parquet")
> {code}
> 3. Writing a non-partitioned dataset:
> {code:r}
>   ds_wide %>%
>     write_dataset("~/Documents/arrow_playground/partitioned", 
>                   format = "parquet")
> {code}
> 4. Creating the partitioning variable first and then attempting to write:
> {code:r}
>   ds2 <- ds_wide %>% 
>     mutate(grouper = round(V1 / 1024))
>   ds2 %>% 
>     write_dataset("~/Documents/arrow_playground/partitioned", 
>                   partitioning = "grouper", 
>                   format = "parquet")  
> {code}
> 5. Attempting to write to csv:
> {code:r}
> ds_wide %>% 
>   write_dataset("~/Documents/arrow_playground/csv_writing/test.csv",
>                 format = "csv")
> {code}
> None of the failures seem to originate in R code and they all result in a 
> similar behaviour: the R sessions consume increasing amounts of RAM until 
> they crash.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to