[ https://issues.apache.org/jira/browse/ARROW-16320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Zsolt Kegyes-Brassai updated ARROW-16320: ----------------------------------------- Attachment: 100m_2_rds.jpg > Dataset re-partitioning consumes considerable amount of memory > -------------------------------------------------------------- > > Key: ARROW-16320 > URL: https://issues.apache.org/jira/browse/ARROW-16320 > Project: Apache Arrow > Issue Type: Bug > Affects Versions: 7.0.0 > Reporter: Zsolt Kegyes-Brassai > Priority: Minor > Attachments: 100m_1_create.jpg, 100m_2_rds.jpg, Rgui_mem.jpg, > Rstudio_env.jpg, Rstudio_mem.jpg > > > A short background: I was trying to create a dataset from a big pile of csv > files (couple of hundreds). In first step the csv were parsed and saved to > parquet files because there were many inconsistencies between csv files. In a > consequent step the dataset was re-partitioned using one column (code_key). > > {code:java} > new_dataset <- open_dataset( > temp_parquet_folder, > format = "parquet", > unify_schemas = TRUE > ) > new_dataset |> > group_by(code_key) |> > write_dataset( > folder_repartitioned_dataset, > format = "parquet" > ) > {code} > > This re-partitioning consumed a considerable amount of memory (5 GB). > * Is this a normal behavior? Or a bug? > * Is there any rule of thumb to estimate the memory requirement for a > dataset re-partitioning? (it’s important when scaling up this approach) > The drawback is that this memory space is not freed up after the > re-partitioning (I am using RStudio). > The {{gc()}} useless in this situation. And there is no any associated object > (to the repartitioning) in the {{R}} environment which can be removed from > memory (using the {{rm()}} function). > * How one can regain this memory space used by re-partitioning? > The rationale behind choosing the dataset re-partitioning: if my > understanding is correct, in the current arrow version the append is not > working when writing parquet files/datasets. (the original csv files were > partly partitioned according to a different variable) > Can you recommend any better approach? -- This message was sent by Atlassian Jira (v8.20.7#820007)