[ https://issues.apache.org/jira/browse/ARROW-16320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17527717#comment-17527717 ]
Zsolt Kegyes-Brassai commented on ARROW-16320: ---------------------------------------------- Hi [~westonpace]. Thank you for your prompt answer. Sorry, I forget to describe the environment: I am using a laptop with 64-bit win10, R 4.1.2 and quite up to date R packages (arrow 7.0.0). I am running my scripts from RStudio IDE. I was checking the memory utilization both in the RStudio environment pane and the windows task manager. Both are showing around 5.6 GB memory utilization increase: in RStudio from 300 MB to 5.9 GB (the task manager is showing about 250 MB higher – most probably the memory occupied by the IDE). There is no (new) visible object in the RStudio Environment which can be associated with this re-partitioning activity. And this memory remained occupied until the RStudio session (or the R project) is closed. I waited for 15 minutes before closing the IDE. > Dataset re-partitioning consumes considerable amount of memory > -------------------------------------------------------------- > > Key: ARROW-16320 > URL: https://issues.apache.org/jira/browse/ARROW-16320 > Project: Apache Arrow > Issue Type: Improvement > Affects Versions: 7.0.0 > Reporter: Zsolt Kegyes-Brassai > Priority: Minor > > A short background: I was trying to create a dataset from a big pile of csv > files (couple of hundreds). In first step the csv were parsed and saved to > parquet files because there were many inconsistencies between csv files. In a > consequent step the dataset was re-partitioned using one column (code_key). > > {code:java} > new_dataset <- open_dataset( > temp_parquet_folder, > format = "parquet", > unify_schemas = TRUE > ) > new_dataset |> > group_by(code_key) |> > write_dataset( > folder_repartitioned_dataset, > format = "parquet" > ) > {code} > > This re-partitioning consumed a considerable amount of memory (5 GB). > * Is this a normal behavior? Or a bug? > * Is there any rule of thumb to estimate the memory requirement for a > dataset re-partitioning? (it’s important when scaling up this approach) > The drawback is that this memory space is not freed up after the > re-partitioning (I am using RStudio). > The {{gc()}} useless in this situation. And there is no any associated object > (to the repartitioning) in the {{R}} environment which can be removed from > memory (using the {{rm()}} function). > * How one can regain this memory space used by re-partitioning? > The rationale behind choosing the dataset re-partitioning: if my > understanding is correct, in the current arrow version the append is not > working when writing parquet files/datasets. (the original csv files were > partly partitioned according to a different variable) > Can you recommend any better approach? -- This message was sent by Atlassian Jira (v8.20.7#820007)