[jira] [Commented] (ARROW-16320) Dataset re-partitioning consumes considerable amount of memory

Zsolt Kegyes-Brassai (Jira) Mon, 25 Apr 2022 11:55:05 -0700


    [ 
https://issues.apache.org/jira/browse/ARROW-16320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17527717#comment-17527717
 ]


Zsolt Kegyes-Brassai commented on ARROW-16320:
----------------------------------------------

Hi [~westonpace]. Thank you for your prompt answer.


Sorry, I forget to describe the environment: I am using a laptop with 64-bit 
win10, R 4.1.2 and quite up to date R packages (arrow 7.0.0). I am running my 
scripts from RStudio IDE. 


I was checking the memory utilization both in the RStudio environment pane and 
the windows task manager. 
Both are showing around 5.6 GB memory utilization increase: in RStudio from 300 
MB to 5.9 GB (the task manager is showing about 250 MB higher – most probably 
the memory occupied by the IDE).  
There is no (new) visible object in the RStudio Environment which can be 
associated with this re-partitioning activity.


And this memory remained occupied until the RStudio session (or the R project) 
is closed.  I waited for 15 minutes before closing the IDE.

> Dataset re-partitioning consumes considerable amount of memory
> --------------------------------------------------------------
>
>                 Key: ARROW-16320
>                 URL: https://issues.apache.org/jira/browse/ARROW-16320
>             Project: Apache Arrow
>          Issue Type: Improvement
>    Affects Versions: 7.0.0
>            Reporter: Zsolt Kegyes-Brassai
>            Priority: Minor
>
> A short background: I was trying to create a dataset from a big pile of csv 
> files (couple of hundreds). In first step the csv were parsed and saved to 
> parquet files because there were many inconsistencies between csv files. In a 
> consequent step the dataset was re-partitioned using one column (code_key).
>  
> {code:java}
> new_dataset <- open_dataset(
>   temp_parquet_folder, 
>   format = "parquet",
>   unify_schemas = TRUE
>   )
> new_dataset |> 
>   group_by(code_key) |> 
>   write_dataset(
>     folder_repartitioned_dataset, 
>     format = "parquet"
>   )
> {code}
>  
> This re-partitioning consumed a considerable amount of memory (5 GB). 
>  * Is this a normal behavior?  Or a bug?
>  * Is there any rule of thumb to estimate the memory requirement for a 
> dataset re-partitioning? (it’s important when scaling up this approach)
> The drawback is that this memory space is not freed up after the 
> re-partitioning  (I am using RStudio). 
> The {{gc()}} useless in this situation. And there is no any associated object 
> (to the repartitioning) in the {{R}} environment which can be removed from 
> memory (using the {{rm()}} function).
>  * How one can regain this memory space used by re-partitioning?
> The rationale behind choosing the dataset re-partitioning: if my 
> understanding is correct,  in the current arrow version the append is not 
> working when writing parquet files/datasets. (the original csv files were 
> partly partitioned according to a different variable)
> Can you recommend any better approach?



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[jira] [Commented] (ARROW-16320) Dataset re-partitioning consumes considerable amount of memory

Reply via email to