[jira] [Commented] (ARROW-16320) Dataset re-partitioning consumes considerable amount of memory

Weston Pace (Jira) Tue, 26 Apr 2022 20:32:06 -0700


    [ 
https://issues.apache.org/jira/browse/ARROW-16320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17528511#comment-17528511
 ]


Weston Pace commented on ARROW-16320:
-------------------------------------

What do you get from the following?

{noformat}
a = arrow::read_parquet(here::here("db", "large_parquet", "part-0.parquet"))
a$nbytes()
{noformat}

The {{nbytes}} function should print a pretty decent approximation of the C 
memory referenced by {{a}}.

{{lobstr::obj_size}} prints only the R memory used (I think).

{{fs::file_size}} is going to give you the size of the file, which is possibly 
encoded and compressed.  Some parquet files can be much larger in memory than 
they are on disk.  So it is not unheard of for a 620MB parquet file to end up 
occupying gigabytes in memory (11 GB seems a little extreme but within the 
realm of possibility)

> Dataset re-partitioning consumes considerable amount of memory
> --------------------------------------------------------------
>
>                 Key: ARROW-16320
>                 URL: https://issues.apache.org/jira/browse/ARROW-16320
>             Project: Apache Arrow
>          Issue Type: Bug
>    Affects Versions: 7.0.0
>            Reporter: Zsolt Kegyes-Brassai
>            Priority: Minor
>         Attachments: Rgui_mem.jpg, Rstudio_env.jpg, Rstudio_mem.jpg
>
>
> A short background: I was trying to create a dataset from a big pile of csv 
> files (couple of hundreds). In first step the csv were parsed and saved to 
> parquet files because there were many inconsistencies between csv files. In a 
> consequent step the dataset was re-partitioned using one column (code_key).
>  
> {code:java}
> new_dataset <- open_dataset(
>   temp_parquet_folder, 
>   format = "parquet",
>   unify_schemas = TRUE
>   )
> new_dataset |> 
>   group_by(code_key) |> 
>   write_dataset(
>     folder_repartitioned_dataset, 
>     format = "parquet"
>   )
> {code}
>  
> This re-partitioning consumed a considerable amount of memory (5 GB). 
>  * Is this a normal behavior?  Or a bug?
>  * Is there any rule of thumb to estimate the memory requirement for a 
> dataset re-partitioning? (it’s important when scaling up this approach)
> The drawback is that this memory space is not freed up after the 
> re-partitioning  (I am using RStudio). 
> The {{gc()}} useless in this situation. And there is no any associated object 
> (to the repartitioning) in the {{R}} environment which can be removed from 
> memory (using the {{rm()}} function).
>  * How one can regain this memory space used by re-partitioning?
> The rationale behind choosing the dataset re-partitioning: if my 
> understanding is correct,  in the current arrow version the append is not 
> working when writing parquet files/datasets. (the original csv files were 
> partly partitioned according to a different variable)
> Can you recommend any better approach?



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[jira] [Commented] (ARROW-16320) Dataset re-partitioning consumes considerable amount of memory

Reply via email to