[jira] [Commented] (ARROW-16452) [R] After dataset scan, some RAM is left consumed until a garbage collection pass

Neal Richardson (Jira) Mon, 09 May 2022 11:21:04 -0700


    [ 
https://issues.apache.org/jira/browse/ARROW-16452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17533950#comment-17533950
 ]


Neal Richardson commented on ARROW-16452:
-----------------------------------------

We already are using a custom memory pool to connect it to gc(): 
https://github.com/apache/arrow/blob/master/r/src/memorypool.cpp

Is there a change we could make there? I don't think we always want to call 
gc() before returning after running an exec plan because that will make the 
return slower. IDK if putting it in a background thread would work?

> [R] After dataset scan, some RAM is left consumed until a garbage collection 
> pass
> ---------------------------------------------------------------------------------
>
>                 Key: ARROW-16452
>                 URL: https://issues.apache.org/jira/browse/ARROW-16452
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: R
>            Reporter: Weston Pace
>            Priority: Major
>
> This might be "not a bug" but I wonder if we can do something better here.  
> When I create and execute a dplyr query there is a bunch of RAM that is left 
> allocated until the next GC pass.
> Since R's garbage collection is only based on RAM that R has allocated this 
> extra memory (which can be quite substantial) might never be freed.
> Perhaps we should just manually trigger a gc pass after running an execution 
> plan?  Or it may be good to get a better understanding of what exactly this 
> memory is being used for.
> In the example below I load ~2GB of data but after the collect there is ~3GB 
> used.  I wait 10 seconds to ensure it's not jemalloc.  Then I run {{gc()}} 
> manually and ~1GB is freed.
> {noformat}
> > dataset = arrow::open_dataset('/home/pace/dev/data/dataset/parquet/5')
> > default_memory_pool()$bytes_allocated
> [1] 64
> > x <- dataset %>% collect(as_data_frame=FALSE)
> > arrow::default_memory_pool()$bytes_allocated
> [1] 2921135104
> > Sys.sleep(10)
> > arrow::default_memory_pool()$bytes_allocated
> [1] 2921135104
> > gc()
>           used (Mb) gc trigger (Mb) max used (Mb)
> Ncells  917099 49.0    1498168 80.1  1498168 80.1
> Vcells 1649894 12.6    8388608 64.0  2617403 20.0
> > arrow::default_memory_pool()$bytes_allocated
> [1] 2028716480
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[jira] [Commented] (ARROW-16452) [R] After dataset scan, some RAM is left consumed until a garbage collection pass

Reply via email to