[ https://issues.apache.org/jira/browse/ARROW-16452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17533950#comment-17533950 ]
Neal Richardson commented on ARROW-16452: ----------------------------------------- We already are using a custom memory pool to connect it to gc(): https://github.com/apache/arrow/blob/master/r/src/memorypool.cpp Is there a change we could make there? I don't think we always want to call gc() before returning after running an exec plan because that will make the return slower. IDK if putting it in a background thread would work? > [R] After dataset scan, some RAM is left consumed until a garbage collection > pass > --------------------------------------------------------------------------------- > > Key: ARROW-16452 > URL: https://issues.apache.org/jira/browse/ARROW-16452 > Project: Apache Arrow > Issue Type: Improvement > Components: R > Reporter: Weston Pace > Priority: Major > > This might be "not a bug" but I wonder if we can do something better here. > When I create and execute a dplyr query there is a bunch of RAM that is left > allocated until the next GC pass. > Since R's garbage collection is only based on RAM that R has allocated this > extra memory (which can be quite substantial) might never be freed. > Perhaps we should just manually trigger a gc pass after running an execution > plan? Or it may be good to get a better understanding of what exactly this > memory is being used for. > In the example below I load ~2GB of data but after the collect there is ~3GB > used. I wait 10 seconds to ensure it's not jemalloc. Then I run {{gc()}} > manually and ~1GB is freed. > {noformat} > > dataset = arrow::open_dataset('/home/pace/dev/data/dataset/parquet/5') > > default_memory_pool()$bytes_allocated > [1] 64 > > x <- dataset %>% collect(as_data_frame=FALSE) > > arrow::default_memory_pool()$bytes_allocated > [1] 2921135104 > > Sys.sleep(10) > > arrow::default_memory_pool()$bytes_allocated > [1] 2921135104 > > gc() > used (Mb) gc trigger (Mb) max used (Mb) > Ncells 917099 49.0 1498168 80.1 1498168 80.1 > Vcells 1649894 12.6 8388608 64.0 2617403 20.0 > > arrow::default_memory_pool()$bytes_allocated > [1] 2028716480 > {noformat} -- This message was sent by Atlassian Jira (v8.20.7#820007)