[ https://issues.apache.org/jira/browse/ARROW-18176?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Lucas Mation updated ARROW-18176: --------------------------------- Description: I first posted on StackOverlow, [here.|https://stackoverflow.com/questions/74221492/r-arrow-open-dataset-selectmyvars-collect-causing-memory-leak] I am having trouble using arrow in R. First, I saved some {{data.tables}} that were about 50-60Gb ({{{}d{}}} in the code chunk) in memory to a parquet file using: {{d %>% write_dataset(f, format='parquet') # f is the directory name}} Then I try to read open the file, select the relevant variables and {{tic()d2 <- open_dataset(f) %>% select(all_of(myvars)) %>% collect #myvars is a vector of variable namestoc()}} I did this conversion for 3 sets of data.tables (unfortunately, data is confidential so I can't include in the example). In one set, I was able to {{open>select>collect}} the desired table in about 60s, obtaining a 10Gb file (after variable selection). For the other two sets, the command caused a memory leak. tic()-toc() returned after 80s. But the object name (d2) never appeared in Rstudio's "Enviroment panel", and memory used keeps creeping up until it occupied most of the available RAM of the server, and then R crashed. Note the orginal dataset, without subsetting cols, was smaller than 60Gb and the server had 512GB. Any ideas on what could be going on here? UPDATE: today I noticed a few more things. 1) If the collected object is small enough (3 cols, 66million rows), R will unfreeze. The console becomes responsive, the object shows up in the Environment panel. But memory use keeps going up (by small amounts because the underlying that is small). While this is helpening, issuing a gc() command reduces the memory use, but it then starts growing again. 2) Even after "rm(d2)" and "gc()", the R session that issued the arrow commands still use around 60-70Gb of RAM... The only way to end that is to close the R session. was: I first posted on StackOverlow, [here.|https://stackoverflow.com/questions/74221492/r-arrow-open-dataset-selectmyvars-collect-causing-memory-leak] I am having trouble using arrow in R. First, I saved some {{data.tables}} that were about 50-60Gb ({{{}d{}}} in the code chunk) in memory to a parquet file using: {{d %>% write_dataset(f, format='parquet') # f is the directory name}} Then I try to read open the file, select the relevant variables and {{tic()d2 <- open_dataset(f) %>% select(all_of(myvars)) %>% collect #myvars is a vector of variable namestoc()}} I did this conversion for 3 sets of data.tables (unfortunately, data is confidential so I can't include in the example). In one set, I was able to {{open>select>collect}} the desired table in about 60s, obtaining a 10Gb file (after variable selection). For the other two sets, the command caused a memory leak. tic()-toc() returned after 80s. But the object name (d2) never appeared in Rstudio's "Enviroment panel", and memory used keeps creeping up until it occupied most of the available RAM of the server, and then R crashed. Note the orginal dataset, without subsetting cols, was smaller than 60Gb and the server had 512GB. Any ideas on what could be going on here? > [R] arrow::open_dataset %>% select(myvars) %>% collect causes memory leak > ------------------------------------------------------------------------- > > Key: ARROW-18176 > URL: https://issues.apache.org/jira/browse/ARROW-18176 > Project: Apache Arrow > Issue Type: Bug > Reporter: Lucas Mation > Priority: Major > > I first posted on StackOverlow, > [here.|https://stackoverflow.com/questions/74221492/r-arrow-open-dataset-selectmyvars-collect-causing-memory-leak] > I am having trouble using arrow in R. First, I saved some {{data.tables}} > that were about 50-60Gb ({{{}d{}}} in the code chunk) in memory to a parquet > file using: > > {{d %>% write_dataset(f, format='parquet') # f is the directory name}} > Then I try to read open the file, select the relevant variables and > > {{tic()d2 <- open_dataset(f) %>% select(all_of(myvars)) %>% collect #myvars > is a vector of variable namestoc()}} > I did this conversion for 3 sets of data.tables (unfortunately, data is > confidential so I can't include in the example). In one set, I was able to > {{open>select>collect}} the desired table in about 60s, obtaining a 10Gb file > (after variable selection). > For the other two sets, the command caused a memory leak. tic()-toc() > returned after 80s. But the object name (d2) never appeared in Rstudio's > "Enviroment panel", and memory used keeps creeping up until it occupied most > of the available RAM of the server, and then R crashed. Note the orginal > dataset, without subsetting cols, was smaller than 60Gb and the server had > 512GB. > Any ideas on what could be going on here? > UPDATE: today I noticed a few more things. > 1) If the collected object is small enough (3 cols, 66million rows), R will > unfreeze. The console becomes responsive, the object shows up in the > Environment panel. But memory use keeps going up (by small amounts because > the underlying that is small). While this is helpening, issuing a gc() > command reduces the memory use, but it then starts growing again. > 2) Even after "rm(d2)" and "gc()", the R session that issued the arrow > commands still use around 60-70Gb of RAM... The only way to end that is to > close the R session. > > > > -- This message was sent by Atlassian Jira (v8.20.10#820010)