[ 
https://issues.apache.org/jira/browse/ARROW-18176?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lucas Mation updated ARROW-18176:
---------------------------------
    Description: 
I first posted on StackOverlow, 
[here.|https://stackoverflow.com/questions/74221492/r-arrow-open-dataset-selectmyvars-collect-causing-memory-leak]

I am having trouble using arrow in R. First, I saved some {{data.tables}} that 
were about 50-60Gb ({{{}d{}}} in the code chunk) in memory to a parquet file 
using:
 
{{d %>% write_dataset(f, format='parquet') # f is the directory name}}

Then I try to read open the file, select the relevant variables and
 
{{tic()d2 <- open_dataset(f) %>% select(all_of(myvars)) %>% collect #myvars is 
a vector of variable namestoc()}}

I did this conversion for 3 sets of data.tables (unfortunately, data is 
confidential so I can't include in the example). In one set, I was able to 
{{open>select>collect}} the desired table in about 60s, obtaining a 10Gb file 
(after variable selection).

For the other two sets, the command caused a memory leak. tic()-toc() returned 
after 80s. But the object name (d2) never appeared in Rstudio's "Enviroment 
panel", and memory used keeps creeping up until it occupied most of the 
available RAM of the server, and then R crashed. Note the orginal dataset, 
without subsetting cols, was smaller than 60Gb and the server had 512GB.

Any ideas on what could be going on here?

UPDATE: today I noticed a few more things.

1) If the collected object is small enough (3 cols, 66million rows), R will 
unfreeze. The console becomes responsive, the object shows up in the 
Environment panel. But memory use keeps going up (by small amounts because the 
underlying that is small). While this is helpening, issuing a gc() command 
reduces the memory use, but it then starts growing again.

2) Even after "rm(d2)" and  "gc()", the R session that issued the arrow 
commands still use around 60-70Gb of RAM... The only way to end that is to 
close the R session. 

 

 

 

 

  was:
I first posted on StackOverlow, 
[here.|https://stackoverflow.com/questions/74221492/r-arrow-open-dataset-selectmyvars-collect-causing-memory-leak]

 

I am having trouble using arrow in R. First, I saved some {{data.tables}} that 
were about 50-60Gb ({{{}d{}}} in the code chunk) in memory to a parquet file 
using:
 
{{d %>% write_dataset(f, format='parquet')  # f is the directory name}}

Then I try to read open the file, select the relevant variables and
 
{{tic()d2 <-  open_dataset(f) %>% select(all_of(myvars)) %>% collect #myvars is 
a vector of variable namestoc()}}

I did this conversion for 3 sets of data.tables (unfortunately, data is 
confidential so I can't include in the example). In one set, I was able to 
{{open>select>collect}} the desired table in about 60s, obtaining a 10Gb file 
(after variable selection).

For the other two sets, the command caused a memory leak. tic()-toc() returned 
after 80s. But the object name (d2) never appeared in Rstudio's "Enviroment 
panel", and memory used keeps creeping up until it occupied most of the 
available RAM of the server, and then R crashed. Note the orginal dataset, 
without subsetting cols, was smaller than 60Gb and the server had 512GB.

Any ideas on what could be going on here?

 

 

 


> [R] arrow::open_dataset %>% select(myvars) %>% collect causes memory leak
> -------------------------------------------------------------------------
>
>                 Key: ARROW-18176
>                 URL: https://issues.apache.org/jira/browse/ARROW-18176
>             Project: Apache Arrow
>          Issue Type: Bug
>            Reporter: Lucas Mation
>            Priority: Major
>
> I first posted on StackOverlow, 
> [here.|https://stackoverflow.com/questions/74221492/r-arrow-open-dataset-selectmyvars-collect-causing-memory-leak]
> I am having trouble using arrow in R. First, I saved some {{data.tables}} 
> that were about 50-60Gb ({{{}d{}}} in the code chunk) in memory to a parquet 
> file using:
>  
> {{d %>% write_dataset(f, format='parquet') # f is the directory name}}
> Then I try to read open the file, select the relevant variables and
>  
> {{tic()d2 <- open_dataset(f) %>% select(all_of(myvars)) %>% collect #myvars 
> is a vector of variable namestoc()}}
> I did this conversion for 3 sets of data.tables (unfortunately, data is 
> confidential so I can't include in the example). In one set, I was able to 
> {{open>select>collect}} the desired table in about 60s, obtaining a 10Gb file 
> (after variable selection).
> For the other two sets, the command caused a memory leak. tic()-toc() 
> returned after 80s. But the object name (d2) never appeared in Rstudio's 
> "Enviroment panel", and memory used keeps creeping up until it occupied most 
> of the available RAM of the server, and then R crashed. Note the orginal 
> dataset, without subsetting cols, was smaller than 60Gb and the server had 
> 512GB.
> Any ideas on what could be going on here?
> UPDATE: today I noticed a few more things.
> 1) If the collected object is small enough (3 cols, 66million rows), R will 
> unfreeze. The console becomes responsive, the object shows up in the 
> Environment panel. But memory use keeps going up (by small amounts because 
> the underlying that is small). While this is helpening, issuing a gc() 
> command reduces the memory use, but it then starts growing again.
> 2) Even after "rm(d2)" and  "gc()", the R session that issued the arrow 
> commands still use around 60-70Gb of RAM... The only way to end that is to 
> close the R session. 
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to