Re: [I] [R][C++] Repartitioning on a new variable uses all my RAM and crashes [arrow]

via GitHub Wed, 28 Feb 2024 05:41:01 -0800


westonpace commented on issue #40224:
URL: https://github.com/apache/arrow/issues/40224#issuecomment-1968990632


   > The data is in Parquet format and is already partitioned by "year" and 
"location". When I try to run this, it gradually uses more and more of my RAM 
until it crashes.
   
   How were you measuring RAM?  Were you looking at the RSS of the process? Or 
were you looking at the amount of free/available memory?
   
   I would also be surprised if that plan itself was leaking / accumulating 
memory.
   
   If it is R specific then maybe R is accumulating everything before the call 
to `write_dataset`?  I seem to remember that being an R fallback at some point 
when creating plans.
   
   In python the `write_dataset` call can take as input a record batch reader.  
I think you actually end up with two acero plans.  The first is the one you 
shared and the second is just `source -> write` (where the first plan's output 
is the "source" node in the second plan).
   
   However, in R, it might be more natural to make a `source -> project -> 
write` plan instead of a `source -> project -> sink` plan in this situation.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] [R][C++] Repartitioning on a new variable uses all my RAM and crashes [arrow]

Reply via email to