westonpace commented on issue #40224: URL: https://github.com/apache/arrow/issues/40224#issuecomment-1968990632
> The data is in Parquet format and is already partitioned by "year" and "location". When I try to run this, it gradually uses more and more of my RAM until it crashes. How were you measuring RAM? Were you looking at the RSS of the process? Or were you looking at the amount of free/available memory? I would also be surprised if that plan itself was leaking / accumulating memory. If it is R specific then maybe R is accumulating everything before the call to `write_dataset`? I seem to remember that being an R fallback at some point when creating plans. In python the `write_dataset` call can take as input a record batch reader. I think you actually end up with two acero plans. The first is the one you shared and the second is just `source -> write` (where the first plan's output is the "source" node in the second plan). However, in R, it might be more natural to make a `source -> project -> write` plan instead of a `source -> project -> sink` plan in this situation. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
