Hi Thomas, Could you share the stack trace of your OOM and, if possible, the code snippet of your pipeline? Afaik, usually only “large" GroupByKey transforms, caused by “hot keys”, may lead to OOM with SparkRunner.
— Alexey > On 26 Apr 2021, at 08:23, Thomas Fredriksen(External) > <thomas.fredrik...@cognite.com> wrote: > > Good morning, > > We are ingesting a very large dataset into our database using Beam on Spark. > The dataset is available through a REST-like API and is splicedin such a way > so that in order to obtain the whole dataset, we must do around 24000 API > calls. > > All in all, this results in 24000 CSV files that need to be parsed then > written to our database. > > Unfortunately, we are encountering some OutOfMemoryErrors along the way. From > what we have gathered, this is due to the data being queued between > transforms in the pipeline. In order to mitigate this, we have tried to > implement a streaming-scheme where the requests streamed to the request > executor, the flows to the database. This too produced the OOM-error. > > What are the best ways of implementing such pipelines so as to minimize the > memory footprint? Are there any differences between runners we should be > aware of here? (e.g. between Dataflow and Spark)