Result of my query can fit the memory if I use 12GB heap per spark executor. This makes the job quite inefficient as past JDBC load job runs fine with 4GB heap to do the main heavy lifting - JDBC is the main data set, just metadata.
I just did run the same JdbcIO read code on Spark and Flink runner. Flink did not blow up on memory. So it seems like this is a limitation of SparkRunner. On Fri, Oct 25, 2019 at 5:28 PM Ryan Skraba <r...@skraba.com> wrote: > One more thing to try -- depending on your pipeline, you can disable > the "auto-reshuffle" of JdbcIO.Read by setting > withOutputParallelization(false) > > This is particularly useful if (1) you do aggressive and cheap > filtering immediately after the read or (2) you do your own > repartitioning action like GroupByKey after the read. > > Given your investigation into the heap, I doubt this will help! I'll > take a closer look at the DoFnOutputManager. In the meantime, is > there anything particularly about your job that might help > investigate? > > All my best, Ryan > > On Fri, Oct 25, 2019 at 2:47 PM Jozef Vilcek <jozo.vil...@gmail.com> > wrote: > > > > I agree I might be too quick to call DoFn output need to fit in memory. > Actually I am not sure what Beam model say on this matter and what output > managers of particular runners do about it. > > > > But SparkRunner definitely has an issue here. I did try set small > `fetchSize` for JdbcIO as well as change `storageLevel` to MEMORY_AND_DISK. > All fails on OOM. > > When looking at the heap, most of it is used by linked list multi-map of > DoFnOutputManager here: > > > https://github.com/apache/beam/blob/v2.15.0/runners/spark/src/main/java/org/apache/beam/runners/spark/translation/MultiDoFnFunction.java#L234 > > > > >