One more thing to try -- depending on your pipeline, you can disable
the "auto-reshuffle" of JdbcIO.Read by setting
withOutputParallelization(false)

This is particularly useful if (1) you do aggressive and cheap
filtering immediately after the read or (2) you do your own
repartitioning action like GroupByKey after the read.

Given your investigation into the heap, I doubt this will help!  I'll
take a closer look at the DoFnOutputManager.  In the meantime, is
there anything particularly about your job that might help
investigate?

All my best, Ryan

On Fri, Oct 25, 2019 at 2:47 PM Jozef Vilcek <jozo.vil...@gmail.com> wrote:
>
> I agree I might be too quick to call DoFn output need to fit in memory. 
> Actually I am not sure what Beam model say on this matter and what output 
> managers of particular runners do about it.
>
> But SparkRunner definitely has an issue here. I did try set small `fetchSize` 
> for JdbcIO as well as change `storageLevel` to MEMORY_AND_DISK. All fails on 
> OOM.
> When looking at the heap, most of it is used by linked list multi-map of 
> DoFnOutputManager here:
> https://github.com/apache/beam/blob/v2.15.0/runners/spark/src/main/java/org/apache/beam/runners/spark/translation/MultiDoFnFunction.java#L234
>
>

Reply via email to