Result of my query can fit the memory if I use 12GB heap per spark
executor. This makes the job quite inefficient as past JDBC load job runs
fine with 4GB heap to do the main heavy lifting - JDBC is the main data
set, just metadata.

I just did run the same JdbcIO read code on Spark and Flink runner. Flink
did not blow up on memory. So it seems like this is a limitation of
SparkRunner.

On Fri, Oct 25, 2019 at 5:28 PM Ryan Skraba <r...@skraba.com> wrote:

> One more thing to try -- depending on your pipeline, you can disable
> the "auto-reshuffle" of JdbcIO.Read by setting
> withOutputParallelization(false)
>
> This is particularly useful if (1) you do aggressive and cheap
> filtering immediately after the read or (2) you do your own
> repartitioning action like GroupByKey after the read.
>
> Given your investigation into the heap, I doubt this will help!  I'll
> take a closer look at the DoFnOutputManager.  In the meantime, is
> there anything particularly about your job that might help
> investigate?
>
> All my best, Ryan
>
> On Fri, Oct 25, 2019 at 2:47 PM Jozef Vilcek <jozo.vil...@gmail.com>
> wrote:
> >
> > I agree I might be too quick to call DoFn output need to fit in memory.
> Actually I am not sure what Beam model say on this matter and what output
> managers of particular runners do about it.
> >
> > But SparkRunner definitely has an issue here. I did try set small
> `fetchSize` for JdbcIO as well as change `storageLevel` to MEMORY_AND_DISK.
> All fails on OOM.
> > When looking at the heap, most of it is used by linked list multi-map of
> DoFnOutputManager here:
> >
> https://github.com/apache/beam/blob/v2.15.0/runners/spark/src/main/java/org/apache/beam/runners/spark/translation/MultiDoFnFunction.java#L234
> >
> >
>

Reply via email to