Hello!

If I remember correctly -- the JdbcIO will use *one* DoFn instance to
read all of the rows, but that instance is not required to hold all of
the rows in memory.

The fetch size will, however, read 50K rows at a time by default and
those will all be held in memory in that single worker until they are
emitted.  You can adjust this setting with the setFetchSize(...)
method.

By default, the JdbcIO.Read transform adds a "reshuffle", which will
repartition the records among all of the nodes in the cluster.  This
means that all of the rows need to fit into total available memory of
the cluster (not just that one node), especially if the RDD underneath
the PCollection is reused/persisted.  You can change the persistence
level to "MEMORY_AND_DISK" in this case if you want to spill data to
disk instead of failing your job:
https://github.com/apache/beam/blob/416f62bdd7fa092257921e4835a48094ebe1dda4/runners/spark/src/main/java/org/apache/beam/runners/spark/SparkPipelineOptions.java#L56

I hope this helps!  Ryan




On Thu, Oct 24, 2019 at 4:26 PM Jean-Baptiste Onofré <j...@nanthrax.net> wrote:
>
> Hi
>
> JdbcIO is basically a DoFn. So it could load all on a single executor 
> (there's no obvious way to split).
>
> It's what you mean ?
>
> Regards
> JB
>
> Le 24 oct. 2019 15:26, Jozef Vilcek <jozo.vil...@gmail.com> a écrit :
>
> Hi,
>
> I am in a need to read a big-ish data set via JdbcIO. This forced me to bump 
> up memory for my executor (right now using SparkRunner). It seems that JdbcIO 
> has a requirement to fit all data in memory as it is using DoFn to unfold 
> query to list of elements.
>
> BoundedSource would not face the need to fit result in memory, but JdbcIO is 
> using DoFn. Also, in recent discussion [1] it was suggested that 
> BoudnedSource should not be used as it is obsolete.
>
> Does anyone faced this issue? What would be the best way to solve it? If DoFn 
> should be kept, then I can only think of splitting the query to ranges and 
> try to find most fitting number of rows to read at once.
>
> I appreciate any thoughts.
>
> [1] 
> https://lists.apache.org/list.html?dev@beam.apache.org:lte=1M:Reading%20from%20RDB%2C%20ParDo%20or%20BoundedSource
>
>

Reply via email to