Hello! If I remember correctly -- the JdbcIO will use *one* DoFn instance to read all of the rows, but that instance is not required to hold all of the rows in memory.
The fetch size will, however, read 50K rows at a time by default and those will all be held in memory in that single worker until they are emitted. You can adjust this setting with the setFetchSize(...) method. By default, the JdbcIO.Read transform adds a "reshuffle", which will repartition the records among all of the nodes in the cluster. This means that all of the rows need to fit into total available memory of the cluster (not just that one node), especially if the RDD underneath the PCollection is reused/persisted. You can change the persistence level to "MEMORY_AND_DISK" in this case if you want to spill data to disk instead of failing your job: https://github.com/apache/beam/blob/416f62bdd7fa092257921e4835a48094ebe1dda4/runners/spark/src/main/java/org/apache/beam/runners/spark/SparkPipelineOptions.java#L56 I hope this helps! Ryan On Thu, Oct 24, 2019 at 4:26 PM Jean-Baptiste Onofré <j...@nanthrax.net> wrote: > > Hi > > JdbcIO is basically a DoFn. So it could load all on a single executor > (there's no obvious way to split). > > It's what you mean ? > > Regards > JB > > Le 24 oct. 2019 15:26, Jozef Vilcek <jozo.vil...@gmail.com> a écrit : > > Hi, > > I am in a need to read a big-ish data set via JdbcIO. This forced me to bump > up memory for my executor (right now using SparkRunner). It seems that JdbcIO > has a requirement to fit all data in memory as it is using DoFn to unfold > query to list of elements. > > BoundedSource would not face the need to fit result in memory, but JdbcIO is > using DoFn. Also, in recent discussion [1] it was suggested that > BoudnedSource should not be used as it is obsolete. > > Does anyone faced this issue? What would be the best way to solve it? If DoFn > should be kept, then I can only think of splitting the query to ranges and > try to find most fitting number of rows to read at once. > > I appreciate any thoughts. > > [1] > https://lists.apache.org/list.html?dev@beam.apache.org:lte=1M:Reading%20from%20RDB%2C%20ParDo%20or%20BoundedSource > >