Hi Josef, JdbcIO per se does not require the result set to fit in memory. The issues come from the limitations of the context in which it runs: - It indeed uses a DoFn to emit results; a DoFn is in general allowed to emit an unbounded number of results that doesn't necessarily have to fit in memory, but some runners may have this requirement (e.g. Spark probably does, Dataflow doesn't, not sure about the others) - JdbcIO uses a database cursor provided by the underlying JDBC driver to read through the results. Again, depending on the particular JDBC driver, the cursor may or may not be able to stream the results without storing all of them in memory. - The biggest issue, though, is that there's no way to automatically split the execution of a JDBC query into several sub-queries whose results together are equivalent to the result of the original query. Because of this, it is not possible to implement JdbcIO in a way that it would *automatically* avoid scanning through the entire result set, because scanning through the entire result set sequentially is the only way JDBC drivers (and most databases) allow you to access query results. Even if we chose to use BoundedSource, we wouldn't be able to implement the split() method.
If you need to read query results in parallel, or to circumvent memory limitations of a particular runner or JDBC driver, you can use JdbcIO.readAll(), and parameterize your query such that passing all the parameter values together adds up to the original query you wanted. Most likely it would be something like transforming "SELECT * FROM TABLE" to a family of queries "SELECT * FROM TABLE WHERE MY_PRIMARY_KEY BETWEEN ? AND ?" and passing primary key ranges adding up to the full range of the table's keys. Note that, whether this performs better, will also depend on the database - e.g. if the database is already bottlenecked, then reading from it in parallel will not make things faster. On Thu, Oct 24, 2019 at 7:26 AM Jean-Baptiste Onofré <j...@nanthrax.net> wrote: > Hi > > JdbcIO is basically a DoFn. So it could load all on a single executor > (there's no obvious way to split). > > It's what you mean ? > > Regards > JB > > Le 24 oct. 2019 15:26, Jozef Vilcek <jozo.vil...@gmail.com> a écrit : > > Hi, > > I am in a need to read a big-ish data set via JdbcIO. This forced me to > bump up memory for my executor (right now using SparkRunner). It seems that > JdbcIO has a requirement to fit all data in memory as it is using DoFn to > unfold query to list of elements. > > BoundedSource would not face the need to fit result in memory, but JdbcIO > is using DoFn. Also, in recent discussion [1] it was suggested that > BoudnedSource should not be used as it is obsolete. > > Does anyone faced this issue? What would be the best way to solve it? If > DoFn should be kept, then I can only think of splitting the query to ranges > and try to find most fitting number of rows to read at once. > > I appreciate any thoughts. > > [1] > https://lists.apache.org/list.html?dev@beam.apache.org:lte=1M:Reading%20from%20RDB%2C%20ParDo%20or%20BoundedSource > > >