Sorry, I just realized I've made a mistake. BoundedSource in some runners may not have the same "fits in memory" limitation as DoFn's, so in that sense you're right - if it was done as a BoundedSource, perhaps it would work better in your case, even if it didn't read things in parallel.
On Thu, Oct 24, 2019 at 8:17 AM Eugene Kirpichov <kirpic...@google.com> wrote: > Hi Josef, > > JdbcIO per se does not require the result set to fit in memory. The issues > come from the limitations of the context in which it runs: > - It indeed uses a DoFn to emit results; a DoFn is in general allowed to > emit an unbounded number of results that doesn't necessarily have to fit in > memory, but some runners may have this requirement (e.g. Spark probably > does, Dataflow doesn't, not sure about the others) > - JdbcIO uses a database cursor provided by the underlying JDBC driver to > read through the results. Again, depending on the particular JDBC driver, > the cursor may or may not be able to stream the results without storing all > of them in memory. > - The biggest issue, though, is that there's no way to automatically split > the execution of a JDBC query into several sub-queries whose results > together are equivalent to the result of the original query. Because of > this, it is not possible to implement JdbcIO in a way that it would > *automatically* avoid scanning through the entire result set, because > scanning through the entire result set sequentially is the only way JDBC > drivers (and most databases) allow you to access query results. Even if we > chose to use BoundedSource, we wouldn't be able to implement the split() > method. > > If you need to read query results in parallel, or to circumvent memory > limitations of a particular runner or JDBC driver, you can use > JdbcIO.readAll(), and parameterize your query such that passing all the > parameter values together adds up to the original query you wanted. Most > likely it would be something like transforming "SELECT * FROM TABLE" to a > family of queries "SELECT * FROM TABLE WHERE MY_PRIMARY_KEY BETWEEN ? AND > ?" and passing primary key ranges adding up to the full range of the > table's keys. > > Note that, whether this performs better, will also depend on the database > - e.g. if the database is already bottlenecked, then reading from it in > parallel will not make things faster. > > On Thu, Oct 24, 2019 at 7:26 AM Jean-Baptiste Onofré <j...@nanthrax.net> > wrote: > >> Hi >> >> JdbcIO is basically a DoFn. So it could load all on a single executor >> (there's no obvious way to split). >> >> It's what you mean ? >> >> Regards >> JB >> >> Le 24 oct. 2019 15:26, Jozef Vilcek <jozo.vil...@gmail.com> a écrit : >> >> Hi, >> >> I am in a need to read a big-ish data set via JdbcIO. This forced me to >> bump up memory for my executor (right now using SparkRunner). It seems that >> JdbcIO has a requirement to fit all data in memory as it is using DoFn to >> unfold query to list of elements. >> >> BoundedSource would not face the need to fit result in memory, but JdbcIO >> is using DoFn. Also, in recent discussion [1] it was suggested that >> BoudnedSource should not be used as it is obsolete. >> >> Does anyone faced this issue? What would be the best way to solve it? If >> DoFn should be kept, then I can only think of splitting the query to ranges >> and try to find most fitting number of rows to read at once. >> >> I appreciate any thoughts. >> >> [1] >> https://lists.apache.org/list.html?dev@beam.apache.org:lte=1M:Reading%20from%20RDB%2C%20ParDo%20or%20BoundedSource >> >> >>