Hi Josef,

JdbcIO per se does not require the result set to fit in memory. The issues
come from the limitations of the context in which it runs:
- It indeed uses a DoFn to emit results; a DoFn is in general allowed to
emit an unbounded number of results that doesn't necessarily have to fit in
memory, but some runners may have this requirement (e.g. Spark probably
does, Dataflow doesn't, not sure about the others)
- JdbcIO uses a database cursor provided by the underlying JDBC driver to
read through the results. Again, depending on the particular JDBC driver,
the cursor may or may not be able to stream the results without storing all
of them in memory.
- The biggest issue, though, is that there's no way to automatically split
the execution of a JDBC query into several sub-queries whose results
together are equivalent to the result of the original query. Because of
this, it is not possible to implement JdbcIO in a way that it would
*automatically* avoid scanning through the entire result set, because
scanning through the entire result set sequentially is the only way JDBC
drivers (and most databases) allow you to access query results. Even if we
chose to use BoundedSource, we wouldn't be able to implement the split()
method.

If you need to read query results in parallel, or to circumvent memory
limitations of a particular runner or JDBC driver, you can use
JdbcIO.readAll(), and parameterize your query such that passing all the
parameter values together adds up to the original query you wanted. Most
likely it would be something like transforming "SELECT * FROM TABLE" to a
family of queries "SELECT * FROM TABLE WHERE MY_PRIMARY_KEY BETWEEN ? AND
?" and passing primary key ranges adding up to the full range of the
table's keys.

Note that, whether this performs better, will also depend on the database -
e.g. if the database is already bottlenecked, then reading from it in
parallel will not make things faster.

On Thu, Oct 24, 2019 at 7:26 AM Jean-Baptiste Onofré <j...@nanthrax.net>
wrote:

> Hi
>
> JdbcIO is basically a DoFn. So it could load all on a single executor
> (there's no obvious way to split).
>
> It's what you mean ?
>
> Regards
> JB
>
> Le 24 oct. 2019 15:26, Jozef Vilcek <jozo.vil...@gmail.com> a écrit :
>
> Hi,
>
> I am in a need to read a big-ish data set via JdbcIO. This forced me to
> bump up memory for my executor (right now using SparkRunner). It seems that
> JdbcIO has a requirement to fit all data in memory as it is using DoFn to
> unfold query to list of elements.
>
> BoundedSource would not face the need to fit result in memory, but JdbcIO
> is using DoFn. Also, in recent discussion [1] it was suggested that
> BoudnedSource should not be used as it is obsolete.
>
> Does anyone faced this issue? What would be the best way to solve it? If
> DoFn should be kept, then I can only think of splitting the query to ranges
> and try to find most fitting number of rows to read at once.
>
> I appreciate any thoughts.
>
> [1]
> https://lists.apache.org/list.html?dev@beam.apache.org:lte=1M:Reading%20from%20RDB%2C%20ParDo%20or%20BoundedSource
>
>
>

Reply via email to