Re: JdbcIO read needs to fit in memory

Eugene Kirpichov Thu, 24 Oct 2019 08:20:16 -0700

Sorry, I just realized I've made a mistake. BoundedSource in some runners
may not have the same "fits in memory" limitation as DoFn's, so in that
sense you're right - if it was done as a BoundedSource, perhaps it would
work better in your case, even if it didn't read things in parallel.


On Thu, Oct 24, 2019 at 8:17 AM Eugene Kirpichov <[email protected]>
wrote:

> Hi Josef,
>
> JdbcIO per se does not require the result set to fit in memory. The issues
> come from the limitations of the context in which it runs:
> - It indeed uses a DoFn to emit results; a DoFn is in general allowed to
> emit an unbounded number of results that doesn't necessarily have to fit in
> memory, but some runners may have this requirement (e.g. Spark probably
> does, Dataflow doesn't, not sure about the others)
> - JdbcIO uses a database cursor provided by the underlying JDBC driver to
> read through the results. Again, depending on the particular JDBC driver,
> the cursor may or may not be able to stream the results without storing all
> of them in memory.
> - The biggest issue, though, is that there's no way to automatically split
> the execution of a JDBC query into several sub-queries whose results
> together are equivalent to the result of the original query. Because of
> this, it is not possible to implement JdbcIO in a way that it would
> *automatically* avoid scanning through the entire result set, because
> scanning through the entire result set sequentially is the only way JDBC
> drivers (and most databases) allow you to access query results. Even if we
> chose to use BoundedSource, we wouldn't be able to implement the split()
> method.
>
> If you need to read query results in parallel, or to circumvent memory
> limitations of a particular runner or JDBC driver, you can use
> JdbcIO.readAll(), and parameterize your query such that passing all the
> parameter values together adds up to the original query you wanted. Most
> likely it would be something like transforming "SELECT * FROM TABLE" to a
> family of queries "SELECT * FROM TABLE WHERE MY_PRIMARY_KEY BETWEEN ? AND
> ?" and passing primary key ranges adding up to the full range of the
> table's keys.
>
> Note that, whether this performs better, will also depend on the database
> - e.g. if the database is already bottlenecked, then reading from it in
> parallel will not make things faster.
>
> On Thu, Oct 24, 2019 at 7:26 AM Jean-Baptiste Onofré <[email protected]>
> wrote:
>
>> Hi
>>
>> JdbcIO is basically a DoFn. So it could load all on a single executor
>> (there's no obvious way to split).
>>
>> It's what you mean ?
>>
>> Regards
>> JB
>>
>> Le 24 oct. 2019 15:26, Jozef Vilcek <[email protected]> a écrit :
>>
>> Hi,
>>
>> I am in a need to read a big-ish data set via JdbcIO. This forced me to
>> bump up memory for my executor (right now using SparkRunner). It seems that
>> JdbcIO has a requirement to fit all data in memory as it is using DoFn to
>> unfold query to list of elements.
>>
>> BoundedSource would not face the need to fit result in memory, but JdbcIO
>> is using DoFn. Also, in recent discussion [1] it was suggested that
>> BoudnedSource should not be used as it is obsolete.
>>
>> Does anyone faced this issue? What would be the best way to solve it? If
>> DoFn should be kept, then I can only think of splitting the query to ranges
>> and try to find most fitting number of rows to read at once.
>>
>> I appreciate any thoughts.
>>
>> [1]
>> https://lists.apache.org/[email protected]:lte=1M:Reading%20from%20RDB%2C%20ParDo%20or%20BoundedSource
>>
>>
>>

Re: JdbcIO read needs to fit in memory

Reply via email to