Re: JdbcIO read needs to fit in memory

2019-10-29 Thread Jozef Vilcek
On Tue, Oct 29, 2019 at 10:04 AM Ryan Skraba wrote: > I didn't get a chance to try this out -- it sounds like a bug with the > SparkRunner, if you've tested it with FlinkRunner and it succeeded. > > From your description, it should be reproducible by reading any large > database table with the Sp

Re: JdbcIO read needs to fit in memory

2019-10-29 Thread Ryan Skraba
I didn't get a chance to try this out -- it sounds like a bug with the SparkRunner, if you've tested it with FlinkRunner and it succeeded. >From your description, it should be reproducible by reading any large database table with the SparkRunner where the entire dataset is greater than the memory

Re: JdbcIO read needs to fit in memory

2019-10-29 Thread Jozef Vilcek
I can not find anything in docs about expected behavior of DoFn emitting arbitrary large number elements on one processElement(). I wonder if Spark Runner behavior is a bug or just a difference (and disadvantage in this case) in execution more towards runner capability matrix differences. Also, i

Re: JdbcIO read needs to fit in memory

2019-10-27 Thread Jozef Vilcek
typo in my previous message. I meant to say => JDBC is `not` the main data set, just metadata On Sun, Oct 27, 2019 at 6:00 PM Jozef Vilcek wrote: > Result of my query can fit the memory if I use 12GB heap per spark > executor. This makes the job quite inefficient as past JDBC load job runs > fin

Re: JdbcIO read needs to fit in memory

2019-10-27 Thread Jozef Vilcek
Result of my query can fit the memory if I use 12GB heap per spark executor. This makes the job quite inefficient as past JDBC load job runs fine with 4GB heap to do the main heavy lifting - JDBC is the main data set, just metadata. I just did run the same JdbcIO read code on Spark and Flink runne

Re: JdbcIO read needs to fit in memory

2019-10-25 Thread Ryan Skraba
One more thing to try -- depending on your pipeline, you can disable the "auto-reshuffle" of JdbcIO.Read by setting withOutputParallelization(false) This is particularly useful if (1) you do aggressive and cheap filtering immediately after the read or (2) you do your own repartitioning action like

Re: JdbcIO read needs to fit in memory

2019-10-25 Thread Eugene Kirpichov
Yeah - in this case your primary option is to use JdbcIO.readAll() and shard your query, as suggested above. Alternative hypothesis: is the result set of your query actually big enough that it *shouldn't* fit in memory? Or could it be a matter of inefficient storage of its elements? Could you brie

Re: JdbcIO read needs to fit in memory

2019-10-25 Thread Jozef Vilcek
I agree I might be too quick to call DoFn output need to fit in memory. Actually I am not sure what Beam model say on this matter and what output managers of particular runners do about it. But SparkRunner definitely has an issue here. I did try set small `fetchSize` for JdbcIO as well as change `

Re: JdbcIO read needs to fit in memory

2019-10-24 Thread Alexey Romanenko
Jozef, do you have any NPE stacktrace to share? > On 24 Oct 2019, at 15:26, Jozef Vilcek wrote: > > Hi, > > I am in a need to read a big-ish data set via JdbcIO. This forced me to bump > up memory for my executor (right now using SparkRunner). It seems that JdbcIO > has a requirement to fit a

Re: JdbcIO read needs to fit in memory

2019-10-24 Thread Ryan Skraba
Hello! If I remember correctly -- the JdbcIO will use *one* DoFn instance to read all of the rows, but that instance is not required to hold all of the rows in memory. The fetch size will, however, read 50K rows at a time by default and those will all be held in memory in that single worker until

Re: JdbcIO read needs to fit in memory

2019-10-24 Thread Eugene Kirpichov
Sorry, I just realized I've made a mistake. BoundedSource in some runners may not have the same "fits in memory" limitation as DoFn's, so in that sense you're right - if it was done as a BoundedSource, perhaps it would work better in your case, even if it didn't read things in parallel. On Thu, Oc

Re: JdbcIO read needs to fit in memory

2019-10-24 Thread Eugene Kirpichov
Hi Josef, JdbcIO per se does not require the result set to fit in memory. The issues come from the limitations of the context in which it runs: - It indeed uses a DoFn to emit results; a DoFn is in general allowed to emit an unbounded number of results that doesn't necessarily have to fit in memor

Re: JdbcIO read needs to fit in memory

2019-10-24 Thread Jean-Baptiste Onofré
HiJdbcIO is basically a DoFn. So it could load all on a single executor (there's no obvious way to split).It's what you mean ?RegardsJBLe 24 oct. 2019 15:26, Jozef Vilcek a écrit :Hi,I am in a need to read a big-ish data set via JdbcIO. This forced me to bump up memory for my executor (right now u

JdbcIO read needs to fit in memory

2019-10-24 Thread Jozef Vilcek
Hi, I am in a need to read a big-ish data set via JdbcIO. This forced me to bump up memory for my executor (right now using SparkRunner). It seems that JdbcIO has a requirement to fit all data in memory as it is using DoFn to unfold query to list of elements. BoundedSource would not face the need