Re: JdbcIO read needs to fit in memory

2019-10-29 Thread Jozef Vilcek
On Tue, Oct 29, 2019 at 10:04 AM Ryan Skraba wrote: > I didn't get a chance to try this out -- it sounds like a bug with the > SparkRunner, if you've tested it with FlinkRunner and it succeeded. > > From your description, it should be reproducible by reading any large > database table with the

Re: JdbcIO read needs to fit in memory

2019-10-29 Thread Ryan Skraba
I didn't get a chance to try this out -- it sounds like a bug with the SparkRunner, if you've tested it with FlinkRunner and it succeeded. >From your description, it should be reproducible by reading any large database table with the SparkRunner where the entire dataset is greater than the memory

Re: JdbcIO read needs to fit in memory

2019-10-29 Thread Jozef Vilcek
I can not find anything in docs about expected behavior of DoFn emitting arbitrary large number elements on one processElement(). I wonder if Spark Runner behavior is a bug or just a difference (and disadvantage in this case) in execution more towards runner capability matrix differences. Also,

Re: JdbcIO read needs to fit in memory

2019-10-27 Thread Jozef Vilcek
Result of my query can fit the memory if I use 12GB heap per spark executor. This makes the job quite inefficient as past JDBC load job runs fine with 4GB heap to do the main heavy lifting - JDBC is the main data set, just metadata. I just did run the same JdbcIO read code on Spark and Flink

Re: JdbcIO read needs to fit in memory

2019-10-25 Thread Ryan Skraba
One more thing to try -- depending on your pipeline, you can disable the "auto-reshuffle" of JdbcIO.Read by setting withOutputParallelization(false) This is particularly useful if (1) you do aggressive and cheap filtering immediately after the read or (2) you do your own repartitioning action

Re: JdbcIO read needs to fit in memory

2019-10-25 Thread Jozef Vilcek
I agree I might be too quick to call DoFn output need to fit in memory. Actually I am not sure what Beam model say on this matter and what output managers of particular runners do about it. But SparkRunner definitely has an issue here. I did try set small `fetchSize` for JdbcIO as well as change

Re: JdbcIO read needs to fit in memory

2019-10-24 Thread Alexey Romanenko
Jozef, do you have any NPE stacktrace to share? > On 24 Oct 2019, at 15:26, Jozef Vilcek wrote: > > Hi, > > I am in a need to read a big-ish data set via JdbcIO. This forced me to bump > up memory for my executor (right now using SparkRunner). It seems that JdbcIO > has a requirement to fit

Re: JdbcIO read needs to fit in memory

2019-10-24 Thread Ryan Skraba
Hello! If I remember correctly -- the JdbcIO will use *one* DoFn instance to read all of the rows, but that instance is not required to hold all of the rows in memory. The fetch size will, however, read 50K rows at a time by default and those will all be held in memory in that single worker

Re: JdbcIO read needs to fit in memory

2019-10-24 Thread Eugene Kirpichov
Sorry, I just realized I've made a mistake. BoundedSource in some runners may not have the same "fits in memory" limitation as DoFn's, so in that sense you're right - if it was done as a BoundedSource, perhaps it would work better in your case, even if it didn't read things in parallel. On Thu,

Re: JdbcIO read needs to fit in memory

2019-10-24 Thread Jean-Baptiste Onofré
HiJdbcIO is basically a DoFn. So it could load all on a single executor (there's no obvious way to split).It's what you mean ?RegardsJBLe 24 oct. 2019 15:26, Jozef Vilcek a écrit :Hi,I am in a need to read a big-ish data set via JdbcIO. This forced me to bump up memory for my executor (right now