On Tue, Oct 29, 2019 at 10:04 AM Ryan Skraba wrote:
> I didn't get a chance to try this out -- it sounds like a bug with the
> SparkRunner, if you've tested it with FlinkRunner and it succeeded.
>
> From your description, it should be reproducible by reading any large
> database table with the
I didn't get a chance to try this out -- it sounds like a bug with the
SparkRunner, if you've tested it with FlinkRunner and it succeeded.
>From your description, it should be reproducible by reading any large
database table with the SparkRunner where the entire dataset is
greater than the memory
I can not find anything in docs about expected behavior of DoFn emitting
arbitrary large number elements on one processElement().
I wonder if Spark Runner behavior is a bug or just a difference (and
disadvantage in this case) in execution more towards runner capability
matrix differences.
Also,
Result of my query can fit the memory if I use 12GB heap per spark
executor. This makes the job quite inefficient as past JDBC load job runs
fine with 4GB heap to do the main heavy lifting - JDBC is the main data
set, just metadata.
I just did run the same JdbcIO read code on Spark and Flink
One more thing to try -- depending on your pipeline, you can disable
the "auto-reshuffle" of JdbcIO.Read by setting
withOutputParallelization(false)
This is particularly useful if (1) you do aggressive and cheap
filtering immediately after the read or (2) you do your own
repartitioning action
I agree I might be too quick to call DoFn output need to fit in memory.
Actually I am not sure what Beam model say on this matter and what output
managers of particular runners do about it.
But SparkRunner definitely has an issue here. I did try set small
`fetchSize` for JdbcIO as well as change
Jozef, do you have any NPE stacktrace to share?
> On 24 Oct 2019, at 15:26, Jozef Vilcek wrote:
>
> Hi,
>
> I am in a need to read a big-ish data set via JdbcIO. This forced me to bump
> up memory for my executor (right now using SparkRunner). It seems that JdbcIO
> has a requirement to fit
Hello!
If I remember correctly -- the JdbcIO will use *one* DoFn instance to
read all of the rows, but that instance is not required to hold all of
the rows in memory.
The fetch size will, however, read 50K rows at a time by default and
those will all be held in memory in that single worker
Sorry, I just realized I've made a mistake. BoundedSource in some runners
may not have the same "fits in memory" limitation as DoFn's, so in that
sense you're right - if it was done as a BoundedSource, perhaps it would
work better in your case, even if it didn't read things in parallel.
On Thu,
HiJdbcIO is basically a DoFn. So it could load all on a single executor (there's no obvious way to split).It's what you mean ?RegardsJBLe 24 oct. 2019 15:26, Jozef Vilcek a écrit :Hi,I am in a need to read a big-ish data set via JdbcIO. This forced me to bump up memory for my executor (right now
10 matches
Mail list logo