On Tue, Oct 29, 2019 at 10:04 AM Ryan Skraba wrote:
> I didn't get a chance to try this out -- it sounds like a bug with the
> SparkRunner, if you've tested it with FlinkRunner and it succeeded.
>
> From your description, it should be reproducible by reading any large
> database table with the Sp
I didn't get a chance to try this out -- it sounds like a bug with the
SparkRunner, if you've tested it with FlinkRunner and it succeeded.
>From your description, it should be reproducible by reading any large
database table with the SparkRunner where the entire dataset is
greater than the memory
I can not find anything in docs about expected behavior of DoFn emitting
arbitrary large number elements on one processElement().
I wonder if Spark Runner behavior is a bug or just a difference (and
disadvantage in this case) in execution more towards runner capability
matrix differences.
Also, i
typo in my previous message. I meant to say => JDBC is `not` the main data
set, just metadata
On Sun, Oct 27, 2019 at 6:00 PM Jozef Vilcek wrote:
> Result of my query can fit the memory if I use 12GB heap per spark
> executor. This makes the job quite inefficient as past JDBC load job runs
> fin
Result of my query can fit the memory if I use 12GB heap per spark
executor. This makes the job quite inefficient as past JDBC load job runs
fine with 4GB heap to do the main heavy lifting - JDBC is the main data
set, just metadata.
I just did run the same JdbcIO read code on Spark and Flink runne
One more thing to try -- depending on your pipeline, you can disable
the "auto-reshuffle" of JdbcIO.Read by setting
withOutputParallelization(false)
This is particularly useful if (1) you do aggressive and cheap
filtering immediately after the read or (2) you do your own
repartitioning action like
Yeah - in this case your primary option is to use JdbcIO.readAll() and shard
your query, as suggested above.
Alternative hypothesis: is the result set of your query actually big enough
that it *shouldn't* fit in memory? Or could it be a matter of inefficient
storage of its elements? Could you brie
I agree I might be too quick to call DoFn output need to fit in memory.
Actually I am not sure what Beam model say on this matter and what output
managers of particular runners do about it.
But SparkRunner definitely has an issue here. I did try set small
`fetchSize` for JdbcIO as well as change `
Jozef, do you have any NPE stacktrace to share?
> On 24 Oct 2019, at 15:26, Jozef Vilcek wrote:
>
> Hi,
>
> I am in a need to read a big-ish data set via JdbcIO. This forced me to bump
> up memory for my executor (right now using SparkRunner). It seems that JdbcIO
> has a requirement to fit a
Hello!
If I remember correctly -- the JdbcIO will use *one* DoFn instance to
read all of the rows, but that instance is not required to hold all of
the rows in memory.
The fetch size will, however, read 50K rows at a time by default and
those will all be held in memory in that single worker until
Sorry, I just realized I've made a mistake. BoundedSource in some runners
may not have the same "fits in memory" limitation as DoFn's, so in that
sense you're right - if it was done as a BoundedSource, perhaps it would
work better in your case, even if it didn't read things in parallel.
On Thu, Oc
Hi Josef,
JdbcIO per se does not require the result set to fit in memory. The issues
come from the limitations of the context in which it runs:
- It indeed uses a DoFn to emit results; a DoFn is in general allowed to
emit an unbounded number of results that doesn't necessarily have to fit in
memor
HiJdbcIO is basically a DoFn. So it could load all on a single executor (there's no obvious way to split).It's what you mean ?RegardsJBLe 24 oct. 2019 15:26, Jozef Vilcek a écrit :Hi,I am in a need to read a big-ish data set via JdbcIO. This forced me to bump up memory for my executor (right now u
Hi,
I am in a need to read a big-ish data set via JdbcIO. This forced me to
bump up memory for my executor (right now using SparkRunner). It seems that
JdbcIO has a requirement to fit all data in memory as it is using DoFn to
unfold query to list of elements.
BoundedSource would not face the need
14 matches
Mail list logo