SplittableDoFn-based source doesn't efficiently scale up in Dataflow

Claire McGinty Tue, 03 May 2022 10:38:49 -0700

Hi Beam users,

I'm looking for input on one of our IOs that we recently migrated
<https://github.com/spotify/scio/pull/4260> to SplittableDoFn. When running
in Dataflow we saw performance gains in every aspect (VCPU hours, total
memory time) except for total elapsed time: the SplittableDoFn
implementation took 1.5x as many minutes as it did previously for about
~900GB of Parquet files.


It seems like the issue is that it isn't scaling up as much as the old
BoundedSource version. I ran the SplittableDoFn implementation a couple
times to be sure, but reliably, it only scaled up to 30%-50% the max number
of workers as it used to. Both implementations of this IO have the same
base level of "splittability" (Parquet row groups) so I'm not sure what the
issue could be.

I saw in an older user@ thread, using Dataflow Runner V2 was suggested as a
mitigation. I did re-try my job using Dataflow Prime and saw significant
improvement; but we're not able to migrate our entire fleet to V2 at this
time.

Is there any workaround for Dataflow Runner V1 to improve the scale-up for
SplittableDoFn sources?

Thanks!
Claire

SplittableDoFn-based source doesn't efficiently scale up in Dataflow

Reply via email to