Hi Steve,

I think the major performance regression comes from
OutputAndTimeBoundedSplittableProcessElementInvoker[1], which will
checkpoint the DoFn based on time/output limit and use timers/state to
reschedule works.

[1]
https://github.com/apache/beam/blob/master/runners/core-java/src/main/java/org/apache/beam/runners/core/OutputAndTimeBoundedSplittableProcessElementInvoker.java

On Thu, Dec 3, 2020 at 9:40 AM Steve Niemitz <[email protected]> wrote:

> I have a pipeline that reads from pubsub, does some aggregation, and
> writes to various places.  Previously, in older versions of beam, when
> running this in the DirectRunner, messages would go through the pipeline
> almost instantly, making it very easy to debug locally, etc.
>
> However, after upgrading to beam 2.25, I noticed that it could take on the
> order of 5-10 minutes for messages to get from the pubsub read step to the
> next step in the pipeline (deserializing them, etc).  The subscription
> being read from has on the order of 100,000 elements/sec arriving in it.
>
> Setting --experiments=use_deprecated_read fixes it, and makes the pipeline
> behave as it did before.
>
> It seems like the SDF implementation in the DirectRunner here is causing
> some kind of issue, either buffering a very large amount of data before
> emitting it in a bundle, or something else.  Has anyone else run into this?
>

Reply via email to