The key in GroupIntoBatches is actually not semantically meaningful, and
for a batch pipeline the use of state/timers is not needed either. If all
you need to do is batch elements into groups of (at most) N, you can write
a DoFn that collects things in its process method and emits them when the
bat
Understood, thanks for the clarification, I'll need to look more in-depth
at my pipeline code then. I'm definitely observing that all steps
downstream from the Stateful step in my pipeline do not start until steps
upstream of the Stateful step are fully completed. The Stateful step is a
RateLimit
The GbkBeforeStatefulParDo is an implementation detail used to send all
elements with the same key to the same worker (so that they can share
state, which is itself partitioned by worker). This does cause a global
barrier in batch pipelines.
On Thu, May 25, 2023 at 2:15 PM Evan Galpin wrote:
> H
Hi all,
I'm running into a scenario where I feel that Dataflow Overrides
(specifically BatchStatefulParDoOverrides.GbkBeforeStatefulParDo ) are
unnecessarily causing a batch pipeline to "pause" throughput since a GBK
needs to have processed all the data in a window before it can output.
Is it str
let's goo
On Thu, May 25, 2023 at 12:49 PM Carolina Escobar
wrote:
> *Get to know our speakers!*
>
> *Take a quick peek at our program:*
>
>
>-
>
>*Beam IO: CDAP and SparkReceiver IO Connectors Overview *
>Alex Kosolapov and Elizaveta Lomteva give an overview of a Beam IO
>
*Get to know our speakers!*
*Take a quick peek at our program:*
-
*Beam IO: CDAP and SparkReceiver IO Connectors Overview *
Alex Kosolapov and Elizaveta Lomteva give an overview of a Beam IO
development process and practical insights gained from experience
developing CDAP IO and