Splitting is part of the issue.
Other example issues are:
* "sources" that input data into the pipeline have no requirement to
produce records in a time ordered manner.
* timers can hold the output watermark and produce records out of order
with time.
All of this time ordering has a cost to perfo
Thanks Reuven for the input and Wang for CC'ing to Reuven.
Generally you should not rely on PCollection being ordered
Is it because Beam splits PCollection into multiple input splits and tries
to process it as efficiently as possible without considering times?
This one is very confusing as I've b
As for the question of writing tests in the face of non-determinism,
you should look into TestStream. MyStatefulDoFn still needs to be
updated to not assume an ordering. (This can be done by setting timers
that provide guarantees that (modulo late data) one has seen all data
up to a certain timesta
Generally you should not rely on PCollection being ordered, though there
have been discussions about adding some time-ordering semantics.
On Sun, Aug 23, 2020 at 9:06 PM Rui Wang wrote:
> Current Beam model does not guarantee an ordering after a GBK (i.e.
> Combine.perKey() in your). So you ca
Current Beam model does not guarantee an ordering after a GBK (i.e.
Combine.perKey() in your). So you cannot expect that the C step sees
elements in a specific order.
As I recall on Dataflow runner, there is very limited ordering
support. Hi +Reuven
Lax can share your insights about it?
-Rui
Hi,
My Beam pipeline is designed to work with an unbounded source KafkaIO.
It roughly looks like below:
p.apply(KafkaIO.read() ...) // (A-1)
.apply(WithKeys.of(...).withKeyType(...))
.apply(Window.into(FixedWindows.of(...)))
.apply(Combine.perKey(...))