Hello,

I'm encountering an unexpected behavior in one of our pipeline, I hope you
might help me make sense of it.

This is a streaming pipeline implemented with the Java SDK (2.10) and
running on Dataflow.

* In the pipeline a FixedWindow is applied to the data with an allowed
lateness of a week (doesn't really matter).
* The windowed PCollection is then used in 2 separate branches: one regular
GroupByKey and and a second one which is Transform which makes have use of
State and Timers.
* In the stateful transform, incoming elements are added to a BagState
container until some condition on the elements is reached, and the elements
in the BagState are dispatched downstream. A timer makes sure that the
BagState is dispatched even if the condition is not met after some timeout
has expired.

Occasionally very late data enters the pipeline, with timestamps older than
the allowed lateness.
In these cases, the GroupByKey transform behaves as expected, and there
aren't any panes with the late data output downstream.
In the stateful transform, on the other hand, I see that these late
elements are processed and added to the BagState, but at a later point in
time (e.g. when the timer is triggered) the elements "disappear" from the
BagState and are no longer observable. This can happen multiple times, i.e.
late elements added to the BagState and then disappearing at a later time.

Is this an expected behavior? Is there some sort of late data "garbage
collection" which eventually removes late elements?

Thank you for your help, I hope my description is clear enough.

Regards,
Amit.

Reply via email to