See https://s.apache.org/beam-lateness for detailed rationale about where
the holds end up. It is a pretty massive read, but at this point I think
even the details there are relevant.
TL;DR is that the hold policy:
- never makes on time data late
- never makes non-droppable data droppable
- ne
Hi Kenn,
agree that for on-time elements, the hold has to respect the output
timestamps. For already late elements, it should be possible to
calculate the output timestamp independently. Currently, the output
watermark is *always* the value of watermark hold, which might be
inappropriate for
Quick reply about one top-level thing: output timestamps and watermark
holds are closely related. A hold is precisely reserving the right to
output at a particular time. The part that is unintuitive is that these are
ever different. That is, really, a hack to allow downstream processing to
proceed
Minor self-correction - in the property c) it MIGHT be possible to
update output watermark to time greater than input watermark, _as long
as any future element cannot be assigned timestamp that is less than the
output watermark_. That seems to be the case only for
TimestampCombiner.END_OF_WINDO
Hi Reuven,
the asynchronicity of watermark update is what I was missing - it is
what relates watermarkhold with element output timestamp. On the other
hand, we have some invariants that have to hold, namely:
a) element arriving as non-late MUST NOT be changed to late
b) element arriving as
This does seem non intuitive, though I'm not sure what the best approach is.
The problem with using currentOutputWatermark as the output timestamp is
that Beam does not define watermark advancement to be synchronous, and at
least the Dataflow runner calculates watermarks completely independent of
Hi,
what seems the most "surprising" to me is that we are using
TimestampCombiners to actually do two (orthogonal) things:
a) calculate a watermark hold for a window, so on-time elements
emitted from a pane are not late in downstream processing
b) calculate timestamp of elements in output
Hi,
I've came across "unexpected" model behaviour when dealing with late data
and custom timestamp combiners. Let's take a following pipeline as an
example:
final PCollection input = ...;
input.apply(
"GlobalWindows",
Window.into(new GlobalWindows())
.triggering(