Re: dealing with late data output timestamps

2020-06-02 Thread Kenneth Knowles
See https://s.apache.org/beam-lateness for detailed rationale about where the holds end up. It is a pretty massive read, but at this point I think even the details there are relevant. TL;DR is that the hold policy: - never makes on time data late - never makes non-droppable data droppable - ne

Re: dealing with late data output timestamps

2020-06-02 Thread Jan Lukavský
Hi Kenn, agree that for on-time elements, the hold has to respect the output timestamps. For already late elements, it should be possible to calculate the output timestamp independently. Currently, the output watermark is *always* the value of watermark hold, which might be inappropriate for

Re: dealing with late data output timestamps

2020-06-01 Thread Kenneth Knowles
Quick reply about one top-level thing: output timestamps and watermark holds are closely related. A hold is precisely reserving the right to output at a particular time. The part that is unintuitive is that these are ever different. That is, really, a hack to allow downstream processing to proceed

Re: dealing with late data output timestamps

2020-05-31 Thread Jan Lukavský
Minor self-correction - in the property c) it MIGHT be possible to update output watermark to time greater than input watermark, _as long as any future element cannot be assigned timestamp that is less than the output watermark_. That seems to be the case only for TimestampCombiner.END_OF_WINDO

Re: dealing with late data output timestamps

2020-05-31 Thread Jan Lukavský
Hi Reuven, the asynchronicity of watermark update is what I was missing - it is what relates watermarkhold with element output timestamp. On the other hand, we have some invariants that have to hold, namely:  a) element arriving as non-late MUST NOT be changed to late  b) element arriving as

Re: dealing with late data output timestamps

2020-05-29 Thread Reuven Lax
This does seem non intuitive, though I'm not sure what the best approach is. The problem with using currentOutputWatermark as the output timestamp is that Beam does not define watermark advancement to be synchronous, and at least the Dataflow runner calculates watermarks completely independent of

Re: dealing with late data output timestamps

2020-05-29 Thread Jan Lukavský
Hi, what seems the most "surprising" to me is that we are using TimestampCombiners to actually do two (orthogonal) things:  a) calculate a watermark hold for a window, so on-time elements emitted from a pane are not late in downstream processing  b) calculate timestamp of elements in output

dealing with late data output timestamps

2020-05-28 Thread David Morávek
Hi, I've came across "unexpected" model behaviour when dealing with late data and custom timestamp combiners. Let's take a following pipeline as an example: final PCollection input = ...; input.apply( "GlobalWindows", Window.into(new GlobalWindows()) .triggering(