Re: Dropping late data in DirectRunner

Luke Cwik Tue, 07 Jan 2020 13:17:44 -0800

That is a really good way to describe my mental model as well.

On Tue, Jan 7, 2020 at 12:20 PM Kenneth Knowles <[email protected]> wrote:


>
>
> On Tue, Jan 7, 2020 at 1:39 AM Jan Lukavský <[email protected]> wrote:
>
>> Hi Kenn,
>>
>> I see that my terminology seems not to be 100% aligned with Beam's. I'll
>> work on that. :-)
>>
>> I agree with what you say, and by "late" I mostly meant "droppable"
>> (arriving too late after watermark).
>>
>> I'm definitely not proposing to get back to something like "out of order"
>> == "late" or anything like that. I'm also aware that stateful operation is
>> windowed operation, but the semantics of the windowing is different than of
>> a GBK. The difference is how time moves in GBK and how moves in stateful
>> DoFn. Throwing away some details (early triggers, late data triggers), the
>> main difference is that in GBK case, time hops just between window
>> boundaries, while in stateful DoFn time moves "smoothly" (with each
>> watermark update). Now, this difference brings the question about why the
>> definition of "droppable" data is the same for both types of operations,
>> when there is a difference in how users "perceive" time. As the more
>> generic operation, stateful DoFn might deserve a more general definition of
>> droppable data, which should degrade naturally to the one of GBK in
>> presence of "discrete time hops".
>>
>
> I understand what you mean. On the other hand, I encourage thinking of
> event time spatially, not as time passing. That is a big part of unifying
> batch/streaming real-time/archival processing. The event time window is a
> secondary key to partition the data (merging windows are slightly more
> complex). All event time windows exist simultaneously. So for both stateful
> ParDo and GBK, I find it helpful to consider this perspective where all
> windows are processed simultaneously / in an arbitrary order not assuming
> windows are ordered at all. Then you see that GBK and stateful ParDo do not
> really treat windows / watermark differently: both of them process a stream
> of data for each (key, window) pair until the watermark informs them that
> the stream is expired, then they GC the state associated with that (key,
> window) pair.
>
> Kenn
>
>> This might have some consequences on how the droppable data should be
>> handled in presence of (early) triggers, because triggerring is actually
>> what makes time to "hop", so we might arrive to a conclusion that we might
>> actually drop any data that has timestamp less than "last trigger time +
>> allowed lateness". This looks appealing to me, because IMO it has strong
>> internal logical consistency. Although it is possible that it would drop
>> more data, which is generally undesirable, I admit that.
>>
>> I'm looking for explanation why the current approach was chosen instead
>> of the other.
>>
>> Jan
>> On 1/7/20 12:52 AM, Kenneth Knowles wrote:
>>
>> This thread has a lot in it, so I am just top-posting.
>>
>>  - Stateful DoFn is a windowed operation; state is per-window. When the
>> window expires, any further inputs are dropped.
>>  - "Late" is not synonymous with out-of-order. It doesn't really have an
>> independent meaning.
>>     - For a GBK/Combine "late" means "not included prior to the on-time
>> output", and "droppable" means "arriving after window expiry".
>>     - For Stateful DoFn there is no real meaning to "late" except if one
>> is talking about "droppable", which still means "arriving after window
>> expiry". A user may have a special timer where they flip a flag and treat
>> elements after the timer differently.
>>
>> I think the definition of when data is droppable is very simple. We
>> explicitly moved to this definition, away from the "out of order == late",
>> because it is more robust and simpler to think about. Users saw lots of
>> confusing behavior when we had "out of order by allowed lateness ==
>> droppable" logic.
>>
>> Kenn
>>
>> On Mon, Jan 6, 2020 at 1:42 AM Jan Lukavský <[email protected]> wrote:
>>
>>> > Generally the watermark update can overtake elements, because runners
>>> can explicitly ignore late data in the watermark calculation (for good
>>> reason - those elements are already late, so no need to hold up the
>>> watermark advancing any more).
>>> This seems not to affect the decision of _not late_ vs. _late_, is it?
>>> If element is late and gets ignored from watermark calculation (whatever
>>> that includes in this context), than the watermark cannot move past
>>> elements that were not marked as _not late_ and thus nothing can make them
>>> _late_.
>>>
>>> > For GBK on-time data simply means the first pane marked as on time.
>>> For state+timers I don't think it makes sense for Beam to define on-time
>>> v.s. late, rather I think the user can come up with their own definition
>>> depending on their use case. For example, if you are buffering data into
>>> BagState and setting a timer to process it, it would be logical to say that
>>> any element that was buffered before the timer expired is on time, and any
>>> data that showed up after the timer fired is late. This would roughly
>>> correspond to what GBK does, and the answer would be very similar to simply
>>> comparing against the watermark (as the timers fire when the watermark
>>> advances).
>>>
>>> Yes, I'd say that stateful DoFns don't have (well defined) concept of
>>> pane, because that is related to concept of trigger and this is a concept
>>> of GBK (or windowed operations in general). The only semantic meaning of
>>> window in stateful DoFn is that it "scopes" state.
>>>
>>> This discussion might have got a little off the original question, so
>>> I'll try to rephrase it:
>>>
>>> Should stateful DoFn drop *all* late data, not just data that arrive
>>> after window boundary + allowed lateness? Some arguments why I think it
>>> should:
>>>  * in windowed operations (GBK), it is correct to drop data on window
>>> boundaries only, because time (as seen by user) effectively hops only on
>>> these discrete time points
>>>  * in stateful dofn on the other hand time move "smoothly" (yes, with
>>> some granularity, millisecond, nanosecond, whatever and with watermark
>>> updates only, but still)
>>>  * this could be viewed that dropping late data immediately as time
>>> (again, from user perspective) moves (not on some more or less artificial
>>> boundary having only little semantic meaning) is consistent with both the
>>> above properties
>>>
>>> The negative side effect of this would be, that more data could be
>>> dropped, but ... isn't this what defines allowed lateness? I don't want to
>>> discuss the implications on user pipelines of such a change (and if we can
>>> or cannot do it), just trying to build some theoretical understanding of
>>> the problem as a whole. The decision if any change could / should be made
>>> can be done afterwards.
>>>
>>> Thanks,
>>>  Jan
>>>
>>> On 1/4/20 10:35 PM, Reuven Lax wrote:
>>>
>>>
>>>
>>> On Sat, Jan 4, 2020 at 12:13 PM Jan Lukavský <[email protected]> wrote:
>>>
>>>> > Yes, but invariants should hold. If I add a ParDo that drops late
>>>> elements (or, more commonly,diverts the late elements  to a different
>>>> PCollection), then the result of that ParDo should _never_ introduce and
>>>> more late data. This cannot be guaranteed simply with watermark checks. The
>>>> ParDo may decide that the element was not late, but by the time it outputs
>>>> the element the watermark may have advanced, causing the element to
>>>> actually be late.
>>>>
>>>> This is actually very interesting. The question is - if I decide about
>>>> lateness based on output watermark of a PTransform, is it still the case,
>>>> that in downstream operator(s) the element could be changed from "not late"
>>>> to "late"? Provided the output watermark is updated synchronously based on
>>>> input data (which should be) and watermark update cannot "overtake"
>>>> elements, I think that the downstream decision should not be changed, so
>>>> the invariant should hold. Or am I missing something?
>>>>
>>>
>>> Generally the watermark update can overtake elements, because runners
>>> can explicitly ignore late data in the watermark calculation (for good
>>> reason - those elements are already late, so no need to hold up the
>>> watermark advancing any more).
>>>
>>> For GBK on-time data simply means the first pane marked as on time. For
>>> state+timers I don't think it makes sense for Beam to define on-time v.s.
>>> late, rather I think the user can come up with their own definition
>>> depending on their use case. For example, if you are buffering data into
>>> BagState and setting a timer to process it, it would be logical to say that
>>> any element that was buffered before the timer expired is on time, and any
>>> data that showed up after the timer fired is late. This would roughly
>>> correspond to what GBK does, and the answer would be very similar to simply
>>> comparing against the watermark (as the timers fire when the watermark
>>> advances).
>>>
>>> Reuven
>>>
>>>> On 1/4/20 8:11 PM, Reuven Lax wrote:
>>>>
>>>>
>>>>
>>>> On Sat, Jan 4, 2020 at 11:03 AM Jan Lukavský <[email protected]> wrote:
>>>>
>>>>> On 1/4/20 6:14 PM, Reuven Lax wrote:
>>>>>
>>>>> There is a very good reason not to define lateness directly in terms
>>>>> of the watermark. The model does not make any guarantees that the 
>>>>> watermark
>>>>> advances synchronously, and in fact for the Dataflow runner the watermark
>>>>> advances asynchronously (i.e. independent of element processing). This
>>>>> means that simply comparing an element timestamp against the watermark
>>>>> creates a race condition. There are cases where the answer could change
>>>>> depending on exactly when you examine the watermark, and if you examine
>>>>> again while processing the same bundle you might come to a different
>>>>> conclusion about lateness.
>>>>>
>>>>> Due to monotonicity of watermark, I don't think that the asynchronous
>>>>> updates of watermark can change the answer from "late" to "not late". That
>>>>> seems fine to me.
>>>>>
>>>>
>>>> It's the other way around. You check to see whether an element is late
>>>> and the answer is "not late." An instant later the answer changes to
>>>> "late"  This does cause many problems, and is why this was changed.
>>>>
>>>>>
>>>>> This non determinism is undesirable when considering lateness, as it
>>>>> can break many invariants that users may rely on (e.g. if I could write a
>>>>> ParDo that filtered all late data, yet still find late data showing up
>>>>> downstream of the ParDo which would be very surprising). For that reason,
>>>>> the SDK always marks things as late based on deterministic signals. e.g.
>>>>> for a triggered GBK everything in the first post-watermark pane is marked
>>>>> as on time (no matter what the watermark is) and everything in subsequent
>>>>> panes is marked as late.
>>>>>
>>>>> Dropping latecomers will always be non-deterministic, that is certain.
>>>>> This is true even in case where watermark is updated synchronously with
>>>>> element processing, due to shuffling and varying (random) differences of
>>>>> processing and event time in upstream operator(s). The question was only 
>>>>> if
>>>>> a latecomer should be dropped only at a window boundaries only (which is a
>>>>> sort of artificial time boundary), or right away when spotted (in stateful
>>>>> dofns only). Another question would be if latecomers should be dropped
>>>>> based on input or output watermark, dropping based on output watermark
>>>>> seems even to be stable in the sense, that all downstream operators should
>>>>> come to the same conclusion (this is a bit of a speculation).
>>>>>
>>>>
>>>> Yes, but invariants should hold. If I add a ParDo that drops late
>>>> elements (or, more commonly,diverts the late elements  to a different
>>>> PCollection), then the result of that ParDo should _never_ introduce and
>>>> more late data. This cannot be guaranteed simply with watermark checks. The
>>>> ParDo may decide that the element was not late, but by the time it outputs
>>>> the element the watermark may have advanced, causing the element to
>>>> actually be late.
>>>>
>>>> In practice this is important. And early version of Dataflow (pre Beam)
>>>> implemented lateness by comparing against the watermark, and it caused no
>>>> end of trouble for users.
>>>>
>>>>>
>>>>> FYI - this is also the reason why Beam does not currently provide
>>>>> users direct access to the watermark. The asynchronous nature of it  can 
>>>>> be
>>>>> very confusing, and often results in users writing bugs in their 
>>>>> pipelines.
>>>>> We decided instead to expose easier-to-reason-about signals such as timers
>>>>> (triggered by the watermark), windows, and lateness.
>>>>>
>>>>> Reuven
>>>>>
>>>>> On Sat, Jan 4, 2020 at 1:15 AM Jan Lukavský <[email protected]> wrote:
>>>>>
>>>>>> I realized the problem. I misinterpreted the
>>>>>> LateDataDroppingDoFnRunner. It doesn't drop *all* late (arriving after
>>>>>> watermark - allowed lateness) data, but only data, that arrive after
>>>>>> maxTimestamp + allowedLateness of their respective windows.
>>>>>>
>>>>>> Stateful DoFn can run on global window (which was the case of my
>>>>>> tests) and there is no dropping then.
>>>>>>
>>>>>> Two questions arise then:
>>>>>>
>>>>>>  a) does it mean that this is one more argument to move this logic to
>>>>>> StatefulDoFnRunner? StatefulDoFnRunner performs state cleanup on window 
>>>>>> GC
>>>>>> time, so without LateDataDroppingDoFnRunner and late data will see empty
>>>>>> state and will produce wrong results.
>>>>>>
>>>>>>  b) is this behavior generally intentional and correct? Windows and
>>>>>> triggers are (in my point of view) features of GBK, not stateful DoFn.
>>>>>> Stateful DoFn is a low level primitive, which can be viewed to operate on
>>>>>> "instant" windows, which should then probably be defined as dropping 
>>>>>> every
>>>>>> single element arrive after allowed lateness. This might probably relate 
>>>>>> to
>>>>>> question if operations should be built bottom up from most primitive and
>>>>>> generic ones to more specific ones - that is GBK be implemented on top of
>>>>>> stateful DoFn and not vice versa.
>>>>>>
>>>>>> Thoughts?
>>>>>>
>>>>>> Jan
>>>>>> On 1/4/20 1:03 AM, Steve Niemitz wrote:
>>>>>>
>>>>>> I do agree that the direct runner doesn't drop late data arriving at
>>>>>> a stateful DoFn (I just tested as well).
>>>>>>
>>>>>> However, I believe this is consistent with other runners.  I'm fairly
>>>>>> certain (at least last time I checked) that at least Dataflow will also
>>>>>> only drop late data at GBK operations, and NOT stateful DoFns.  Whether 
>>>>>> or
>>>>>> not this is intentional is debatable however, without being able to 
>>>>>> inspect
>>>>>> the watermark inside the stateful DoFn, it'd be very difficult to do
>>>>>> anything useful with late data.
>>>>>>
>>>>>>
>>>>>> On Fri, Jan 3, 2020 at 5:47 PM Jan Lukavský <[email protected]> wrote:
>>>>>>
>>>>>>> I did write a test that tested if data is dropped in a plain
>>>>>>> stateful DoFn. I did this as part of validating that PR [1] didn't drop
>>>>>>> more data when using @RequiresTimeSortedInput than it would without this
>>>>>>> annotation. This test failed and I didn't commit it, yet.
>>>>>>>
>>>>>>> The test was basically as follows:
>>>>>>>
>>>>>>>  - use TestStream to generate three elements with timestamps 2, 1
>>>>>>> and 0
>>>>>>>
>>>>>>>  - between elements with timestamp 1 and 0 move watermark to 1
>>>>>>>
>>>>>>>  - use allowed lateness of zero
>>>>>>>
>>>>>>>  - use stateful dofn that just emits arbitrary data for each input
>>>>>>> element
>>>>>>>
>>>>>>>  - use Count.globally to count outputs
>>>>>>>
>>>>>>> The outcome was that stateful dofn using @RequiresTimeSortedInput
>>>>>>> output 2 elements, without the annotation it was 3 elements. I think the
>>>>>>> correct one would be 2 elements in this case. The difference is caused 
>>>>>>> by
>>>>>>> the annotation having (currently) its own logic for dropping data, which
>>>>>>> could be removed if we agree, that the data should be dropped in all 
>>>>>>> cases.
>>>>>>> On 1/3/20 11:23 PM, Kenneth Knowles wrote:
>>>>>>>
>>>>>>> Did you write such a @Category(ValidatesRunner.class) test? I
>>>>>>> believe the Java  direct runner does drop late data, for both GBK and
>>>>>>> stateful ParDo.
>>>>>>>
>>>>>>> Stateful ParDo is implemented on top of GBK:
>>>>>>> https://github.com/apache/beam/blob/64262a61402fad67d9ad8a66eaf6322593d3b5dc/runners/direct-java/src/main/java/org/apache/beam/runners/direct/ParDoMultiOverrideFactory.java#L172
>>>>>>>
>>>>>>> And GroupByKey, via DirectGroupByKey, via DirectGroupAlsoByWindow,
>>>>>>> does drop late data:
>>>>>>> https://github.com/apache/beam/blob/c2f0d282337f3ae0196a7717712396a5a41fdde1/runners/direct-java/src/main/java/org/apache/beam/runners/direct/GroupAlsoByWindowEvaluatorFactory.java#L220
>>>>>>>
>>>>>>> I'm not sure why it has its own code, since ReduceFnRunner also
>>>>>>> drops late data, and it does use ReduceFnRunner (the same code path all
>>>>>>> Java-based runners use).
>>>>>>>
>>>>>>> Kenn
>>>>>>>
>>>>>>>
>>>>>>> On Fri, Jan 3, 2020 at 1:02 PM Jan Lukavský <[email protected]> wrote:
>>>>>>>
>>>>>>>> Yes, the non-reliability of late data dropping in distributed
>>>>>>>> runner is understood. But this is even where DirectRunner can play its
>>>>>>>> role, because only there it is actually possible to emulate and test
>>>>>>>> specific watermark conditions. Question regarding this for the java
>>>>>>>> DirectRunner - should we completely drop LataDataDroppingDoFnRunner and
>>>>>>>> delegate the late data dropping to StatefulDoFnRunner? Seems logical 
>>>>>>>> to me,
>>>>>>>> as if we agree that late data should always be dropped, then there 
>>>>>>>> would no
>>>>>>>> "valid" use of StatefulDoFnRunner without the late data dropping
>>>>>>>> functionality.
>>>>>>>> On 1/3/20 9:32 PM, Robert Bradshaw wrote:
>>>>>>>>
>>>>>>>> I agree, in fact we just recently enabled late data dropping to the
>>>>>>>> direct runner in Python to be able to develop better tests for 
>>>>>>>> Dataflow.
>>>>>>>>
>>>>>>>> It should be noted, however, that in a distributed runner (absent
>>>>>>>> the quiessence of TestStream) that one can't *count* on late data being
>>>>>>>> dropped at a certain point, and in fact (due to delays in fully 
>>>>>>>> propagating
>>>>>>>> the watermark) late data can even become on-time, so the promises about
>>>>>>>> what happens behind the watermark are necessarily a bit loose.
>>>>>>>>
>>>>>>>> On Fri, Jan 3, 2020 at 9:15 AM Luke Cwik <[email protected]> wrote:
>>>>>>>>
>>>>>>>>> I agree that the DirectRunner should drop late data. Late data
>>>>>>>>> dropping is optional but the DirectRunner is used by many for testing 
>>>>>>>>> and
>>>>>>>>> we should have the same behaviour they would get on other runners or 
>>>>>>>>> users
>>>>>>>>> may be surprised.
>>>>>>>>>
>>>>>>>>> On Fri, Jan 3, 2020 at 3:33 AM Jan Lukavský <[email protected]>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Hi,
>>>>>>>>>>
>>>>>>>>>> I just found out that DirectRunner is apparently not using
>>>>>>>>>> LateDataDroppingDoFnRunner, which means that it doesn't drop late
>>>>>>>>>> data
>>>>>>>>>> in cases where there is no GBK operation involved (dropping in
>>>>>>>>>> GBK seems
>>>>>>>>>> to be correct). There is apparently no @Category(ValidatesRunner)
>>>>>>>>>> test
>>>>>>>>>> for that behavior (because DirectRunner would fail it), so the
>>>>>>>>>> question
>>>>>>>>>> is - should late data dropping be considered part of model (of
>>>>>>>>>> which
>>>>>>>>>> DirectRunner should be a canonical implementation) and therefore
>>>>>>>>>> that
>>>>>>>>>> should be fixed there, or is the late data dropping an optional
>>>>>>>>>> feature
>>>>>>>>>> of a runner?
>>>>>>>>>>
>>>>>>>>>> I'm strongly in favor of the first option, and I think it is
>>>>>>>>>> likely that
>>>>>>>>>> all real-world runners would probably adhere to that (I didn't
>>>>>>>>>> check
>>>>>>>>>> that, though).
>>>>>>>>>>
>>>>>>>>>> Opinions?
>>>>>>>>>>
>>>>>>>>>>   Jan
>>>>>>>>>>
>>>>>>>>>>

Re: Dropping late data in DirectRunner

Reply via email to