I agree, in fact we just recently enabled late data dropping to the direct
runner in Python to be able to develop better tests for Dataflow.

It should be noted, however, that in a distributed runner (absent the
quiessence of TestStream) that one can't *count* on late data being dropped
at a certain point, and in fact (due to delays in fully propagating the
watermark) late data can even become on-time, so the promises about what
happens behind the watermark are necessarily a bit loose.

On Fri, Jan 3, 2020 at 9:15 AM Luke Cwik <lc...@google.com> wrote:

> I agree that the DirectRunner should drop late data. Late data dropping is
> optional but the DirectRunner is used by many for testing and we should
> have the same behaviour they would get on other runners or users may be
> surprised.
>
> On Fri, Jan 3, 2020 at 3:33 AM Jan Lukavský <je...@seznam.cz> wrote:
>
>> Hi,
>>
>> I just found out that DirectRunner is apparently not using
>> LateDataDroppingDoFnRunner, which means that it doesn't drop late data
>> in cases where there is no GBK operation involved (dropping in GBK seems
>> to be correct). There is apparently no @Category(ValidatesRunner) test
>> for that behavior (because DirectRunner would fail it), so the question
>> is - should late data dropping be considered part of model (of which
>> DirectRunner should be a canonical implementation) and therefore that
>> should be fixed there, or is the late data dropping an optional feature
>> of a runner?
>>
>> I'm strongly in favor of the first option, and I think it is likely that
>> all real-world runners would probably adhere to that (I didn't check
>> that, though).
>>
>> Opinions?
>>
>>   Jan
>>
>>

Reply via email to