I agree, in fact we just recently enabled late data dropping to the direct runner in Python to be able to develop better tests for Dataflow.
It should be noted, however, that in a distributed runner (absent the quiessence of TestStream) that one can't *count* on late data being dropped at a certain point, and in fact (due to delays in fully propagating the watermark) late data can even become on-time, so the promises about what happens behind the watermark are necessarily a bit loose. On Fri, Jan 3, 2020 at 9:15 AM Luke Cwik <lc...@google.com> wrote: > I agree that the DirectRunner should drop late data. Late data dropping is > optional but the DirectRunner is used by many for testing and we should > have the same behaviour they would get on other runners or users may be > surprised. > > On Fri, Jan 3, 2020 at 3:33 AM Jan Lukavský <je...@seznam.cz> wrote: > >> Hi, >> >> I just found out that DirectRunner is apparently not using >> LateDataDroppingDoFnRunner, which means that it doesn't drop late data >> in cases where there is no GBK operation involved (dropping in GBK seems >> to be correct). There is apparently no @Category(ValidatesRunner) test >> for that behavior (because DirectRunner would fail it), so the question >> is - should late data dropping be considered part of model (of which >> DirectRunner should be a canonical implementation) and therefore that >> should be fixed there, or is the late data dropping an optional feature >> of a runner? >> >> I'm strongly in favor of the first option, and I think it is likely that >> all real-world runners would probably adhere to that (I didn't check >> that, though). >> >> Opinions? >> >> Jan >> >>