On Fri, May 21, 2021 at 11:47 AM Kenneth Knowles <[email protected]> wrote:
> +dev <[email protected]> +Reuven Lax <[email protected]> > > Advancing the watermark to infinity does have an effect on the > GlobalWindow. The GlobalWindow ends a little bit before infinity :-). That > is why this works to cause the output even for unbounded aggregations. > I'm definitely glad to hear that GlobalWindow is supposed to close on Drain, so it sounds like the FixedWindows work around is not necessary. If the watermark advances with the intention of causing windows to close, then it's unclear to me in what cases droppedDueToLateness would be expected, and whether it would be expected in our case. Is it possible that the watermark is advanced while there are still messages working their way through the pipeline, so that by the time they hit the aggregation, they're considered late? If so, is there a way to prevent that? > On Fri, May 21, 2021 at 5:10 AM Jeff Klukas <[email protected]> wrote: > >> Beam users, >> >> We're attempting to write a Java pipeline that uses Count.perKey() to >> collect event counts, and then flush those to an HTTP API every ten minutes >> based on processing time. >> >> We've tried expressing this using GlobalWindows with an >> AfterProcessingTime trigger, but we find that when we drain the pipeline >> we end up with entries in the droppedDueToLateness metric. This was >> initially surprising, but may be line line with documented behavior [0]: >> >> > When you issue the Drain command, Dataflow immediately closes any >> in-process windows and fires all triggers. The system does not wait for any >> outstanding time-based windows to finish. Dataflow causes open windows to >> close by advancing the system watermark to infinity >> >> Perhaps advancing watermark to infinity has no effect on GlobalWindows, >> so we attempted to get around this by using a fixed but arbitrarily-long >> window: >> >> FixedWindows.of(Duration.standardDays(36500)) >> >> The first few tests with this configuration came back clean, but the >> third test again showed droppedDueToLateness after calling Drain. You can >> see this current configuration in [1]. >> >> Is there a pattern for reliably flushing on Drain when doing processing >> time-based aggregates like this? >> >> [0] >> https://cloud.google.com/dataflow/docs/guides/stopping-a-pipeline#effects_of_draining_a_job >> [1] >> https://github.com/mozilla/gcp-ingestion/pull/1689/files#diff-1d75ce2cbda625465d5971a83d842dd35e2eaded2c2dd2b6c7d0d7cdfd459115R58-R71 >> >>
