The withAllowedLateness controls when data can enter the system and still
be considered valid. The timestamp of the data is always relative to the
watermark.
timestamp is before watermark - withAllowedLateness -> data can be dropped
timestamp is after watermark - withAllowedLatness -> data can not be dropped

Since in your case your using event time (and not processing time), the
watermark should not be moving forward when the source is not available.

But, when there is no data being read from the source because there are no
records, it is up to the source to choose how the watermark advances and
may move the watermark forward based upon some estimate of where it thinks
the watermark should be at. Since some sources may not be able to tell
whether the source is truly unavailable or that there is just no incoming
data, it may move the watermark forward irregardless and thus
withAllowedLateness becomes important again.

On Fri, Dec 23, 2016 at 10:58 AM, Xu, Mingmin <[email protected]> wrote:

> Hello,
>
>
>
> I’m working on a POC project with Apache Beam. The rough pipeline reads
> from a checkout Kafka topic, and generate hourly summary data on different
> dimensions. I suppose a Fixed Time Window, with Time-Based Trigger could
> handle the case. –EventTime is the checkout timestamp.
>
>
>
> However, when the job, or the source is down for some time, like several
> hours, it would have problems to run the recovery. Data will be dropped,
> unless I set a large value for *withAllowedLateness, *large *allowedLateness+
> accumulatingFiredPanes* also leads to lots of pane data in memory. Is
> this the right way to handle a recovery scenario? Appreciate for any
> suggestion.
>
>
>
> Thank you!
>
> Mingmin
>
>
>

Reply via email to