I'm evaluating Flink for a reporting application that will keep
various aggregates updated in a database. It will be consuming from
Kafka queues that are replicated from remote data centers, so in case
there is a long outage in replication, I need to decide what to do
about windowing and late data.

If I use Flink's built-in windows and watermarks, any late data will
be come in 1-element windows, which could overwhelm the database if a
large batch of late data comes in and they are each mapped to
individual database updates.

As far as I can tell, I have two options:

1. Ignore late data, by marking it as late in an
AssignerWithPunctuatedWatermarks function, and then discarding it in a
flatMap operator. In this scenario, I would rely on a batch process to
fill in the missing data later, in the lambda architecture style.

2. Implement my own watermark logic to allow full windows of late
data. It seems like I could, for example, emit a "tick" message that
is replicated to all partitions every n messages, and then a custom
Trigger could decide when to purge each window based on the ticks and
a timeout duration. The system would never emit a real Watermark.

My questions are:
- Am I mistaken about either of these, or are there any other options
I'm not seeing for avoiding 1-element windows?
- For option 2, are there any problems with not emitting actual
watermarks, as long as the windows are eventually purged by a trigger?

Thanks,
Mike

Reply via email to