Hi devs,

I’ve got a question about how to design a Flink app that can handle the
reprocessing of historical data. As a quick background, I’m building an ETL
/ data aggregator application, and one of the requirements is that if in
the future we want to extend the system by adding some new metrics or
dimensions, we’ll have the option to reprocess historical data in order to
generate backfill (assuming the historical data already has the fields
we’ll be adding to our aggregations, of course).

The data source for my application is Kafka, so my naive approach here
would just be to re-ingest the older events into Kafka and let the stream
processor handle the rest. But if I’m understanding Flink’s windowing and
watermarks correctly for an event-time based app, there doesn’t really seem
to be a way of revisiting older data. If you’re using event-time windowing,
then any window for which (end window range + window allowed lateness +
watermark strategy out of orderness) <= max_timestamp is permanently
finished since max_timestamp is always increasing.

And that all does make sense, but I’m not sure where that leaves me for my
requirement to support backfill. Is the alternative to just recreate the
app every time and try to ensure that all the historical data gets
processed before any new, live data?

Best,
Ty

Reply via email to