Hi devs, I’ve got a question about how to design a Flink app that can handle the reprocessing of historical data. As a quick background, I’m building an ETL / data aggregator application, and one of the requirements is that if in the future we want to extend the system by adding some new metrics or dimensions, we’ll have the option to reprocess historical data in order to generate backfill (assuming the historical data already has the fields we’ll be adding to our aggregations, of course).
The data source for my application is Kafka, so my naive approach here would just be to re-ingest the older events into Kafka and let the stream processor handle the rest. But if I’m understanding Flink’s windowing and watermarks correctly for an event-time based app, there doesn’t really seem to be a way of revisiting older data. If you’re using event-time windowing, then any window for which (end window range + window allowed lateness + watermark strategy out of orderness) <= max_timestamp is permanently finished since max_timestamp is always increasing. And that all does make sense, but I’m not sure where that leaves me for my requirement to support backfill. Is the alternative to just recreate the app every time and try to ensure that all the historical data gets processed before any new, live data? Best, Ty
