Hi everyone,

What is the typical architectural approach with Flink SQL for processing recent 
events from Kafka and older events from some separate cheaper storage?

I currently have the following situation in mind:

* events are appearing in Kafka and retained there for, say, 1 month
* events are also regularly forwarded to AWS S3 into Parquet files

Given a use case, or a bugfix, that requires to (re)-process events older than 
1 month, how do we design a Flink SQL pipeline that produces correct results?

* one possibility I envisaged is simply to first start the Flink SQL 
application using a FileSystem connector to read the events from parquet, then 
to shut it down while triggering a savepoint and finally resume it from that 
savepoint while now using the Kafka connector. I saw some presentations where 
engineers from Lyft were discussing that approach. For what I understand 
though, the FileSystem connector currently does not emit watermarks (I 
understand https://issues.apache.org/jira/browse/FLINK-21871 is addressing just 
that), so if I get things correctly that would imply that none of my Flink SQL 
code can depend on event time, which seems very restrictive.

* another option is to use infinite retention in Kafka, which is expensive, or 
to copy old events from S3 back to Kafka when we need to process them.

Since streaming in Flink, data retention in Kafka and pipeline backfilling are 
such common concepts, I am imagining that many teams are addressing the 
situation I'm describing above already.

What is the usual way of approaching this?

Thanks a lot in advance




Reply via email to