Hi, We have an event processing pipeline that populates various reports from different Kafka topics and would like to centralize processing in Flink. My team is new to Flink but we did some prototyping using Kinesis.
To enable new reporting based on past events, we'd like the ability to replay those Kafka events when creating new reports; a capability we don't have today. We ingest the same topics from many Kafka clusters in different datacenters and it is not practical to have enough retention on these Kafka topics for technical reasons and also practical issues around GDPR compliance and Kafka's immutability (it's not an issue today because our Kafka retention is short). So we'd like to archive events into files that we push to AWS S3 along with some metadata to help implement GDPR more efficiently. I've looked into Avro object container files and it seems like it would work for us. I was thinking of having a dedicated Flink job reading and archiving to S3 and somehow plug these S3 files back into a FileSource when a replay is needed to backfill new reporting views. S3 would contain Avro container files with a pattern like sourceDC__topicName__YYYYMMDDHHMM__NN.data where files are rolled over every hour or so and "rekeyed" into NN slots as per the event key to retain logical order while having reasonable file sizes. I presume someone has already done something similar. Any pointer would be great! -- Simon Paradis paradissi...@gmail.com