Beginner: guidance on long term event stream persistence and replaying

Simon Paradis Mon, 08 Nov 2021 13:23:12 -0800

Hi,

We have an event processing pipeline that populates various reports from
different Kafka topics and would like to centralize processing in Flink. My
team is new to Flink but we did some prototyping using Kinesis.


To enable new reporting based on past events, we'd like the ability to
replay those Kafka events when creating new reports; a capability we don't
have today.

We ingest the same topics from many Kafka clusters in different datacenters
and it is not practical to have enough retention on these Kafka topics for
technical reasons and also practical issues around GDPR compliance and
Kafka's immutability (it's not an issue today because our Kafka retention
is short).

So we'd like to archive events into files that we push to AWS S3 along with
some metadata to help implement GDPR more efficiently. I've looked into
Avro object container files and it seems like it would work for us.

I was thinking of having a dedicated Flink job reading and archiving to S3
and somehow plug these S3 files back into a FileSource when a replay is
needed to backfill new reporting views. S3 would contain Avro container
files with a pattern like

sourceDC__topicName__YYYYMMDDHHMM__NN.data

where files are rolled over every hour or so and "rekeyed" into NN slots as
per the event key to retain logical order while having reasonable file
sizes.

I presume someone has already done something similar. Any pointer would be
great!

-- 
Simon Paradis
paradissi...@gmail.com

Beginner: guidance on long term event stream persistence and replaying

Reply via email to