Hi Simon,

>From the top of my head I do not see a reason why this shouldn't work in
Flink. I'm not sure what your question is here.

For reading both from the FileSource and Kafka at the same time you might
want to take a look at the Hybrid Source [1]. Apart from that there are
FileSource/FileSink and KafaSource that I presume you have already found :)

Best,
Piotrek

[1]
https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/connectors/datastream/hybridsource/

pon., 8 lis 2021 o 22:22 Simon Paradis <paradissi...@gmail.com> napisaƂ(a):

> Hi,
>
> We have an event processing pipeline that populates various reports from
> different Kafka topics and would like to centralize processing in Flink. My
> team is new to Flink but we did some prototyping using Kinesis.
>
> To enable new reporting based on past events, we'd like the ability to
> replay those Kafka events when creating new reports; a capability we don't
> have today.
>
> We ingest the same topics from many Kafka clusters in different
> datacenters and it is not practical to have enough retention on these Kafka
> topics for technical reasons and also practical issues around GDPR
> compliance and Kafka's immutability (it's not an issue today because our
> Kafka retention is short).
>
> So we'd like to archive events into files that we push to AWS S3 along
> with some metadata to help implement GDPR more efficiently. I've looked
> into Avro object container files and it seems like it would work for us.
>
> I was thinking of having a dedicated Flink job reading and archiving to S3
> and somehow plug these S3 files back into a FileSource when a replay is
> needed to backfill new reporting views. S3 would contain Avro container
> files with a pattern like
>
> sourceDC__topicName__YYYYMMDDHHMM__NN.data
>
> where files are rolled over every hour or so and "rekeyed" into NN slots
> as per the event key to retain logical order while having reasonable file
> sizes.
>
> I presume someone has already done something similar. Any pointer would be
> great!
>
> --
> Simon Paradis
> paradissi...@gmail.com
>

Reply via email to