I have an architectural question.

I am planning to create a data transformation pipeline for document
transformation. Each component will send processing events to the Kafka
'events' topic.

It will have the following steps:

1) Upload data to the repository (S3 or other storage). Get public URL to
the uploaded document. Create 'received' event with the document URL and
send the event to the Kafka 'events' topic.

2) Tranformer process will be listening to the Kafka 'events' topic. It
will react on the 'received' event in the 'events' topic, will download the
document, transform it, push the transformed document to the repository (S3
or other storage), create 'transformed' event and send 'transformed' event
to the same 'events' topic.

Tranformer process can break in the middle (exception, died, crashed,
etc.). Upon startup, Tranformer process needs to check 'events' topic for
documents that were received but not transformed.

Should it read all events from the 'events' topic? Should it join
'received' and 'transformed' events somehow to understand what was received
but not transformed?

I don't have a clear idea of how it should behave.

Please help.

*Pavel Molchanov*

Reply via email to