I have an architectural question. I am planning to create a data transformation pipeline for document transformation. Each component will send processing events to the Kafka 'events' topic.
It will have the following steps: 1) Upload data to the repository (S3 or other storage). Get public URL to the uploaded document. Create 'received' event with the document URL and send the event to the Kafka 'events' topic. 2) Tranformer process will be listening to the Kafka 'events' topic. It will react on the 'received' event in the 'events' topic, will download the document, transform it, push the transformed document to the repository (S3 or other storage), create 'transformed' event and send 'transformed' event to the same 'events' topic. Tranformer process can break in the middle (exception, died, crashed, etc.). Upon startup, Tranformer process needs to check 'events' topic for documents that were received but not transformed. Should it read all events from the 'events' topic? Should it join 'received' and 'transformed' events somehow to understand what was received but not transformed? I don't have a clear idea of how it should behave. Please help. *Pavel Molchanov*