Hi all, We have a data processing system where a daily batch process generates some data into a Kafka topic. This then goes through several other components that enrich the data, these are also integrated via Kafka. So overall we have something like:
Batch job -> topic A -> streaming app 2 -> topic B -> streaming app 3 We would like to know when all the data generated onto topic A finally gets processed by streaming app 3, as we may trigger some other processes from this (e.g. notifying customers their data is processed for that day). We've come up with a possible solution, and it would be great to get feedback to see what we missed. Assumptions: - Consumers all track their offsets using Kafka, committing once they've done all required processing for a message - We have some "batch-monitor" component which will track progress, described below - It isn't important to us to know exactly when the batch finished processing, sometime soon after batch finished processing is good enough Broad flow: - Batch job reads some input data and publishes output to topic A - Batch job sends data to our "batch-monitor" component about the offsets on each partition at the time it finishes it's processing - "batch-monitor" subscribes to the topic containing the committed offsets for topic A for streaming app 2 consumer - "batch-monitor" can therefore see when streaming app 2 has committed all the offsets that were in the batch - Once "batch-monitor" detects that streaming app 2 has finished it's processing for the batch it records max offsets for all partitions for messages in topic b -> these can be used to know when streaming app 3 has finished processing the batch - "batch-monitor" subscribes to the topic containing the committed offsets for topic B for streaming app 3 consumer - "batch-monitor" can therefore see when streaming app 3 has committed all the offsets that were in the batch - Once that happens "batch-monitor" can send some notification somewhere else Any thoughts gratefully received Tim