Hi all,

We have a data processing system where a daily batch process generates
some data into a Kafka topic. This then goes through several other
components that enrich the data, these are also integrated via Kafka.
So overall we have something like:

Batch job -> topic A -> streaming app 2 -> topic B -> streaming app 3

We would like to know when all the data generated onto topic A finally
gets processed by streaming app 3, as we may trigger some other
processes from this (e.g. notifying customers their data is processed
for that day). We've come up with a possible solution, and it would be
great to get feedback to see what we missed.

Assumptions:
- Consumers all track their offsets using Kafka, committing once
they've done all required processing for a message
- We have some "batch-monitor" component which will track progress,
described below
- It isn't important to us to know exactly when the batch finished
processing, sometime soon after batch finished processing is good
enough

Broad flow:
- Batch job reads some input data and publishes output to topic A
- Batch job sends data to our "batch-monitor" component about the
offsets on each partition at the time it finishes it's processing
- "batch-monitor" subscribes to the topic containing the committed
offsets for topic A for streaming app 2 consumer
- "batch-monitor" can therefore see when streaming app 2 has committed
all the offsets that were in the batch
- Once "batch-monitor" detects that streaming app 2 has finished it's
processing for the batch it records max offsets for all partitions for
messages in topic b -> these can be used to know when streaming app 3
has finished processing the batch
- "batch-monitor" subscribes to the topic containing the committed
offsets for topic B for streaming app 3 consumer
- "batch-monitor" can therefore see when streaming app 3 has committed
all the offsets that were in the batch
- Once that happens "batch-monitor" can send some notification somewhere else

Any thoughts gratefully received

Tim

Reply via email to