Assuming you know how many items are in a batch ahead of time, could you
just add a batch ID and position of a message within a batch to each
message you send to topic A? Then your end application (streaming app 3)
could check if every message in that batch has been processed, and trigger
events when that condition is true. This would require some kind of
tracking (perhaps in another database), but would get rid of the need for a
batch monitoring program that tracks offsets.

Kafka seems like an awkward fit for batch processing. Is it possible
there's another datastore that's better suited for your use case?

On Fri, Mar 22, 2019 at 11:04 AM Matthias J. Sax <matth...@confluent.io>
wrote:

> Sounds reasonable to me.
>
> -Matthias
>
> On 3/22/19 9:50 AM, Tim Gent wrote:
> > Hi all,
> >
> > We have a data processing system where a daily batch process generates
> > some data into a Kafka topic. This then goes through several other
> > components that enrich the data, these are also integrated via Kafka.
> > So overall we have something like:
> >
> > Batch job -> topic A -> streaming app 2 -> topic B -> streaming app 3
> >
> > We would like to know when all the data generated onto topic A finally
> > gets processed by streaming app 3, as we may trigger some other
> > processes from this (e.g. notifying customers their data is processed
> > for that day). We've come up with a possible solution, and it would be
> > great to get feedback to see what we missed.
> >
> > Assumptions:
> > - Consumers all track their offsets using Kafka, committing once
> > they've done all required processing for a message
> > - We have some "batch-monitor" component which will track progress,
> > described below
> > - It isn't important to us to know exactly when the batch finished
> > processing, sometime soon after batch finished processing is good
> > enough
> >
> > Broad flow:
> > - Batch job reads some input data and publishes output to topic A
> > - Batch job sends data to our "batch-monitor" component about the
> > offsets on each partition at the time it finishes it's processing
> > - "batch-monitor" subscribes to the topic containing the committed
> > offsets for topic A for streaming app 2 consumer
> > - "batch-monitor" can therefore see when streaming app 2 has committed
> > all the offsets that were in the batch
> > - Once "batch-monitor" detects that streaming app 2 has finished it's
> > processing for the batch it records max offsets for all partitions for
> > messages in topic b -> these can be used to know when streaming app 3
> > has finished processing the batch
> > - "batch-monitor" subscribes to the topic containing the committed
> > offsets for topic B for streaming app 3 consumer
> > - "batch-monitor" can therefore see when streaming app 3 has committed
> > all the offsets that were in the batch
> > - Once that happens "batch-monitor" can send some notification somewhere
> else
> >
> > Any thoughts gratefully received
> >
> > Tim
> >
>
>

Reply via email to