Thanks all for the feedback and suggestions.

@Henn I did consider something like your suggestion, though it brings a
couple of challenges:
1) Not all messages emitted from batch job will make it to streaming app 3,
some will be filtered out (granted you could change all applications to not
filter messages and instead send on some kind of empty message)
2) You could still solve the problem, but would need some tracking at each
stage. As you say it means you need some other datastore for tracking, and
need to call when processing each message. Overall I think it would be
quite invasive, requiring code changes in each component, and further code
changes if the pipeline had additional stages added. Definitely still
do-able, but I was leaning towards a monitoring component as it could be
more generic, and extending it to other pipelines or apps would require
only configuration changes

@星月夜 I did also consider your suggestion, but you would need to add
metadata to the last message for each partition (there is no guarantee the
"last" message from the batch job would be the last message "streaming app
A" processed, as there is only guarantee of ordering within a partition).
There would then be more complexity with situations like increasing number
of partitions, or different numbers of partitions on topic A and topic B.
Also note for the streaming app my assumption is there would be multiple
consumers per topic, which means you need some co-ordination across those
consumers to know when the whole batch is truly finished.

I agree things would be much more straightforward if the whole system were
batch-orientated. At the moment though reasons we're not going down that
path:
- Effort required - there are a number of components (more than shown in
the example) so it would require a lot of work with no immediate business
benefit
- There are other inputs to the flow that actually are streams and we get
some benefits from processing those immediately rather than waiting for a
batch process
- There are likely to be more stream inputs in the future

We could have separate stream and batch processing, though it's more work
than I think there's appetite for right now.

On Mon, 25 Mar 2019 at 17:16, 星月夜 <1095193...@qq.com> wrote:

> Just add some metadata to the last message to indicate  whether current
> batch is end.
> Agreed@Henn. If work is batch-oriented, Spark or other batch processing
> system is a more suitable solution.
>
>
>
> ---Original---
> From: "Harper Henn"<harper.h...@datto.com>
> Date: Tue, Mar 26, 2019 00:36 AM
> To: "users"<users@kafka.apache.org>;
> Subject: Re: Tracking progress for messages generated by a batch process
>
>
> Assuming you know how many items are in a batch ahead of time, could you
> just add a batch ID and position of a message within a batch to each
> message you send to topic A? Then your end application (streaming app 3)
> could check if every message in that batch has been processed, and trigger
> events when that condition is true. This would require some kind of
> tracking (perhaps in another database), but would get rid of the need for a
> batch monitoring program that tracks offsets.
>
> Kafka seems like an awkward fit for batch processing. Is it possible
> there's another datastore that's better suited for your use case?
>
> On Fri, Mar 22, 2019 at 11:04 AM Matthias J. Sax <matth...@confluent.io>
> wrote:
>
> > Sounds reasonable to me.
> >
> > -Matthias
> >
> > On 3/22/19 9:50 AM, Tim Gent wrote:
> > > Hi all,
> > >
> > > We have a data processing system where a daily batch process generates
> > > some data into a Kafka topic. This then goes through several other
> > > components that enrich the data, these are also integrated via Kafka.
> > > So overall we have something like:
> > >
> > > Batch job -> topic A -> streaming app 2 -> topic B -> streaming app 3
> > >
> > > We would like to know when all the data generated onto topic A finally
> > > gets processed by streaming app 3, as we may trigger some other
> > > processes from this (e.g. notifying customers their data is processed
> > > for that day). We've come up with a possible solution, and it would be
> > > great to get feedback to see what we missed.
> > >
> > > Assumptions:
> > > - Consumers all track their offsets using Kafka, committing once
> > > they've done all required processing for a message
> > > - We have some "batch-monitor" component which will track progress,
> > > described below
> > > - It isn't important to us to know exactly when the batch finished
> > > processing, sometime soon after batch finished processing is good
> > > enough
> > >
> > > Broad flow:
> > > - Batch job reads some input data and publishes output to topic A
> > > - Batch job sends data to our "batch-monitor" component about the
> > > offsets on each partition at the time it finishes it's processing
> > > - "batch-monitor" subscribes to the topic containing the committed
> > > offsets for topic A for streaming app 2 consumer
> > > - "batch-monitor" can therefore see when streaming app 2 has committed
> > > all the offsets that were in the batch
> > > - Once "batch-monitor" detects that streaming app 2 has finished it's
> > > processing for the batch it records max offsets for all partitions for
> > > messages in topic b -> these can be used to know when streaming app 3
> > > has finished processing the batch
> > > - "batch-monitor" subscribes to the topic containing the committed
> > > offsets for topic B for streaming app 3 consumer
> > > - "batch-monitor" can therefore see when streaming app 3 has committed
> > > all the offsets that were in the batch
> > > - Once that happens "batch-monitor" can send some notification
> somewhere
> > else
> > >
> > > Any thoughts gratefully received
> > >
> > > Tim
> > >
> >
> >

Reply via email to