Hi Yi,

Detecting both scenarios will be useful.

Scenario 1: To detect that the reprocessing job has "caught up" with the
stream, as described by Jay Kreps here
<http://radar.oreilly.com/2014/07/questioning-the-lambda-architecture.html>,
so that we can make the app point to the new DB and tear down the old
version of the jobs and DB.

Scenario 2: We want to reuse our stream infrastructure/code for bounded
datasets, so that we don't have to write the some jobs on Hadoop (which is
the way it is right now). So, detecting the end is required for shutting
down the jobs.

Regards,

Kishore.



On Wed, Oct 14, 2015 at 11:55 PM, Yi Pan <nickpa...@gmail.com> wrote:

> Hi, Kishore,
>
> First I want some clarification on your use case.
> 1) Scenario 1: you still want the Samza jobs continuously running, while
> simply want to detect the end of a certain stream. On detection, do you
> need to unsubscribe from the stream? Or you are still OK receiving more
> messages from the stream?
> 2) Scenario 2: you want the Samza jobs to shutdown when detecting the end
> of a certain stream.
>
> Which scenario are you targeting?
>
> Thanks!
>
> -Yi
>
> On Wed, Oct 14, 2015 at 9:33 AM, Kishore N C <kishor...@gmail.com> wrote:
>
> > Hi,
> >
> > Our data processing pipeline consists of a set of Samza jobs, that form a
> > DAG. Sometimes, we have to throw finite datasets into the Kafka topic
> that
> > acts as the entry point to the pipeline. Given that different Samza jobs
> in
> > the DAG could have varying latencies in terms of processing the records
> (or
> > could even temporarily fails or be stuck), how do I detect that my
> assembly
> > of jobs have finished processing all records? It's not as simple as
> > tallying the input and output record counts, as some jobs could be
> > filtering data, and others could be grouping records etc.
> >
> > Thanks,
> >
> > Kishore.
> >
>



-- 
It is our choices that show what we truly are,
far more than our abilities.

Reply via email to