Re: Running cluster of stream processing application

Damian Guy Fri, 09 Dec 2016 00:06:13 -0800

Hi Sachin,

What you have suggested will never happen. If there is only 1 partition
there will only ever be one consumer of that partition. So if you had 2
instances of your streams application, and only a single input partition,
only 1 instance would be processing the data.
If you are running like this, then you might want to set
StreamsConfig.NUM_STANDBY_REPLICAS_CONFIG to 1 - this will mean that the
State Store that is generated by the aggregation is kept up to date on the
instance that is not processing the data. So in the event that the active
instance fails, the standby instance should be able to continue without too
much of a gap in processing time.


Thanks,
Damian

On Fri, 9 Dec 2016 at 04:55 Sachin Mittal <sjmit...@gmail.com> wrote:

> Hi,
> I followed the document and I have few questions.
> Say I have a single partition input key topic and say I run 2 streams
> application from machine1 and machine2.
> Both the application have same application id are have identical code.
> Say topic1 has messages like
> (k1, v11)
> (k1, v12)
> (k1, v13)
> (k2, v21)
> (k2, v22)
> (k2, v23)
> When I was running single application I was getting results like
> (k1, agg(v11, v12, v13))
> (k2, agg(v21, v22, v23))
>
> Now when 2 applications are run and say messages are read in round robin
> fashion.
> v11 v13 v22 - machine 1
> v12 v21 v23 - machine 2
>
> The aggregation at machine 1 would be
> (k1, agg(v11, v13))
> (k2, agg(v22))
>
> The aggregation at machine 2 would be
> (k1, agg(v12))
> (k2, agg(v21, v23))
>
> So now where do I join the independent results of these 2 aggregation to
> get the final result as expected when single instance was running.
>
> Note my high level dsl is sometime like
> srcSTopic.aggragate(...).foreach(key, aggregation) {
>     //process aggragated value and push it to some external storage
> }
>
> So I want this each to be running against the final set of aggregated
> value. Do I need to add another step before foreach to make sure the
> different results from 2 machines are joined to get the final one as
> expected. If yes what does that step 2.
>
> Thanks
> Sachin
>
>
>
>
>
>
> On Fri, Dec 9, 2016 at 9:42 AM, Mathieu Fenniak <
> mathieu.fenn...@replicon.com> wrote:
>
> > Hi Sachin,
> >
> > Some quick answers, and a link to some documentation to read more:
> >
> > - If you restart the application, it will start from the point it crashed
> > (possibly reprocessing a small window of records).
> >
> > - You can run more than one instance of the application.  They'll
> > coordinate by virtue of being part of a Kafka consumer group; if one
> > crashes, the partitions that it was reading from will be picked up by
> other
> > instances.
> >
> > - When running more than one instance, the tasks will be distributed
> > between the instances.
> >
> > Confluent's docs on the Kafka Streams architecture goes into a lot more
> > detail: http://docs.confluent.io/3.0.0/streams/architecture.html
> >
> >
> >
> >
> > On Thu, Dec 8, 2016 at 9:05 PM, Sachin Mittal <sjmit...@gmail.com>
> wrote:
> >
> > > Hi All,
> > > We were able to run a stream processing application against a fairly
> > decent
> > > load of messages in production environment.
> > >
> > > To make the system robust say the stream processing application
> crashes,
> > is
> > > there a way to make it auto start from the point when it crashed?
> > >
> > > Also is there any concept like running the same application in a
> cluster,
> > > where one fails, other takes over, until we bring back up the failed
> node
> > > of streams application.
> > >
> > > If yes, is there any guidelines or some knowledge base we can look at
> to
> > > understand how this would work.
> > >
> > > Is there way like in spark, where the driver program distributes the
> > tasks
> > > across various nodes in a cluster, is there something similar in kafka
> > > streaming too.
> > >
> > > Thanks
> > > Sachin
> > >
> >
>

Re: Running cluster of stream processing application

Reply via email to