Re: Running cluster of stream processing application

Matthias J. Sax Fri, 09 Dec 2016 10:38:24 -0800

About failure and restart. Kafka Streams does not provide any tooling
for this. It's a library.


However, because it is a library it is also agnostic to whatever tool
you want to use. You can for example you a resource manager like Mesos
or YARN, or you containerize your application, or you use tools like
Chef. And there is a bunch more -- pick whatever fits your needs best.

-Matthias


On 12/9/16 12:04 AM, Damian Guy wrote:
> Hi Sachin,
> 
> What you have suggested will never happen. If there is only 1 partition
> there will only ever be one consumer of that partition. So if you had 2
> instances of your streams application, and only a single input partition,
> only 1 instance would be processing the data.
> If you are running like this, then you might want to set
> StreamsConfig.NUM_STANDBY_REPLICAS_CONFIG to 1 - this will mean that the
> State Store that is generated by the aggregation is kept up to date on the
> instance that is not processing the data. So in the event that the active
> instance fails, the standby instance should be able to continue without too
> much of a gap in processing time.
> 
> Thanks,
> Damian
> 
> On Fri, 9 Dec 2016 at 04:55 Sachin Mittal <sjmit...@gmail.com> wrote:
> 
>> Hi,
>> I followed the document and I have few questions.
>> Say I have a single partition input key topic and say I run 2 streams
>> application from machine1 and machine2.
>> Both the application have same application id are have identical code.
>> Say topic1 has messages like
>> (k1, v11)
>> (k1, v12)
>> (k1, v13)
>> (k2, v21)
>> (k2, v22)
>> (k2, v23)
>> When I was running single application I was getting results like
>> (k1, agg(v11, v12, v13))
>> (k2, agg(v21, v22, v23))
>>
>> Now when 2 applications are run and say messages are read in round robin
>> fashion.
>> v11 v13 v22 - machine 1
>> v12 v21 v23 - machine 2
>>
>> The aggregation at machine 1 would be
>> (k1, agg(v11, v13))
>> (k2, agg(v22))
>>
>> The aggregation at machine 2 would be
>> (k1, agg(v12))
>> (k2, agg(v21, v23))
>>
>> So now where do I join the independent results of these 2 aggregation to
>> get the final result as expected when single instance was running.
>>
>> Note my high level dsl is sometime like
>> srcSTopic.aggragate(...).foreach(key, aggregation) {
>>     //process aggragated value and push it to some external storage
>> }
>>
>> So I want this each to be running against the final set of aggregated
>> value. Do I need to add another step before foreach to make sure the
>> different results from 2 machines are joined to get the final one as
>> expected. If yes what does that step 2.
>>
>> Thanks
>> Sachin
>>
>>
>>
>>
>>
>>
>> On Fri, Dec 9, 2016 at 9:42 AM, Mathieu Fenniak <
>> mathieu.fenn...@replicon.com> wrote:
>>
>>> Hi Sachin,
>>>
>>> Some quick answers, and a link to some documentation to read more:
>>>
>>> - If you restart the application, it will start from the point it crashed
>>> (possibly reprocessing a small window of records).
>>>
>>> - You can run more than one instance of the application.  They'll
>>> coordinate by virtue of being part of a Kafka consumer group; if one
>>> crashes, the partitions that it was reading from will be picked up by
>> other
>>> instances.
>>>
>>> - When running more than one instance, the tasks will be distributed
>>> between the instances.
>>>
>>> Confluent's docs on the Kafka Streams architecture goes into a lot more
>>> detail: http://docs.confluent.io/3.0.0/streams/architecture.html
>>>
>>>
>>>
>>>
>>> On Thu, Dec 8, 2016 at 9:05 PM, Sachin Mittal <sjmit...@gmail.com>
>> wrote:
>>>
>>>> Hi All,
>>>> We were able to run a stream processing application against a fairly
>>> decent
>>>> load of messages in production environment.
>>>>
>>>> To make the system robust say the stream processing application
>> crashes,
>>> is
>>>> there a way to make it auto start from the point when it crashed?
>>>>
>>>> Also is there any concept like running the same application in a
>> cluster,
>>>> where one fails, other takes over, until we bring back up the failed
>> node
>>>> of streams application.
>>>>
>>>> If yes, is there any guidelines or some knowledge base we can look at
>> to
>>>> understand how this would work.
>>>>
>>>> Is there way like in spark, where the driver program distributes the
>>> tasks
>>>> across various nodes in a cluster, is there something similar in kafka
>>>> streaming too.
>>>>
>>>> Thanks
>>>> Sachin
>>>>
>>>
>>
>

signature.asc
Description: OpenPGP digital signature

Re: Running cluster of stream processing application

Reply via email to