We've only started using kafka-based group coordination for small and
simple use cases at LinkedIn so far.

Given that you kill -9 your process, your explanation for the long
stabilization time makes sense. I'd recommend calling KafkaConsumer.close.
It should speed up the rebalance times.

Another idea: it sounds like you sequentially deploy changes to your
consumers. Is this required? If not, then adding some parallelism to the
deployment would reduce the number of rebalances and therefore cause the
group to stabilize sooner.

On Fri, Feb 17, 2017 at 10:55 PM, Praveen <praveev...@gmail.com> wrote:

> Hey Onur,
>
> I was just watching your talk on rebalancing from last year -
> https://www.youtube.com/watch?v=QaeXDh12EhE
> Nice talk!.
>
> I think I have an idea as to why it takes 1 hr in my case based on the
> talk in the video. In my case with 32 boxes / consumers from the same
> group, I believe the current state of the group coordinator's state machine
> gets messed up each time a new one is added until the very last consumer.
> Also I have a heartbeat set to 97 seconds (97 secs b/c normal processing
> could take that long and we don't want coordinator to think consumer is
> dead). I think both of these coupled together is why the cluster restart
> takes > 1hr. I'm curious how linkedin does clean cluster restarts? How do
> you handle the scenario described above?
>
> Praveen
>
>
> On Wed, Feb 15, 2017 at 10:22 AM, Praveen <praveev...@gmail.com> wrote:
>
>> I still think a clean cluster start should not take > 1 hr for balancing
>> though. Is this expected or am i doing something different?
>>
>> I thought this would be a common use case.
>>
>> Praveen
>>
>> On Fri, Feb 10, 2017 at 10:26 AM, Onur Karaman <
>> okara...@linkedin.com.invalid> wrote:
>>
>>> Pradeep is right.
>>>
>>> close() will try and send out a LeaveGroupRequest while a kill -9 will
>>> not.
>>>
>>> On Fri, Feb 10, 2017 at 10:19 AM, Pradeep Gollakota <
>>> pradeep...@gmail.com>
>>> wrote:
>>>
>>> > I believe if you're calling the .close() method on shutdown, then the
>>> > LeaveGroupRequest will be made. If you're doing a kill -9, I'm not
>>> sure if
>>> > that request will be made.
>>> >
>>> > On Fri, Feb 10, 2017 at 8:47 AM, Praveen <praveev...@gmail.com> wrote:
>>> >
>>> > > @Pradeep - I just read your thread, the 1hr pause was when all the
>>> > > consumers where shutdown simultaneously.  I'm testing out rolling
>>> restart
>>> > > to get the actual numbers. The initial numbers are promising.
>>> > >
>>> > > `STOP (1) (1min later kicks off) -> REBALANCE -> START (1) ->
>>> REBALANCE
>>> > > (takes 1min to get a partition)`
>>> > >
>>> > > In your thread, Ewen says -
>>> > >
>>> > > "The LeaveGroupRequest is only sent on a graceful shutdown. If a
>>> > > consumer knows it is going to
>>> > > shutdown, it is good to proactively make sure the group knows it
>>> needs to
>>> > > rebalance work because some of the partitions that were handled by
>>> the
>>> > > consumer need to be handled by some other group members."
>>> > >
>>> > > So does this mean that the consumer should inform the group ahead of
>>> > > time before it goes down? Currently, I just shutdown the process.
>>> > >
>>> > >
>>> > > On Fri, Feb 10, 2017 at 8:35 AM, Pradeep Gollakota <
>>> pradeep...@gmail.com
>>> > >
>>> > > wrote:
>>> > >
>>> > > > I asked a similar question a while ago. There doesn't appear to be
>>> a
>>> > way
>>> > > to
>>> > > > not triggering the rebalance. But I'm not sure why it would be
>>> taking >
>>> > > 1hr
>>> > > > in your case. For us it was pretty fast.
>>> > > >
>>> > > > https://www.mail-archive.com/users@kafka.apache.org/msg23925.html
>>> > > >
>>> > > >
>>> > > >
>>> > > > On Fri, Feb 10, 2017 at 4:28 AM, Krzysztof Lesniewski, Nexiot AG <
>>> > > > krzysztof.lesniew...@nexiot.ch> wrote:
>>> > > >
>>> > > > > Would be great to get some input on it.
>>> > > > >
>>> > > > > - Krzysztof Lesniewski
>>> > > > >
>>> > > > >
>>> > > > > On 06.02.2017 08:27, Praveen wrote:
>>> > > > >
>>> > > > >> I have a 16 broker kafka cluster. There is a topic with 32
>>> > partitions
>>> > > > >> containing real time data and on the other side, I have 32
>>> boxes w/
>>> > 1
>>> > > > >> consumer reading from these partitions.
>>> > > > >>
>>> > > > >> Today our deployment strategy is stop, deploy and start the
>>> > processes
>>> > > on
>>> > > > >> all the 32 consumers. This triggers re-balancing and takes a
>>> long
>>> > > period
>>> > > > >> of
>>> > > > >> time (> 1hr). Such a long pause isn't good for real time
>>> processing.
>>> > > > >>
>>> > > > >> I was thinking of rolling deploy but I think that will still
>>> cause
>>> > > > >> re-balancing b/c we will still have consumers go down and come
>>> up.
>>> > > > >>
>>> > > > >> How do you deploy to consumers without triggering re-balancing
>>> (or
>>> > > > >> triggering one that doesn't affect your SLA) when doing real
>>> time
>>> > > > >> processing?
>>> > > > >>
>>> > > > >> Thanks,
>>> > > > >> Praveen
>>> > > > >>
>>> > > > >>
>>> > > > >
>>> > > >
>>> > >
>>> >
>>>
>>
>>
>

Reply via email to