[
https://issues.apache.org/jira/browse/KAFKA-9849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Konstantine Karantasis resolved KAFKA-9849.
-------------------------------------------
Resolution: Fixed
> Fix issue with worker.unsync.backoff.ms creating zombie workers when
> incremental cooperative rebalancing is used
> ----------------------------------------------------------------------------------------------------------------
>
> Key: KAFKA-9849
> URL: https://issues.apache.org/jira/browse/KAFKA-9849
> Project: Kafka
> Issue Type: Bug
> Components: KafkaConnect
> Affects Versions: 2.3.1, 2.5.0, 2.4.1
> Reporter: Konstantine Karantasis
> Assignee: Konstantine Karantasis
> Priority: Major
> Fix For: 2.3.2, 2.6.0, 2.4.2, 2.5.1
>
>
> {{worker.unsync.backoff.ms}} is a property that was introduced a while ago
> when eager (stop-the-world) rebalancing was the only option for Connect
> workers. The goal of this property is to avoid triggering consecutive
> rebalances when a worker fails to catch up with the config topic in time and
> therefore voluntarily leaves the group with a {{LeaveGroupRequest}}.
> With incremental cooperative rebalancing this backoff
> ({{worker.unsync.backoff.ms) }}that has a default value equal to the default
> value of {{scheduled.rebalance.max.delay.ms}} (5min) might end up turning a
> worker into a zombie worker that retains its tasks but stays out of the
> group. This worker, by backing off from rebalancing, leaves not option to the
> leader of the group but to reassign the missing tasks that were thought as
> lost to other members of the group if the worker that backs off does not
> return in time before {{scheduled.rebalance.max.delay.ms}} expires.
> Clearly, {{worker.unsync.backoff.ms}} was introduced to avoid rebalancing
> storms under the presence of intermittent connectivity issues with eager
> rebalancing. However when incremental cooperative rebalancing is used this
> property might inadvertently make workers operate as zombie workers that keep
> running tasks while they are out of the group.
> Of course, a good tradeoff needs to be made between avoiding to make the
> protocol too eager again and at the same time avoiding to turn workers into
> zombies when connection is not lost for too long from the broker coordinator.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)