[jira] [Created] (KAFKA-8626) Group will fall into constant incremental rebalancing with a long non-responsive static member

Boyang Chen (JIRA) Wed, 03 Jul 2019 09:36:17 -0700

Boyang Chen created KAFKA-8626:
----------------------------------

             Summary: Group will fall into constant incremental rebalancing 
with a long non-responsive static member
                 Key: KAFKA-8626
                 URL: https://issues.apache.org/jira/browse/KAFKA-8626
             Project: Kafka
          Issue Type: Bug
            Reporter: Boyang Chen
            Assignee: Boyang Chen



Currently when a group rebalances, static members have up until the expiration 
of the rebalance timeout to rejoin. if they do not rejoin in time, then they 
are rejoined virtually by the coordinator. basically the coordinator just uses 
the old subscription. This behavior may be a problem for cooperative 
reassignment. the issue is that the old subscription may contain a set of owned 
partitions. the assignor will respect the owned set of partitions, but that 
won't stop it from trying to move them to another consumer. in this case, we 
will set the NEED_REJOIN error code. the idea is that consumers observe this 
error, revoke any needed partitions and immediately rejoin. but if the static 
member just continues using its old subscription, then we'll be stuck in 
rebalance state until the static member comes back online, because the 
non-responsive static member won't give up subscription.

Some ideas proposed by Jason:

1. make revocation optional. basically get rid of the internal REJOIN_NEEDED 
error code. consumers only rebalance if they revoke partitions themselves or 
detect the group rebalancing. in this case, the static member would just 
decline to give up its partitions until it is back online.
2. make the assignor aware of which members are active in the current 
rebalance. if a static member is not active, then the assignor can just not 
reassign any of its owned partitions. it might be a good idea to have this 
anyway because rebalances are often used as a (clumsy) way to collect 
information from the group members. for example, when connect rebalances a 
group, it is looking for consistency among the members on the config offset 
that have read. if one member is just reporting old state, then this protocol 
won't work.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (KAFKA-8626) Group will fall into constant incremental rebalancing with a long non-responsive static member

Reply via email to