Boyang Chen created KAFKA-8626: ---------------------------------- Summary: Group will fall into constant incremental rebalancing with a long non-responsive static member Key: KAFKA-8626 URL: https://issues.apache.org/jira/browse/KAFKA-8626 Project: Kafka Issue Type: Bug Reporter: Boyang Chen Assignee: Boyang Chen
Currently when a group rebalances, static members have up until the expiration of the rebalance timeout to rejoin. if they do not rejoin in time, then they are rejoined virtually by the coordinator. basically the coordinator just uses the old subscription. This behavior may be a problem for cooperative reassignment. the issue is that the old subscription may contain a set of owned partitions. the assignor will respect the owned set of partitions, but that won't stop it from trying to move them to another consumer. in this case, we will set the NEED_REJOIN error code. the idea is that consumers observe this error, revoke any needed partitions and immediately rejoin. but if the static member just continues using its old subscription, then we'll be stuck in rebalance state until the static member comes back online, because the non-responsive static member won't give up subscription. Some ideas proposed by Jason: 1. make revocation optional. basically get rid of the internal REJOIN_NEEDED error code. consumers only rebalance if they revoke partitions themselves or detect the group rebalancing. in this case, the static member would just decline to give up its partitions until it is back online. 2. make the assignor aware of which members are active in the current rebalance. if a static member is not active, then the assignor can just not reassign any of its owned partitions. it might be a good idea to have this anyway because rebalances are often used as a (clumsy) way to collect information from the group members. for example, when connect rebalances a group, it is looking for consistency among the members on the config offset that have read. if one member is just reporting old state, then this protocol won't work. -- This message was sent by Atlassian JIRA (v7.6.3#76005)