Boyang Chen created KAFKA-8626:
----------------------------------
Summary: Group will fall into constant incremental rebalancing
with a long non-responsive static member
Key: KAFKA-8626
URL: https://issues.apache.org/jira/browse/KAFKA-8626
Project: Kafka
Issue Type: Bug
Reporter: Boyang Chen
Assignee: Boyang Chen
Currently when a group rebalances, static members have up until the expiration
of the rebalance timeout to rejoin. if they do not rejoin in time, then they
are rejoined virtually by the coordinator. basically the coordinator just uses
the old subscription. This behavior may be a problem for cooperative
reassignment. the issue is that the old subscription may contain a set of owned
partitions. the assignor will respect the owned set of partitions, but that
won't stop it from trying to move them to another consumer. in this case, we
will set the NEED_REJOIN error code. the idea is that consumers observe this
error, revoke any needed partitions and immediately rejoin. but if the static
member just continues using its old subscription, then we'll be stuck in
rebalance state until the static member comes back online, because the
non-responsive static member won't give up subscription.
Some ideas proposed by Jason:
1. make revocation optional. basically get rid of the internal REJOIN_NEEDED
error code. consumers only rebalance if they revoke partitions themselves or
detect the group rebalancing. in this case, the static member would just
decline to give up its partitions until it is back online.
2. make the assignor aware of which members are active in the current
rebalance. if a static member is not active, then the assignor can just not
reassign any of its owned partitions. it might be a good idea to have this
anyway because rebalances are often used as a (clumsy) way to collect
information from the group members. for example, when connect rebalances a
group, it is looking for consistency among the members on the config offset
that have read. if one member is just reporting old state, then this protocol
won't work.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)