Sophie Blee-Goldman created KAFKA-10078:
-------------------------------------------
Summary: Partition may skip assignment with static members and
incremental rebalances
Key: KAFKA-10078
URL: https://issues.apache.org/jira/browse/KAFKA-10078
Project: Kafka
Issue Type: Bug
Components: streams
Affects Versions: 2.4.0
Reporter: Sophie Blee-Goldman
Assignee: Sophie Blee-Goldman
Fix For: 2.6.0
When static membership (KIP-345) and incremental rebalancing (KIP-429) are
turned on at the same time, that upon failure it is possible some partitions
are not assigned to anyone. The event sequence is the following:
1. An assignment (task1) from rebalance is sent to an existing static member
with owned list (partition1, partition2), hence upon receiving the assignment
the static member is supposed to revoke partition2 and then re-join the group
to trigger another rebalance.
2. The member crashed before re-join the group, lost all of its assigned
partitions. However since this member is static with long session timeout, it
was not kicked out of the group yet at the coordinator side.
3. The member resumes and then re-join with a known instance.id. The
coordinator would not trigger a rebalance in this case and just give it the
previous assignment (partition1), and since the member has forgot about its
previous owned partitions it would just take partition1 and not re-join.
4. As a result the partition2 is not owned by this member any more but not
re-assigned to anyone; until the next rebalance it would not be fetched by any
member of the group.
The key here is that today we are relying on the member's local memory to
calculate the added / revoked diff based on (owned, assigned). But if the
member crashed and lost all of its owned partition, AND if it is a static
member whose re-join would not trigger a new rebalance, this will break.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)