Sean Quah created KAFKA-19862:
---------------------------------
Summary: Group coordinator loading may fail when there is
concurrent compaction
Key: KAFKA-19862
URL: https://issues.apache.org/jira/browse/KAFKA-19862
Project: Kafka
Issue Type: Bug
Components: group-coordinator
Reporter: Sean Quah
Assignee: Sean Quah
For consumer and streams groups, we reject replay of
{{Consumer/StreamsGroupCurrentMemberAssignment}} records when we detect a
partition / task is already owned by another member.
During group coordinator load, we replay the records in
{{{}__consumer_offsets{}}}. When compaction is running concurrently, we can
load uncompacted data, followed by a newly swapped in compacted segment,
followed by the uncompacted head of the log. This allows for situations where
the record unassigning a partition/task is missed during loading.
eg.
We can load a record \{ Member A is assigned partition X },
then miss the record \{ Member A is unassigned partition X },
then load the record \{ Member B is assigned partition X }, which fails with an
exception like
{{[GroupCoordinator id=2] Failed to load metadata from __consumer_offsets-4
with epoch 10 due to java.lang.RuntimeException: Replaying record
CoordinatorRecord(key=ConsumerGroupCurrentMemberAssignmentKey(groupId='...',
memberId='ZxHk7W53S_aHFdpxYc-_Jw'),
value=ApiMessageAndVersion(ConsumerGroupCurrentMemberAssignmentValue(memberEpoch=854659,
previousMemberEpoch=854633, state=0,
assignedPartitions=[TopicPartitions(topicId=9lL1aTMuSC22QAXsHgzhew,
partitions=[1, 2]), TopicPartitions(topicId=RHKM682KQYyOfF1XsOSF1A,
partitions=[0]), TopicPartitions(topicId=rKx9q1JmS1uP-ug_cj56ug,
partitions=[0]), TopicPartitions(topicId=I7EtFwesTRubnj-VHClqbQ,
partitions=[2]), TopicPartitions(topicId=ydAln6IUTZe-od9UUkn3rg,
partitions=[2])], partitionsPendingRevocation=[]) at version 0)) from
__consumer_offsets-4 at offset 3889549 with producer id -1 and producer epoch
-1 failed..}}
{{java.lang.IllegalStateException: Cannot set the epoch of
RHKM682KQYyOfF1XsOSF1A-0 to 854659 because the partition is still owned at
epoch 853490}}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)