[
https://issues.apache.org/jira/browse/KAFKA-19181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Chirag Wadhwa resolved KAFKA-19181.
-----------------------------------
Resolution: Fixed
The system tests have been resolved
> Investigate system test failures
> --------------------------------
>
> Key: KAFKA-19181
> URL: https://issues.apache.org/jira/browse/KAFKA-19181
> Project: Kafka
> Issue Type: Sub-task
> Reporter: Chirag Wadhwa
> Assignee: Chirag Wadhwa
> Priority: Major
>
> The nightly runs for system tests has picked failures for the following 2
> tests -
> 1) test_share_multiple_partitions
> 2) test_broker_failure
> Investigation analysis -
> 1) for the first test, 3 brokers are run, with 3 share consumers (all part of
> same group). A million messages are produced to a topic with 3 partitions.
> Once the messages are produced and consumed, the assertions include checks
> that all consumers include some messages for all share partitions. But with
> the new SimpleAssignor algorithm in place, some consumers are not assigned
> some partitions, so they don't consume from those share partitions, resulting
> into assertion failures.
> Fix - change the test to first find the assignment of the consumers using the
> kafka-share-groups.sh --describe command and only include assertions for
> assigned share partitions
>
> 2) The bug is introduced when the coordinator is not active when an
> initialize request is under process. The consumer sends heartbeats to the
> broker, but none of them are successful. Initially a few of them are failing
> because of {{COORDINATOR_NOT_AVAILABLE}} error. This is expected and should
> be fine because this is a transient error. But during this time, the broker
> keeps updating the memberEpoch for the member. But the response sent back to
> the member has a memberEpoch as 0. Now I understand that client depends on
> the broker's response to update its memberEpoch, and thus the subsequent
> requests are also sent with memberEpoch 0. This happens for a couple of
> requests (broker keeps increasing the memberEpoch but sends back a
> {{COORDINATOR_NOT_AVAILABLE}} error with memberEpoch as 0 ). Finally when
> the coordinator is active, as expected we get a {{FENCED_MEMBER_EPOCH}}
> exception. And now, the member keeps of sending heartbeat with the wrong
> memberEpoch and the broker keeps on sending back {{FENCED_MEMBER_EPOCH}}
> exception.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)