Hi, We are upgrading a 3 broker cluster (1001,1002,1003) from 3.1.0 to 3.2.0. During upgrade, it is noticed that when 1003 is restarted, it doesn't join back the ISR list and the broker is stuck. Same is the case with 1002. Only when 1001 is restarted, 1003,1002 re-join the ISR list and start replicating data.
{"type":"log", "host":"kf-pl47-me8-2", "level":"INFO", "neid":"kafka-b352b4f8cf4447e9a73d9e7ef3ec746c", "system":"kafka", "time":"2022-12-06T10:07:30.386", "timezone":"UTC", "log":{"message":"main - kafka.server.KafkaServer - [KafkaServer id=1003] started"}} {"type":"log", "host":"kf-pl47-me8-2", "level":"INFO", "neid":"kafka-b352b4f8cf4447e9a73d9e7ef3ec746c", "system":"kafka", "time":"2022-12-06T10:07:30.442", "timezone":"UTC", "log":{"message":"data-plane-kafka-request-handler-1 - state.change.logger - [Broker id=1003] Add 397 partitions and deleted 0 partitions from metadata cache in response to UpdateMetadata request sent by controller 1002 epoch 18 with correlation id 0"}} {"type":"log", "host":"kf-pl47-me8-2", "level":"INFO", "neid":"kafka-b352b4f8cf4447e9a73d9e7ef3ec746c", "system":"kafka", "time":"2022-12-06T10:07:30.448", "timezone":"UTC", "log":{"message":"BrokerToControllerChannelManager broker=1003 name=alterIsr - kafka.server.BrokerToControllerRequestThread - [BrokerToControllerChannelManager broker=1003 name=alterIsr]: Recorded new controller, from now on will use broker kf-pl47-me8-1.kf-pl47-me8-headless.nc0968-admin-ns.svc.cluster.local:9092 (id: 1002 rack: null)"}} {"type":"log", "host":"kf-pl47-me8-2", "level":"ERROR", "neid":"kafka-b352b4f8cf4447e9a73d9e7ef3ec746c", "system":"kafka", "time":"2022-12-06T10:07:30.451", "timezone":"UTC", "log":{"message":"data-plane-kafka-network-thread-1003-ListenerName(PLAINTEXT)-PLAINTEXT-1 - kafka.network.Processor - Closing socket for 192.168.216.11:9092-192.168.199.100:53778-0 because of error"}} org.apache.kafka.common.errors.InvalidRequestException: Error getting request for apiKey: LEADER_AND_ISR, apiVersion: 6, connectionId: 192.168.216.11:9092-192.168.199.100:53778-0, listenerName: ListenerName(PLAINTEXT), principal: User:ANONYMOUS org.apache.kafka.common.errors.InvalidRequestException: Error getting request for apiKey: LEADER_AND_ISR, apiVersion: 6, connectionId: 192.168.216.11:9092-192.168.235.153:46282-461, listenerName: ListenerName(PLAINTEXT), principal: User:ANONYMOUS Caused by: org.apache.kafka.common.errors.UnsupportedVersionException: Can't read version 6 of LeaderAndIsrTopicState {"type":"log", "host":"kf-pl47-me8-2", "level":"INFO", "neid":"kafka-b352b4f8cf4447e9a73d9e7ef3ec746c", "system":"kafka", "time":"2022-12-06T10:12:50.916", "timezone":"UTC", "log":{"message":"controller-event-thread - kafka.controller.KafkaController - [Controller id=1003] 1003 successfully elected as the controller. Epoch incremented to 20 and epoch zk version is now 20"}} {"type":"log", "host":"kf-pl47-me8-2", "level":"INFO", "neid":"kafka-b352b4f8cf4447e9a73d9e7ef3ec746c", "system":"kafka", "time":"2022-12-06T10:12:50.917", "timezone":"UTC", "log":{"message":"controller-event-thread - kafka.controller.KafkaController - [Controller id=1003] Registering handlers"}} Note: Unclean leader election is not enabled. This possibly was introduced by KAFKA-13587 <https://issues.apache.org/jira/browse/KAFKA-13587>. In the below snapshot during the upgrade, at 16:05:15 UTC 2022, 1001 was restarting and both 1002 and 1003 were already up and running (after the upgrade from 3.1.0 to 3.2.0), but did not manage to re-join the ISRs. Wed Dec 7 16:05:15 UTC 2022 Topic: test TopicId: L6Yj_Nf9RrirNhFQzvXODw PartitionCount: 2 ReplicationFactor: 3 Configs: compression.type=producer,min.insync.replicas=1,cleanup.policy=delete,flush.ms=1000,segment.bytes=100000000,flush.messages=10000,max.message.bytes=1000012,index.interval.bytes=4096,unclean.leader.election.enable=false,retention.bytes=1000000000,segment.index.bytes=10485760 Topic: test Partition: 0 Leader: none Replicas: 1002,1003,1001 Isr: 1001 Topic: test Partition: 1 Leader: none Replicas: 1001,1002,1003 Isr: 1001 Wed Dec 7 16:05:33 UTC 2022 Topic: test TopicId: L6Yj_Nf9RrirNhFQzvXODw PartitionCount: 2 ReplicationFactor: 3 Configs: compression.type=producer,min.insync.replicas=1,cleanup.policy=delete,flush.ms=1000,segment.bytes=100000000,flush.messages=10000,max.message.bytes=1000012,index.interval.bytes=4096,unclean.leader.election.enable=false,retention.bytes=1000000000,segment.index.bytes=10485760 Topic: test Partition: 0 Leader: 1001 Replicas: 1002,1003,1001 Isr: 1001,1002,1003 Topic: test Partition: 1 Leader: 1001 Replicas: 1001,1002,1003 Isr: 1001,1002,1003 Is there anything the user needs to do explicitly to work around this issue? Thank you, Swathi