shilin Lu created KAFKA-15240: --------------------------------- Summary: BrokerToControllerChannelManager cache activeController error cause DefaultAlterPartitionManager send AlterPartition request failed Key: KAFKA-15240 URL: https://issues.apache.org/jira/browse/KAFKA-15240 Project: Kafka Issue Type: Bug Components: core Affects Versions: 3.5.0, 2.8.2, 2.8.1, 2.8.0 Environment: 2.8.1 kafka version Reporter: shilin Lu Assignee: shilin Lu Attachments: image-2023-07-24-16-35-56-589.png
After KIP-497,partition leader do not use zk to propagateIsrChanges,it will send AlterPartitionRequest to controller to propagateIsrChanges.Then broker will cache active controller node info through controllerNodeProvider interface. 2023.07.12,in kafka product environment,we find so much `Broker had a stale broker epoch` when send partitionAlterRequest to controller.And in this kafka cluster has so much replica not in isr assignment with replica fetch is correct.So it only propagateIsrChanges failed. !https://iwiki.woa.com/tencent/api/attachments/s3/url?attachmentid=3165506! But there has something strange,if broker send partitionAlterRequest failed controller will print some log like this.But in active controller node not find this log info !image-2023-07-24-16-35-56-589.png! Then i just suspect this broker connect to an error active controller.Through network packet capture, find this broker connect to an unfamiliar broker port(9092) send request.Refer to this kafka cluster operation history,find this unfamiliar broker is an old broker node in this cluster and this node is a controller node in new kafka cluster. Current BrokerToControllerChannelManager update active controller only happened when disconnect or responseCode is NOT_CONTROLLER. So when no request send and error broker node is another kafka cluster controller node,this case will repetite. -- This message was sent by Atlassian Jira (v8.20.10#820010)