Gwen Shapira created KAFKA-12674: ------------------------------------ Summary: Client failover takes 2-4 seconds on clean broker shutdown Key: KAFKA-12674 URL: https://issues.apache.org/jira/browse/KAFKA-12674 Project: Kafka Issue Type: Bug Affects Versions: 2.7.0 Reporter: Gwen Shapira
I ran two perf-producer clients against a 4-broker cluster running AWS, behind ELB. And then did a rolling restart, taking down one broker at a time using controlled shutdown. I got the following errors on every broker shutdown: {{[2021-04-16 01:31:39,846] WARN [Producer clientId=producer-1] Received invalid metadata error in produce request on partition perf-test-3 due to org.apache.kafka.common.errors.NotLeaderOrFollowerException: For requests intended only for the leader, this error indicates that the broker is not the current leader. For requests intended for any replica, this error indicates that the broker is not a replica of the topic partition.. Going to request metadata update now (org.apache.kafka.clients.producer.internals.Sender)}} {{[2021-04-16 01:44:22,691] WARN [Producer clientId=producer-1] Connection to node 0 (b0-pkc-7yrmj.us-east-2.aws.confluent.cloud/3.140.123.43:9092) terminated during authentication. This may happen due to any of the following reasons: (1) Authentication failed due to invalid credentials with brokers older than 1.0.0, (2) Firewall blocking Kafka TLS traffic (eg it may only allow HTTPS traffic), (3) Transient network issue. (org.apache.kafka.clients.NetworkClient)}} The "Connection to node... terminated" error continued for 2-4 seconds. It looks like the metadata request was repeatedly sent to the node that just went down. I'd expect it to go on an existing connection to one of the live nodes. -- This message was sent by Atlassian Jira (v8.3.4#803005)