Aravind Velamur Srinivasan created KAFKA-7865: -------------------------------------------------
Summary: Kafka Constant Consumer Errors for ~30 min after Network Blip Key: KAFKA-7865 URL: https://issues.apache.org/jira/browse/KAFKA-7865 Project: Kafka Issue Type: Bug Components: consumer Affects Versions: 0.10.2.1 Reporter: Aravind Velamur Srinivasan We are running v0.10.2.1 Kafka on AWS backed by EBS with 10 brokers (5 zookeepers). A few days ago we had a network blip for ~30-45seconds. The interesting part was consumers coordinated by one of the brokers all kept getting error code 16 (NOT_COORDINATOR) for ~30-35 mins before eventually receiving the messages successfully. The broker itself was up and running and the resource utilization was fine as well (in terms of CPU, memory, disk, etc). In addition the under replicated partitions and other things recovered within a minute and all the other CGs coordinated by other brokers were fine as well. The broker had errors during the blip (but just only during the blip like this - other brokers saw this as well but were just fine and recovered in ~a minute): {noformat} org.apache.kafka.common.errors.NotLeaderForPartitionException: This server is not the leader for that topic-partition. (kafka.server.ReplicaFetcherThread) {noformat} Eventually after 30 mins it recovered but being a real-time messaging bus, 30 mins is not so real-time :) Some of the questions we have is: 1. Why this was the only broker which was affected? Note: this was not the controller and this one didn't see any more n/w issues than the others. 2. What made it recover? This is because we didn't change anything or restart anything as well. 3. Why did the client retries never worked? The client was constantly retrying and kept getting the same error. 4. Why we didn't notice any error logs as well? 5. Is this is a known issue which is solved in the later releases? 6. What can we do mitigate this? Are we running into something like this: org.apache.kafka.common.errors.NotLeaderForPartitionException: This server is not the leader for that topic-partition. (kafka.server.ReplicaFetcherThread) Note: Some of the other settings we have: zookeeper.connection.timeout.ms=10000 // server.properties zookeeper.connection.timeout.ms=6000 // consumer.properties -- This message was sent by Atlassian JIRA (v7.6.3#76005)