Aravind Velamur Srinivasan created KAFKA-7865:
-------------------------------------------------

             Summary: Kafka Constant Consumer Errors for ~30 min after Network 
Blip
                 Key: KAFKA-7865
                 URL: https://issues.apache.org/jira/browse/KAFKA-7865
             Project: Kafka
          Issue Type: Bug
          Components: consumer
    Affects Versions: 0.10.2.1
            Reporter: Aravind Velamur Srinivasan


We are running v0.10.2.1 Kafka on AWS backed by EBS with 10 brokers (5 
zookeepers). A few days ago we had a network blip for ~30-45seconds. The 
interesting part was consumers coordinated by one of the brokers all kept 
getting error code 16 (NOT_COORDINATOR) for ~30-35 mins before eventually 
receiving the messages successfully.

The broker itself was up and running and the resource utilization was fine as 
well (in terms of CPU, memory, disk, etc). In addition the under replicated 
partitions and other things recovered within a minute and all the other CGs 
coordinated by other brokers were fine as well. The broker had errors during 
the blip (but just only during the blip like this  - other brokers saw this as 
well but were just fine and recovered in ~a minute):
{noformat}
org.apache.kafka.common.errors.NotLeaderForPartitionException: This server is 
not the leader for that topic-partition. (kafka.server.ReplicaFetcherThread)
{noformat}

Eventually after 30 mins it recovered but being a real-time messaging bus, 30 
mins is not so real-time :) 

Some of the questions we have is:
1. Why this was the only broker which was affected? Note: this was not the 
controller and this one didn't see any more n/w issues than the others.
2. What made it recover? This is because we didn't change anything or restart 
anything as well.
3. Why did the client retries never worked? The client was constantly retrying 
and kept getting the same error.
4. Why we didn't notice any error logs as well? 
5. Is this is a known issue which is solved in the later releases?
6. What can we do mitigate this?

Are we running into something like this: 
org.apache.kafka.common.errors.NotLeaderForPartitionException: This server is 
not the leader for that topic-partition. (kafka.server.ReplicaFetcherThread)

Note: Some of the other settings we have:
zookeeper.connection.timeout.ms=10000 // server.properties
zookeeper.connection.timeout.ms=6000 // consumer.properties




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to