Takao Kobayashi created KAFKA-6113:
--------------------------------------

             Summary: broker failure leads to under replicated partitions
                 Key: KAFKA-6113
                 URL: https://issues.apache.org/jira/browse/KAFKA-6113
             Project: Kafka
          Issue Type: Bug
    Affects Versions: 0.10.1.1
            Reporter: Takao Kobayashi
         Attachments: Screen Shot 2017-10-20 at 10.57.28 AM.png, kafka1.csv, 
kafka2.csv, kafka3.csv, kafka4.csv, kafka5.csv, zookeeper2.csv

A similar issue to https://issues.apache.org/jira/browse/KAFKA-2729 but with 
some slight differences: We're using a 5 kafka, 3 zookeeper node setup running 
on kubernetes on aws. One node (5.kafka.production1) suddenly failed and was 
offline for ~13min. 
During the outage many partitions were under replicated. As soon as the node 
came back online, all brokers recovered. 
Looking through the logs, there were a bunch of partitions that failed to 
shrink ISR (to remove the failed broker) since the cached zkVersion on the 
kafka node was not equal to that in zookeeper (screenshot of one such example 
is attached)
I've attached the logs for all the kafka nodes and one of the zookeeper nodes. 
Any advice or insight would be much appreciate



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to