Takao Kobayashi created KAFKA-6113: -------------------------------------- Summary: broker failure leads to under replicated partitions Key: KAFKA-6113 URL: https://issues.apache.org/jira/browse/KAFKA-6113 Project: Kafka Issue Type: Bug Affects Versions: 0.10.1.1 Reporter: Takao Kobayashi Attachments: Screen Shot 2017-10-20 at 10.57.28 AM.png, kafka1.csv, kafka2.csv, kafka3.csv, kafka4.csv, kafka5.csv, zookeeper2.csv
A similar issue to https://issues.apache.org/jira/browse/KAFKA-2729 but with some slight differences: We're using a 5 kafka, 3 zookeeper node setup running on kubernetes on aws. One node (5.kafka.production1) suddenly failed and was offline for ~13min. During the outage many partitions were under replicated. As soon as the node came back online, all brokers recovered. Looking through the logs, there were a bunch of partitions that failed to shrink ISR (to remove the failed broker) since the cached zkVersion on the kafka node was not equal to that in zookeeper (screenshot of one such example is attached) I've attached the logs for all the kafka nodes and one of the zookeeper nodes. Any advice or insight would be much appreciate -- This message was sent by Atlassian JIRA (v6.4.14#64029)