[ https://issues.apache.org/jira/browse/KAFKA-8702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Andrey Falko updated KAFKA-8702: -------------------------------- Description: We first started seeing this with 2.1.1 version of Kafka. We are currently on 2.3.0. We were able to actively reproduce this today on one of our staging environments. There are three brokers in this environment, 0, 1, and 2. The reproduction steps are as follows: 1) Push some traffic to a topic that looks like this: $ bin/kafka-topics.sh --describe --zookeeper $(grep zookeeper.connect= /kafka/config/server.properties | awk -F= '\{print $2}') --topic test Topic:test PartitionCount:6 ReplicationFactor:3 Configs:cleanup.policy=delete,retention.ms=86400000 Topic: test Partition: 0 Leader: 0 Replicas: 2,0,1 Isr: 0,1,2 Topic: test Partition: 1 Leader: 0 Replicas: 0,1,2 Isr: 0,1,2 Topic: test Partition: 2 Leader: 1 Replicas: 1,2,0 Isr: 1,2,0 Topic: test Partition: 3 Leader: 2 Replicas: 2,1,0 Isr: 1,2,0 Topic: test Partition: 4 Leader: 0 Replicas: 0,2,1 Isr: 0,1,2 Topic: test Partition: 5 Leader: 1 Replicas: 1,0,2 Isr: 1,2,0 2) We proceed to run the following on broker 0: iptables -D INPUT -j DROP -p tcp --destination-port 9093 && iptables -D OUTPUT -j DROP -p tcp --destination-port 9093 Note: our replication and traffic from clients comes in on TLS protected port 9093 only. 3) Leadership doesn't change b/c Zookeeper connection is unaffected. However, we start seeing URP. 4) We reboot broker 0. We see offline partitions. Leadership never changes and the cluster only recovers when broker 0 comes back online. Best regards, Andrey Falko was: We first started seeing this with 2.1.1 version of Kafka. We are currently on 2.3.0. We were able to actively reproduce this today on one of our staging environments. The reproduction steps are as follows: 1) Push some traffic to a topic that looks like this: $ bin/kafka-topics.sh --describe --zookeeper $(grep zookeeper.connect= /kafka/config/server.properties | awk -F= '\{print $2}') --topic test Topic:test PartitionCount:6 ReplicationFactor:3 Configs:cleanup.policy=delete,retention.ms=86400000 Topic: test Partition: 0 Leader: 0 Replicas: 2,0,1 Isr: 0,1,2 Topic: test Partition: 1 Leader: 0 Replicas: 0,1,2 Isr: 0,1,2 Topic: test Partition: 2 Leader: 1 Replicas: 1,2,0 Isr: 1,2,0 Topic: test Partition: 3 Leader: 2 Replicas: 2,1,0 Isr: 1,2,0 Topic: test Partition: 4 Leader: 0 Replicas: 0,2,1 Isr: 0,1,2 Topic: test Partition: 5 Leader: 1 Replicas: 1,0,2 Isr: 1,2,0 2) We proceed to run the following on broker 0: iptables -D INPUT -j DROP -p tcp --destination-port 9093 && iptables -D OUTPUT -j DROP -p tcp --destination-port 9093 Note: our replication and traffic from clients comes in on TLS protected port 9093 only. 3) Leadership doesn't change b/c Zookeeper connection is unaffected. However, we start seeing URP. 4) We reboot broker 0. We see offline partitions. Leadership never changes and the cluster only recovers when broker 0 comes back online. Best regards, Andrey Falko > Kafka leader election doesn't happen when leader broker port is partitioned > off the network > ------------------------------------------------------------------------------------------- > > Key: KAFKA-8702 > URL: https://issues.apache.org/jira/browse/KAFKA-8702 > Project: Kafka > Issue Type: Bug > Components: core > Affects Versions: 2.1.0 > Reporter: Andrey Falko > Priority: Major > > We first started seeing this with 2.1.1 version of Kafka. We are currently on > 2.3.0. > We were able to actively reproduce this today on one of our staging > environments. There are three brokers in this environment, 0, 1, and 2. The > reproduction steps are as follows: > 1) Push some traffic to a topic that looks like this: > $ bin/kafka-topics.sh --describe --zookeeper $(grep zookeeper.connect= > /kafka/config/server.properties | awk -F= '\{print $2}') --topic test > Topic:test PartitionCount:6 ReplicationFactor:3 > Configs:cleanup.policy=delete,retention.ms=86400000 > Topic: test Partition: 0 Leader: 0 Replicas: 2,0,1 Isr: > 0,1,2 > Topic: test Partition: 1 Leader: 0 Replicas: 0,1,2 Isr: > 0,1,2 > Topic: test Partition: 2 Leader: 1 Replicas: 1,2,0 Isr: > 1,2,0 > Topic: test Partition: 3 Leader: 2 Replicas: 2,1,0 Isr: > 1,2,0 > Topic: test Partition: 4 Leader: 0 Replicas: 0,2,1 Isr: > 0,1,2 > Topic: test Partition: 5 Leader: 1 Replicas: 1,0,2 Isr: > 1,2,0 > 2) We proceed to run the following on broker 0: > iptables -D INPUT -j DROP -p tcp --destination-port 9093 && iptables -D > OUTPUT -j DROP -p tcp --destination-port 9093 > Note: our replication and traffic from clients comes in on TLS protected > port 9093 only. > 3) Leadership doesn't change b/c Zookeeper connection is unaffected. However, > we start seeing URP. > 4) We reboot broker 0. We see offline partitions. Leadership never changes > and the cluster only recovers when broker 0 comes back online. > Best regards, > Andrey Falko -- This message was sent by Atlassian JIRA (v7.6.14#76016)