Hi, We recently saw a split-brain behavior happened in production.
There were a controller switch, and then unclean leader elections. It led to 24 partitions (out of 70) had 2 leaders for 2 days until I restarted the broker. Producers and consumers were talking to different "leaders" so the records were produced to a broker from where no consumer was consuming. Total controller count was always 1 during the event. I did a very rough search on Kafka's JIRA, and seems there is no similar report. I want to know if there's a known bug. If it is a new one, although I cannot reproduce it, I'd like to provide more details (logs, datadog dashboard screenshots) for Kafka team to investigate it. [env] Kafka version: 0.11.0.2 zookeeper version: 3.4.12. OS: CentOS 7 We have 3 servers, each is hosting 1 Kafka + 1 zookeeper. Unclean leader election is enabled. Topics: there were 2 topics: __consumer_offsets=>50 partitions; postfix=>20 partitions. [details] The split brain was triggered by server reboot. We were doing a system maintenance so rebooting 3 nodes 1 by 1. (Kafka/Zookeeper stop script are hooked to OS shutdown) 1. When rebooting server #3, controller election happened and switched from #1 to #2. 2. After Kafka #3 restarted, Unclean leader election was seen from JMX metrics. - Controller changed from #1 to #2. - The leader count changed from (23, 24, 23) to (43, 24, 27). - The under replicated partitions changed from (0, 0, 0) to (20, 0, 4) - ISR shrink rate on #2 grew and kept being about 4.8 3. message.in for each broker was the same as pre-reboot. 4. bytes.out was 0 for broker #2. The cluster kept the same weird status for 30 hours. Can anyone explain how could this happen? I'd like to provide logs and metrics if needed.