Hi,

We recently saw a split-brain behavior happened in production.

There were a controller switch, and then unclean leader elections.
It led to 24 partitions (out of 70) had 2 leaders for 2 days until I
restarted the broker. Producers and consumers were talking to different
"leaders" so the records were produced to a broker from where no consumer
was consuming.
Total controller count was always 1 during the event.

I did a very rough search on Kafka's JIRA, and seems there is no similar
report.
I want to know if there's a known bug.
If it is a new one, although I cannot reproduce it, I'd like to provide
more details (logs, datadog dashboard screenshots) for Kafka team to
investigate it.

[env]
Kafka version: 0.11.0.2
zookeeper version: 3.4.12.
OS: CentOS 7
We have 3 servers, each is hosting 1 Kafka + 1 zookeeper.
Unclean leader election is enabled.
Topics: there were 2 topics: __consumer_offsets=>50 partitions; postfix=>20
partitions.

[details]
The split brain was triggered by server reboot. We were doing a system
maintenance so rebooting 3 nodes 1 by 1. (Kafka/Zookeeper stop script are
hooked to OS shutdown)

1. When rebooting server #3, controller election happened and switched from
#1 to #2.
2. After Kafka #3 restarted, Unclean leader election was seen from JMX
metrics.
  - Controller changed from #1 to #2.
  - The leader count changed from (23, 24, 23) to (43, 24, 27).
  - The under replicated partitions changed from (0, 0, 0) to (20, 0, 4)
  - ISR shrink rate on #2 grew and kept being about 4.8
3. message.in for each broker was the same as pre-reboot.
4. bytes.out was 0 for broker #2.

The cluster kept the same weird status for 30 hours. Can anyone explain how
could this happen?

I'd like to provide logs and metrics if needed.

Reply via email to