Multiple Leaders on single Partition in Kafka Cluster

Weichu Liu Mon, 29 Oct 2018 00:20:02 -0700

Hi,

We recently saw a split-brain behavior happened in production.


There were a controller switch, and then unclean leader elections.
It led to 24 partitions (out of 70) had 2 leaders for 2 days until I
restarted the broker. Producers and consumers were talking to different
"leaders" so the records were produced to a broker from where no consumer
was consuming.
Total controller count was always 1 during the event.

I did a very rough search on Kafka's JIRA, and seems there is no similar
report.
I want to know if there's a known bug.
If it is a new one, although I cannot reproduce it, I'd like to provide
more details (logs, datadog dashboard screenshots) for Kafka team to
investigate it.

[env]
Kafka version: 0.11.0.2
zookeeper version: 3.4.12.
OS: CentOS 7
We have 3 servers, each is hosting 1 Kafka + 1 zookeeper.
Unclean leader election is enabled.
Topics: there were 2 topics: __consumer_offsets=>50 partitions; postfix=>20
partitions.

[details]
The split brain was triggered by server reboot. We were doing a system
maintenance so rebooting 3 nodes 1 by 1. (Kafka/Zookeeper stop script are
hooked to OS shutdown)

1. When rebooting server #3, controller election happened and switched from
#1 to #2.
2. After Kafka #3 restarted, Unclean leader election was seen from JMX
metrics.
  - Controller changed from #1 to #2.
  - The leader count changed from (23, 24, 23) to (43, 24, 27).
  - The under replicated partitions changed from (0, 0, 0) to (20, 0, 4)
  - ISR shrink rate on #2 grew and kept being about 4.8
3. message.in for each broker was the same as pre-reboot.
4. bytes.out was 0 for broker #2.

The cluster kept the same weird status for 30 hours. Can anyone explain how
could this happen?

I'd like to provide logs and metrics if needed.

Multiple Leaders on single Partition in Kafka Cluster

Reply via email to