Luke Chen created KAFKA-15489:
---------------------------------
Summary: split brain in KRaft cluster
Key: KAFKA-15489
URL: https://issues.apache.org/jira/browse/KAFKA-15489
Project: Kafka
Issue Type: Bug
Components: kraft
Affects Versions: 3.5.1
Reporter: Luke Chen
Assignee: Luke Chen
I found in the current KRaft implementation, when network partition happened
between the current controller leader and the other controller nodes, the
"split brain" issue will happen. It causes 2 leaders will exist in the
controller cluster, and 2 inconsistent sets of metadata will return to the
clients.
*Root cause*
In
[KIP-595|https://cwiki.apache.org/confluence/display/KAFKA/KIP-595%3A+A+Raft+Protocol+for+the+Metadata+Quorum#KIP595:ARaftProtocolfortheMetadataQuorum-Vote],
we said A voter will begin a new election under three conditions:
1. If it fails to receive a FetchResponse from the current leader before
expiration of quorum.fetch.timeout.ms
2. If it receives a EndQuorumEpoch request from the current leader
3. If it fails to receive a majority of votes before expiration of
quorum.election.timeout.ms after declaring itself a candidate.
And that's exactly what the current KRaft's implementation.
However, when the leader is isolated from the network partition, there's no way
for it to resign from the leadership and start a new election. So the leader
will always be the leader even though all other nodes are down. And this makes
the split brain issue possible.
When reading further in the KIP-595, I found we indeed considered this
situation and have solution for that. in [this
section|https://cwiki.apache.org/confluence/display/KAFKA/KIP-595%3A+A+Raft+Protocol+for+the+Metadata+Quorum#KIP595:ARaftProtocolfortheMetadataQuorum-LeaderProgressTimeout],
it said:
{quote}In the pull-based model, however, say a new leader has been elected with
a new epoch and everyone has learned about it except the old leader (e.g. that
leader was not in the voters anymore and hence not receiving the
BeginQuorumEpoch as well), then that old leader would not be notified by anyone
about the new leader / epoch and become a pure "zombie leader", as there is no
regular heartbeats being pushed from leader to the follower. This could lead to
stale information being served to the observers and clients inside the cluster.
{quote}
{quote}To resolve this issue, we will piggy-back on the
"quorum.fetch.timeout.ms" config, such that if the leader did not receive Fetch
requests from a majority of the quorum for that amount of time, it would begin
a new election and start sending VoteRequest to voter nodes in the cluster to
understand the latest quorum. If it couldn't connect to any known voter, the
old leader shall keep starting new elections and bump the epoch.
{quote}
But we missed this implementation in current KRaft.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)