Luke Chen created KAFKA-15489:
---------------------------------

             Summary: split brain in KRaft cluster 
                 Key: KAFKA-15489
                 URL: https://issues.apache.org/jira/browse/KAFKA-15489
             Project: Kafka
          Issue Type: Bug
          Components: kraft
    Affects Versions: 3.5.1
            Reporter: Luke Chen
            Assignee: Luke Chen


I found in the current KRaft implementation, when network partition happened 
between the current controller leader and the other controller nodes, the 
"split brain" issue will happen. It causes 2 leaders will exist in the 
controller cluster, and 2 inconsistent sets of metadata will return to the 
clients.

 

*Root cause*
In 
[KIP-595|https://cwiki.apache.org/confluence/display/KAFKA/KIP-595%3A+A+Raft+Protocol+for+the+Metadata+Quorum#KIP595:ARaftProtocolfortheMetadataQuorum-Vote],
 we said A voter will begin a new election under three conditions:

1. If it fails to receive a FetchResponse from the current leader before 
expiration of quorum.fetch.timeout.ms
2. If it receives a EndQuorumEpoch request from the current leader
3. If it fails to receive a majority of votes before expiration of 
quorum.election.timeout.ms after declaring itself a candidate.

And that's exactly what the current KRaft's implementation.

 

However, when the leader is isolated from the network partition, there's no way 
for it to resign from the leadership and start a new election. So the leader 
will always be the leader even though all other nodes are down. And this makes 
the split brain issue possible.

When reading further in the KIP-595, I found we indeed considered this 
situation and have solution for that. in [this 
section|https://cwiki.apache.org/confluence/display/KAFKA/KIP-595%3A+A+Raft+Protocol+for+the+Metadata+Quorum#KIP595:ARaftProtocolfortheMetadataQuorum-LeaderProgressTimeout],
 it said:
{quote}In the pull-based model, however, say a new leader has been elected with 
a new epoch and everyone has learned about it except the old leader (e.g. that 
leader was not in the voters anymore and hence not receiving the 
BeginQuorumEpoch as well), then that old leader would not be notified by anyone 
about the new leader / epoch and become a pure "zombie leader", as there is no 
regular heartbeats being pushed from leader to the follower. This could lead to 
stale information being served to the observers and clients inside the cluster.
{quote}
{quote}To resolve this issue, we will piggy-back on the 
"quorum.fetch.timeout.ms" config, such that if the leader did not receive Fetch 
requests from a majority of the quorum for that amount of time, it would begin 
a new election and start sending VoteRequest to voter nodes in the cluster to 
understand the latest quorum. If it couldn't connect to any known voter, the 
old leader shall keep starting new elections and bump the epoch.
{quote}
 

But we missed this implementation in current KRaft.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to