[ https://issues.apache.org/jira/browse/KAFKA-15489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Luke Chen reopened KAFKA-15489: ------------------------------- > split brain in KRaft cluster > ----------------------------- > > Key: KAFKA-15489 > URL: https://issues.apache.org/jira/browse/KAFKA-15489 > Project: Kafka > Issue Type: Bug > Components: kraft > Affects Versions: 3.5.1 > Reporter: Luke Chen > Assignee: Luke Chen > Priority: Major > Fix For: 3.6.0 > > > I found in the current KRaft implementation, when network partition happened > between the current controller leader and the other controller nodes, the > "split brain" issue will happen. It causes 2 leaders will exist in the > controller cluster, and 2 inconsistent sets of metadata will return to the > clients. > > *Root cause* > In > [KIP-595|https://cwiki.apache.org/confluence/display/KAFKA/KIP-595%3A+A+Raft+Protocol+for+the+Metadata+Quorum#KIP595:ARaftProtocolfortheMetadataQuorum-Vote], > we said A voter will begin a new election under three conditions: > 1. If it fails to receive a FetchResponse from the current leader before > expiration of quorum.fetch.timeout.ms > 2. If it receives a EndQuorumEpoch request from the current leader > 3. If it fails to receive a majority of votes before expiration of > quorum.election.timeout.ms after declaring itself a candidate. > And that's exactly what the current KRaft's implementation. > > However, when the leader is isolated from the network partition, there's no > way for it to resign from the leadership and start a new election. So the > leader will always be the leader even though all other nodes are down. And > this makes the split brain issue possible. > When reading further in the KIP-595, I found we indeed considered this > situation and have solution for that. in [this > section|https://cwiki.apache.org/confluence/display/KAFKA/KIP-595%3A+A+Raft+Protocol+for+the+Metadata+Quorum#KIP595:ARaftProtocolfortheMetadataQuorum-LeaderProgressTimeout], > it said: > {quote}In the pull-based model, however, say a new leader has been elected > with a new epoch and everyone has learned about it except the old leader > (e.g. that leader was not in the voters anymore and hence not receiving the > BeginQuorumEpoch as well), then that old leader would not be notified by > anyone about the new leader / epoch and become a pure "zombie leader", as > there is no regular heartbeats being pushed from leader to the follower. This > could lead to stale information being served to the observers and clients > inside the cluster. > {quote} > {quote}To resolve this issue, we will piggy-back on the > "quorum.fetch.timeout.ms" config, such that if the leader did not receive > Fetch requests from a majority of the quorum for that amount of time, it > would begin a new election and start sending VoteRequest to voter nodes in > the cluster to understand the latest quorum. If it couldn't connect to any > known voter, the old leader shall keep starting new elections and bump the > epoch. > {quote} > > But we missed this implementation in current KRaft. > > *The flow is like this:* > 1. 3 controller nodes, A(leader), B(follower), C(follower) > 2. network partition happened between [A] and [B, C]. > 3. B and C starts new election since fetch timeout expired before receiving > fetch response from leader A. > 4. B (or C) is elected as a leader in new epoch, while A is still the leader > in old epoch. > 5. broker D creates a topic "new", and updates to leader B. > 6. broker E describe topic "new", but got nothing because it is connecting to > the old leader A. > -- This message was sent by Atlassian Jira (v8.20.10#820010)