[
https://issues.apache.org/jira/browse/KAFKA-15489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Luke Chen reopened KAFKA-15489:
-------------------------------
> split brain in KRaft cluster
> -----------------------------
>
> Key: KAFKA-15489
> URL: https://issues.apache.org/jira/browse/KAFKA-15489
> Project: Kafka
> Issue Type: Bug
> Components: kraft
> Affects Versions: 3.5.1
> Reporter: Luke Chen
> Assignee: Luke Chen
> Priority: Major
> Fix For: 3.6.0
>
>
> I found in the current KRaft implementation, when network partition happened
> between the current controller leader and the other controller nodes, the
> "split brain" issue will happen. It causes 2 leaders will exist in the
> controller cluster, and 2 inconsistent sets of metadata will return to the
> clients.
>
> *Root cause*
> In
> [KIP-595|https://cwiki.apache.org/confluence/display/KAFKA/KIP-595%3A+A+Raft+Protocol+for+the+Metadata+Quorum#KIP595:ARaftProtocolfortheMetadataQuorum-Vote],
> we said A voter will begin a new election under three conditions:
> 1. If it fails to receive a FetchResponse from the current leader before
> expiration of quorum.fetch.timeout.ms
> 2. If it receives a EndQuorumEpoch request from the current leader
> 3. If it fails to receive a majority of votes before expiration of
> quorum.election.timeout.ms after declaring itself a candidate.
> And that's exactly what the current KRaft's implementation.
>
> However, when the leader is isolated from the network partition, there's no
> way for it to resign from the leadership and start a new election. So the
> leader will always be the leader even though all other nodes are down. And
> this makes the split brain issue possible.
> When reading further in the KIP-595, I found we indeed considered this
> situation and have solution for that. in [this
> section|https://cwiki.apache.org/confluence/display/KAFKA/KIP-595%3A+A+Raft+Protocol+for+the+Metadata+Quorum#KIP595:ARaftProtocolfortheMetadataQuorum-LeaderProgressTimeout],
> it said:
> {quote}In the pull-based model, however, say a new leader has been elected
> with a new epoch and everyone has learned about it except the old leader
> (e.g. that leader was not in the voters anymore and hence not receiving the
> BeginQuorumEpoch as well), then that old leader would not be notified by
> anyone about the new leader / epoch and become a pure "zombie leader", as
> there is no regular heartbeats being pushed from leader to the follower. This
> could lead to stale information being served to the observers and clients
> inside the cluster.
> {quote}
> {quote}To resolve this issue, we will piggy-back on the
> "quorum.fetch.timeout.ms" config, such that if the leader did not receive
> Fetch requests from a majority of the quorum for that amount of time, it
> would begin a new election and start sending VoteRequest to voter nodes in
> the cluster to understand the latest quorum. If it couldn't connect to any
> known voter, the old leader shall keep starting new elections and bump the
> epoch.
> {quote}
>
> But we missed this implementation in current KRaft.
>
> *The flow is like this:*
> 1. 3 controller nodes, A(leader), B(follower), C(follower)
> 2. network partition happened between [A] and [B, C].
> 3. B and C starts new election since fetch timeout expired before receiving
> fetch response from leader A.
> 4. B (or C) is elected as a leader in new epoch, while A is still the leader
> in old epoch.
> 5. broker D creates a topic "new", and updates to leader B.
> 6. broker E describe topic "new", but got nothing because it is connecting to
> the old leader A.
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)