[ https://issues.apache.org/jira/browse/KAFKA-14077?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Jose Armando Garcia Sancio updated KAFKA-14077: ----------------------------------------------- Fix Version/s: 3.4.0 (was: 3.3.0) > KRaft should support recovery from failed disk > ---------------------------------------------- > > Key: KAFKA-14077 > URL: https://issues.apache.org/jira/browse/KAFKA-14077 > Project: Kafka > Issue Type: Bug > Reporter: Jason Gustafson > Priority: Blocker > Fix For: 3.4.0 > > > If one of the nodes in the metadata quorum has a disk failure, there is no > way currently to safely bring the node back into the quorum. When we lose > disk state, we are at risk of losing committed data even if the failure only > affects a minority of the cluster. > Here is an example. Suppose that a metadata quorum has 3 members: v1, v2, and > v3. Initially, v1 is the leader and writes a record at offset 1. After v2 > acknowledges replication of the record, it becomes committed. Suppose that v1 > fails before v3 has a chance to replicate this record. As long as v1 remains > down, the raft protocol guarantees that only v2 can become leader, so the > record cannot be lost. The raft protocol expects that when v1 returns, it > will still have that record, but what if there is a disk failure, the state > cannot be recovered and v1 participates in leader election? Then we would > have committed data on a minority of the voters. The main problem here > concerns how we recover from this impaired state without risking the loss of > this data. > Consider a naive solution which brings v1 back with an empty disk. Since the > node has lost is prior knowledge of the state of the quorum, it will vote for > any candidate that comes along. If v3 becomes a candidate, then it will vote > for itself and it just needs the vote from v1 to become leader. If that > happens, then the committed data on v2 will become lost. > This is just one scenario. In general, the invariants that the raft protocol > is designed to preserve go out the window when disk state is lost. For > example, it is also possible to contrive a scenario where the loss of disk > state leads to multiple leaders. There is a good reason why raft requires > that any vote cast by a voter is written to disk since otherwise the voter > may vote for different candidates in the same epoch. > Many systems solve this problem with a unique identifier which is generated > automatically and stored on disk. This identifier is then committed to the > raft log. If a disk changes, we would see a new identifier and we can prevent > the node from breaking raft invariants. Then recovery from a failed disk > requires a quorum reconfiguration. We need something like this in KRaft to > make disk recovery possible. > > -- This message was sent by Atlassian Jira (v8.20.10#820010)