Jason Gustafson created KAFKA-14077:
---------------------------------------
Summary: KRaft should support recovery from failed disk
Key: KAFKA-14077
URL: https://issues.apache.org/jira/browse/KAFKA-14077
Project: Kafka
Issue Type: Bug
Reporter: Jason Gustafson
Fix For: 3.3.0
If one of the nodes in the metadata quorum has a disk failure, there is no way
currently to safely bring the node back into the quorum. When we lose disk
state, we are at risk of losing committed data even if the failure only affects
a minority of the cluster.
Here is an example. Suppose that a metadata quorum has 3 members: v1, v2, and
v3. Initially, v1 is the leader and writes a record at offset 1. After v2
acknowledges replication of the record, it becomes committed. Suppose that v1
fails before v3 has a chance to replicate this record. As long as v1 remains
down, the raft protocol guarantees that only v2 can become leader, so the
record cannot be lost. The raft protocol expects that when v1 returns, it will
still have that record, but what if there is a disk failure, the state cannot
be recovered and v1 participates in leader election? Then we would have
committed data on a minority of the voters. The main problem here concerns how
we recover from this impaired state without risking the loss of this data.
Consider a naive solution which brings v1 back with an empty disk. Since the
node has lost is prior knowledge of the state of the quorum, it will vote for
any candidate that comes along. If v3 becomes a candidate, then it will vote
for itself and it just needs the vote from v1 to become leader. If that
happens, then the committed data on v2 will become lost.
This is just one scenario. In general, the invariants that the raft protocol is
designed to preserve go out the window when disk state is lost. For example, it
is also possible to contrive a scenario where the loss of disk state leads to
multiple leaders. There is a good reason why raft requires that any vote cast
by a voter is written to disk since otherwise the voter may vote for different
candidates in the same epoch.
Many systems solve this problem with a unique identifier which is generated
automatically and stored on disk. This identifier is then committed to the raft
log. If a disk changes, we would see a new identifier and we can prevent the
node from breaking raft invariants. Then recovery from a failed disk requires a
quorum reconfiguration. We need something like this in KRaft to make disk
recovery possible.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)