[
https://issues.apache.org/jira/browse/KAFKA-14139?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Jose Armando Garcia Sancio updated KAFKA-14139:
-----------------------------------------------
Fix Version/s: 3.4.0
> Replaced disk can lead to loss of committed data even with non-empty ISR
> ------------------------------------------------------------------------
>
> Key: KAFKA-14139
> URL: https://issues.apache.org/jira/browse/KAFKA-14139
> Project: Kafka
> Issue Type: Bug
> Reporter: Jason Gustafson
> Priority: Major
> Fix For: 3.4.0
>
>
> We have been thinking about disk failure cases recently. Suppose that a disk
> has failed and the user needs to restart the disk from an empty state. The
> concern is whether this can lead to the unnecessary loss of committed data.
> For normal topic partitions, removal from the ISR during controlled shutdown
> buys us some protection. After the replica is restarted, it must prove its
> state to the leader before it can be added back to the ISR. And it cannot
> become a leader until it does so.
> An obvious exception to this is when the replica is the last member in the
> ISR. In this case, the disk failure itself has compromised the committed
> data, so some amount of loss must be expected.
> We have been considering other scenarios in which the loss of one disk can
> lead to data loss even when there are replicas remaining which have all of
> the committed entries. One such scenario is this:
> Suppose we have a partition with two replicas: A and B. Initially A is the
> leader and it is the only member of the ISR.
> # Broker B catches up to A, so A attempts to send an AlterPartition request
> to the controller to add B into the ISR.
> # Before the AlterPartition request is received, replica B has a hard
> failure.
> # The current controller successfully fences broker B. It takes no action on
> this partition since B is already out of the ISR.
> # Before the controller receives the AlterPartition request to add B, it
> also fails.
> # While the new controller is initializing, suppose that replica B finishes
> startup, but the disk has been replaced (all of the previous state has been
> lost).
> # The new controller sees the registration from broker B first.
> # Finally, the AlterPartition from A arrives which adds B back into the ISR
> even though it has an empty log.
> (Credit for coming up with this scenario goes to [~junrao] .)
> I tested this in KRaft and confirmed that this sequence is possible (even if
> perhaps unlikely). There are a few ways we could have potentially detected
> the issue. First, perhaps the leader should have bumped the leader epoch on
> all partitions when B was fenced. Then the inflight AlterPartition would be
> doomed no matter when it arrived.
> Alternatively, we could have relied on the broker epoch to distinguish the
> dead broker's state from that of the restarted broker. This could be done by
> including the broker epoch in both the `Fetch` request and in
> `AlterPartition`.
> Finally, perhaps even normal kafka replication should be using a unique
> identifier for each disk so that we can reliably detect when it has changed.
> For example, something like what was proposed for the metadata quorum here:
> [https://cwiki.apache.org/confluence/display/KAFKA/KIP-853%3A+KRaft+Voter+Changes.]
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)