[GitHub] [kafka] hachikuji commented on a change in pull request #10309: KAFKA-12181; Loosen raft fetch offset validation of remote replicas

GitBox Mon, 22 Mar 2021 11:46:24 -0700


hachikuji commented on a change in pull request #10309:
URL: https://github.com/apache/kafka/pull/10309#discussion_r598986869




##########
File path: raft/src/main/java/org/apache/kafka/raft/LeaderState.java
##########
@@ -170,36 +183,38 @@ public boolean updateReplicaState(int replicaId,
             .collect(Collectors.toList());
     }
 
-    private List<VoterState> followersByDescendingFetchOffset() {
-        return new ArrayList<>(this.voterReplicaStates.values()).stream()
+    private List<ReplicaState> followersByDescendingFetchOffset() {
+        return new ArrayList<>(this.voterStates.values()).stream()
             .sorted()
             .collect(Collectors.toList());
     }
 
     private boolean updateEndOffset(ReplicaState state,
                                     LogOffsetMetadata endOffsetMetadata) {
         state.endOffset.ifPresent(currentEndOffset -> {
-            if (currentEndOffset.offset > endOffsetMetadata.offset)
-                throw new IllegalArgumentException("Non-monotonic update to 
end offset for nodeId " + state.nodeId);
+            if (currentEndOffset.offset > endOffsetMetadata.offset) {
+                if (state.nodeId == localId) {
+                    throw new IllegalStateException("Detected non-monotonic 
update of local " +
+                        "end offset: " + currentEndOffset.offset + " -> " + 
endOffsetMetadata.offset);
+                } else {
+                    log.warn("Detected non-monotonic update of fetch offset 
from nodeId {}: {} -> {}",

Review comment:
       The situation we are trying to handle is when a follower loses its disk. 
Basically the damage is already done by the time we receive the Fetch and the 
only thing we can do is let the follower try to catch back up. The problem with 
the old logic is that it prevented this even in situations which would not 
violate guarantees. I am planning to file a follow-up jira to think of some 
ways to handle disk loss situations more generally. We would like to at least 
detect the situation and see if we can prevent it from causing too much damage.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [kafka] hachikuji commented on a change in pull request #10309: KAFKA-12181; Loosen raft fetch offset validation of remote replicas

Reply via email to