[ https://issues.apache.org/jira/browse/RATIS-2278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17943815#comment-17943815 ]
Xinhao GU edited comment on RATIS-2278 at 4/12/25 1:40 PM: ----------------------------------------------------------- !image-2025-04-12-21-39-10-485.png! Although we changed the the Preconditions to validate startIndex, the ERROR appears again. So I think the the Preconditions should be removed completely. [~szetszwo] {code:java} --- a/ratis-server/src/main/java/org/apache/ratis/server/impl/ServerImplUtils.java +++ b/ratis-server/src/main/java/org/apache/ratis/server/impl/ServerImplUtils.java @@ -138,13 +138,6 @@ public final class ServerImplUtils { synchronized void append(List<ConsecutiveIndices> entriesTermIndices) { for(ConsecutiveIndices indices : entriesTermIndices) { - // validate startIndex - final Map.Entry<Long, ConsecutiveIndices> lastEntry = map.lastEntry(); - if (lastEntry != null) { - final long nextIndex = lastEntry.getValue().getNextIndex(); - Preconditions.assertTrue(indices.startIndex >= nextIndex, - () -> "startIndex = " + indices.startIndex + " < nextIndex = " + nextIndex); - } map.put(indices.startIndex, indices); } } {code} was (Author: JIRAUSER308993): !image-2025-04-12-21-39-10-485.png! {code:java} {code} Although we changed the the Preconditions to validate startIndex, the ERROR appears again. So I think the the Preconditions should be removed completely. [~szetszwo] {code:java} --- a/ratis-server/src/main/java/org/apache/ratis/server/impl/ServerImplUtils.java +++ b/ratis-server/src/main/java/org/apache/ratis/server/impl/ServerImplUtils.java @@ -138,13 +138,6 @@ public final class ServerImplUtils { synchronized void append(List<ConsecutiveIndices> entriesTermIndices) { for(ConsecutiveIndices indices : entriesTermIndices) { - // validate startIndex - final Map.Entry<Long, ConsecutiveIndices> lastEntry = map.lastEntry(); - if (lastEntry != null) { - final long nextIndex = lastEntry.getValue().getNextIndex(); - Preconditions.assertTrue(indices.startIndex >= nextIndex, - () -> "startIndex = " + indices.startIndex + " < nextIndex = " + nextIndex); - } map.put(indices.startIndex, indices); } } {code} > Follower Fails to Append Entries Due to Index Validation Race Condition in > NavigableIndices > ------------------------------------------------------------------------------------------- > > Key: RATIS-2278 > URL: https://issues.apache.org/jira/browse/RATIS-2278 > Project: Ratis > Issue Type: Bug > Components: Leader > Reporter: Xinhao GU > Assignee: Xinhao GU > Priority: Major > Fix For: 3.2.0 > > Attachments: image-2025-04-06-22-56-34-207.png, > image-2025-04-12-21-39-10-485.png > > Time Spent: 40m > Remaining Estimate: 0h > > h2. *Problem Description* > When a follower receives overlapping {{appendEntries}} requests from the > leader, it may throw an {{IllegalStateException}} during log index > validation. Specifically, the {{NavigableIndices}} class fails to handle > concurrent log appends properly, causing the follower to incorrectly reject > valid entries with an "index mismatch" error. > h2. *Steps to Reproduce* > # {*}Cluster Setup{*}: Deploy a Raft group with at least 1 leader and 2 > followers. > # {*}High Concurrency Scenario{*}: Trigger frequent {{appendEntries}} > requests from the leader to a follower. > # {*}Overlapping Requests{*}: Ensure two threads ({{{}thread-1{}}} and > {{{}thread-2{}}}) process {{appendEntries}} requests simultaneously: > ** {{thread-1}} appends logs with indices {{16093–16185}} and enters > {{state.getLog().append()}} but hasn’t invoked {{{}removeExisting(){}}}. > ** {{thread-2}} attempts to append logs starting at index {{16093}} (from a > new request) while {{thread-1}} is still processing. > # {*}Validation Failure{*}: The {{NavigableIndices.append()}} method detects > a gap between {{{}thread-1{}}}'s last index ({{{}16185{}}}) and > {{{}thread-2{}}}'s start index ({{{}16093{}}}), throwing an error. > h2. *Expected Behavior* > Concurrent {{appendEntries}} requests should be serialized or validated > atomically to prevent index gaps. Followers should accept logs as long as > they pass the initial {{ConsecutiveIndices.convert()}} check. > h2. *Actual Behavior* > > {code:java} > 2025-03-04 05:44:14,946 [7-server-thread2] ERROR > o.a.r.s.i.RaftServerImpl:1488 - 7@group-000200000000: Failed appendEntries > 8->7#73-t49,previous=(t:3, i:16092),leaderCommit=16184,initializing? > false,entries: size=93, first=(t:4, i:16093), > CONFIGURATIONENTRY(current:id:"6"address:"172.16.2.14:10750"startupRole:FOLLOWER, > id:"7"address:"172.16.2.15:10750"startupRole:FOLLOWER, > id:"8"address:"172.16.2.11:10750"startupRole:FOLLOWER, old > java.lang.IllegalStateException: startIndex: expected == 16186 but computed > == 16093 at > org.apache.ratis.util.Preconditions.assertTrue(Preconditions.java:77) > at org.apache.ratis.util.Preconditions.assertSame(Preconditions.java:87) > at > org.apache.ratis.server.impl.ServerImplUtils$NavigableIndices.append(ServerImplUtils.java:144) > at > org.apache.ratis.server.impl.RaftServerImpl.appendLog(RaftServerImpl.java:1631) > at > org.apache.ratis.server.impl.RaftServerImpl.appendEntriesAsync(RaftServerImpl.java:1596) > at > org.apache.ratis.server.impl.RaftServerImpl.appendEntriesAsync(RaftServerImpl.java:1486) > at > org.apache.ratis.server.impl.RaftServerProxy.lambda$null$28(RaftServerProxy.java:646) > at org.apache.ratis.util.JavaUtils.callAsUnchecked(JavaUtils.java:118) > at > org.apache.ratis.server.impl.RaftServerImpl.lambda$executeSubmitServerRequestAsync$10(RaftServerImpl.java:906) > at > java.base/java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1768) > at > java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) > at > java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) > at java.base/java.lang.Thread.run(Thread.java:833) > 2025-03-04 05:44:14,950 [grpc-default-executor-8] WARN > o.a.ratis.util.LogUtils:129 - 7: Failed APPEND_ENTRIES request > 8->7#73-t49,previous=(t:3, i:16092),leaderCommit=16184,initializing? > false,entries: size=93, first=(t:4, i:16093), > CONFIGURATIONENTRY(current:id:"6"address:"172.16.2.14:10750"startupRole:FOLLOWER, > id:"7"address:"172.16.2.15:10750"startupRole:FOLLOWER, > id:"8"address:"172.16.2.11:10750"startupRole:FOLLOWER, old > java.util.concurrent.CompletionException: java.io.IOException: > java.lang.IllegalStateException: startIndex: expected == 16186 but computed > == 16093 > at org.apache.ratis.util.JavaUtils.callAsUnchecked(JavaUtils.java:122) > at > org.apache.ratis.server.impl.RaftServerImpl.lambda$executeSubmitServerRequestAsync$10(RaftServerImpl.java:906) > at > java.base/java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1768) > at > java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) > at > java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) > at java.base/java.lang.Thread.run(Thread.java:833) > Caused by: java.io.IOException: java.lang.IllegalStateException: startIndex: > expected == 16186 but computed == 16093 > at org.apache.ratis.util.IOUtils.asIOException(IOUtils.java:56) > at > org.apache.ratis.server.impl.RaftServerImpl.appendEntriesAsync(RaftServerImpl.java:1490) > at > org.apache.ratis.server.impl.RaftServerProxy.lambda$null$28(RaftServerProxy.java:646) > at org.apache.ratis.util.JavaUtils.callAsUnchecked(JavaUtils.java:118) > ... 5 common frames omitted > Caused by: java.lang.IllegalStateException: startIndex: expected == 16186 but > computed == 16093 > at org.apache.ratis.util.Preconditions.assertTrue(Preconditions.java:77) > at org.apache.ratis.util.Preconditions.assertSame(Preconditions.java:87) > at > org.apache.ratis.server.impl.ServerImplUtils$NavigableIndices.append(ServerImplUtils.java:144) > at > org.apache.ratis.server.impl.RaftServerImpl.appendLog(RaftServerImpl.java:1631) > at > org.apache.ratis.server.impl.RaftServerImpl.appendEntriesAsync(RaftServerImpl.java:1596) > at > org.apache.ratis.server.impl.RaftServerImpl.appendEntriesAsync(RaftServerImpl.java:1486) > ... 7 common frames omitted*{code} > *!image-2025-04-06-22-56-34-207.png|width=1143,height=288!* > > h2. *Root Cause Analysis* > # {*}Thread Race Condition{*}: > The {{NavigableIndices}} class uses {{synchronized}} methods for {{append()}} > and {{{}removeExisting(){}}}, but *inter-method concurrency* is not handled. > If two threads interleave execution (e.g., {{thread-1}} appends logs but > hasn’t cleaned up {{map}} via {{{}removeExisting(){}}}, while {{thread-2}} > starts a new append), the index continuity check in {{append()}} fails > erroneously. > # {*}Redundant Validation{*}: > The {{append()}} method revalidates log continuity using the {{{}map{}}}’s > last entry, even though {{ConsecutiveIndices.convert()}} already ensures > intra-request index continuity. This creates unnecessary > concurrency-sensitive checks. > h2. *Proposed Solutions* > h3. {*}Option 1{*}: Remove Redundant Validation in {{append()}} > Modify {{NavigableIndices.append()}} to skip the startIndex check, relying > solely on {{convert()}} for validation. > {*}Justification{*}: > * {{convert()}} already ensures all entries in a single request are > consecutive. > * {{removeExisting()}} guarantees cleanup after appending, making > cross-request checks redundant and race-prone. > h3. {*}Option 2{*}: Global Lock for {{NavigableIndices'map}} > Add a global lock for map to serialize all {{append()}} and > {{removeExisting()}} operations. However, this could introduce performance > bottlenecks. > -- This message was sent by Atlassian Jira (v8.20.10#820010)