[ https://issues.apache.org/jira/browse/RATIS-2137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17873786#comment-17873786 ]
Tsz-wo Sze edited comment on RATIS-2137 at 8/15/24 6:35 PM: ------------------------------------------------------------ bq. ... leader decreaseNextIndex repeatedly, even if the follower's index is greater than the value, meets your expectation? Unless the follower has sent a suggested nextIndex, the leader only can decrease nextIndex and retry bq. Should leader adjust the index smarter or let follower rolling segment log in this situation? How? BTW, rolling segment log is nothing to do with leader sending appendEntries. Log is logically a continuous list of records (i.e. there are no segments, logically). was (Author: szetszwo): bq. ... leader decreaseNextIndex repeatedly, even if the follower's index is greater than the value, meets your expectation? Unless the follower has sent a suggested nextIndex, the leader only can decrease nextIndex and retry bq. Should leader adjust the index smarter or let follower rolling segment log in this situation? How? BTW, rolling segment log is nothing to do with leader sending appendEntries. Log is logically a continuous records (i.e. there are no segments, logically). > Leader fails to send correct index to follower after timeout exception > ---------------------------------------------------------------------- > > Key: RATIS-2137 > URL: https://issues.apache.org/jira/browse/RATIS-2137 > Project: Ratis > Issue Type: Bug > Affects Versions: 2.5.1 > Reporter: Kevin Liu > Priority: Major > Attachments: image-2024-08-13-11-28-16-250.png > > > *I found that after the following log, the follower became unavailable. The > follower received incorrect entries repeatedly for about 10min, then got > installSnapshot failed and started to election. After two hours, it succeed > to install snapshot, but failed to updateLastAppliedTermIndex. After that, it > repeated 'receive installSnapshot and installSnapshot failed' for several > hours until I restarted the server.* > 24/08/11 09:03:13,714 INFO [nioEventLoopGroup-3-3] RaftServer$Division: > 1@group-47BEDE733167: Failed appendEntries as the first entry (index > 34795876) already exists (snapshotIndex: 34670809, commitIndex: 34795893) > 24/08/11 09:03:13,714 INFO [nioEventLoopGroup-3-3] RaftServer$Division: > 1@group-47BEDE733167: inconsistency entries. > Reply:3<-1#2559343:FAIL-t59,INCONSISTENCY,nextIndex=34795894,followerCommit=34795893,matchIndex=-1 > 24/08/11 09:03:13,715 INFO [nioEventLoopGroup-3-3] RaftServer$Division: > 1@group-47BEDE733167: Failed appendEntries as the first entry (index > 34795875) already exists (snapshotIndex: 34670809, commitIndex: 34795893) > 24/08/11 09:03:13,715 INFO [nioEventLoopGroup-3-3] RaftServer$Division: > 1@group-47BEDE733167: inconsistency entries. > Reply:3<-1#2559406:FAIL-t59,INCONSISTENCY,nextIndex=34795894,followerCommit=34795893,matchIndex=-1 > *....(repeat 'Failed appendEntries')* > 24/08/11 09:15:41,827 INFO [nioEventLoopGroup-3-3] RaftServer$Division: > 1@group-47BEDE733167: Failed appendEntries as the first entry (index > 34465382) already exists (snapshotIndex: 34670809, commitIndex: 34795893) > 24/08/11 09:15:41,827 INFO [nioEventLoopGroup-3-3] RaftServer$Division: > 1@group-47BEDE733167: inconsistency entries. > Reply:3<-1#2892557:FAIL-t59,INCONSISTENCY,nextIndex=34795894,followerCommit=34795893,matchIndex=-1 > 24/08/11 09:15:42,230 INFO [nioEventLoopGroup-3-3] > SnapshotInstallationHandler: 1@group-47BEDE733167: receive installSnapshot: > 3->1#0-t59,chunk:bbe49073-5dad-4499-9051-58a0e53b0658,0 > 24/08/11 09:15:42,231 ERROR [nioEventLoopGroup-3-3] > SnapshotInstallationHandler: 1@group-47BEDE733167: installSnapshot failed > java.lang.IllegalStateException: 1@group-47BEDE733167 log's commit index is > 34795893, last included index in snapshot is 34670057 > 24/08/11 09:15:42,233 WARN [nioEventLoopGroup-3-3] DefaultChannelPipeline: An > exceptionCaught() event was fired, and it reached at the tail of the > pipeline. It usually means the last handler in the pipeline did not handle > the exception. > java.lang.IllegalStateException: 1@group-47BEDE733167 log's commit index is > 34795893, last included index in snapshot is 34670057 > 24/08/11 09:15:50,105 INFO [1@group-47BEDE733167-FollowerState] > FollowerState: 1@group-47BEDE733167-FollowerState: change to CANDIDATE, > lastRpcElapsedTime:7874610911ns, electionTimeout:3353ms > 24/08/11 09:15:50,105 INFO [1@group-47BEDE733167-FollowerState] RoleInfo: 1: > shutdown 1@group-47BEDE733167-FollowerState > 24/08/11 09:15:50,105 INFO [1@group-47BEDE733167-FollowerState] > RaftServer$Division: 1@group-47BEDE733167: changes role from FOLLOWER to > CANDIDATE at term 59 for changeToCandidate > 24/08/11 09:15:50,106 INFO [1@group-47BEDE733167-FollowerState] > RaftServerConfigKeys: raft.server.leaderelection.pre-vote = true (default) > 24/08/11 09:15:50,106 INFO [1@group-47BEDE733167-FollowerState] RoleInfo: 1: > start 1@group-47BEDE733167-LeaderElection5 > 24/08/11 09:15:50,107 INFO [1@group-47BEDE733167-LeaderElection5] > RaftServer$Division: 1@group-47BEDE733167: change Leader from 3 to null at > term 59 for PRE_VOTE > 24/08/11 09:15:50,107 INFO [1@group-47BEDE733167-LeaderElection5] > LeaderElection: 1@group-47BEDE733167-LeaderElection5 PRE_VOTE round 0: submit > vote requests at term 59 for 34233595: > peers:[1|rpc:xxx:9862|admin:|client:xxx:9087|dataStream:|priority:0|startupRole:FOLLOWER, > > 2|rpc:xxx:9862|admin:|client:xxx:9087|dataStream:|priority:0|startupRole:FOLLOWER, > > 3|rpc:xxx:9862|admin:|client:xxx:9087|dataStream:|priority:0|startupRole:FOLLOWER]|listeners:[], > old=null > 24/08/11 09:15:50,110 INFO [1@group-47BEDE733167-LeaderElection5] > LeaderElection: 1@group-47BEDE733167-LeaderElection5: PRE_VOTE REJECTED > received 2 response(s) and 0 exception(s): > 24/08/11 09:15:50,110 INFO [1@group-47BEDE733167-LeaderElection5] > LeaderElection: Response 0: 1<-2#0:FAIL-t59 > 24/08/11 09:15:50,110 INFO [1@group-47BEDE733167-LeaderElection5] > LeaderElection: Response 1: 1<-3#0:FAIL-t59 > 24/08/11 09:15:50,110 INFO [1@group-47BEDE733167-LeaderElection5] > LeaderElection: 1@group-47BEDE733167-LeaderElection5 PRE_VOTE round 0: result > REJECTED > 24/08/11 09:15:50,110 INFO [1@group-47BEDE733167-LeaderElection5] > RaftServer$Division: 1@group-47BEDE733167: changes role from CANDIDATE to > FOLLOWER at term 59 for REJECTED > 24/08/11 09:15:50,110 INFO [1@group-47BEDE733167-LeaderElection5] RoleInfo: > 1: shutdown 1@group-47BEDE733167-LeaderElection5 > 24/08/11 09:15:50,111 INFO [1@group-47BEDE733167-LeaderElection5] RoleInfo: > 1: start 1@group-47BEDE733167-FollowerState > 24/08/11 09:15:52,075 INFO [nioEventLoopGroup-3-3] > SnapshotInstallationHandler: 1@group-47BEDE733167: receive installSnapshot: > 3->1#0-t59,chunk:802b8106-076b-4ffc-8cf1-1ace04952c7a,0 > 24/08/11 09:15:52,075 INFO [nioEventLoopGroup-3-3] RaftServer$Division: > 1@group-47BEDE733167: change Leader from null to 3 at term 59 for > installSnapshot, leader elected after 1967ms > 24/08/11 09:15:52,075 ERROR [nioEventLoopGroup-3-3] > SnapshotInstallationHandler: 1@group-47BEDE733167: installSnapshot failed > java.lang.IllegalStateException: 1@group-47BEDE733167 log's commit index is > 34795893, last included index in snapshot is 34670057 > *(repeat 'election, fail, install snapshot fail' for more than 2 hours)* > *(still election fail, but this time install snapshot successfully, but > failed to updateLastAppliedTermIndex)* > 24/08/11 11:52:38,168 INFO [nioEventLoopGroup-3-3] > SnapshotInstallationHandler: 1@group-47BEDE733167: receive installSnapshot: > 3->1#0-t59,chunk:dcc73f7c-c17d-44e5-96b6-f6f1570e68a7,0 > 24/08/11 11:52:38,168 INFO [nioEventLoopGroup-3-3] RaftServer$Division: > 1@group-47BEDE733167: change Leader from null to 3 at term 59 for > installSnapshot, leader elected after 3610ms > 24/08/11 11:52:39,434 INFO [nioEventLoopGroup-3-3] SnapshotManager: > Installing snapshot:3->1#0-t59,chunk:dcc73f7c-c17d-44e5-96b6-f6f1570e68a7,0, > to tmp > dir:/data0/celeborn/ha_ratis/c5196f6d-2c34-3ed3-8b8a-47bede733167/tmp/snapshot-dcc73f7c-c17d-44e5-96b6-f6f1570e68a7 > 24/08/11 11:52:39,455 INFO [nioEventLoopGroup-3-3] SnapshotManager: Installed > snapshot, renaming temporary dir > /data0/celeborn/ha_ratis/c5196f6d-2c34-3ed3-8b8a-47bede733167/tmp/snapshot-dcc73f7c-c17d-44e5-96b6-f6f1570e68a7 > to /data0/celeborn/ha_ratis/c5196f6d-2c34-3ed3-8b8a-47bede733167/sm > 24/08/11 11:52:40,829 INFO [1@group-47BEDE733167-StateMachineUpdater] > StateMachine: Reinitializing state machine. > 24/08/11 11:52:40,829 INFO [nioEventLoopGroup-3-3] RaftLog: > 1@group-47BEDE733167-SegmentedRaftLog: snapshotIndex: updateIncreasingly > 33870802 -> 34870063 > 24/08/11 11:52:40,829 INFO [1@group-47BEDE733167-StateMachineUpdater] > StateMachine: obsolete snapshot provided: (t:59, i:34670809) > 24/08/11 11:52:40,829 INFO [nioEventLoopGroup-3-3] SegmentedRaftLogWorker: > 1@group-47BEDE733167-SegmentedRaftLogWorker: flushIndex: setUnconditionally > 34795893 -> 34870063 > 24/08/11 11:52:40,829 INFO [nioEventLoopGroup-3-3] SegmentedRaftLogWorker: > 1@group-47BEDE733167-SegmentedRaftLogWorker: safeCacheEvictIndex: > setUnconditionally 34783743 -> 34870063 > 24/08/11 11:52:40,829 INFO [1@group-47BEDE733167-StateMachineUpdater] > StateMachineUpdater: 1@group-47BEDE733167-StateMachineUpdater: appliedIndex: > setUnconditionally 34795893 -> 34670809 > 24/08/11 11:52:40,829 INFO [nioEventLoopGroup-3-3] SegmentedRaftLogWorker: > 1@group-47BEDE733167-SegmentedRaftLogWorker: Closing segment > log_inprogress_34783744 to index: 34795893 > 24/08/11 11:52:40,837 INFO [nioEventLoopGroup-3-3] > SnapshotInstallationHandler: 1@group-47BEDE733167: successfully install the > entire snapshot-34870063 > 24/08/11 11:52:40,877 INFO [nioEventLoopGroup-3-3] > SnapshotInstallationHandler: 1@group-47BEDE733167: reply installSnapshot: > 3<-1#0:OK-t59,SUCCESS,requestIndex=0 > 24/08/11 11:52:40,902 INFO [1@group-47BEDE733167-SegmentedRaftLogWorker] > SegmentedRaftLogWorker: 1@group-47BEDE733167-SegmentedRaftLogWorker: Rolled > log segment from > /data0/celeborn/ha_ratis/c5196f6d-2c34-3ed3-8b8a-47bede733167/current/log_inprogress_34783744 > to > /data0/celeborn/ha_ratis/c5196f6d-2c34-3ed3-8b8a-47bede733167/current/log_34783744-34795893 > 24/08/11 11:52:41,160 ERROR [1@group-47BEDE733167-StateMachineUpdater] > StateMachineUpdater: 1@group-47BEDE733167-StateMachineUpdater caught a > Throwable. > java.lang.IllegalStateException: 1: Failed updateLastAppliedTermIndex: newTI > = (t:59, i:34670810) < oldTI = (t:59, i:34795893) > 24/08/11 11:52:41,161 INFO [1@group-47BEDE733167-StateMachineUpdater] > RaftServer$Division: 1@group-47BEDE733167: shutdown > 24/08/11 11:52:41,161 INFO [1@group-47BEDE733167-StateMachineUpdater] > JmxRegister: Successfully un-registered JMX Bean with object name > Ratis:service=RaftServer,group=group-47BEDE733167,id=1 > 24/08/11 11:52:41,161 INFO [1@group-47BEDE733167-StateMachineUpdater] > RoleInfo: 1: shutdown 1@group-47BEDE733167-FollowerState > 24/08/11 11:52:41,161 INFO [1@group-47BEDE733167-FollowerState] > FollowerState: 1@group-47BEDE733167-FollowerState was interrupted > 24/08/11 11:52:41,161 INFO [1@group-47BEDE733167-StateMachineUpdater] > RaftServer$Division: 1@group-47BEDE733167: closes. applyIndex: 34670810 > 24/08/11 11:52:42,340 INFO [1@group-47BEDE733167-StateMachineUpdater] > SegmentedRaftLogWorker: 1@group-47BEDE733167-SegmentedRaftLogWorker close() > 24/08/11 11:52:48,170 INFO [nioEventLoopGroup-3-3] > SnapshotInstallationHandler: 1@group-47BEDE733167: receive installSnapshot: > 3->1#0-t59,chunk:7179265b-d007-4835-9b5f-fa9e4d853b72,0 > 24/08/11 11:52:48,170 ERROR [nioEventLoopGroup-3-3] > SnapshotInstallationHandler: 1@group-47BEDE733167: installSnapshot failed > org.apache.ratis.protocol.exceptions.ServerNotReadyException: > 1@group-47BEDE733167 is not in [STARTING, RUNNING]: current state is CLOSED > 24/08/11 11:52:58,171 INFO [nioEventLoopGroup-3-3] > SnapshotInstallationHandler: 1@group-47BEDE733167: receive installSnapshot: > 3->1#0-t59,chunk:7f56f35f-20f3-4b42-bdb6-c2d52db646fc,0 > 24/08/11 11:52:58,171 ERROR [nioEventLoopGroup-3-3] > SnapshotInstallationHandler: 1@group-47BEDE733167: installSnapshot failed > org.apache.ratis.protocol.exceptions.ServerNotReadyException: > 1@group-47BEDE733167 is not in [STARTING, RUNNING]: current state is CLOSED > *(repeat 'receive installSnapshot and installSnapshot failed' for serval > hours until I restarted the server)* > *Here is what I found in the leader's log* > 24/08/11 09:03:10,130 WARN > [3@group-47BEDE733167->1-LogAppenderDefault-LogAppenderDaemon] LogAppender: > 3@group-47BEDE733167->1-LogAppenderDefault: Failed to appendEntries > (retry=1): org.apache.ratis.protocol.exceptions.TimeoutIOException > 24/08/11 09:03:13,714 INFO > [3@group-47BEDE733167->1-LogAppenderDefault-LogAppenderDaemon] FollowerInfo: > 3@group-47BEDE733167->1: decreaseNextIndex nextIndex: updateUnconditionally > 34795876 -> 34795875 > 24/08/11 09:03:13,715 INFO > [3@group-47BEDE733167->1-LogAppenderDefault-LogAppenderDaemon] FollowerInfo: > 3@group-47BEDE733167->1: decreaseNextIndex nextIndex: updateUnconditionally > 34795875 -> 34795874 > *I guess when the leader called appendEntries for the first time, the > follower had been executed successfully, but the leader did not receive the > follower's response and times out. When it resent, it found that it could not > match the change and started to decreaseNextIndex.* > *Sometimes it can be fixed automatically by rolling segment log, but not > always.* > 24/08/12 11:47:53,351 INFO [nioEventLoopGroup-3-3] RaftServer$Division: > 2@group-47BEDE733167: Failed appendEntries as the first entry (index > 35049128) already exists (snapshotIndex: 35070063, commitIndex: 35259248) > 24/08/12 11:47:53,351 INFO [nioEventLoopGroup-3-3] RaftServer$Division: > 2@group-47BEDE733167: inconsistency entries. > Reply:3<-2#3583083:FAIL-t59,INCONSISTENCY,nextIndex=35259249,followerCommit=35259248,matchIndex=-1 > 24/08/12 11:47:55,132 INFO [nioEventLoopGroup-3-3] SegmentedRaftLogWorker: > 2@group-47BEDE733167-SegmentedRaftLogWorker: Rolling segment > log-35250411_35261308 to index:35261308 -- This message was sent by Atlassian Jira (v8.20.10#820010)