[ 
https://issues.apache.org/jira/browse/RATIS-2137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17879663#comment-17879663
 ] 

Tsz-wo Sze commented on RATIS-2137:
-----------------------------------

[~lemony], are you going to build it yourself?  Or do you need a release based 
on 2.5.1?  Please feel free to share your thought.

> Leader fails to send correct index to follower after timeout exception
> ----------------------------------------------------------------------
>
>                 Key: RATIS-2137
>                 URL: https://issues.apache.org/jira/browse/RATIS-2137
>             Project: Ratis
>          Issue Type: Bug
>          Components: server
>    Affects Versions: 2.5.1
>            Reporter: Kevin Liu
>            Assignee: Kevin Liu
>            Priority: Major
>             Fix For: 3.2.0
>
>         Attachments: image-2024-08-13-11-28-16-250.png
>
>          Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> *I found that after the following log, the follower became unavailable. The 
> follower received incorrect entries repeatedly for about 10min, then got 
> installSnapshot failed and started to election. After two hours, it succeed 
> to install snapshot, but failed to updateLastAppliedTermIndex. After that, it 
> repeated 'receive installSnapshot and installSnapshot failed' for several 
> hours until I restarted the server.*
> 24/08/11 09:03:13,714 INFO [nioEventLoopGroup-3-3] RaftServer$Division: 
> 1@group-47BEDE733167: Failed appendEntries as the first entry (index 
> 34795876) already exists (snapshotIndex: 34670809, commitIndex: 34795893)
> 24/08/11 09:03:13,714 INFO [nioEventLoopGroup-3-3] RaftServer$Division: 
> 1@group-47BEDE733167: inconsistency entries. 
> Reply:3<-1#2559343:FAIL-t59,INCONSISTENCY,nextIndex=34795894,followerCommit=34795893,matchIndex=-1
> 24/08/11 09:03:13,715 INFO [nioEventLoopGroup-3-3] RaftServer$Division: 
> 1@group-47BEDE733167: Failed appendEntries as the first entry (index 
> 34795875) already exists (snapshotIndex: 34670809, commitIndex: 34795893)
> 24/08/11 09:03:13,715 INFO [nioEventLoopGroup-3-3] RaftServer$Division: 
> 1@group-47BEDE733167: inconsistency entries. 
> Reply:3<-1#2559406:FAIL-t59,INCONSISTENCY,nextIndex=34795894,followerCommit=34795893,matchIndex=-1
> *....(repeat 'Failed appendEntries')*
> 24/08/11 09:15:41,827 INFO [nioEventLoopGroup-3-3] RaftServer$Division: 
> 1@group-47BEDE733167: Failed appendEntries as the first entry (index 
> 34465382) already exists (snapshotIndex: 34670809, commitIndex: 34795893)
> 24/08/11 09:15:41,827 INFO [nioEventLoopGroup-3-3] RaftServer$Division: 
> 1@group-47BEDE733167: inconsistency entries. 
> Reply:3<-1#2892557:FAIL-t59,INCONSISTENCY,nextIndex=34795894,followerCommit=34795893,matchIndex=-1
> 24/08/11 09:15:42,230 INFO [nioEventLoopGroup-3-3] 
> SnapshotInstallationHandler: 1@group-47BEDE733167: receive installSnapshot: 
> 3->1#0-t59,chunk:bbe49073-5dad-4499-9051-58a0e53b0658,0
> 24/08/11 09:15:42,231 ERROR [nioEventLoopGroup-3-3] 
> SnapshotInstallationHandler: 1@group-47BEDE733167: installSnapshot failed
> java.lang.IllegalStateException: 1@group-47BEDE733167 log's commit index is 
> 34795893, last included index in snapshot is 34670057
> 24/08/11 09:15:42,233 WARN [nioEventLoopGroup-3-3] DefaultChannelPipeline: An 
> exceptionCaught() event was fired, and it reached at the tail of the 
> pipeline. It usually means the last handler in the pipeline did not handle 
> the exception.
> java.lang.IllegalStateException: 1@group-47BEDE733167 log's commit index is 
> 34795893, last included index in snapshot is 34670057
> 24/08/11 09:15:50,105 INFO [1@group-47BEDE733167-FollowerState] 
> FollowerState: 1@group-47BEDE733167-FollowerState: change to CANDIDATE, 
> lastRpcElapsedTime:7874610911ns, electionTimeout:3353ms
> 24/08/11 09:15:50,105 INFO [1@group-47BEDE733167-FollowerState] RoleInfo: 1: 
> shutdown 1@group-47BEDE733167-FollowerState
> 24/08/11 09:15:50,105 INFO [1@group-47BEDE733167-FollowerState] 
> RaftServer$Division: 1@group-47BEDE733167: changes role from  FOLLOWER to 
> CANDIDATE at term 59 for changeToCandidate
> 24/08/11 09:15:50,106 INFO [1@group-47BEDE733167-FollowerState] 
> RaftServerConfigKeys: raft.server.leaderelection.pre-vote = true (default)
> 24/08/11 09:15:50,106 INFO [1@group-47BEDE733167-FollowerState] RoleInfo: 1: 
> start 1@group-47BEDE733167-LeaderElection5
> 24/08/11 09:15:50,107 INFO [1@group-47BEDE733167-LeaderElection5] 
> RaftServer$Division: 1@group-47BEDE733167: change Leader from 3 to null at 
> term 59 for PRE_VOTE
> 24/08/11 09:15:50,107 INFO [1@group-47BEDE733167-LeaderElection5] 
> LeaderElection: 1@group-47BEDE733167-LeaderElection5 PRE_VOTE round 0: submit 
> vote requests at term 59 for 34233595: 
> peers:[1|rpc:xxx:9862|admin:|client:xxx:9087|dataStream:|priority:0|startupRole:FOLLOWER,
>  
> 2|rpc:xxx:9862|admin:|client:xxx:9087|dataStream:|priority:0|startupRole:FOLLOWER,
>  
> 3|rpc:xxx:9862|admin:|client:xxx:9087|dataStream:|priority:0|startupRole:FOLLOWER]|listeners:[],
>  old=null
> 24/08/11 09:15:50,110 INFO [1@group-47BEDE733167-LeaderElection5] 
> LeaderElection: 1@group-47BEDE733167-LeaderElection5: PRE_VOTE REJECTED 
> received 2 response(s) and 0 exception(s):
> 24/08/11 09:15:50,110 INFO [1@group-47BEDE733167-LeaderElection5] 
> LeaderElection:   Response 0: 1<-2#0:FAIL-t59
> 24/08/11 09:15:50,110 INFO [1@group-47BEDE733167-LeaderElection5] 
> LeaderElection:   Response 1: 1<-3#0:FAIL-t59
> 24/08/11 09:15:50,110 INFO [1@group-47BEDE733167-LeaderElection5] 
> LeaderElection: 1@group-47BEDE733167-LeaderElection5 PRE_VOTE round 0: result 
> REJECTED
> 24/08/11 09:15:50,110 INFO [1@group-47BEDE733167-LeaderElection5] 
> RaftServer$Division: 1@group-47BEDE733167: changes role from CANDIDATE to 
> FOLLOWER at term 59 for REJECTED
> 24/08/11 09:15:50,110 INFO [1@group-47BEDE733167-LeaderElection5] RoleInfo: 
> 1: shutdown 1@group-47BEDE733167-LeaderElection5
> 24/08/11 09:15:50,111 INFO [1@group-47BEDE733167-LeaderElection5] RoleInfo: 
> 1: start 1@group-47BEDE733167-FollowerState
> 24/08/11 09:15:52,075 INFO [nioEventLoopGroup-3-3] 
> SnapshotInstallationHandler: 1@group-47BEDE733167: receive installSnapshot: 
> 3->1#0-t59,chunk:802b8106-076b-4ffc-8cf1-1ace04952c7a,0
> 24/08/11 09:15:52,075 INFO [nioEventLoopGroup-3-3] RaftServer$Division: 
> 1@group-47BEDE733167: change Leader from null to 3 at term 59 for 
> installSnapshot, leader elected after 1967ms
> 24/08/11 09:15:52,075 ERROR [nioEventLoopGroup-3-3] 
> SnapshotInstallationHandler: 1@group-47BEDE733167: installSnapshot failed
> java.lang.IllegalStateException: 1@group-47BEDE733167 log's commit index is 
> 34795893, last included index in snapshot is 34670057
> *(repeat 'election, fail, install snapshot fail' for more than 2 hours)*
> *(still election fail, but this time install snapshot successfully, but 
> failed to updateLastAppliedTermIndex)*
> 24/08/11 11:52:38,168 INFO [nioEventLoopGroup-3-3] 
> SnapshotInstallationHandler: 1@group-47BEDE733167: receive installSnapshot: 
> 3->1#0-t59,chunk:dcc73f7c-c17d-44e5-96b6-f6f1570e68a7,0
> 24/08/11 11:52:38,168 INFO [nioEventLoopGroup-3-3] RaftServer$Division: 
> 1@group-47BEDE733167: change Leader from null to 3 at term 59 for 
> installSnapshot, leader elected after 3610ms
> 24/08/11 11:52:39,434 INFO [nioEventLoopGroup-3-3] SnapshotManager: 
> Installing snapshot:3->1#0-t59,chunk:dcc73f7c-c17d-44e5-96b6-f6f1570e68a7,0, 
> to tmp 
> dir:/data0/celeborn/ha_ratis/c5196f6d-2c34-3ed3-8b8a-47bede733167/tmp/snapshot-dcc73f7c-c17d-44e5-96b6-f6f1570e68a7
> 24/08/11 11:52:39,455 INFO [nioEventLoopGroup-3-3] SnapshotManager: Installed 
> snapshot, renaming temporary dir 
> /data0/celeborn/ha_ratis/c5196f6d-2c34-3ed3-8b8a-47bede733167/tmp/snapshot-dcc73f7c-c17d-44e5-96b6-f6f1570e68a7
>  to /data0/celeborn/ha_ratis/c5196f6d-2c34-3ed3-8b8a-47bede733167/sm
> 24/08/11 11:52:40,829 INFO [1@group-47BEDE733167-StateMachineUpdater] 
> StateMachine: Reinitializing state machine.
> 24/08/11 11:52:40,829 INFO [nioEventLoopGroup-3-3] RaftLog: 
> 1@group-47BEDE733167-SegmentedRaftLog: snapshotIndex: updateIncreasingly 
> 33870802 -> 34870063
> 24/08/11 11:52:40,829 INFO [1@group-47BEDE733167-StateMachineUpdater] 
> StateMachine: obsolete snapshot provided: (t:59, i:34670809)
> 24/08/11 11:52:40,829 INFO [nioEventLoopGroup-3-3] SegmentedRaftLogWorker: 
> 1@group-47BEDE733167-SegmentedRaftLogWorker: flushIndex: setUnconditionally 
> 34795893 -> 34870063
> 24/08/11 11:52:40,829 INFO [nioEventLoopGroup-3-3] SegmentedRaftLogWorker: 
> 1@group-47BEDE733167-SegmentedRaftLogWorker: safeCacheEvictIndex: 
> setUnconditionally 34783743 -> 34870063
> 24/08/11 11:52:40,829 INFO [1@group-47BEDE733167-StateMachineUpdater] 
> StateMachineUpdater: 1@group-47BEDE733167-StateMachineUpdater: appliedIndex: 
> setUnconditionally 34795893 -> 34670809
> 24/08/11 11:52:40,829 INFO [nioEventLoopGroup-3-3] SegmentedRaftLogWorker: 
> 1@group-47BEDE733167-SegmentedRaftLogWorker: Closing segment 
> log_inprogress_34783744 to index: 34795893
> 24/08/11 11:52:40,837 INFO [nioEventLoopGroup-3-3] 
> SnapshotInstallationHandler: 1@group-47BEDE733167: successfully install the 
> entire snapshot-34870063
> 24/08/11 11:52:40,877 INFO [nioEventLoopGroup-3-3] 
> SnapshotInstallationHandler: 1@group-47BEDE733167: reply installSnapshot: 
> 3<-1#0:OK-t59,SUCCESS,requestIndex=0
> 24/08/11 11:52:40,902 INFO [1@group-47BEDE733167-SegmentedRaftLogWorker] 
> SegmentedRaftLogWorker: 1@group-47BEDE733167-SegmentedRaftLogWorker: Rolled 
> log segment from 
> /data0/celeborn/ha_ratis/c5196f6d-2c34-3ed3-8b8a-47bede733167/current/log_inprogress_34783744
>  to 
> /data0/celeborn/ha_ratis/c5196f6d-2c34-3ed3-8b8a-47bede733167/current/log_34783744-34795893
> 24/08/11 11:52:41,160 ERROR [1@group-47BEDE733167-StateMachineUpdater] 
> StateMachineUpdater: 1@group-47BEDE733167-StateMachineUpdater caught a 
> Throwable.
> java.lang.IllegalStateException: 1: Failed updateLastAppliedTermIndex: newTI 
> = (t:59, i:34670810) < oldTI = (t:59, i:34795893)
> 24/08/11 11:52:41,161 INFO [1@group-47BEDE733167-StateMachineUpdater] 
> RaftServer$Division: 1@group-47BEDE733167: shutdown
> 24/08/11 11:52:41,161 INFO [1@group-47BEDE733167-StateMachineUpdater] 
> JmxRegister: Successfully un-registered JMX Bean with object name 
> Ratis:service=RaftServer,group=group-47BEDE733167,id=1
> 24/08/11 11:52:41,161 INFO [1@group-47BEDE733167-StateMachineUpdater] 
> RoleInfo: 1: shutdown 1@group-47BEDE733167-FollowerState
> 24/08/11 11:52:41,161 INFO [1@group-47BEDE733167-FollowerState] 
> FollowerState: 1@group-47BEDE733167-FollowerState was interrupted
> 24/08/11 11:52:41,161 INFO [1@group-47BEDE733167-StateMachineUpdater] 
> RaftServer$Division: 1@group-47BEDE733167: closes. applyIndex: 34670810
> 24/08/11 11:52:42,340 INFO [1@group-47BEDE733167-StateMachineUpdater] 
> SegmentedRaftLogWorker: 1@group-47BEDE733167-SegmentedRaftLogWorker close()
> 24/08/11 11:52:48,170 INFO [nioEventLoopGroup-3-3] 
> SnapshotInstallationHandler: 1@group-47BEDE733167: receive installSnapshot: 
> 3->1#0-t59,chunk:7179265b-d007-4835-9b5f-fa9e4d853b72,0
> 24/08/11 11:52:48,170 ERROR [nioEventLoopGroup-3-3] 
> SnapshotInstallationHandler: 1@group-47BEDE733167: installSnapshot failed
> org.apache.ratis.protocol.exceptions.ServerNotReadyException: 
> 1@group-47BEDE733167 is not in [STARTING, RUNNING]: current state is CLOSED
> 24/08/11 11:52:58,171 INFO [nioEventLoopGroup-3-3] 
> SnapshotInstallationHandler: 1@group-47BEDE733167: receive installSnapshot: 
> 3->1#0-t59,chunk:7f56f35f-20f3-4b42-bdb6-c2d52db646fc,0
> 24/08/11 11:52:58,171 ERROR [nioEventLoopGroup-3-3] 
> SnapshotInstallationHandler: 1@group-47BEDE733167: installSnapshot failed
> org.apache.ratis.protocol.exceptions.ServerNotReadyException: 
> 1@group-47BEDE733167 is not in [STARTING, RUNNING]: current state is CLOSED
> *(repeat 'receive installSnapshot and installSnapshot failed' for serval 
> hours until I restarted the server)*
> *Here is what I found in the leader's log*
> 24/08/11 09:03:10,130 WARN 
> [3@group-47BEDE733167->1-LogAppenderDefault-LogAppenderDaemon] LogAppender: 
> 3@group-47BEDE733167->1-LogAppenderDefault: Failed to appendEntries 
> (retry=1): org.apache.ratis.protocol.exceptions.TimeoutIOException
> 24/08/11 09:03:13,714 INFO 
> [3@group-47BEDE733167->1-LogAppenderDefault-LogAppenderDaemon] FollowerInfo: 
> 3@group-47BEDE733167->1: decreaseNextIndex nextIndex: updateUnconditionally 
> 34795876 -> 34795875
> 24/08/11 09:03:13,715 INFO 
> [3@group-47BEDE733167->1-LogAppenderDefault-LogAppenderDaemon] FollowerInfo: 
> 3@group-47BEDE733167->1: decreaseNextIndex nextIndex: updateUnconditionally 
> 34795875 -> 34795874
> *I guess when the leader called appendEntries for the first time, the 
> follower had been executed successfully, but the leader did not receive the 
> follower's response and times out. When it resent, it found that it could not 
> match the change and started to decreaseNextIndex.*
> *Sometimes it can be fixed automatically by rolling segment log, but not 
> always.*
> 24/08/12 11:47:53,351 INFO [nioEventLoopGroup-3-3] RaftServer$Division: 
> 2@group-47BEDE733167: Failed appendEntries as the first entry (index 
> 35049128) already exists (snapshotIndex: 35070063, commitIndex: 35259248)
> 24/08/12 11:47:53,351 INFO [nioEventLoopGroup-3-3] RaftServer$Division: 
> 2@group-47BEDE733167: inconsistency entries. 
> Reply:3<-2#3583083:FAIL-t59,INCONSISTENCY,nextIndex=35259249,followerCommit=35259248,matchIndex=-1
> 24/08/12 11:47:55,132 INFO [nioEventLoopGroup-3-3] SegmentedRaftLogWorker: 
> 2@group-47BEDE733167-SegmentedRaftLogWorker: Rolling segment 
> log-35250411_35261308 to index:35261308
> 24/08/12 11:54:46,402 INFO [2@group-47BEDE733167-StateMachineUpdater] 
> StateMachine: Taking a snapshot to file 
> /data0/celeborn/ha_ratis/c5196f6d-2c34-3ed3-8b8a-47bede733167/tmp/raft_snapshot_1723434886330_4407392949522239377.dat.
> 24/08/12 11:54:46,403 INFO [2@group-47BEDE733167-StateMachineUpdater] 
> StateMachine: Saving digest 272e2c87a8e62785588ab383c0b8c290 for snapshot 
> file 
> /data0/celeborn/ha_ratis/c5196f6d-2c34-3ed3-8b8a-47bede733167/sm/snapshot.59_35270066.
> 24/08/12 11:54:46,458 INFO [2@group-47BEDE733167-StateMachineUpdater] 
> StateMachine: Renaming a snapshot file 
> /data0/celeborn/ha_ratis/c5196f6d-2c34-3ed3-8b8a-47bede733167/tmp/raft_snapshot_1723434886330_4407392949522239377.dat
>  to 
> /data0/celeborn/ha_ratis/c5196f6d-2c34-3ed3-8b8a-47bede733167/sm/snapshot.59_35270066.
> 24/08/12 11:54:46,487 INFO [2@group-47BEDE733167-StateMachineUpdater] 
> SimpleStateMachineStorage: Deleting old snapshot at 
> /data0/celeborn/ha_ratis/c5196f6d-2c34-3ed3-8b8a-47bede733167/sm/snapshot.59_34670063
> 24/08/12 11:54:46,530 INFO [2@group-47BEDE733167-StateMachineUpdater] 
> StateMachineUpdater: 2@group-47BEDE733167-StateMachineUpdater: Took a 
> snapshot at index 35270066
> 24/08/12 11:54:46,530 INFO [2@group-47BEDE733167-StateMachineUpdater] 
> StateMachineUpdater: 2@group-47BEDE733167-StateMachineUpdater: snapshotIndex: 
> updateIncreasingly 35070063 -> 35270066
> 24/08/12 11:57:16,592 INFO [nioEventLoopGroup-3-3] SegmentedRaftLogWorker: 
> 2@group-47BEDE733167-SegmentedRaftLogWorker: Rolling segment 
> log-35261309_35271592 to index:35271592



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to