[ 
https://issues.apache.org/jira/browse/RATIS-2137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17873389#comment-17873389
 ] 

Kevin Liu edited comment on RATIS-2137 at 8/14/24 3:26 AM:
-----------------------------------------------------------

[~szetszwo] , yes, NETTY, and raft.client.rpc.request.timeout=10s. I am not 
sure why the follower did not response in time. There are no info between  
09:00:40,405 and 09:03:13,714(the follower got the first Failed appendEntries), 
and no info indicated GC pause by JvmPauseMonitor in the follower's log file. 
So maybe network packet loss when the follower response to the leader?Anyway, 
is this phenomenon, leader decreaseNextIndex repeatedly, even if the follower's 
index is greater than the value, meets your expectation? Should leader adjust 
the index smarter or let follower rolling segment log in this situation?


was (Author: lemony):
[~szetszwo] , yes, NETTY, and raft.client.rpc.request.timeout=10s. I am not 
sure why the follower did not response in time. There are no info between  
09:00:40,405 and 09:03:13,714(the follower got the first Failed appendEntries), 
and no info indicated GC pause by JvmPauseMonitor in the follower's log file. 
So maybe network packet loss?Anyway, is this phenomenon, leader 
decreaseNextIndex repeatedly, even if the follower's index is greater than the 
value, meets your expectation? Should leader adjust the index smarter or let 
follower rolling segment log in this situation?

> Leader fails to send correct index to follower after timeout exception
> ----------------------------------------------------------------------
>
>                 Key: RATIS-2137
>                 URL: https://issues.apache.org/jira/browse/RATIS-2137
>             Project: Ratis
>          Issue Type: Bug
>    Affects Versions: 2.5.1
>            Reporter: Kevin Liu
>            Priority: Major
>         Attachments: image-2024-08-13-11-28-16-250.png
>
>
> *I found that after the following log, the follower became unavailable. The 
> follower received incorrect entries repeatedly for about 10min, then got 
> installSnapshot failed and started to election. After two hours, it succeed 
> to install snapshot, but failed to updateLastAppliedTermIndex. After that, it 
> repeated 'receive installSnapshot and installSnapshot failed' for several 
> hours until I restarted the server.*
> 24/08/11 09:03:13,714 INFO [nioEventLoopGroup-3-3] RaftServer$Division: 
> 1@group-47BEDE733167: Failed appendEntries as the first entry (index 
> 34795876) already exists (snapshotIndex: 34670809, commitIndex: 34795893)
> 24/08/11 09:03:13,714 INFO [nioEventLoopGroup-3-3] RaftServer$Division: 
> 1@group-47BEDE733167: inconsistency entries. 
> Reply:3<-1#2559343:FAIL-t59,INCONSISTENCY,nextIndex=34795894,followerCommit=34795893,matchIndex=-1
> 24/08/11 09:03:13,715 INFO [nioEventLoopGroup-3-3] RaftServer$Division: 
> 1@group-47BEDE733167: Failed appendEntries as the first entry (index 
> 34795875) already exists (snapshotIndex: 34670809, commitIndex: 34795893)
> 24/08/11 09:03:13,715 INFO [nioEventLoopGroup-3-3] RaftServer$Division: 
> 1@group-47BEDE733167: inconsistency entries. 
> Reply:3<-1#2559406:FAIL-t59,INCONSISTENCY,nextIndex=34795894,followerCommit=34795893,matchIndex=-1
> *....(repeat 'Failed appendEntries')*
> 24/08/11 09:15:41,827 INFO [nioEventLoopGroup-3-3] RaftServer$Division: 
> 1@group-47BEDE733167: Failed appendEntries as the first entry (index 
> 34465382) already exists (snapshotIndex: 34670809, commitIndex: 34795893)
> 24/08/11 09:15:41,827 INFO [nioEventLoopGroup-3-3] RaftServer$Division: 
> 1@group-47BEDE733167: inconsistency entries. 
> Reply:3<-1#2892557:FAIL-t59,INCONSISTENCY,nextIndex=34795894,followerCommit=34795893,matchIndex=-1
> 24/08/11 09:15:42,230 INFO [nioEventLoopGroup-3-3] 
> SnapshotInstallationHandler: 1@group-47BEDE733167: receive installSnapshot: 
> 3->1#0-t59,chunk:bbe49073-5dad-4499-9051-58a0e53b0658,0
> 24/08/11 09:15:42,231 ERROR [nioEventLoopGroup-3-3] 
> SnapshotInstallationHandler: 1@group-47BEDE733167: installSnapshot failed
> java.lang.IllegalStateException: 1@group-47BEDE733167 log's commit index is 
> 34795893, last included index in snapshot is 34670057
> 24/08/11 09:15:42,233 WARN [nioEventLoopGroup-3-3] DefaultChannelPipeline: An 
> exceptionCaught() event was fired, and it reached at the tail of the 
> pipeline. It usually means the last handler in the pipeline did not handle 
> the exception.
> java.lang.IllegalStateException: 1@group-47BEDE733167 log's commit index is 
> 34795893, last included index in snapshot is 34670057
> 24/08/11 09:15:50,105 INFO [1@group-47BEDE733167-FollowerState] 
> FollowerState: 1@group-47BEDE733167-FollowerState: change to CANDIDATE, 
> lastRpcElapsedTime:7874610911ns, electionTimeout:3353ms
> 24/08/11 09:15:50,105 INFO [1@group-47BEDE733167-FollowerState] RoleInfo: 1: 
> shutdown 1@group-47BEDE733167-FollowerState
> 24/08/11 09:15:50,105 INFO [1@group-47BEDE733167-FollowerState] 
> RaftServer$Division: 1@group-47BEDE733167: changes role from  FOLLOWER to 
> CANDIDATE at term 59 for changeToCandidate
> 24/08/11 09:15:50,106 INFO [1@group-47BEDE733167-FollowerState] 
> RaftServerConfigKeys: raft.server.leaderelection.pre-vote = true (default)
> 24/08/11 09:15:50,106 INFO [1@group-47BEDE733167-FollowerState] RoleInfo: 1: 
> start 1@group-47BEDE733167-LeaderElection5
> 24/08/11 09:15:50,107 INFO [1@group-47BEDE733167-LeaderElection5] 
> RaftServer$Division: 1@group-47BEDE733167: change Leader from 3 to null at 
> term 59 for PRE_VOTE
> 24/08/11 09:15:50,107 INFO [1@group-47BEDE733167-LeaderElection5] 
> LeaderElection: 1@group-47BEDE733167-LeaderElection5 PRE_VOTE round 0: submit 
> vote requests at term 59 for 34233595: 
> peers:[1|rpc:xxx:9862|admin:|client:xxx:9087|dataStream:|priority:0|startupRole:FOLLOWER,
>  
> 2|rpc:xxx:9862|admin:|client:xxx:9087|dataStream:|priority:0|startupRole:FOLLOWER,
>  
> 3|rpc:xxx:9862|admin:|client:xxx:9087|dataStream:|priority:0|startupRole:FOLLOWER]|listeners:[],
>  old=null
> 24/08/11 09:15:50,110 INFO [1@group-47BEDE733167-LeaderElection5] 
> LeaderElection: 1@group-47BEDE733167-LeaderElection5: PRE_VOTE REJECTED 
> received 2 response(s) and 0 exception(s):
> 24/08/11 09:15:50,110 INFO [1@group-47BEDE733167-LeaderElection5] 
> LeaderElection:   Response 0: 1<-2#0:FAIL-t59
> 24/08/11 09:15:50,110 INFO [1@group-47BEDE733167-LeaderElection5] 
> LeaderElection:   Response 1: 1<-3#0:FAIL-t59
> 24/08/11 09:15:50,110 INFO [1@group-47BEDE733167-LeaderElection5] 
> LeaderElection: 1@group-47BEDE733167-LeaderElection5 PRE_VOTE round 0: result 
> REJECTED
> 24/08/11 09:15:50,110 INFO [1@group-47BEDE733167-LeaderElection5] 
> RaftServer$Division: 1@group-47BEDE733167: changes role from CANDIDATE to 
> FOLLOWER at term 59 for REJECTED
> 24/08/11 09:15:50,110 INFO [1@group-47BEDE733167-LeaderElection5] RoleInfo: 
> 1: shutdown 1@group-47BEDE733167-LeaderElection5
> 24/08/11 09:15:50,111 INFO [1@group-47BEDE733167-LeaderElection5] RoleInfo: 
> 1: start 1@group-47BEDE733167-FollowerState
> 24/08/11 09:15:52,075 INFO [nioEventLoopGroup-3-3] 
> SnapshotInstallationHandler: 1@group-47BEDE733167: receive installSnapshot: 
> 3->1#0-t59,chunk:802b8106-076b-4ffc-8cf1-1ace04952c7a,0
> 24/08/11 09:15:52,075 INFO [nioEventLoopGroup-3-3] RaftServer$Division: 
> 1@group-47BEDE733167: change Leader from null to 3 at term 59 for 
> installSnapshot, leader elected after 1967ms
> 24/08/11 09:15:52,075 ERROR [nioEventLoopGroup-3-3] 
> SnapshotInstallationHandler: 1@group-47BEDE733167: installSnapshot failed
> java.lang.IllegalStateException: 1@group-47BEDE733167 log's commit index is 
> 34795893, last included index in snapshot is 34670057
> *(repeat 'election, fail, install snapshot fail' for more than 2 hours)*
> *(still election fail, but this time install snapshot successfully, but 
> failed to updateLastAppliedTermIndex)*
> 24/08/11 11:52:38,168 INFO [nioEventLoopGroup-3-3] 
> SnapshotInstallationHandler: 1@group-47BEDE733167: receive installSnapshot: 
> 3->1#0-t59,chunk:dcc73f7c-c17d-44e5-96b6-f6f1570e68a7,0
> 24/08/11 11:52:38,168 INFO [nioEventLoopGroup-3-3] RaftServer$Division: 
> 1@group-47BEDE733167: change Leader from null to 3 at term 59 for 
> installSnapshot, leader elected after 3610ms
> 24/08/11 11:52:39,434 INFO [nioEventLoopGroup-3-3] SnapshotManager: 
> Installing snapshot:3->1#0-t59,chunk:dcc73f7c-c17d-44e5-96b6-f6f1570e68a7,0, 
> to tmp 
> dir:/data0/celeborn/ha_ratis/c5196f6d-2c34-3ed3-8b8a-47bede733167/tmp/snapshot-dcc73f7c-c17d-44e5-96b6-f6f1570e68a7
> 24/08/11 11:52:39,455 INFO [nioEventLoopGroup-3-3] SnapshotManager: Installed 
> snapshot, renaming temporary dir 
> /data0/celeborn/ha_ratis/c5196f6d-2c34-3ed3-8b8a-47bede733167/tmp/snapshot-dcc73f7c-c17d-44e5-96b6-f6f1570e68a7
>  to /data0/celeborn/ha_ratis/c5196f6d-2c34-3ed3-8b8a-47bede733167/sm
> 24/08/11 11:52:40,829 INFO [1@group-47BEDE733167-StateMachineUpdater] 
> StateMachine: Reinitializing state machine.
> 24/08/11 11:52:40,829 INFO [nioEventLoopGroup-3-3] RaftLog: 
> 1@group-47BEDE733167-SegmentedRaftLog: snapshotIndex: updateIncreasingly 
> 33870802 -> 34870063
> 24/08/11 11:52:40,829 INFO [1@group-47BEDE733167-StateMachineUpdater] 
> StateMachine: obsolete snapshot provided: (t:59, i:34670809)
> 24/08/11 11:52:40,829 INFO [nioEventLoopGroup-3-3] SegmentedRaftLogWorker: 
> 1@group-47BEDE733167-SegmentedRaftLogWorker: flushIndex: setUnconditionally 
> 34795893 -> 34870063
> 24/08/11 11:52:40,829 INFO [nioEventLoopGroup-3-3] SegmentedRaftLogWorker: 
> 1@group-47BEDE733167-SegmentedRaftLogWorker: safeCacheEvictIndex: 
> setUnconditionally 34783743 -> 34870063
> 24/08/11 11:52:40,829 INFO [1@group-47BEDE733167-StateMachineUpdater] 
> StateMachineUpdater: 1@group-47BEDE733167-StateMachineUpdater: appliedIndex: 
> setUnconditionally 34795893 -> 34670809
> 24/08/11 11:52:40,829 INFO [nioEventLoopGroup-3-3] SegmentedRaftLogWorker: 
> 1@group-47BEDE733167-SegmentedRaftLogWorker: Closing segment 
> log_inprogress_34783744 to index: 34795893
> 24/08/11 11:52:40,837 INFO [nioEventLoopGroup-3-3] 
> SnapshotInstallationHandler: 1@group-47BEDE733167: successfully install the 
> entire snapshot-34870063
> 24/08/11 11:52:40,877 INFO [nioEventLoopGroup-3-3] 
> SnapshotInstallationHandler: 1@group-47BEDE733167: reply installSnapshot: 
> 3<-1#0:OK-t59,SUCCESS,requestIndex=0
> 24/08/11 11:52:40,902 INFO [1@group-47BEDE733167-SegmentedRaftLogWorker] 
> SegmentedRaftLogWorker: 1@group-47BEDE733167-SegmentedRaftLogWorker: Rolled 
> log segment from 
> /data0/celeborn/ha_ratis/c5196f6d-2c34-3ed3-8b8a-47bede733167/current/log_inprogress_34783744
>  to 
> /data0/celeborn/ha_ratis/c5196f6d-2c34-3ed3-8b8a-47bede733167/current/log_34783744-34795893
> 24/08/11 11:52:41,160 ERROR [1@group-47BEDE733167-StateMachineUpdater] 
> StateMachineUpdater: 1@group-47BEDE733167-StateMachineUpdater caught a 
> Throwable.
> java.lang.IllegalStateException: 1: Failed updateLastAppliedTermIndex: newTI 
> = (t:59, i:34670810) < oldTI = (t:59, i:34795893)
> 24/08/11 11:52:41,161 INFO [1@group-47BEDE733167-StateMachineUpdater] 
> RaftServer$Division: 1@group-47BEDE733167: shutdown
> 24/08/11 11:52:41,161 INFO [1@group-47BEDE733167-StateMachineUpdater] 
> JmxRegister: Successfully un-registered JMX Bean with object name 
> Ratis:service=RaftServer,group=group-47BEDE733167,id=1
> 24/08/11 11:52:41,161 INFO [1@group-47BEDE733167-StateMachineUpdater] 
> RoleInfo: 1: shutdown 1@group-47BEDE733167-FollowerState
> 24/08/11 11:52:41,161 INFO [1@group-47BEDE733167-FollowerState] 
> FollowerState: 1@group-47BEDE733167-FollowerState was interrupted
> 24/08/11 11:52:41,161 INFO [1@group-47BEDE733167-StateMachineUpdater] 
> RaftServer$Division: 1@group-47BEDE733167: closes. applyIndex: 34670810
> 24/08/11 11:52:42,340 INFO [1@group-47BEDE733167-StateMachineUpdater] 
> SegmentedRaftLogWorker: 1@group-47BEDE733167-SegmentedRaftLogWorker close()
> 24/08/11 11:52:48,170 INFO [nioEventLoopGroup-3-3] 
> SnapshotInstallationHandler: 1@group-47BEDE733167: receive installSnapshot: 
> 3->1#0-t59,chunk:7179265b-d007-4835-9b5f-fa9e4d853b72,0
> 24/08/11 11:52:48,170 ERROR [nioEventLoopGroup-3-3] 
> SnapshotInstallationHandler: 1@group-47BEDE733167: installSnapshot failed
> org.apache.ratis.protocol.exceptions.ServerNotReadyException: 
> 1@group-47BEDE733167 is not in [STARTING, RUNNING]: current state is CLOSED
> 24/08/11 11:52:58,171 INFO [nioEventLoopGroup-3-3] 
> SnapshotInstallationHandler: 1@group-47BEDE733167: receive installSnapshot: 
> 3->1#0-t59,chunk:7f56f35f-20f3-4b42-bdb6-c2d52db646fc,0
> 24/08/11 11:52:58,171 ERROR [nioEventLoopGroup-3-3] 
> SnapshotInstallationHandler: 1@group-47BEDE733167: installSnapshot failed
> org.apache.ratis.protocol.exceptions.ServerNotReadyException: 
> 1@group-47BEDE733167 is not in [STARTING, RUNNING]: current state is CLOSED
> *(repeat 'receive installSnapshot and installSnapshot failed' for serval 
> hours until I restarted the server)*
> *Here is what I found in the leader's log*
> 24/08/11 09:03:10,130 WARN 
> [3@group-47BEDE733167->1-LogAppenderDefault-LogAppenderDaemon] LogAppender: 
> 3@group-47BEDE733167->1-LogAppenderDefault: Failed to appendEntries 
> (retry=1): org.apache.ratis.protocol.exceptions.TimeoutIOException
> 24/08/11 09:03:13,714 INFO 
> [3@group-47BEDE733167->1-LogAppenderDefault-LogAppenderDaemon] FollowerInfo: 
> 3@group-47BEDE733167->1: decreaseNextIndex nextIndex: updateUnconditionally 
> 34795876 -> 34795875
> 24/08/11 09:03:13,715 INFO 
> [3@group-47BEDE733167->1-LogAppenderDefault-LogAppenderDaemon] FollowerInfo: 
> 3@group-47BEDE733167->1: decreaseNextIndex nextIndex: updateUnconditionally 
> 34795875 -> 34795874
> *I guess when the leader called appendEntries for the first time, the 
> follower had been executed successfully, but the leader did not receive the 
> follower's response and times out. When it resent, it found that it could not 
> match the change and started to decreaseNextIndex.*
> *Sometimes it can be fixed automatically by rolling segment log, but not 
> always.*
> 24/08/12 11:47:53,351 INFO [nioEventLoopGroup-3-3] RaftServer$Division: 
> 2@group-47BEDE733167: Failed appendEntries as the first entry (index 
> 35049128) already exists (snapshotIndex: 35070063, commitIndex: 35259248)
> 24/08/12 11:47:53,351 INFO [nioEventLoopGroup-3-3] RaftServer$Division: 
> 2@group-47BEDE733167: inconsistency entries. 
> Reply:3<-2#3583083:FAIL-t59,INCONSISTENCY,nextIndex=35259249,followerCommit=35259248,matchIndex=-1
> 24/08/12 11:47:55,132 INFO [nioEventLoopGroup-3-3] SegmentedRaftLogWorker: 
> 2@group-47BEDE733167-SegmentedRaftLogWorker: Rolling segment 
> log-35250411_35261308 to index:35261308



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to