[ https://issues.apache.org/jira/browse/RATIS-794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17016437#comment-17016437 ]
Tsz-wo Sze commented on RATIS-794: ---------------------------------- Thanks [~shashikant]. The follower should reply INCONSISTENCY when receiving appendEntries with an index <= commitIndex. Then, the leader will resend starting at index (commitIndex + 1). > Ratils leader should retry append requests based on follower commit info in > case of intermittent append failures > ---------------------------------------------------------------------------------------------------------------- > > Key: RATIS-794 > URL: https://issues.apache.org/jira/browse/RATIS-794 > Project: Ratis > Issue Type: Bug > Components: server > Reporter: Shashikant Banerjee > Assignee: Tsz-wo Sze > Priority: Major > Fix For: 0.5.0 > > > During Ozone testing, it was observed that a leader election happens in > between the test , where a follower has caught to a certain index 313. The > new leader starts sends an append request to the follower which fails with > grpc Exception. This leads to leader reset the connection and start from the > beginning (index 1). > > > {code:java} > 2020-01-13 14:56:32,995 INFO org.apache.ratis.server.impl.RaftServerImpl: > 0.0.0.0:9858@group-4F125BF42C14: changes role from CANDIDATE to LEADER at > term 7 for changeToLeader > 2020-01-13 14:56:32,995 INFO > org.apache.hadoop.ozone.container.common.transport.server.ratis.XceiverServerRatis: > Leader change notification received for group: group-4F125BF42C14 with new > leaderId: ed90869c-317e-4303-8922-9fa83a3983cb > 2020-01-13 14:56:33,042 WARN org.apache.ratis.grpc.server.GrpcLogAppender: > 0.0.0.0:9858@group-4F125BF42C14->10.120.139.111:9858-AppendLogResponseHandler: > Failed appendEntries: > org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: UNAVAILABLE: io > exception > 2020-01-13 14:56:33,043 DEBUG org.apache.ratis.util.PeerProxyMap: > ed90869c-317e-4303-8922-9fa83a3983cb: reset proxy for > b65b0b6c-b0bb-429f-a23d-467c72d4b85c > 2020-01-13 14:56:33,044 DEBUG org.apache.ratis.util.LifeCycle: > b65b0b6c-b0bb-429f-a23d-467c72d4b85c:10.120.139.111:9858: RUNNING -> CLOSING > 2020-01-13 14:56:33,044 DEBUG org.apache.ratis.util.LifeCycle: > b65b0b6c-b0bb-429f-a23d-467c72d4b85c:10.120.139.111:9858: CLOSING -> CLOSED > 2020-01-13 14:56:33,044 DEBUG org.apache.ratis.util.LifeCycle: > b65b0b6c-b0bb-429f-a23d-467c72d4b85c:10.120.139.111:9858: NEW > 2020-01-13 14:56:33,044 DEBUG org.apache.ratis.util.TimeoutScheduler: new > ScheduledThreadPoolExecutor > 2020-01-13 14:56:33,044 DEBUG org.apache.ratis.util.PeerProxyMap: > ed90869c-317e-4303-8922-9fa83a3983cb: Closing proxy for peer > b65b0b6c-b0bb-429f-a23d-467c72d4b85c:10.120.139.111:9858 > 2020-01-13 14:56:33,045 DEBUG org.apache.ratis.util.TimeoutScheduler: > schedule a task: timeout 6000ms, sid 1 > 2020-01-13 14:56:33,047 INFO org.apache.ratis.server.impl.FollowerInfo: > 0.0.0.0:9858@group-4F125BF42C14->10.120.139.111:9858: nextIndex: > updateUnconditionally 314 -> 1 ---------------------> set the next index for > the follower back to 1 and starts from 1) > 2020-01-13 14:56:35,840 DEBUG org.apache.ratis.grpc.server.GrpcLogAppender: > 0.0.0.0:9858@group-4F125BF42C14->10.120.139.111:9858-AppendLogResponseHandler: > received the first reply > ed90869c-317e-4303-8922-9fa83a3983cb<-b65b0b6c-b0bb-429f-a23d-467c72d4b85c#2:OK,SUCCESS,nextIndex:314,term:5,followerCommit:313, > request=AppendEntriesRequest:cid=2,entriesCount=0,lastEntry=null . > -------------------> (Receives the response from follower indficating > follower is at 312) > Although the follower is at 313, the leader keeps on sending the > appendRequests from index 1. > 2020-01-13 14:56:35,841 DEBUG org.apache.ratis.server.impl.FollowerInfo: > 0.0.0.0:9858@group-4F125BF42C14->10.120.139.111:9858: nextIndex: > updateIncreasingly 1 -> 2 > 2020-01-13 14:56:35,841 DEBUG org.apache.ratis.util.TimeoutScheduler: > schedule a task: timeout 6000ms, sid 7 > 2020-01-13 14:56:35,843 DEBUG org.apache.ratis.server.impl.FollowerInfo: > 0.0.0.0:9858@group-4F125BF42C14->10.120.139.111:9858: nextIndex: > updateIncreasingly 2 -> 3 > 2020-01-13 14:56:35,843 DEBUG org.apache.ratis.util.TimeoutScheduler: > schedule a task: timeout 6000ms, sid 8 > {code} > -- This message was sent by Atlassian Jira (v8.3.4#803005)