Shashikant Banerjee created RATIS-794:
-----------------------------------------

             Summary: Ratils leader should retry append requests based on 
follower commit info in case of intermittent append failures
                 Key: RATIS-794
                 URL: https://issues.apache.org/jira/browse/RATIS-794
             Project: Ratis
          Issue Type: Bug
          Components: server
            Reporter: Shashikant Banerjee
            Assignee: Tsz-wo Sze
             Fix For: 0.5.0


During Ozone testing, it was observed that a leader election happens in between 
the test , where a follower has caught to a certain index 313. The new leader 
starts sends an append request to the follower which fails with grpc Exception. 
This leads to leader reset the connection and start from the beginning (index 
1). 
 
 
{code:java}
2020-01-13 14:56:32,995 INFO org.apache.ratis.server.impl.RaftServerImpl: 
0.0.0.0:9858@group-4F125BF42C14: changes role from CANDIDATE to LEADER at term 
7 for changeToLeader
2020-01-13 14:56:32,995 INFO 
org.apache.hadoop.ozone.container.common.transport.server.ratis.XceiverServerRatis:
 Leader change notification received for group: group-4F125BF42C14 with new 
leaderId: ed90869c-317e-4303-8922-9fa83a3983cb
2020-01-13 14:56:33,042 WARN org.apache.ratis.grpc.server.GrpcLogAppender: 
0.0.0.0:9858@group-4F125BF42C14->10.120.139.111:9858-AppendLogResponseHandler: 
Failed appendEntries: 
org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: UNAVAILABLE: io 
exception
2020-01-13 14:56:33,043 DEBUG org.apache.ratis.util.PeerProxyMap: 
ed90869c-317e-4303-8922-9fa83a3983cb: reset proxy for 
b65b0b6c-b0bb-429f-a23d-467c72d4b85c
2020-01-13 14:56:33,044 DEBUG org.apache.ratis.util.LifeCycle: 
b65b0b6c-b0bb-429f-a23d-467c72d4b85c:10.120.139.111:9858: RUNNING -> CLOSING
2020-01-13 14:56:33,044 DEBUG org.apache.ratis.util.LifeCycle: 
b65b0b6c-b0bb-429f-a23d-467c72d4b85c:10.120.139.111:9858: CLOSING -> CLOSED
2020-01-13 14:56:33,044 DEBUG org.apache.ratis.util.LifeCycle: 
b65b0b6c-b0bb-429f-a23d-467c72d4b85c:10.120.139.111:9858: NEW
2020-01-13 14:56:33,044 DEBUG org.apache.ratis.util.TimeoutScheduler: new 
ScheduledThreadPoolExecutor
2020-01-13 14:56:33,044 DEBUG org.apache.ratis.util.PeerProxyMap: 
ed90869c-317e-4303-8922-9fa83a3983cb: Closing proxy for peer 
b65b0b6c-b0bb-429f-a23d-467c72d4b85c:10.120.139.111:9858
2020-01-13 14:56:33,045 DEBUG org.apache.ratis.util.TimeoutScheduler: schedule 
a task: timeout 6000ms, sid 1 
2020-01-13 14:56:33,047 INFO org.apache.ratis.server.impl.FollowerInfo: 
0.0.0.0:9858@group-4F125BF42C14->10.120.139.111:9858: nextIndex: 
updateUnconditionally 314 -> 1 ---------------------> set the next index for 
the follower back to 1 and  starts from 1)
2020-01-13 14:56:35,840 DEBUG org.apache.ratis.grpc.server.GrpcLogAppender: 
0.0.0.0:9858@group-4F125BF42C14->10.120.139.111:9858-AppendLogResponseHandler: 
received the first reply 
ed90869c-317e-4303-8922-9fa83a3983cb<-b65b0b6c-b0bb-429f-a23d-467c72d4b85c#2:OK,SUCCESS,nextIndex:314,term:5,followerCommit:313,
 request=AppendEntriesRequest:cid=2,entriesCount=0,lastEntry=null .  
-------------------> (Receives the response from follower indficating follower 
is at 312)
Although the follower is at 313, the leader keeps on sending the appendRequests 
from index 1. 
2020-01-13 14:56:35,841 DEBUG org.apache.ratis.server.impl.FollowerInfo: 
0.0.0.0:9858@group-4F125BF42C14->10.120.139.111:9858: nextIndex: 
updateIncreasingly 1 -> 2
2020-01-13 14:56:35,841 DEBUG org.apache.ratis.util.TimeoutScheduler: schedule 
a task: timeout 6000ms, sid 7
2020-01-13 14:56:35,843 DEBUG org.apache.ratis.server.impl.FollowerInfo: 
0.0.0.0:9858@group-4F125BF42C14->10.120.139.111:9858: nextIndex: 
updateIncreasingly 2 -> 3
2020-01-13 14:56:35,843 DEBUG org.apache.ratis.util.TimeoutScheduler: schedule 
a task: timeout 6000ms, sid 8
{code}
 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to