[ 
https://issues.apache.org/jira/browse/RATIS-1074?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Glen Geng updated RATIS-1074:
-----------------------------
    Summary:  GrpcLogAppender improperly decrease nextIndex to 1, which may 
trigger a installSnapshot request to follower.  (was:  GrpcLogAppender 
improperly decrease nextIndex to 1, which will trigger installSnapshot request 
to followers.)

>  GrpcLogAppender improperly decrease nextIndex to 1, which may trigger a 
> installSnapshot request to follower.
> -------------------------------------------------------------------------------------------------------------
>
>                 Key: RATIS-1074
>                 URL: https://issues.apache.org/jira/browse/RATIS-1074
>             Project: Ratis
>          Issue Type: Bug
>            Reporter: Glen Geng
>            Assignee: Glen Geng
>            Priority: Major
>              Labels: MultiRaft, dense-storage
>             Fix For: 1.1.0
>
>
> GrpcLogAppender improperly decrease nextIndex to 1, which will trigger 
> installSnapshot request to followers, and make pipeline be vulnerable.
>  
> *Below is the interaction log between leader and the follower.*
> Follower side, received an install snapshot request, and closed the pipeline.
> {code:java}
> ozone-root-datanode-172.16.90.69.log.9:2020-09-21 19:28:49,719 
> [grpc-default-executor-628] INFO org.apache.ratis.server.impl.RaftServerImpl: 
> 591f38ce-0b11-4003-8528-7a5802f89ea8@group-622E07D885D8: receive 
> installSnapshot: 
> 6769b0ef-fb4a-4235-bcf0-f1623fcd74cb->591f38ce-0b11-4003-8528-7a5802f89ea8#0-t514,notify:(t:392,
>  i:992920)
>  ozone-root-datanode-172.16.90.69.log.9:2020-09-21 19:28:49,719 
> [grpc-default-executor-628] INFO 
> org.apache.hadoop.ozone.container.common.transport.server.ratis.XceiverServerRatis:
>  Leader change notification received for group: group-622E07D885D8 with new 
> leaderId: 6769b0ef-fb4a-4235-bcf0-f1623fcd74cb
>  ozone-root-datanode-172.16.90.69.log.9:2020-09-21 19:28:49,719 
> [grpc-default-executor-628] INFO org.apache.ratis.server.impl.RaftServerImpl: 
> 591f38ce-0b11-4003-8528-7a5802f89ea8@group-622E07D885D8: change Leader from 
> null to 6769b0ef-fb4a-4235-bcf0-f1623fcd74cb at term 514 for installSnapshot, 
> leader elected after 50593ms
>  ozone-root-datanode-172.16.90.69.log.9:2020-09-21 19:28:49,719 
> [grpc-default-executor-628] INFO org.apache.ratis.server.impl.RaftServerImpl: 
> 591f38ce-0b11-4003-8528-7a5802f89ea8@group-622E07D885D8: 
> notifyInstallSnapshot: nextIndex is 2248004 but the leader's first available 
> index is 992920.
>  ozone-root-datanode-172.16.90.69.log.9:2020-09-21 19:28:49,719 
> [grpc-default-executor-628] WARN 
> org.apache.hadoop.ozone.container.common.transport.server.ratis.XceiverServerRatis:
>  Install snapshot notification received from Leader with termIndex: (t:392, 
> i:992920), terminating pipeline: group-622E07D885D8
>  ozone-root-datanode-172.16.90.69.log.9:2020-09-21 19:28:49,720 
> [grpc-default-executor-628] ERROR 
> org.apache.hadoop.ozone.container.common.transport.server.ratis.XceiverServerRatis:
>  pipeline Action CLOSE on pipeline 
> PipelineID=433c67af-f834-4959-bcf7-622e07d885d8.Reason : 
> 591f38ce-0b11-4003-8528-7a5802f89ea8 closes pipeline when installSnapshot 
> from leader because leader snapshot doesn't contain any data to replay, all 
> the log entries prior to the snapshot might have been purged.So follower 
> should not try to install snapshot from leader butcan close the pipeline 
> here. It's in follower state for 1761ms
>  ozone-root-datanode-172.16.90.69.log.9:2020-09-21 19:28:49,720 
> [grpc-default-executor-628] INFO org.apache.ratis.server.impl.RaftServerImpl: 
> 591f38ce-0b11-4003-8528-7a5802f89ea8@group-622E07D885D8: StateMachine 
> successfully installed snapshot index 992920. Reloading the StateMachine.
>  ozone-root-datanode-172.16.90.69.log.9:2020-09-21 19:28:49,720 
> [grpc-default-executor-628] INFO 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogWorker: 
> 591f38ce-0b11-4003-8528-7a5802f89ea8@group-622E07D885D8-SegmentedRaftLogWorker:
>  flushIndex: setUnconditionally 2248003 -> 992920
>  ozone-root-datanode-172.16.90.69.log.9:2020-09-21 19:28:49,720 
> [grpc-default-executor-628] INFO 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogWorker: 
> 591f38ce-0b11-4003-8528-7a5802f89ea8@group-622E07D885D8-SegmentedRaftLogWorker:
>  safeCacheEvictIndex: setUnconditionally 2248001 -> 992920
>  ozone-root-datanode-172.16.90.69.log.9:2020-09-21 19:28:49,720 
> [grpc-default-executor-628] INFO org.apache.ratis.server.impl.RaftServerImpl: 
> 591f38ce-0b11-4003-8528-7a5802f89ea8@group-622E07D885D8: reply 
> installSnapshot: 
> 6769b0ef-fb4a-4235-bcf0-f1623fcd74cb<-591f38ce-0b11-4003-8528-7a5802f89ea8#0:FAIL-t514,IN_PROGRESS{code}
>  Leader side, just take charge, got vote from another follower, but miss vote 
> from this follower
> {code:java}
> ozone-root-datanode-172.16.90.54.log.3:2020-09-21 19:28:21,499 
> [6769b0ef-fb4a-4235-bcf0-f1623fcd74cb@group-622E07D885D8-LeaderElection143] 
> INFO org.apache.ratis.server.impl.LeaderElection: 
> 6769b0ef-fb4a-4235-bcf0-f1623fcd74cb@group-622E07D885D8-LeaderElection143: 
> begin an election at term 510 for 2248004: 
> [6769b0ef-fb4a-4235-bcf0-f1623fcd74cb:172.16.90.54:9858:0, 
> 591f38ce-0b11-4003-8528-7a5802f89ea8:172.16.90.69:9858:1, 
> e5596d61-c021-4a4b-89bc-6d8e6d4f1baf:172.16.90.56:9858:0], old=null
>  ozone-root-datanode-172.16.90.54.log.3:2020-09-21 19:28:26,580 
> [6769b0ef-fb4a-4235-bcf0-f1623fcd74cb@group-622E07D885D8-LeaderElection143] 
> INFO org.apache.ratis.server.impl.LeaderElection: 
> 6769b0ef-fb4a-4235-bcf0-f1623fcd74cb@group-622E07D885D8-LeaderElection143: 
> Election PASSED; received 1 response(s) 
> [6769b0ef-fb4a-4235-bcf0-f1623fcd74cb<-e5596d61-c021-4a4b-89bc-6d8e6d4f1baf#0:OK-t510]
>  and 0 exception(s); 
> 6769b0ef-fb4a-4235-bcf0-f1623fcd74cb@group-622E07D885D8:t510, leader=null, 
> voted=6769b0ef-fb4a-4235-bcf0-f1623fcd74cb, 
> raftlog=6769b0ef-fb4a-4235-bcf0-f1623fcd74cb@group-622E07D885D8-SegmentedRaftLog:OPENED:c2248003,f2248004,i2248004,
>  conf=2248004: [6769b0ef-fb4a-4235-bcf0-f1623fcd74cb:172.16.90.54:9858:0, 
> 591f38ce-0b11-4003-8528-7a5802f89ea8:172.16.90.69:9858:1, 
> e5596d61-c021-4a4b-89bc-6d8e6d4f1baf:172.16.90.56:9858:0], old=null
>  ozone-root-datanode-172.16.90.54.log.3:2020-09-21 19:28:26,580 
> [6769b0ef-fb4a-4235-bcf0-f1623fcd74cb@group-622E07D885D8-LeaderElection143] 
> INFO org.apache.ratis.server.impl.RoleInfo: 
> 6769b0ef-fb4a-4235-bcf0-f1623fcd74cb: shutdown LeaderElection
>  ozone-root-datanode-172.16.90.54.log.3:2020-09-21 19:28:26,580 
> [6769b0ef-fb4a-4235-bcf0-f1623fcd74cb@group-622E07D885D8-LeaderElection143] 
> INFO org.apache.ratis.server.impl.RaftServerImpl: 
> 6769b0ef-fb4a-4235-bcf0-f1623fcd74cb@group-622E07D885D8: changes role from 
> CANDIDATE to LEADER at term 510 for changeToLeader
>  ozone-root-datanode-172.16.90.54.log.3:2020-09-21 19:28:26,580 
> [6769b0ef-fb4a-4235-bcf0-f1623fcd74cb@group-622E07D885D8-LeaderElection143] 
> INFO 
> org.apache.hadoop.ozone.container.common.transport.server.ratis.XceiverServerRatis:
>  Leader change notification received for group: group-622E07D885D8 with new 
> leaderId: 6769b0ef-fb4a-4235-bcf0-f1623fcd74cb
>  ozone-root-datanode-172.16.90.54.log.3:2020-09-21 19:28:26,588 
> [6769b0ef-fb4a-4235-bcf0-f1623fcd74cb@group-622E07D885D8-LeaderElection143] 
> INFO org.apache.ratis.server.impl.RaftServerImpl: 
> 6769b0ef-fb4a-4235-bcf0-f1623fcd74cb@group-622E07D885D8: change Leader from 
> null to 6769b0ef-fb4a-4235-bcf0-f1623fcd74cb at term 510 for becomeLeader, 
> leader elected after 20419ms
>  ozone-root-datanode-172.16.90.54.log.3:2020-09-21 19:28:26,588 
> [6769b0ef-fb4a-4235-bcf0-f1623fcd74cb@group-622E07D885D8-LeaderElection143] 
> INFO org.apache.ratis.server.RaftServerConfigKeys: 
> raft.server.staging.catchup.gap = 1000 (default)
>  ozone-root-datanode-172.16.90.54.log.3:2020-09-21 19:28:26,588 
> [6769b0ef-fb4a-4235-bcf0-f1623fcd74cb@group-622E07D885D8-LeaderElection143] 
> INFO org.apache.ratis.server.RaftServerConfigKeys: raft.server.rpc.sleep.time 
> = 25ms (default){code}
>  
> Due to Grpc exception, leader decrease next index of follower to 1. Since 
> leader has purged log, it has no choice but send out an installSnapshot 
> request to follower.
> {code:java}
> ozone-root-datanode-172.16.90.54.log.3:2020-09-21 19:28:51,531 
> [grpc-default-executor-345] WARN 
> org.apache.ratis.grpc.server.GrpcLogAppender: 
> 6769b0ef-fb4a-4235-bcf0-f1623fcd74cb@group-622E07D885D8->591f38ce-0b11-4003-8528-7a5802f89ea8-AppendLogResponseHandler:
>  Failed appendEntries: 
> org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: UNAVAILABLE: io 
> exception
>  ozone-root-datanode-172.16.90.54.log.3:2020-09-21 19:28:51,531 
> [grpc-default-executor-345] INFO org.apache.ratis.server.impl.FollowerInfo: 
> 6769b0ef-fb4a-4235-bcf0-f1623fcd74cb@group-622E07D885D8->591f38ce-0b11-4003-8528-7a5802f89ea8:
>  nextIndex: updateUnconditionally 2248006 -> 1
>  ozone-root-datanode-172.16.90.54.log.3:2020-09-21 19:28:51,672 
> [org.apache.ratis.server.impl.LogAppender$AppenderDaemon$$Lambda$601/1151550579@71fbee17]
>  INFO org.apache.ratis.grpc.server.GrpcLogAppender: 
> 6769b0ef-fb4a-4235-bcf0-f1623fcd74cb@group-622E07D885D8->591f38ce-0b11-4003-8528-7a5802f89ea8-GrpcLogAppender:
>  followerNextIndex = 1 but logStartIndex = 992920, notify follower to install 
> snapshot-(t:392, i:992920){code}
>   



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to