[
https://issues.apache.org/jira/browse/RATIS-1074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17200746#comment-17200746
]
Shashikant Banerjee commented on RATIS-1074:
--------------------------------------------
cc ~ [~msingh], [~szetszwo]
> GrpcLogAppender improperly decrease nextIndex to 1, which may trigger a
> installSnapshot request to follower.
> -------------------------------------------------------------------------------------------------------------
>
> Key: RATIS-1074
> URL: https://issues.apache.org/jira/browse/RATIS-1074
> Project: Ratis
> Issue Type: Bug
> Reporter: Glen Geng
> Assignee: Glen Geng
> Priority: Major
> Labels: MultiRaft, dense-storage
> Fix For: 1.1.0
>
>
> GrpcLogAppender improperly decrease nextIndex to 1, which will trigger
> installSnapshot request to followers, and make pipeline be vulnerable.
>
> *Below is the interaction log between leader and the follower.*
> Follower side, received an install snapshot request, and closed the pipeline.
> {code:java}
> ozone-root-datanode-172.16.90.69.log.9:2020-09-21 19:28:49,719
> [grpc-default-executor-628] INFO org.apache.ratis.server.impl.RaftServerImpl:
> 591f38ce-0b11-4003-8528-7a5802f89ea8@group-622E07D885D8: receive
> installSnapshot:
> 6769b0ef-fb4a-4235-bcf0-f1623fcd74cb->591f38ce-0b11-4003-8528-7a5802f89ea8#0-t514,notify:(t:392,
> i:992920)
> ozone-root-datanode-172.16.90.69.log.9:2020-09-21 19:28:49,719
> [grpc-default-executor-628] INFO
> org.apache.hadoop.ozone.container.common.transport.server.ratis.XceiverServerRatis:
> Leader change notification received for group: group-622E07D885D8 with new
> leaderId: 6769b0ef-fb4a-4235-bcf0-f1623fcd74cb
> ozone-root-datanode-172.16.90.69.log.9:2020-09-21 19:28:49,719
> [grpc-default-executor-628] INFO org.apache.ratis.server.impl.RaftServerImpl:
> 591f38ce-0b11-4003-8528-7a5802f89ea8@group-622E07D885D8: change Leader from
> null to 6769b0ef-fb4a-4235-bcf0-f1623fcd74cb at term 514 for installSnapshot,
> leader elected after 50593ms
> ozone-root-datanode-172.16.90.69.log.9:2020-09-21 19:28:49,719
> [grpc-default-executor-628] INFO org.apache.ratis.server.impl.RaftServerImpl:
> 591f38ce-0b11-4003-8528-7a5802f89ea8@group-622E07D885D8:
> notifyInstallSnapshot: nextIndex is 2248004 but the leader's first available
> index is 992920.
> ozone-root-datanode-172.16.90.69.log.9:2020-09-21 19:28:49,719
> [grpc-default-executor-628] WARN
> org.apache.hadoop.ozone.container.common.transport.server.ratis.XceiverServerRatis:
> Install snapshot notification received from Leader with termIndex: (t:392,
> i:992920), terminating pipeline: group-622E07D885D8
> ozone-root-datanode-172.16.90.69.log.9:2020-09-21 19:28:49,720
> [grpc-default-executor-628] ERROR
> org.apache.hadoop.ozone.container.common.transport.server.ratis.XceiverServerRatis:
> pipeline Action CLOSE on pipeline
> PipelineID=433c67af-f834-4959-bcf7-622e07d885d8.Reason :
> 591f38ce-0b11-4003-8528-7a5802f89ea8 closes pipeline when installSnapshot
> from leader because leader snapshot doesn't contain any data to replay, all
> the log entries prior to the snapshot might have been purged.So follower
> should not try to install snapshot from leader butcan close the pipeline
> here. It's in follower state for 1761ms
> ozone-root-datanode-172.16.90.69.log.9:2020-09-21 19:28:49,720
> [grpc-default-executor-628] INFO org.apache.ratis.server.impl.RaftServerImpl:
> 591f38ce-0b11-4003-8528-7a5802f89ea8@group-622E07D885D8: StateMachine
> successfully installed snapshot index 992920. Reloading the StateMachine.
> ozone-root-datanode-172.16.90.69.log.9:2020-09-21 19:28:49,720
> [grpc-default-executor-628] INFO
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogWorker:
> 591f38ce-0b11-4003-8528-7a5802f89ea8@group-622E07D885D8-SegmentedRaftLogWorker:
> flushIndex: setUnconditionally 2248003 -> 992920
> ozone-root-datanode-172.16.90.69.log.9:2020-09-21 19:28:49,720
> [grpc-default-executor-628] INFO
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogWorker:
> 591f38ce-0b11-4003-8528-7a5802f89ea8@group-622E07D885D8-SegmentedRaftLogWorker:
> safeCacheEvictIndex: setUnconditionally 2248001 -> 992920
> ozone-root-datanode-172.16.90.69.log.9:2020-09-21 19:28:49,720
> [grpc-default-executor-628] INFO org.apache.ratis.server.impl.RaftServerImpl:
> 591f38ce-0b11-4003-8528-7a5802f89ea8@group-622E07D885D8: reply
> installSnapshot:
> 6769b0ef-fb4a-4235-bcf0-f1623fcd74cb<-591f38ce-0b11-4003-8528-7a5802f89ea8#0:FAIL-t514,IN_PROGRESS{code}
>
> Leader side, just take charge, got vote from another follower, but miss vote
> from this follower
> {code:java}
> ozone-root-datanode-172.16.90.54.log.3:2020-09-21 19:28:21,499
> [6769b0ef-fb4a-4235-bcf0-f1623fcd74cb@group-622E07D885D8-LeaderElection143]
> INFO org.apache.ratis.server.impl.LeaderElection:
> 6769b0ef-fb4a-4235-bcf0-f1623fcd74cb@group-622E07D885D8-LeaderElection143:
> begin an election at term 510 for 2248004:
> [6769b0ef-fb4a-4235-bcf0-f1623fcd74cb:172.16.90.54:9858:0,
> 591f38ce-0b11-4003-8528-7a5802f89ea8:172.16.90.69:9858:1,
> e5596d61-c021-4a4b-89bc-6d8e6d4f1baf:172.16.90.56:9858:0], old=null
> ozone-root-datanode-172.16.90.54.log.3:2020-09-21 19:28:26,580
> [6769b0ef-fb4a-4235-bcf0-f1623fcd74cb@group-622E07D885D8-LeaderElection143]
> INFO org.apache.ratis.server.impl.LeaderElection:
> 6769b0ef-fb4a-4235-bcf0-f1623fcd74cb@group-622E07D885D8-LeaderElection143:
> Election PASSED; received 1 response(s)
> [6769b0ef-fb4a-4235-bcf0-f1623fcd74cb<-e5596d61-c021-4a4b-89bc-6d8e6d4f1baf#0:OK-t510]
> and 0 exception(s);
> 6769b0ef-fb4a-4235-bcf0-f1623fcd74cb@group-622E07D885D8:t510, leader=null,
> voted=6769b0ef-fb4a-4235-bcf0-f1623fcd74cb,
> raftlog=6769b0ef-fb4a-4235-bcf0-f1623fcd74cb@group-622E07D885D8-SegmentedRaftLog:OPENED:c2248003,f2248004,i2248004,
> conf=2248004: [6769b0ef-fb4a-4235-bcf0-f1623fcd74cb:172.16.90.54:9858:0,
> 591f38ce-0b11-4003-8528-7a5802f89ea8:172.16.90.69:9858:1,
> e5596d61-c021-4a4b-89bc-6d8e6d4f1baf:172.16.90.56:9858:0], old=null
> ozone-root-datanode-172.16.90.54.log.3:2020-09-21 19:28:26,580
> [6769b0ef-fb4a-4235-bcf0-f1623fcd74cb@group-622E07D885D8-LeaderElection143]
> INFO org.apache.ratis.server.impl.RoleInfo:
> 6769b0ef-fb4a-4235-bcf0-f1623fcd74cb: shutdown LeaderElection
> ozone-root-datanode-172.16.90.54.log.3:2020-09-21 19:28:26,580
> [6769b0ef-fb4a-4235-bcf0-f1623fcd74cb@group-622E07D885D8-LeaderElection143]
> INFO org.apache.ratis.server.impl.RaftServerImpl:
> 6769b0ef-fb4a-4235-bcf0-f1623fcd74cb@group-622E07D885D8: changes role from
> CANDIDATE to LEADER at term 510 for changeToLeader
> ozone-root-datanode-172.16.90.54.log.3:2020-09-21 19:28:26,580
> [6769b0ef-fb4a-4235-bcf0-f1623fcd74cb@group-622E07D885D8-LeaderElection143]
> INFO
> org.apache.hadoop.ozone.container.common.transport.server.ratis.XceiverServerRatis:
> Leader change notification received for group: group-622E07D885D8 with new
> leaderId: 6769b0ef-fb4a-4235-bcf0-f1623fcd74cb
> ozone-root-datanode-172.16.90.54.log.3:2020-09-21 19:28:26,588
> [6769b0ef-fb4a-4235-bcf0-f1623fcd74cb@group-622E07D885D8-LeaderElection143]
> INFO org.apache.ratis.server.impl.RaftServerImpl:
> 6769b0ef-fb4a-4235-bcf0-f1623fcd74cb@group-622E07D885D8: change Leader from
> null to 6769b0ef-fb4a-4235-bcf0-f1623fcd74cb at term 510 for becomeLeader,
> leader elected after 20419ms
> ozone-root-datanode-172.16.90.54.log.3:2020-09-21 19:28:26,588
> [6769b0ef-fb4a-4235-bcf0-f1623fcd74cb@group-622E07D885D8-LeaderElection143]
> INFO org.apache.ratis.server.RaftServerConfigKeys:
> raft.server.staging.catchup.gap = 1000 (default)
> ozone-root-datanode-172.16.90.54.log.3:2020-09-21 19:28:26,588
> [6769b0ef-fb4a-4235-bcf0-f1623fcd74cb@group-622E07D885D8-LeaderElection143]
> INFO org.apache.ratis.server.RaftServerConfigKeys: raft.server.rpc.sleep.time
> = 25ms (default){code}
>
> Due to Grpc exception, leader decrease next index of follower to 1. Since
> leader has purged log, it has no choice but send out an installSnapshot
> request to follower, which will trigger follower to close the pipeline.
> {code:java}
> ozone-root-datanode-172.16.90.54.log.3:2020-09-21 19:28:51,531
> [grpc-default-executor-345] WARN
> org.apache.ratis.grpc.server.GrpcLogAppender:
> 6769b0ef-fb4a-4235-bcf0-f1623fcd74cb@group-622E07D885D8->591f38ce-0b11-4003-8528-7a5802f89ea8-AppendLogResponseHandler:
> Failed appendEntries:
> org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: UNAVAILABLE: io
> exception
> ozone-root-datanode-172.16.90.54.log.3:2020-09-21 19:28:51,531
> [grpc-default-executor-345] INFO org.apache.ratis.server.impl.FollowerInfo:
> 6769b0ef-fb4a-4235-bcf0-f1623fcd74cb@group-622E07D885D8->591f38ce-0b11-4003-8528-7a5802f89ea8:
> nextIndex: updateUnconditionally 2248006 -> 1
> ozone-root-datanode-172.16.90.54.log.3:2020-09-21 19:28:51,672
> [org.apache.ratis.server.impl.LogAppender$AppenderDaemon$$Lambda$601/1151550579@71fbee17]
> INFO org.apache.ratis.grpc.server.GrpcLogAppender:
> 6769b0ef-fb4a-4235-bcf0-f1623fcd74cb@group-622E07D885D8->591f38ce-0b11-4003-8528-7a5802f89ea8-GrpcLogAppender:
> followerNextIndex = 1 but logStartIndex = 992920, notify follower to install
> snapshot-(t:392, i:992920){code}
>
>
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)