[
https://issues.apache.org/jira/browse/RATIS-2261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17954269#comment-17954269
]
JiangHua Zhu edited comment on RATIS-2261 at 5/27/25 9:48 AM:
--------------------------------------------------------------
I found at least 2 related issues.
1. When s0 starts ahead of schedule and becomes the Leader role. After starting
s3 and s4 subsequently, s0 will gradually become a follower, because s3 or s4
itself carries a higher term and sends appendEntries to s0.
Here are some logs:
{code:java}
2025-05-27 15:02:05,769 [s0-impl-thread1] INFO server.RaftServer$Division
(RaftServerImpl.java:setRole(400)) - s0@group-298C3567E6A7: changes role from
null to FOLLOWER at term 0 for startAsFollower
{code}
{code:java}
2025-05-27 15:02:05,988 [s0@group-298C3567E6A7-FollowerState] INFO
impl.FollowerState (FollowerState.java:runImpl(165)) -
s0@group-298C3567E6A7-FollowerState: change to CANDIDATE,
lastRpcElapsedTime:218784250ns, electionTimeout:214ms
{code}
{code:java}
2025-05-27 15:02:05,989 [s0@group-298C3567E6A7-FollowerState] INFO
server.RaftServer$Division (RaftServerImpl.java:setRole(400)) -
s0@group-298C3567E6A7: changes role from FOLLOWER to CANDIDATE at term 0 for
changeToCandidate
{code}
{code:java}
2025-05-27 15:02:05,995 [s0@group-298C3567E6A7-LeaderElection1] INFO
server.RaftServer$Division (RaftServerImpl.java:setRole(400)) -
s0@group-298C3567E6A7: changes role from CANDIDATE to LEADER at term 1 for
changeToLeader
{code}
{code:java}
2025-05-27 15:02:06,010 [s0@group-298C3567E6A7-LeaderElection1] INFO
server.RaftServer$Division (ServerState.java:setLeader(272)) -
s0@group-298C3567E6A7: change Leader from null to s0 at term 1 for
becomeLeader, leader elected after 333ms
{code}
{code:java}
2025-05-27 15:02:06,647 [s3-impl-thread1] INFO server.RaftServer$Division
(RaftServerImpl.java:setRole(400)) - s3@group-298C3567E6A7: changes role from
null to FOLLOWER at term 0 for startAsFollower
{code}
{code:java}
2025-05-27 15:02:06,656 [s4-impl-thread1] INFO server.RaftServer$Division
(RaftServerImpl.java:setRole(400)) - s4@group-298C3567E6A7: changes role from
null to FOLLOWER at term 0 for startAsFollower
{code}
{code:java}
2025-05-27 15:02:06,869 [s3@group-298C3567E6A7-FollowerState] INFO
impl.FollowerState (FollowerState.java:runImpl(165)) -
s3@group-298C3567E6A7-FollowerState: change to CANDIDATE,
lastRpcElapsedTime:222172416ns, electionTimeout:219ms
{code}
{code:java}
2025-05-27 15:02:06,870 [s3@group-298C3567E6A7-FollowerState] INFO
server.RaftServer$Division (RaftServerImpl.java:setRole(400)) -
s3@group-298C3567E6A7: changes role from FOLLOWER to CANDIDATE at term 0 for
changeToCandidate
{code}
{code:java}
2025-05-27 15:02:06,873 [s4@group-298C3567E6A7-FollowerState] INFO
server.RaftServer$Division (RaftServerImpl.java:setRole(400)) -
s4@group-298C3567E6A7: changes role from FOLLOWER to CANDIDATE at term 0 for
changeToCandidate
{code}
{code:java}
2025-05-27 15:02:06,910 [s4@group-298C3567E6A7-LeaderElection3] INFO
server.RaftServer$Division (RaftServerImpl.java:setRole(400)) -
s4@group-298C3567E6A7: changes role from CANDIDATE to FOLLOWER at term 0 for
DISCOVERED_A_NEW_TERM (term=1)
{code}
{code:java}
2025-05-27 15:02:06,911 [s3@group-298C3567E6A7-LeaderElection2] INFO
server.RaftServer$Division (RaftServerImpl.java:setRole(400)) -
s3@group-298C3567E6A7: changes role from CANDIDATE to FOLLOWER at term 0 for
DISCOVERED_A_NEW_TERM (term=1)
{code}
{code:java}
2025-05-27 15:02:07,115 [s3@group-298C3567E6A7-FollowerState] INFO
impl.FollowerState (FollowerState.java:runImpl(165)) -
s3@group-298C3567E6A7-FollowerState: change to CANDIDATE,
lastRpcElapsedTime:204282541ns, electionTimeout:199ms
{code}
{code:java}
2025-05-27 15:02:07,116 [s3@group-298C3567E6A7-FollowerState] INFO
server.RaftServer$Division (RaftServerImpl.java:setRole(400)) -
s3@group-298C3567E6A7: changes role from FOLLOWER to CANDIDATE at term 1 for
changeToCandidate
{code}
{code:java}
2025-05-27 15:02:07,124 [grpc-default-executor-0] INFO
server.RaftServer$Division (RaftServerImpl.java:setRole(400)) -
s4@group-298C3567E6A7: changes role from FOLLOWER to FOLLOWER at term 1 for
candidate:s3
{code}
{code:java}
2025-05-27 15:02:07,125 [s3@group-298C3567E6A7-LeaderElection4] INFO
server.RaftServer$Division (RaftServerImpl.java:setRole(400)) -
s3@group-298C3567E6A7: changes role from CANDIDATE to LEADER at term 2 for
changeToLeader
{code}
{code:java}
2025-05-27 15:02:07,138 [s3@group-298C3567E6A7-LeaderElection4] INFO
server.RaftServer$Division (ServerState.java:setLeader(272)) -
s3@group-298C3567E6A7: change Leader from null to s3 at term 2 for
becomeLeader, leader elected after 497ms
{code}
{code:java}
2025-05-27 15:02:07,168 [grpc-default-executor-7] INFO
server.RaftServer$Division (ServerState.java:setLeader(272)) -
s4@group-298C3567E6A7: change Leader from null to s3 at term 2 for
appendEntries, leader elected after 525ms
{code}
{code:java}
2025-05-27 15:02:07,169 [grpc-default-executor-6] INFO
server.RaftServer$Division (RaftServerImpl.java:setRole(400)) -
s0@group-298C3567E6A7: changes role from LEADER to FOLLOWER at term 1 for
appendEntries
{code}
[link
testInstallSnapshotDuringBootstrap.log|https://issues.apache.org/jira/secure/attachment/13076705/testInstallSnapshotDuringBootstrap.log]
2. When the client sends the setConfiguration() command, it cannot establish a
connection with s3 or s4. Here are some timeout logs.
{code:java}
2025-05-27 17:30:36,663 [grpc-default-executor-9] WARN server.GrpcLogAppender
(LogUtils.java:warn(127)) -
s0@group-817E46CF4DDB->s3-InstallSnapshotResponseHandler: Failed
InstallSnapshot: org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException:
UNAVAILABLE: io exception
Caused by:
org.apache.ratis.thirdparty.io.netty.channel.ConnectTimeoutException:
connection timed out after 30000 ms: /0.0.0.0:15004
{code}
was (Author: jianghuazhu):
I found at least 2 related issues.
1. When s0 starts ahead of schedule and becomes the Leader role. After starting
s3 and s4 subsequently, s0 will gradually become a follower, because s3 or s4
itself carries a higher term and sends appendEntries to s0.
Here are some logs:
{code:java}
2025-05-27 15:02:05,769 [s0-impl-thread1] INFO server.RaftServer$Division
(RaftServerImpl.java:setRole(400)) - s0@group-298C3567E6A7: changes role from
null to FOLLOWER at term 0 for startAsFollower
{code}
{code:java}
2025-05-27 15:02:05,988 [s0@group-298C3567E6A7-FollowerState] INFO
impl.FollowerState (FollowerState.java:runImpl(165)) -
s0@group-298C3567E6A7-FollowerState: change to CANDIDATE,
lastRpcElapsedTime:218784250ns, electionTimeout:214ms
{code}
{code:java}
2025-05-27 15:02:05,989 [s0@group-298C3567E6A7-FollowerState] INFO
server.RaftServer$Division (RaftServerImpl.java:setRole(400)) -
s0@group-298C3567E6A7: changes role from FOLLOWER to CANDIDATE at term 0 for
changeToCandidate
{code}
{code:java}
2025-05-27 15:02:05,995 [s0@group-298C3567E6A7-LeaderElection1] INFO
server.RaftServer$Division (RaftServerImpl.java:setRole(400)) -
s0@group-298C3567E6A7: changes role from CANDIDATE to LEADER at term 1 for
changeToLeader
{code}
{code:java}
2025-05-27 15:02:06,010 [s0@group-298C3567E6A7-LeaderElection1] INFO
server.RaftServer$Division (ServerState.java:setLeader(272)) -
s0@group-298C3567E6A7: change Leader from null to s0 at term 1 for
becomeLeader, leader elected after 333ms
{code}
{code:java}
2025-05-27 15:02:06,647 [s3-impl-thread1] INFO server.RaftServer$Division
(RaftServerImpl.java:setRole(400)) - s3@group-298C3567E6A7: changes role from
null to FOLLOWER at term 0 for startAsFollower
{code}
{code:java}
2025-05-27 15:02:06,656 [s4-impl-thread1] INFO server.RaftServer$Division
(RaftServerImpl.java:setRole(400)) - s4@group-298C3567E6A7: changes role from
null to FOLLOWER at term 0 for startAsFollower
{code}
{code:java}
2025-05-27 15:02:06,869 [s3@group-298C3567E6A7-FollowerState] INFO
impl.FollowerState (FollowerState.java:runImpl(165)) -
s3@group-298C3567E6A7-FollowerState: change to CANDIDATE,
lastRpcElapsedTime:222172416ns, electionTimeout:219ms
{code}
{code:java}
2025-05-27 15:02:06,870 [s3@group-298C3567E6A7-FollowerState] INFO
server.RaftServer$Division (RaftServerImpl.java:setRole(400)) -
s3@group-298C3567E6A7: changes role from FOLLOWER to CANDIDATE at term 0 for
changeToCandidate
{code}
{code:java}
2025-05-27 15:02:06,873 [s4@group-298C3567E6A7-FollowerState] INFO
server.RaftServer$Division (RaftServerImpl.java:setRole(400)) -
s4@group-298C3567E6A7: changes role from FOLLOWER to CANDIDATE at term 0 for
changeToCandidate
{code}
{code:java}
2025-05-27 15:02:06,910 [s4@group-298C3567E6A7-LeaderElection3] INFO
server.RaftServer$Division (RaftServerImpl.java:setRole(400)) -
s4@group-298C3567E6A7: changes role from CANDIDATE to FOLLOWER at term 0 for
DISCOVERED_A_NEW_TERM (term=1)
{code}
{code:java}
2025-05-27 15:02:06,911 [s3@group-298C3567E6A7-LeaderElection2] INFO
server.RaftServer$Division (RaftServerImpl.java:setRole(400)) -
s3@group-298C3567E6A7: changes role from CANDIDATE to FOLLOWER at term 0 for
DISCOVERED_A_NEW_TERM (term=1)
{code}
{code:java}
2025-05-27 15:02:07,115 [s3@group-298C3567E6A7-FollowerState] INFO
impl.FollowerState (FollowerState.java:runImpl(165)) -
s3@group-298C3567E6A7-FollowerState: change to CANDIDATE,
lastRpcElapsedTime:204282541ns, electionTimeout:199ms
{code}
{code:java}
2025-05-27 15:02:07,116 [s3@group-298C3567E6A7-FollowerState] INFO
server.RaftServer$Division (RaftServerImpl.java:setRole(400)) -
s3@group-298C3567E6A7: changes role from FOLLOWER to CANDIDATE at term 1 for
changeToCandidate
{code}
{code:java}
2025-05-27 15:02:07,124 [grpc-default-executor-0] INFO
server.RaftServer$Division (RaftServerImpl.java:setRole(400)) -
s4@group-298C3567E6A7: changes role from FOLLOWER to FOLLOWER at term 1 for
candidate:s3
{code}
{code:java}
2025-05-27 15:02:07,125 [s3@group-298C3567E6A7-LeaderElection4] INFO
server.RaftServer$Division (RaftServerImpl.java:setRole(400)) -
s3@group-298C3567E6A7: changes role from CANDIDATE to LEADER at term 2 for
changeToLeader
{code}
{code:java}
2025-05-27 15:02:07,138 [s3@group-298C3567E6A7-LeaderElection4] INFO
server.RaftServer$Division (ServerState.java:setLeader(272)) -
s3@group-298C3567E6A7: change Leader from null to s3 at term 2 for
becomeLeader, leader elected after 497ms
{code}
{code:java}
2025-05-27 15:02:07,168 [grpc-default-executor-7] INFO
server.RaftServer$Division (ServerState.java:setLeader(272)) -
s4@group-298C3567E6A7: change Leader from null to s3 at term 2 for
appendEntries, leader elected after 525ms
{code}
{code:java}
2025-05-27 15:02:07,169 [grpc-default-executor-6] INFO
server.RaftServer$Division (RaftServerImpl.java:setRole(400)) -
s0@group-298C3567E6A7: changes role from LEADER to FOLLOWER at term 1 for
appendEntries
{code}
[link
testInstallSnapshotDuringBootstrap.log|https://issues.apache.org/jira/secure/attachment/13076705/testInstallSnapshotDuringBootstrap.log]
2. When the client sends the setConfiguration() command, it cannot establish a
connection with s0. Here are some timeout logs.
> Intermittent failure in
> TestRaftSnapshotWithGrpc.testInstallSnapshotDuringBootstrap
> -----------------------------------------------------------------------------------
>
> Key: RATIS-2261
> URL: https://issues.apache.org/jira/browse/RATIS-2261
> Project: Ratis
> Issue Type: Bug
> Components: gRPC, test
> Reporter: Attila Doroszlai
> Priority: Major
> Attachments:
> org.apache.ratis.grpc.TestRaftSnapshotWithGrpc-output.txt,
> testInstallSnapshotDuringBootstrap.log
>
>
> {code}
> Tests run: 3, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 114.816 s <<<
> FAILURE! - in org.apache.ratis.grpc.TestRaftSnapshotWithGrpc
> org.apache.ratis.grpc.TestRaftSnapshotWithGrpc.testInstallSnapshotDuringBootstrap
> Time elapsed: 101.468 s <<< ERROR!
> java.util.concurrent.TimeoutException: testInstallSnapshotDuringBootstrap()
> timed out after 100 seconds
> at java.util.ArrayList.forEach(ArrayList.java:1259)
> at java.util.ArrayList.forEach(ArrayList.java:1259)
> Suppressed: java.io.InterruptedIOException: retry
> policy=RetryForeverWithSleep(sleepTime = 100ms)
> at
> org.apache.ratis.client.impl.BlockingImpl.sendRequestWithRetry(BlockingImpl.java:138)
> at
> org.apache.ratis.client.impl.AdminImpl.setConfiguration(AdminImpl.java:46)
> at
> org.apache.ratis.client.api.AdminApi.setConfiguration(AdminApi.java:51)
> at
> org.apache.ratis.client.api.AdminApi.setConfiguration(AdminApi.java:45)
> at
> org.apache.ratis.server.impl.MiniRaftCluster.setConfiguration(MiniRaftCluster.java:836)
> at
> org.apache.ratis.statemachine.RaftSnapshotBaseTest.lambda$testInstallSnapshotDuringBootstrap$6(RaftSnapshotBaseTest.java:309)
> at
> org.apache.ratis.server.impl.RaftServerTestUtil.runWithMinorityPeers(RaftServerTestUtil.java:231)
> at
> org.apache.ratis.statemachine.RaftSnapshotBaseTest.testInstallSnapshotDuringBootstrap(RaftSnapshotBaseTest.java:308)
> {code}
> Faeild in 2/100 runs:
> https://github.com/adoroszlai/ratis/actions/runs/13901407901
--
This message was sent by Atlassian Jira
(v8.20.10#820010)