[ 
https://issues.apache.org/jira/browse/RATIS-2261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17954269#comment-17954269
 ] 

JiangHua Zhu edited comment on RATIS-2261 at 5/27/25 1:31 PM:
--------------------------------------------------------------

I found at least 2 related issues.
1. When s0 starts ahead of schedule and becomes the Leader role. After starting 
s3 and s4 subsequently, s0 will gradually become a follower, because s3 or s4 
itself carries a higher term and sends appendEntries to s0.
Here are some logs:
{code:java}
2025-05-27 15:02:05,769 [s0-impl-thread1] INFO  server.RaftServer$Division 
(RaftServerImpl.java:setRole(400)) - s0@group-298C3567E6A7: changes role from   
   null to FOLLOWER at term 0 for startAsFollower 
{code}
{code:java}
2025-05-27 15:02:05,988 [s0@group-298C3567E6A7-FollowerState] INFO  
impl.FollowerState (FollowerState.java:runImpl(165)) - 
s0@group-298C3567E6A7-FollowerState: change to CANDIDATE, 
lastRpcElapsedTime:218784250ns, electionTimeout:214ms
{code}
{code:java}
2025-05-27 15:02:05,989 [s0@group-298C3567E6A7-FollowerState] INFO  
server.RaftServer$Division (RaftServerImpl.java:setRole(400)) - 
s0@group-298C3567E6A7: changes role from  FOLLOWER to CANDIDATE at term 0 for 
changeToCandidate
{code}
{code:java}
2025-05-27 15:02:05,995 [s0@group-298C3567E6A7-LeaderElection1] INFO  
server.RaftServer$Division (RaftServerImpl.java:setRole(400)) - 
s0@group-298C3567E6A7: changes role from CANDIDATE to LEADER at term 1 for 
changeToLeader
{code}
{code:java}
2025-05-27 15:02:06,010 [s0@group-298C3567E6A7-LeaderElection1] INFO  
server.RaftServer$Division (ServerState.java:setLeader(272)) - 
s0@group-298C3567E6A7: change Leader from null to s0 at term 1 for 
becomeLeader, leader elected after 333ms
{code}
{code:java}
2025-05-27 15:02:06,647 [s3-impl-thread1] INFO  server.RaftServer$Division 
(RaftServerImpl.java:setRole(400)) - s3@group-298C3567E6A7: changes role from   
   null to FOLLOWER at term 0 for startAsFollower
{code}
{code:java}
2025-05-27 15:02:06,656 [s4-impl-thread1] INFO  server.RaftServer$Division 
(RaftServerImpl.java:setRole(400)) - s4@group-298C3567E6A7: changes role from   
   null to FOLLOWER at term 0 for startAsFollower
{code}
{code:java}
2025-05-27 15:02:06,869 [s3@group-298C3567E6A7-FollowerState] INFO  
impl.FollowerState (FollowerState.java:runImpl(165)) - 
s3@group-298C3567E6A7-FollowerState: change to CANDIDATE, 
lastRpcElapsedTime:222172416ns, electionTimeout:219ms
{code}
{code:java}
2025-05-27 15:02:06,870 [s3@group-298C3567E6A7-FollowerState] INFO  
server.RaftServer$Division (RaftServerImpl.java:setRole(400)) - 
s3@group-298C3567E6A7: changes role from  FOLLOWER to CANDIDATE at term 0 for 
changeToCandidate
{code}
{code:java}
2025-05-27 15:02:06,873 [s4@group-298C3567E6A7-FollowerState] INFO  
server.RaftServer$Division (RaftServerImpl.java:setRole(400)) - 
s4@group-298C3567E6A7: changes role from  FOLLOWER to CANDIDATE at term 0 for 
changeToCandidate
{code}
{code:java}
2025-05-27 15:02:06,910 [s4@group-298C3567E6A7-LeaderElection3] INFO  
server.RaftServer$Division (RaftServerImpl.java:setRole(400)) - 
s4@group-298C3567E6A7: changes role from CANDIDATE to FOLLOWER at term 0 for 
DISCOVERED_A_NEW_TERM (term=1)
{code}
{code:java}
2025-05-27 15:02:06,911 [s3@group-298C3567E6A7-LeaderElection2] INFO  
server.RaftServer$Division (RaftServerImpl.java:setRole(400)) - 
s3@group-298C3567E6A7: changes role from CANDIDATE to FOLLOWER at term 0 for 
DISCOVERED_A_NEW_TERM (term=1)
{code}
{code:java}
2025-05-27 15:02:07,115 [s3@group-298C3567E6A7-FollowerState] INFO  
impl.FollowerState (FollowerState.java:runImpl(165)) - 
s3@group-298C3567E6A7-FollowerState: change to CANDIDATE, 
lastRpcElapsedTime:204282541ns, electionTimeout:199ms
{code}
{code:java}
2025-05-27 15:02:07,116 [s3@group-298C3567E6A7-FollowerState] INFO  
server.RaftServer$Division (RaftServerImpl.java:setRole(400)) - 
s3@group-298C3567E6A7: changes role from  FOLLOWER to CANDIDATE at term 1 for 
changeToCandidate
{code}
{code:java}
2025-05-27 15:02:07,124 [grpc-default-executor-0] INFO  
server.RaftServer$Division (RaftServerImpl.java:setRole(400)) - 
s4@group-298C3567E6A7: changes role from  FOLLOWER to FOLLOWER at term 1 for 
candidate:s3
{code}
{code:java}
2025-05-27 15:02:07,125 [s3@group-298C3567E6A7-LeaderElection4] INFO  
server.RaftServer$Division (RaftServerImpl.java:setRole(400)) - 
s3@group-298C3567E6A7: changes role from CANDIDATE to LEADER at term 2 for 
changeToLeader
{code}
{code:java}
2025-05-27 15:02:07,138 [s3@group-298C3567E6A7-LeaderElection4] INFO  
server.RaftServer$Division (ServerState.java:setLeader(272)) - 
s3@group-298C3567E6A7: change Leader from null to s3 at term 2 for 
becomeLeader, leader elected after 497ms
{code}
{code:java}
2025-05-27 15:02:07,168 [grpc-default-executor-7] INFO  
server.RaftServer$Division (ServerState.java:setLeader(272)) - 
s4@group-298C3567E6A7: change Leader from null to s3 at term 2 for 
appendEntries, leader elected after 525ms
{code}
{code:java}
2025-05-27 15:02:07,169 [grpc-default-executor-6] INFO  
server.RaftServer$Division (RaftServerImpl.java:setRole(400)) - 
s0@group-298C3567E6A7: changes role from    LEADER to FOLLOWER at term 1 for 
appendEntries
{code}

[testInstallSnapshotDuringBootstrap.log|https://issues.apache.org/jira/secure/attachment/13076707/testInstallSnapshotDuringBootstrap.log]


2. When the client sends the setConfiguration() command, it cannot establish a 
connection with s3 or s4. Here are some timeout logs.
{code:java}
2025-05-27 17:30:36,663 [grpc-default-executor-9] WARN  server.GrpcLogAppender 
(LogUtils.java:warn(127)) - 
s0@group-817E46CF4DDB->s3-InstallSnapshotResponseHandler: Failed 
InstallSnapshot: org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: 
UNAVAILABLE: io exception
  Caused by: 
org.apache.ratis.thirdparty.io.netty.channel.ConnectTimeoutException: 
connection timed out after 30000 ms: /0.0.0.0:15004
{code}

cc [~adoroszlai] [~szetszwo]


was (Author: jianghuazhu):
I found at least 2 related issues.
1. When s0 starts ahead of schedule and becomes the Leader role. After starting 
s3 and s4 subsequently, s0 will gradually become a follower, because s3 or s4 
itself carries a higher term and sends appendEntries to s0.
Here are some logs:
{code:java}
2025-05-27 15:02:05,769 [s0-impl-thread1] INFO  server.RaftServer$Division 
(RaftServerImpl.java:setRole(400)) - s0@group-298C3567E6A7: changes role from   
   null to FOLLOWER at term 0 for startAsFollower 
{code}
{code:java}
2025-05-27 15:02:05,988 [s0@group-298C3567E6A7-FollowerState] INFO  
impl.FollowerState (FollowerState.java:runImpl(165)) - 
s0@group-298C3567E6A7-FollowerState: change to CANDIDATE, 
lastRpcElapsedTime:218784250ns, electionTimeout:214ms
{code}
{code:java}
2025-05-27 15:02:05,989 [s0@group-298C3567E6A7-FollowerState] INFO  
server.RaftServer$Division (RaftServerImpl.java:setRole(400)) - 
s0@group-298C3567E6A7: changes role from  FOLLOWER to CANDIDATE at term 0 for 
changeToCandidate
{code}
{code:java}
2025-05-27 15:02:05,995 [s0@group-298C3567E6A7-LeaderElection1] INFO  
server.RaftServer$Division (RaftServerImpl.java:setRole(400)) - 
s0@group-298C3567E6A7: changes role from CANDIDATE to LEADER at term 1 for 
changeToLeader
{code}
{code:java}
2025-05-27 15:02:06,010 [s0@group-298C3567E6A7-LeaderElection1] INFO  
server.RaftServer$Division (ServerState.java:setLeader(272)) - 
s0@group-298C3567E6A7: change Leader from null to s0 at term 1 for 
becomeLeader, leader elected after 333ms
{code}
{code:java}
2025-05-27 15:02:06,647 [s3-impl-thread1] INFO  server.RaftServer$Division 
(RaftServerImpl.java:setRole(400)) - s3@group-298C3567E6A7: changes role from   
   null to FOLLOWER at term 0 for startAsFollower
{code}
{code:java}
2025-05-27 15:02:06,656 [s4-impl-thread1] INFO  server.RaftServer$Division 
(RaftServerImpl.java:setRole(400)) - s4@group-298C3567E6A7: changes role from   
   null to FOLLOWER at term 0 for startAsFollower
{code}
{code:java}
2025-05-27 15:02:06,869 [s3@group-298C3567E6A7-FollowerState] INFO  
impl.FollowerState (FollowerState.java:runImpl(165)) - 
s3@group-298C3567E6A7-FollowerState: change to CANDIDATE, 
lastRpcElapsedTime:222172416ns, electionTimeout:219ms
{code}
{code:java}
2025-05-27 15:02:06,870 [s3@group-298C3567E6A7-FollowerState] INFO  
server.RaftServer$Division (RaftServerImpl.java:setRole(400)) - 
s3@group-298C3567E6A7: changes role from  FOLLOWER to CANDIDATE at term 0 for 
changeToCandidate
{code}
{code:java}
2025-05-27 15:02:06,873 [s4@group-298C3567E6A7-FollowerState] INFO  
server.RaftServer$Division (RaftServerImpl.java:setRole(400)) - 
s4@group-298C3567E6A7: changes role from  FOLLOWER to CANDIDATE at term 0 for 
changeToCandidate
{code}
{code:java}
2025-05-27 15:02:06,910 [s4@group-298C3567E6A7-LeaderElection3] INFO  
server.RaftServer$Division (RaftServerImpl.java:setRole(400)) - 
s4@group-298C3567E6A7: changes role from CANDIDATE to FOLLOWER at term 0 for 
DISCOVERED_A_NEW_TERM (term=1)
{code}
{code:java}
2025-05-27 15:02:06,911 [s3@group-298C3567E6A7-LeaderElection2] INFO  
server.RaftServer$Division (RaftServerImpl.java:setRole(400)) - 
s3@group-298C3567E6A7: changes role from CANDIDATE to FOLLOWER at term 0 for 
DISCOVERED_A_NEW_TERM (term=1)
{code}
{code:java}
2025-05-27 15:02:07,115 [s3@group-298C3567E6A7-FollowerState] INFO  
impl.FollowerState (FollowerState.java:runImpl(165)) - 
s3@group-298C3567E6A7-FollowerState: change to CANDIDATE, 
lastRpcElapsedTime:204282541ns, electionTimeout:199ms
{code}
{code:java}
2025-05-27 15:02:07,116 [s3@group-298C3567E6A7-FollowerState] INFO  
server.RaftServer$Division (RaftServerImpl.java:setRole(400)) - 
s3@group-298C3567E6A7: changes role from  FOLLOWER to CANDIDATE at term 1 for 
changeToCandidate
{code}
{code:java}
2025-05-27 15:02:07,124 [grpc-default-executor-0] INFO  
server.RaftServer$Division (RaftServerImpl.java:setRole(400)) - 
s4@group-298C3567E6A7: changes role from  FOLLOWER to FOLLOWER at term 1 for 
candidate:s3
{code}
{code:java}
2025-05-27 15:02:07,125 [s3@group-298C3567E6A7-LeaderElection4] INFO  
server.RaftServer$Division (RaftServerImpl.java:setRole(400)) - 
s3@group-298C3567E6A7: changes role from CANDIDATE to LEADER at term 2 for 
changeToLeader
{code}
{code:java}
2025-05-27 15:02:07,138 [s3@group-298C3567E6A7-LeaderElection4] INFO  
server.RaftServer$Division (ServerState.java:setLeader(272)) - 
s3@group-298C3567E6A7: change Leader from null to s3 at term 2 for 
becomeLeader, leader elected after 497ms
{code}
{code:java}
2025-05-27 15:02:07,168 [grpc-default-executor-7] INFO  
server.RaftServer$Division (ServerState.java:setLeader(272)) - 
s4@group-298C3567E6A7: change Leader from null to s3 at term 2 for 
appendEntries, leader elected after 525ms
{code}
{code:java}
2025-05-27 15:02:07,169 [grpc-default-executor-6] INFO  
server.RaftServer$Division (RaftServerImpl.java:setRole(400)) - 
s0@group-298C3567E6A7: changes role from    LEADER to FOLLOWER at term 1 for 
appendEntries
{code}



2. When the client sends the setConfiguration() command, it cannot establish a 
connection with s3 or s4. Here are some timeout logs.
{code:java}
2025-05-27 17:30:36,663 [grpc-default-executor-9] WARN  server.GrpcLogAppender 
(LogUtils.java:warn(127)) - 
s0@group-817E46CF4DDB->s3-InstallSnapshotResponseHandler: Failed 
InstallSnapshot: org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: 
UNAVAILABLE: io exception
  Caused by: 
org.apache.ratis.thirdparty.io.netty.channel.ConnectTimeoutException: 
connection timed out after 30000 ms: /0.0.0.0:15004
{code}

cc [~adoroszlai] [~szetszwo]

> Intermittent failure in 
> TestRaftSnapshotWithGrpc.testInstallSnapshotDuringBootstrap
> -----------------------------------------------------------------------------------
>
>                 Key: RATIS-2261
>                 URL: https://issues.apache.org/jira/browse/RATIS-2261
>             Project: Ratis
>          Issue Type: Bug
>          Components: gRPC, test
>            Reporter: Attila Doroszlai
>            Priority: Major
>         Attachments: 
> org.apache.ratis.grpc.TestRaftSnapshotWithGrpc-output.txt, 
> testInstallSnapshotDuringBootstrap.log
>
>
> {code}
> Tests run: 3, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 114.816 s <<< 
> FAILURE! - in org.apache.ratis.grpc.TestRaftSnapshotWithGrpc
> org.apache.ratis.grpc.TestRaftSnapshotWithGrpc.testInstallSnapshotDuringBootstrap
>   Time elapsed: 101.468 s  <<< ERROR!
> java.util.concurrent.TimeoutException: testInstallSnapshotDuringBootstrap() 
> timed out after 100 seconds
>       at java.util.ArrayList.forEach(ArrayList.java:1259)
>       at java.util.ArrayList.forEach(ArrayList.java:1259)
>       Suppressed: java.io.InterruptedIOException: retry 
> policy=RetryForeverWithSleep(sleepTime = 100ms)
>               at 
> org.apache.ratis.client.impl.BlockingImpl.sendRequestWithRetry(BlockingImpl.java:138)
>               at 
> org.apache.ratis.client.impl.AdminImpl.setConfiguration(AdminImpl.java:46)
>               at 
> org.apache.ratis.client.api.AdminApi.setConfiguration(AdminApi.java:51)
>               at 
> org.apache.ratis.client.api.AdminApi.setConfiguration(AdminApi.java:45)
>               at 
> org.apache.ratis.server.impl.MiniRaftCluster.setConfiguration(MiniRaftCluster.java:836)
>               at 
> org.apache.ratis.statemachine.RaftSnapshotBaseTest.lambda$testInstallSnapshotDuringBootstrap$6(RaftSnapshotBaseTest.java:309)
>               at 
> org.apache.ratis.server.impl.RaftServerTestUtil.runWithMinorityPeers(RaftServerTestUtil.java:231)
>               at 
> org.apache.ratis.statemachine.RaftSnapshotBaseTest.testInstallSnapshotDuringBootstrap(RaftSnapshotBaseTest.java:308)
> {code}
> Faeild in 2/100 runs:
> https://github.com/adoroszlai/ratis/actions/runs/13901407901



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to