[ https://issues.apache.org/jira/browse/RATIS-788?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Attila Doroszlai updated RATIS-788: ----------------------------------- Description: It seems if any exception happens while becoming leader, a single-node Ratis group can get stuck: {{in LEADER state but not ready yet}}. Or is there some timeout that lets Ratis get out of this state? I guess 3-node Ratis might be able to recover. {{LeaderElection}} ignores the exception because it is already shut down after successful vote. The {{ConcurrentModificationException}} which triggered it in this specific case is being fixed in RATIS-789. {code} 2020-01-09 23:31:35,161 [Thread-95] INFO impl.RaftServerImpl (RaftServerImpl.java:setRole(173)) - 6b60526e-eae6-4f33-854d-fa396187085c@group-C5BA1605619E: changes role from FOLLOWER to CANDIDATE at term 0 for changeToCandidate 2020-01-09 23:31:35,165 [Thread-95] INFO impl.RoleInfo (RoleInfo.java:updateAndGet(143)) - 6b60526e-eae6-4f33-854d-fa396187085c: start LeaderElection 2020-01-09 23:31:35,176 [6b60526e-eae6-4f33-854d-fa396187085c@group-C5BA1605619E-LeaderElection1] INFO impl.LeaderElection (LeaderElection.java:askForVotes(206)) - 6b60526e-eae6-4f33-854d-fa396187085c@group-C5BA1605619E-LeaderElection1: begin an election at term 1 for -1: [6b60526e-eae6-4f33-854d-fa396187085c:localhost:9872], old=null 2020-01-09 23:31:35,177 [6b60526e-eae6-4f33-854d-fa396187085c@group-C5BA1605619E-LeaderElection1] INFO impl.RoleInfo (RoleInfo.java:shutdownLeaderElection(134)) - 6b60526e-eae6-4f33-854d-fa396187085c: shutdown LeaderElection 2020-01-09 23:31:35,178 [6b60526e-eae6-4f33-854d-fa396187085c@group-C5BA1605619E-LeaderElection1] INFO impl.RaftServerImpl (RaftServerImpl.java:setRole(173)) - 6b60526e-eae6-4f33-854d-fa396187085c@group-C5BA1605619E: changes role from CANDIDATE to LEADER at term 1 for changeToLeader 2020-01-09 23:31:35,178 [6b60526e-eae6-4f33-854d-fa396187085c@group-C5BA1605619E-LeaderElection1] INFO impl.RaftServerImpl (ServerState.java:setLeader(255)) - 6b60526e-eae6-4f33-854d-fa396187085c@group-C5BA1605619E: change Leader from null to 6b60526e-eae6-4f33-854d-fa396187085c at term 1 for becomeLeader, leader elected after 1269ms ... 2020-01-09 23:31:35,217 [6b60526e-eae6-4f33-854d-fa396187085c@group-C5BA1605619E-LeaderElection1] INFO impl.LeaderElection (LeaderElection.java:run(165)) - 6b60526e-eae6-4f33-854d-fa396187085c@group-C5BA1605619E-LeaderElection1: ConcurrentModificationException is safely ignored since this is already CLOSING java.util.ConcurrentModificationException at java.util.ArrayList.forEach(ArrayList.java:1260) at org.apache.ratis.metrics.impl.MetricRegistriesImpl.lambda$create$1(MetricRegistriesImpl.java:66) at org.apache.ratis.metrics.impl.RefCountingMap.lambda$put$0(RefCountingMap.java:51) at java.util.concurrent.ConcurrentHashMap.compute(ConcurrentHashMap.java:1853) at org.apache.ratis.metrics.impl.RefCountingMap.put(RefCountingMap.java:46) at org.apache.ratis.metrics.impl.MetricRegistriesImpl.create(MetricRegistriesImpl.java:59) at org.apache.ratis.server.metrics.RatisMetrics.create(RatisMetrics.java:45) at org.apache.ratis.server.metrics.RatisMetrics.getMetricRegistryForLogAppender(RatisMetrics.java:82) at org.apache.ratis.server.metrics.LogAppenderMetrics.<init>(LogAppenderMetrics.java:32) at org.apache.ratis.server.impl.LeaderState.<init>(LeaderState.java:221) at org.apache.ratis.server.impl.RoleInfo.startLeaderState(RoleInfo.java:94) at org.apache.ratis.server.impl.RaftServerImpl.changeToLeader(RaftServerImpl.java:348) at org.apache.ratis.server.impl.LeaderElection.askForVotes(LeaderElection.java:238) at org.apache.ratis.server.impl.LeaderElection.run(LeaderElection.java:161) at java.lang.Thread.run(Thread.java:748) ... 2020-01-09 23:31:48,567 ... 6b60526e-eae6-4f33-854d-fa396187085c@group-C5BA1605619E is in LEADER state but not ready yet. {code} was: It seems single-node Ratis can get stuck if exception happens while becoming leader. {{LeaderElection}} ignores the exception because it is already shut down after successful vote. I guess 3-node Ratis might be able to recover. {code} 2020-01-09 23:31:35,160 [Thread-95] INFO impl.FollowerState (FollowerState.java:run(108)) - 6b60526e-eae6-4f33-854d-fa396187085c@group-C5BA1605619E-FollowerState: change to CANDIDATE, lastRpcTime:1117ms, electionTimeout:1103ms 2020-01-09 23:31:35,161 [Thread-95] INFO impl.RoleInfo (RoleInfo.java:shutdownFollowerState(121)) - 6b60526e-eae6-4f33-854d-fa396187085c: shutdown FollowerState 2020-01-09 23:31:35,161 [Thread-95] INFO impl.RaftServerImpl (RaftServerImpl.java:setRole(173)) - 6b60526e-eae6-4f33-854d-fa396187085c@group-C5BA1605619E: changes role from FOLLOWER to CANDIDATE at term 0 for changeToCandidate 2020-01-09 23:31:35,165 [Thread-95] INFO impl.RoleInfo (RoleInfo.java:updateAndGet(143)) - 6b60526e-eae6-4f33-854d-fa396187085c: start LeaderElection 2020-01-09 23:31:35,176 [6b60526e-eae6-4f33-854d-fa396187085c@group-C5BA1605619E-LeaderElection1] INFO impl.LeaderElection (LeaderElection.java:askForVotes(206)) - 6b60526e-eae6-4f33-854d-fa396187085c@group-C5BA1605619E-LeaderElection1: begin an election at term 1 for -1: [6b60526e-eae6-4f33-854d-fa396187085c:localhost:9872], old=null 2020-01-09 23:31:35,177 [6b60526e-eae6-4f33-854d-fa396187085c@group-C5BA1605619E-LeaderElection1] INFO impl.RoleInfo (RoleInfo.java:shutdownLeaderElection(134)) - 6b60526e-eae6-4f33-854d-fa396187085c: shutdown LeaderElection 2020-01-09 23:31:35,178 [6b60526e-eae6-4f33-854d-fa396187085c@group-C5BA1605619E-LeaderElection1] INFO impl.RaftServerImpl (RaftServerImpl.java:setRole(173)) - 6b60526e-eae6-4f33-854d-fa396187085c@group-C5BA1605619E: changes role from CANDIDATE to LEADER at term 1 for changeToLeader 2020-01-09 23:31:35,178 [6b60526e-eae6-4f33-854d-fa396187085c@group-C5BA1605619E-LeaderElection1] INFO impl.RaftServerImpl (ServerState.java:setLeader(255)) - 6b60526e-eae6-4f33-854d-fa396187085c@group-C5BA1605619E: change Leader from null to 6b60526e-eae6-4f33-854d-fa396187085c at term 1 for becomeLeader, leader elected after 1269ms 2020-01-09 23:31:35,183 [6b60526e-eae6-4f33-854d-fa396187085c@group-C5BA1605619E-LeaderElection1] INFO server.RaftServerConfigKeys (ConfUtils.java:logGet(43)) - raft.server.staging.catchup.gap = 1000 (default) 2020-01-09 23:31:35,185 [6b60526e-eae6-4f33-854d-fa396187085c@group-C5BA1605619E-LeaderElection1] INFO server.RaftServerConfigKeys (ConfUtils.java:logGet(43)) - raft.server.rpc.sleep.time = 25ms (default) 2020-01-09 23:31:35,217 [6b60526e-eae6-4f33-854d-fa396187085c@group-C5BA1605619E-LeaderElection1] INFO impl.LeaderElection (LeaderElection.java:run(165)) - 6b60526e-eae6-4f33-854d-fa396187085c@group-C5BA1605619E-LeaderElection1: ConcurrentModificationException is safely ignored since this is already CLOSING java.util.ConcurrentModificationException at java.util.ArrayList.forEach(ArrayList.java:1260) at org.apache.ratis.metrics.impl.MetricRegistriesImpl.lambda$create$1(MetricRegistriesImpl.java:66) at org.apache.ratis.metrics.impl.RefCountingMap.lambda$put$0(RefCountingMap.java:51) at java.util.concurrent.ConcurrentHashMap.compute(ConcurrentHashMap.java:1853) at org.apache.ratis.metrics.impl.RefCountingMap.put(RefCountingMap.java:46) at org.apache.ratis.metrics.impl.MetricRegistriesImpl.create(MetricRegistriesImpl.java:59) at org.apache.ratis.server.metrics.RatisMetrics.create(RatisMetrics.java:45) at org.apache.ratis.server.metrics.RatisMetrics.getMetricRegistryForLogAppender(RatisMetrics.java:82) at org.apache.ratis.server.metrics.LogAppenderMetrics.<init>(LogAppenderMetrics.java:32) at org.apache.ratis.server.impl.LeaderState.<init>(LeaderState.java:221) at org.apache.ratis.server.impl.RoleInfo.startLeaderState(RoleInfo.java:94) at org.apache.ratis.server.impl.RaftServerImpl.changeToLeader(RaftServerImpl.java:348) at org.apache.ratis.server.impl.LeaderElection.askForVotes(LeaderElection.java:238) at org.apache.ratis.server.impl.LeaderElection.run(LeaderElection.java:161) at java.lang.Thread.run(Thread.java:748) ... 2020-01-09 23:31:48,567 ... 6b60526e-eae6-4f33-854d-fa396187085c@group-C5BA1605619E is in LEADER state but not ready yet. {code} > Server stuck due to exception while becoming leader > --------------------------------------------------- > > Key: RATIS-788 > URL: https://issues.apache.org/jira/browse/RATIS-788 > Project: Ratis > Issue Type: Bug > Components: server > Reporter: Attila Doroszlai > Priority: Major > Attachments: > org.apache.hadoop.ozone.scm.TestSCMContainerPlacementPolicyMetrics-output.txt, > org.apache.hadoop.ozone.scm.TestSCMContainerPlacementPolicyMetrics.txt > > > It seems if any exception happens while becoming leader, a single-node Ratis > group can get stuck: {{in LEADER state but not ready yet}}. Or is there some > timeout that lets Ratis get out of this state? I guess 3-node Ratis might be > able to recover. > {{LeaderElection}} ignores the exception because it is already shut down > after successful vote. > The {{ConcurrentModificationException}} which triggered it in this specific > case is being fixed in RATIS-789. > {code} > 2020-01-09 23:31:35,161 [Thread-95] INFO impl.RaftServerImpl > (RaftServerImpl.java:setRole(173)) - > 6b60526e-eae6-4f33-854d-fa396187085c@group-C5BA1605619E: changes role from > FOLLOWER to CANDIDATE at term 0 for changeToCandidate > 2020-01-09 23:31:35,165 [Thread-95] INFO impl.RoleInfo > (RoleInfo.java:updateAndGet(143)) - 6b60526e-eae6-4f33-854d-fa396187085c: > start LeaderElection > 2020-01-09 23:31:35,176 > [6b60526e-eae6-4f33-854d-fa396187085c@group-C5BA1605619E-LeaderElection1] > INFO impl.LeaderElection (LeaderElection.java:askForVotes(206)) - > 6b60526e-eae6-4f33-854d-fa396187085c@group-C5BA1605619E-LeaderElection1: > begin an election at term 1 for -1: > [6b60526e-eae6-4f33-854d-fa396187085c:localhost:9872], old=null > 2020-01-09 23:31:35,177 > [6b60526e-eae6-4f33-854d-fa396187085c@group-C5BA1605619E-LeaderElection1] > INFO impl.RoleInfo (RoleInfo.java:shutdownLeaderElection(134)) - > 6b60526e-eae6-4f33-854d-fa396187085c: shutdown LeaderElection > 2020-01-09 23:31:35,178 > [6b60526e-eae6-4f33-854d-fa396187085c@group-C5BA1605619E-LeaderElection1] > INFO impl.RaftServerImpl (RaftServerImpl.java:setRole(173)) - > 6b60526e-eae6-4f33-854d-fa396187085c@group-C5BA1605619E: changes role from > CANDIDATE to LEADER at term 1 for changeToLeader > 2020-01-09 23:31:35,178 > [6b60526e-eae6-4f33-854d-fa396187085c@group-C5BA1605619E-LeaderElection1] > INFO impl.RaftServerImpl (ServerState.java:setLeader(255)) - > 6b60526e-eae6-4f33-854d-fa396187085c@group-C5BA1605619E: change Leader from > null to 6b60526e-eae6-4f33-854d-fa396187085c at term 1 for becomeLeader, > leader elected after 1269ms > ... > 2020-01-09 23:31:35,217 > [6b60526e-eae6-4f33-854d-fa396187085c@group-C5BA1605619E-LeaderElection1] > INFO impl.LeaderElection (LeaderElection.java:run(165)) - > 6b60526e-eae6-4f33-854d-fa396187085c@group-C5BA1605619E-LeaderElection1: > ConcurrentModificationException is safely ignored since this is already > CLOSING > java.util.ConcurrentModificationException > at java.util.ArrayList.forEach(ArrayList.java:1260) > at > org.apache.ratis.metrics.impl.MetricRegistriesImpl.lambda$create$1(MetricRegistriesImpl.java:66) > at > org.apache.ratis.metrics.impl.RefCountingMap.lambda$put$0(RefCountingMap.java:51) > at > java.util.concurrent.ConcurrentHashMap.compute(ConcurrentHashMap.java:1853) > at > org.apache.ratis.metrics.impl.RefCountingMap.put(RefCountingMap.java:46) > at > org.apache.ratis.metrics.impl.MetricRegistriesImpl.create(MetricRegistriesImpl.java:59) > at > org.apache.ratis.server.metrics.RatisMetrics.create(RatisMetrics.java:45) > at > org.apache.ratis.server.metrics.RatisMetrics.getMetricRegistryForLogAppender(RatisMetrics.java:82) > at > org.apache.ratis.server.metrics.LogAppenderMetrics.<init>(LogAppenderMetrics.java:32) > at org.apache.ratis.server.impl.LeaderState.<init>(LeaderState.java:221) > at > org.apache.ratis.server.impl.RoleInfo.startLeaderState(RoleInfo.java:94) > at > org.apache.ratis.server.impl.RaftServerImpl.changeToLeader(RaftServerImpl.java:348) > at > org.apache.ratis.server.impl.LeaderElection.askForVotes(LeaderElection.java:238) > at > org.apache.ratis.server.impl.LeaderElection.run(LeaderElection.java:161) > at java.lang.Thread.run(Thread.java:748) > ... > 2020-01-09 23:31:48,567 ... > 6b60526e-eae6-4f33-854d-fa396187085c@group-C5BA1605619E is in LEADER state > but not ready yet. > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)