[ 
https://issues.apache.org/jira/browse/RATIS-788?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Attila Doroszlai updated RATIS-788:
-----------------------------------
    Description: 
It seems if any exception happens while becoming leader, a single-node Ratis 
group can get stuck: {{in LEADER state but not ready yet}}.  Or is there some 
timeout that lets Ratis get out of this state?  I guess 3-node Ratis might be 
able to recover.

{{LeaderElection}} ignores the exception because it is already shut down after 
successful vote.

The {{ConcurrentModificationException}} which triggered it in this specific 
case is being fixed in RATIS-789.

{code}
2020-01-09 23:31:35,161 [Thread-95] INFO  impl.RaftServerImpl 
(RaftServerImpl.java:setRole(173)) - 
6b60526e-eae6-4f33-854d-fa396187085c@group-C5BA1605619E: changes role from  
FOLLOWER to CANDIDATE at term 0 for changeToCandidate
2020-01-09 23:31:35,165 [Thread-95] INFO  impl.RoleInfo 
(RoleInfo.java:updateAndGet(143)) - 6b60526e-eae6-4f33-854d-fa396187085c: start 
LeaderElection
2020-01-09 23:31:35,176 
[6b60526e-eae6-4f33-854d-fa396187085c@group-C5BA1605619E-LeaderElection1] INFO  
impl.LeaderElection (LeaderElection.java:askForVotes(206)) - 
6b60526e-eae6-4f33-854d-fa396187085c@group-C5BA1605619E-LeaderElection1: begin 
an election at term 1 for -1: 
[6b60526e-eae6-4f33-854d-fa396187085c:localhost:9872], old=null
2020-01-09 23:31:35,177 
[6b60526e-eae6-4f33-854d-fa396187085c@group-C5BA1605619E-LeaderElection1] INFO  
impl.RoleInfo (RoleInfo.java:shutdownLeaderElection(134)) - 
6b60526e-eae6-4f33-854d-fa396187085c: shutdown LeaderElection
2020-01-09 23:31:35,178 
[6b60526e-eae6-4f33-854d-fa396187085c@group-C5BA1605619E-LeaderElection1] INFO  
impl.RaftServerImpl (RaftServerImpl.java:setRole(173)) - 
6b60526e-eae6-4f33-854d-fa396187085c@group-C5BA1605619E: changes role from 
CANDIDATE to LEADER at term 1 for changeToLeader
2020-01-09 23:31:35,178 
[6b60526e-eae6-4f33-854d-fa396187085c@group-C5BA1605619E-LeaderElection1] INFO  
impl.RaftServerImpl (ServerState.java:setLeader(255)) - 
6b60526e-eae6-4f33-854d-fa396187085c@group-C5BA1605619E: change Leader from 
null to 6b60526e-eae6-4f33-854d-fa396187085c at term 1 for becomeLeader, leader 
elected after 1269ms
...
2020-01-09 23:31:35,217 
[6b60526e-eae6-4f33-854d-fa396187085c@group-C5BA1605619E-LeaderElection1] INFO  
impl.LeaderElection (LeaderElection.java:run(165)) - 
6b60526e-eae6-4f33-854d-fa396187085c@group-C5BA1605619E-LeaderElection1: 
ConcurrentModificationException is safely ignored since this is already CLOSING
java.util.ConcurrentModificationException
        at java.util.ArrayList.forEach(ArrayList.java:1260)
        at 
org.apache.ratis.metrics.impl.MetricRegistriesImpl.lambda$create$1(MetricRegistriesImpl.java:66)
        at 
org.apache.ratis.metrics.impl.RefCountingMap.lambda$put$0(RefCountingMap.java:51)
        at 
java.util.concurrent.ConcurrentHashMap.compute(ConcurrentHashMap.java:1853)
        at 
org.apache.ratis.metrics.impl.RefCountingMap.put(RefCountingMap.java:46)
        at 
org.apache.ratis.metrics.impl.MetricRegistriesImpl.create(MetricRegistriesImpl.java:59)
        at 
org.apache.ratis.server.metrics.RatisMetrics.create(RatisMetrics.java:45)
        at 
org.apache.ratis.server.metrics.RatisMetrics.getMetricRegistryForLogAppender(RatisMetrics.java:82)
        at 
org.apache.ratis.server.metrics.LogAppenderMetrics.<init>(LogAppenderMetrics.java:32)
        at org.apache.ratis.server.impl.LeaderState.<init>(LeaderState.java:221)
        at 
org.apache.ratis.server.impl.RoleInfo.startLeaderState(RoleInfo.java:94)
        at 
org.apache.ratis.server.impl.RaftServerImpl.changeToLeader(RaftServerImpl.java:348)
        at 
org.apache.ratis.server.impl.LeaderElection.askForVotes(LeaderElection.java:238)
        at 
org.apache.ratis.server.impl.LeaderElection.run(LeaderElection.java:161)
        at java.lang.Thread.run(Thread.java:748)
...
2020-01-09 23:31:48,567 ... 
6b60526e-eae6-4f33-854d-fa396187085c@group-C5BA1605619E is in LEADER state but 
not ready yet.
{code}

  was:
It seems single-node Ratis can get stuck if exception happens while becoming 
leader.  {{LeaderElection}} ignores the exception because it is already shut 
down after successful vote.  I guess 3-node Ratis might be able to recover.

{code}
2020-01-09 23:31:35,160 [Thread-95] INFO  impl.FollowerState 
(FollowerState.java:run(108)) - 
6b60526e-eae6-4f33-854d-fa396187085c@group-C5BA1605619E-FollowerState: change 
to CANDIDATE, lastRpcTime:1117ms, electionTimeout:1103ms
2020-01-09 23:31:35,161 [Thread-95] INFO  impl.RoleInfo 
(RoleInfo.java:shutdownFollowerState(121)) - 
6b60526e-eae6-4f33-854d-fa396187085c: shutdown FollowerState
2020-01-09 23:31:35,161 [Thread-95] INFO  impl.RaftServerImpl 
(RaftServerImpl.java:setRole(173)) - 
6b60526e-eae6-4f33-854d-fa396187085c@group-C5BA1605619E: changes role from  
FOLLOWER to CANDIDATE at term 0 for changeToCandidate
2020-01-09 23:31:35,165 [Thread-95] INFO  impl.RoleInfo 
(RoleInfo.java:updateAndGet(143)) - 6b60526e-eae6-4f33-854d-fa396187085c: start 
LeaderElection
2020-01-09 23:31:35,176 
[6b60526e-eae6-4f33-854d-fa396187085c@group-C5BA1605619E-LeaderElection1] INFO  
impl.LeaderElection (LeaderElection.java:askForVotes(206)) - 
6b60526e-eae6-4f33-854d-fa396187085c@group-C5BA1605619E-LeaderElection1: begin 
an election at term 1 for -1: 
[6b60526e-eae6-4f33-854d-fa396187085c:localhost:9872], old=null
2020-01-09 23:31:35,177 
[6b60526e-eae6-4f33-854d-fa396187085c@group-C5BA1605619E-LeaderElection1] INFO  
impl.RoleInfo (RoleInfo.java:shutdownLeaderElection(134)) - 
6b60526e-eae6-4f33-854d-fa396187085c: shutdown LeaderElection
2020-01-09 23:31:35,178 
[6b60526e-eae6-4f33-854d-fa396187085c@group-C5BA1605619E-LeaderElection1] INFO  
impl.RaftServerImpl (RaftServerImpl.java:setRole(173)) - 
6b60526e-eae6-4f33-854d-fa396187085c@group-C5BA1605619E: changes role from 
CANDIDATE to LEADER at term 1 for changeToLeader
2020-01-09 23:31:35,178 
[6b60526e-eae6-4f33-854d-fa396187085c@group-C5BA1605619E-LeaderElection1] INFO  
impl.RaftServerImpl (ServerState.java:setLeader(255)) - 
6b60526e-eae6-4f33-854d-fa396187085c@group-C5BA1605619E: change Leader from 
null to 6b60526e-eae6-4f33-854d-fa396187085c at term 1 for becomeLeader, leader 
elected after 1269ms
2020-01-09 23:31:35,183 
[6b60526e-eae6-4f33-854d-fa396187085c@group-C5BA1605619E-LeaderElection1] INFO  
server.RaftServerConfigKeys (ConfUtils.java:logGet(43)) - 
raft.server.staging.catchup.gap = 1000 (default)
2020-01-09 23:31:35,185 
[6b60526e-eae6-4f33-854d-fa396187085c@group-C5BA1605619E-LeaderElection1] INFO  
server.RaftServerConfigKeys (ConfUtils.java:logGet(43)) - 
raft.server.rpc.sleep.time = 25ms (default)
2020-01-09 23:31:35,217 
[6b60526e-eae6-4f33-854d-fa396187085c@group-C5BA1605619E-LeaderElection1] INFO  
impl.LeaderElection (LeaderElection.java:run(165)) - 
6b60526e-eae6-4f33-854d-fa396187085c@group-C5BA1605619E-LeaderElection1: 
ConcurrentModificationException is safely ignored since this is already CLOSING
java.util.ConcurrentModificationException
        at java.util.ArrayList.forEach(ArrayList.java:1260)
        at 
org.apache.ratis.metrics.impl.MetricRegistriesImpl.lambda$create$1(MetricRegistriesImpl.java:66)
        at 
org.apache.ratis.metrics.impl.RefCountingMap.lambda$put$0(RefCountingMap.java:51)
        at 
java.util.concurrent.ConcurrentHashMap.compute(ConcurrentHashMap.java:1853)
        at 
org.apache.ratis.metrics.impl.RefCountingMap.put(RefCountingMap.java:46)
        at 
org.apache.ratis.metrics.impl.MetricRegistriesImpl.create(MetricRegistriesImpl.java:59)
        at 
org.apache.ratis.server.metrics.RatisMetrics.create(RatisMetrics.java:45)
        at 
org.apache.ratis.server.metrics.RatisMetrics.getMetricRegistryForLogAppender(RatisMetrics.java:82)
        at 
org.apache.ratis.server.metrics.LogAppenderMetrics.<init>(LogAppenderMetrics.java:32)
        at org.apache.ratis.server.impl.LeaderState.<init>(LeaderState.java:221)
        at 
org.apache.ratis.server.impl.RoleInfo.startLeaderState(RoleInfo.java:94)
        at 
org.apache.ratis.server.impl.RaftServerImpl.changeToLeader(RaftServerImpl.java:348)
        at 
org.apache.ratis.server.impl.LeaderElection.askForVotes(LeaderElection.java:238)
        at 
org.apache.ratis.server.impl.LeaderElection.run(LeaderElection.java:161)
        at java.lang.Thread.run(Thread.java:748)
...
2020-01-09 23:31:48,567 ... 
6b60526e-eae6-4f33-854d-fa396187085c@group-C5BA1605619E is in LEADER state but 
not ready yet.
{code}


> Server stuck due to exception while becoming leader
> ---------------------------------------------------
>
>                 Key: RATIS-788
>                 URL: https://issues.apache.org/jira/browse/RATIS-788
>             Project: Ratis
>          Issue Type: Bug
>          Components: server
>            Reporter: Attila Doroszlai
>            Priority: Major
>         Attachments: 
> org.apache.hadoop.ozone.scm.TestSCMContainerPlacementPolicyMetrics-output.txt,
>  org.apache.hadoop.ozone.scm.TestSCMContainerPlacementPolicyMetrics.txt
>
>
> It seems if any exception happens while becoming leader, a single-node Ratis 
> group can get stuck: {{in LEADER state but not ready yet}}.  Or is there some 
> timeout that lets Ratis get out of this state?  I guess 3-node Ratis might be 
> able to recover.
> {{LeaderElection}} ignores the exception because it is already shut down 
> after successful vote.
> The {{ConcurrentModificationException}} which triggered it in this specific 
> case is being fixed in RATIS-789.
> {code}
> 2020-01-09 23:31:35,161 [Thread-95] INFO  impl.RaftServerImpl 
> (RaftServerImpl.java:setRole(173)) - 
> 6b60526e-eae6-4f33-854d-fa396187085c@group-C5BA1605619E: changes role from  
> FOLLOWER to CANDIDATE at term 0 for changeToCandidate
> 2020-01-09 23:31:35,165 [Thread-95] INFO  impl.RoleInfo 
> (RoleInfo.java:updateAndGet(143)) - 6b60526e-eae6-4f33-854d-fa396187085c: 
> start LeaderElection
> 2020-01-09 23:31:35,176 
> [6b60526e-eae6-4f33-854d-fa396187085c@group-C5BA1605619E-LeaderElection1] 
> INFO  impl.LeaderElection (LeaderElection.java:askForVotes(206)) - 
> 6b60526e-eae6-4f33-854d-fa396187085c@group-C5BA1605619E-LeaderElection1: 
> begin an election at term 1 for -1: 
> [6b60526e-eae6-4f33-854d-fa396187085c:localhost:9872], old=null
> 2020-01-09 23:31:35,177 
> [6b60526e-eae6-4f33-854d-fa396187085c@group-C5BA1605619E-LeaderElection1] 
> INFO  impl.RoleInfo (RoleInfo.java:shutdownLeaderElection(134)) - 
> 6b60526e-eae6-4f33-854d-fa396187085c: shutdown LeaderElection
> 2020-01-09 23:31:35,178 
> [6b60526e-eae6-4f33-854d-fa396187085c@group-C5BA1605619E-LeaderElection1] 
> INFO  impl.RaftServerImpl (RaftServerImpl.java:setRole(173)) - 
> 6b60526e-eae6-4f33-854d-fa396187085c@group-C5BA1605619E: changes role from 
> CANDIDATE to LEADER at term 1 for changeToLeader
> 2020-01-09 23:31:35,178 
> [6b60526e-eae6-4f33-854d-fa396187085c@group-C5BA1605619E-LeaderElection1] 
> INFO  impl.RaftServerImpl (ServerState.java:setLeader(255)) - 
> 6b60526e-eae6-4f33-854d-fa396187085c@group-C5BA1605619E: change Leader from 
> null to 6b60526e-eae6-4f33-854d-fa396187085c at term 1 for becomeLeader, 
> leader elected after 1269ms
> ...
> 2020-01-09 23:31:35,217 
> [6b60526e-eae6-4f33-854d-fa396187085c@group-C5BA1605619E-LeaderElection1] 
> INFO  impl.LeaderElection (LeaderElection.java:run(165)) - 
> 6b60526e-eae6-4f33-854d-fa396187085c@group-C5BA1605619E-LeaderElection1: 
> ConcurrentModificationException is safely ignored since this is already 
> CLOSING
> java.util.ConcurrentModificationException
>       at java.util.ArrayList.forEach(ArrayList.java:1260)
>       at 
> org.apache.ratis.metrics.impl.MetricRegistriesImpl.lambda$create$1(MetricRegistriesImpl.java:66)
>       at 
> org.apache.ratis.metrics.impl.RefCountingMap.lambda$put$0(RefCountingMap.java:51)
>       at 
> java.util.concurrent.ConcurrentHashMap.compute(ConcurrentHashMap.java:1853)
>       at 
> org.apache.ratis.metrics.impl.RefCountingMap.put(RefCountingMap.java:46)
>       at 
> org.apache.ratis.metrics.impl.MetricRegistriesImpl.create(MetricRegistriesImpl.java:59)
>       at 
> org.apache.ratis.server.metrics.RatisMetrics.create(RatisMetrics.java:45)
>       at 
> org.apache.ratis.server.metrics.RatisMetrics.getMetricRegistryForLogAppender(RatisMetrics.java:82)
>       at 
> org.apache.ratis.server.metrics.LogAppenderMetrics.<init>(LogAppenderMetrics.java:32)
>       at org.apache.ratis.server.impl.LeaderState.<init>(LeaderState.java:221)
>       at 
> org.apache.ratis.server.impl.RoleInfo.startLeaderState(RoleInfo.java:94)
>       at 
> org.apache.ratis.server.impl.RaftServerImpl.changeToLeader(RaftServerImpl.java:348)
>       at 
> org.apache.ratis.server.impl.LeaderElection.askForVotes(LeaderElection.java:238)
>       at 
> org.apache.ratis.server.impl.LeaderElection.run(LeaderElection.java:161)
>       at java.lang.Thread.run(Thread.java:748)
> ...
> 2020-01-09 23:31:48,567 ... 
> 6b60526e-eae6-4f33-854d-fa396187085c@group-C5BA1605619E is in LEADER state 
> but not ready yet.
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to