[ 
https://issues.apache.org/jira/browse/STORM-2400?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Satish Duggana updated STORM-2400:
----------------------------------
    Description: 
This issue is reported to Curator with CURATOR-358. 

org.apache.curator.framework.recipes.leader.LeaderLatch#getLeader() throws 
KeeperException with Code#NONODE intermittently as mentioned in the stack trace 
below. It may be possible participant's ephemeral ZK node is removed because 
its connection/session is closed.

You can see the below code at 
https://github.com/apache/curator/blob/master/curator-recipes/src/main/java/org/apache/curator/framework/recipes/leader/LeaderLatch.java#L451

{code}
public Participant getLeader() throws Exception
{ 
  Collection<String> participantNodes = 
LockInternals.getParticipantNodes(client, latchPath, LOCK_NAME, sorter); 
  return LeaderSelector.getLeader(client, participantNodes); 
}
{code}

I guess it hits a race condition where a participant node is retrieved but when 
it invokes LeaderSelector#getLeader() it would have been removed because of 
session timeout and it throws KeeperException with NoNode code. It does not 
retry as the RetryLoop retries only for connection/session timeouts. But in 
this case, NoNode should have been retried. I could not find any APIs on 
CuratorClient to configure the kind of KeeperException codes to be retried. It 
may be good to have a way to take what kind of errors should be retried in 
org.apache.curator.framework.CuratorFrameworkFactory.Builder APIs.
Intermittent Exception found with the stack trace:

{noformat}
2016-11-15 06:09:33.954 o.a.s.d.nimbus [ERROR] Error when processing event
org.apache.storm.shade.org.apache.zookeeper.KeeperException$NoNodeException: 
KeeperErrorCode = NoNode for 
/storm/leader-lock/_c_97c09eed-5bba-4ac8-a05f-abdc4e8e95cf-latch-0000000002
at 
org.apache.storm.shade.org.apache.zookeeper.KeeperException.create(KeeperException.java:111)
at 
org.apache.storm.shade.org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
at 
org.apache.storm.shade.org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1155)
at 
org.apache.storm.shade.org.apache.curator.framework.imps.GetDataBuilderImpl$4.call(GetDataBuilderImpl.java:304)
at 
org.apache.storm.shade.org.apache.curator.framework.imps.GetDataBuilderImpl$4.call(GetDataBuilderImpl.java:293)
at 
org.apache.storm.shade.org.apache.curator.RetryLoop.callWithRetry(RetryLoop.java:108)
at 
org.apache.storm.shade.org.apache.curator.framework.imps.GetDataBuilderImpl.pathInForeground(GetDataBuilderImpl.java:290)
at 
org.apache.storm.shade.org.apache.curator.framework.imps.GetDataBuilderImpl.forPath(GetDataBuilderImpl.java:281)
at 
org.apache.storm.shade.org.apache.curator.framework.imps.GetDataBuilderImpl.forPath(GetDataBuilderImpl.java:42)
at 
org.apache.storm.shade.org.apache.curator.framework.recipes.leader.LeaderSelector.participantForPath(LeaderSelector.java:375)
at 
org.apache.storm.shade.org.apache.curator.framework.recipes.leader.LeaderSelector.getLeader(LeaderSelector.java:346)
at 
org.apache.storm.shade.org.apache.curator.framework.recipes.leader.LeaderLatch.getLeader(LeaderLatch.java:454)
{noformat}

  was:
org.apache.curator.framework.recipes.leader.LeaderLatch#getLeader() throws 
KeeperException with Code#NONODE intermittently as mentioned in the stack trace 
below. It may be possible participant's ephemeral ZK node is removed because 
its connection/session is closed.

You can see the below code at 
https://github.com/apache/curator/blob/master/curator-recipes/src/main/java/org/apache/curator/framework/recipes/leader/LeaderLatch.java#L451

{code}
public Participant getLeader() throws Exception
{ 
  Collection<String> participantNodes = 
LockInternals.getParticipantNodes(client, latchPath, LOCK_NAME, sorter); 
  return LeaderSelector.getLeader(client, participantNodes); 
}
{code}

I guess it hits a race condition where a participant node is retrieved but when 
it invokes LeaderSelector#getLeader() it would have been removed because of 
session timeout and it throws KeeperException with NoNode code. It does not 
retry as the RetryLoop retries only for connection/session timeouts. But in 
this case, NoNode should have been retried. I could not find any APIs on 
CuratorClient to configure the kind of KeeperException codes to be retried. It 
may be good to have a way to take what kind of errors should be retried in 
org.apache.curator.framework.CuratorFrameworkFactory.Builder APIs.
Intermittent Exception found with the stack trace:

{noformat}
2016-11-15 06:09:33.954 o.a.s.d.nimbus [ERROR] Error when processing event
org.apache.storm.shade.org.apache.zookeeper.KeeperException$NoNodeException: 
KeeperErrorCode = NoNode for 
/storm/leader-lock/_c_97c09eed-5bba-4ac8-a05f-abdc4e8e95cf-latch-0000000002
at 
org.apache.storm.shade.org.apache.zookeeper.KeeperException.create(KeeperException.java:111)
at 
org.apache.storm.shade.org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
at 
org.apache.storm.shade.org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1155)
at 
org.apache.storm.shade.org.apache.curator.framework.imps.GetDataBuilderImpl$4.call(GetDataBuilderImpl.java:304)
at 
org.apache.storm.shade.org.apache.curator.framework.imps.GetDataBuilderImpl$4.call(GetDataBuilderImpl.java:293)
at 
org.apache.storm.shade.org.apache.curator.RetryLoop.callWithRetry(RetryLoop.java:108)
at 
org.apache.storm.shade.org.apache.curator.framework.imps.GetDataBuilderImpl.pathInForeground(GetDataBuilderImpl.java:290)
at 
org.apache.storm.shade.org.apache.curator.framework.imps.GetDataBuilderImpl.forPath(GetDataBuilderImpl.java:281)
at 
org.apache.storm.shade.org.apache.curator.framework.imps.GetDataBuilderImpl.forPath(GetDataBuilderImpl.java:42)
at 
org.apache.storm.shade.org.apache.curator.framework.recipes.leader.LeaderSelector.participantForPath(LeaderSelector.java:375)
at 
org.apache.storm.shade.org.apache.curator.framework.recipes.leader.LeaderSelector.getLeader(LeaderSelector.java:346)
at 
org.apache.storm.shade.org.apache.curator.framework.recipes.leader.LeaderLatch.getLeader(LeaderLatch.java:454)
{noformat}


> Intermittent failure in nimbus because of errors from LeaderLatch#getLeader()
> -----------------------------------------------------------------------------
>
>                 Key: STORM-2400
>                 URL: https://issues.apache.org/jira/browse/STORM-2400
>             Project: Apache Storm
>          Issue Type: Bug
>            Reporter: Satish Duggana
>            Assignee: Satish Duggana
>             Fix For: 2.0.0, 1.1.0
>
>          Time Spent: 1h
>  Remaining Estimate: 0h
>
> This issue is reported to Curator with CURATOR-358. 
> org.apache.curator.framework.recipes.leader.LeaderLatch#getLeader() throws 
> KeeperException with Code#NONODE intermittently as mentioned in the stack 
> trace below. It may be possible participant's ephemeral ZK node is removed 
> because its connection/session is closed.
> You can see the below code at 
> https://github.com/apache/curator/blob/master/curator-recipes/src/main/java/org/apache/curator/framework/recipes/leader/LeaderLatch.java#L451
> {code}
> public Participant getLeader() throws Exception
> { 
>   Collection<String> participantNodes = 
> LockInternals.getParticipantNodes(client, latchPath, LOCK_NAME, sorter); 
>   return LeaderSelector.getLeader(client, participantNodes); 
> }
> {code}
> I guess it hits a race condition where a participant node is retrieved but 
> when it invokes LeaderSelector#getLeader() it would have been removed because 
> of session timeout and it throws KeeperException with NoNode code. It does 
> not retry as the RetryLoop retries only for connection/session timeouts. But 
> in this case, NoNode should have been retried. I could not find any APIs on 
> CuratorClient to configure the kind of KeeperException codes to be retried. 
> It may be good to have a way to take what kind of errors should be retried in 
> org.apache.curator.framework.CuratorFrameworkFactory.Builder APIs.
> Intermittent Exception found with the stack trace:
> {noformat}
> 2016-11-15 06:09:33.954 o.a.s.d.nimbus [ERROR] Error when processing event
> org.apache.storm.shade.org.apache.zookeeper.KeeperException$NoNodeException: 
> KeeperErrorCode = NoNode for 
> /storm/leader-lock/_c_97c09eed-5bba-4ac8-a05f-abdc4e8e95cf-latch-0000000002
> at 
> org.apache.storm.shade.org.apache.zookeeper.KeeperException.create(KeeperException.java:111)
> at 
> org.apache.storm.shade.org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
> at 
> org.apache.storm.shade.org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1155)
> at 
> org.apache.storm.shade.org.apache.curator.framework.imps.GetDataBuilderImpl$4.call(GetDataBuilderImpl.java:304)
> at 
> org.apache.storm.shade.org.apache.curator.framework.imps.GetDataBuilderImpl$4.call(GetDataBuilderImpl.java:293)
> at 
> org.apache.storm.shade.org.apache.curator.RetryLoop.callWithRetry(RetryLoop.java:108)
> at 
> org.apache.storm.shade.org.apache.curator.framework.imps.GetDataBuilderImpl.pathInForeground(GetDataBuilderImpl.java:290)
> at 
> org.apache.storm.shade.org.apache.curator.framework.imps.GetDataBuilderImpl.forPath(GetDataBuilderImpl.java:281)
> at 
> org.apache.storm.shade.org.apache.curator.framework.imps.GetDataBuilderImpl.forPath(GetDataBuilderImpl.java:42)
> at 
> org.apache.storm.shade.org.apache.curator.framework.recipes.leader.LeaderSelector.participantForPath(LeaderSelector.java:375)
> at 
> org.apache.storm.shade.org.apache.curator.framework.recipes.leader.LeaderSelector.getLeader(LeaderSelector.java:346)
> at 
> org.apache.storm.shade.org.apache.curator.framework.recipes.leader.LeaderLatch.getLeader(LeaderLatch.java:454)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to