[ https://issues.apache.org/jira/browse/STORM-2400?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Satish Duggana updated STORM-2400: ---------------------------------- Description: This issue is reported to Curator with CURATOR-358. org.apache.curator.framework.recipes.leader.LeaderLatch#getLeader() throws KeeperException with Code#NONODE intermittently as mentioned in the stack trace below. It may be possible participant's ephemeral ZK node is removed because its connection/session is closed. You can see the below code at https://github.com/apache/curator/blob/master/curator-recipes/src/main/java/org/apache/curator/framework/recipes/leader/LeaderLatch.java#L451 {code} public Participant getLeader() throws Exception { Collection<String> participantNodes = LockInternals.getParticipantNodes(client, latchPath, LOCK_NAME, sorter); return LeaderSelector.getLeader(client, participantNodes); } {code} I guess it hits a race condition where a participant node is retrieved but when it invokes LeaderSelector#getLeader() it would have been removed because of session timeout and it throws KeeperException with NoNode code. It does not retry as the RetryLoop retries only for connection/session timeouts. But in this case, NoNode should have been retried. I could not find any APIs on CuratorClient to configure the kind of KeeperException codes to be retried. It may be good to have a way to take what kind of errors should be retried in org.apache.curator.framework.CuratorFrameworkFactory.Builder APIs. Intermittent Exception found with the stack trace: {noformat} 2016-11-15 06:09:33.954 o.a.s.d.nimbus [ERROR] Error when processing event org.apache.storm.shade.org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /storm/leader-lock/_c_97c09eed-5bba-4ac8-a05f-abdc4e8e95cf-latch-0000000002 at org.apache.storm.shade.org.apache.zookeeper.KeeperException.create(KeeperException.java:111) at org.apache.storm.shade.org.apache.zookeeper.KeeperException.create(KeeperException.java:51) at org.apache.storm.shade.org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1155) at org.apache.storm.shade.org.apache.curator.framework.imps.GetDataBuilderImpl$4.call(GetDataBuilderImpl.java:304) at org.apache.storm.shade.org.apache.curator.framework.imps.GetDataBuilderImpl$4.call(GetDataBuilderImpl.java:293) at org.apache.storm.shade.org.apache.curator.RetryLoop.callWithRetry(RetryLoop.java:108) at org.apache.storm.shade.org.apache.curator.framework.imps.GetDataBuilderImpl.pathInForeground(GetDataBuilderImpl.java:290) at org.apache.storm.shade.org.apache.curator.framework.imps.GetDataBuilderImpl.forPath(GetDataBuilderImpl.java:281) at org.apache.storm.shade.org.apache.curator.framework.imps.GetDataBuilderImpl.forPath(GetDataBuilderImpl.java:42) at org.apache.storm.shade.org.apache.curator.framework.recipes.leader.LeaderSelector.participantForPath(LeaderSelector.java:375) at org.apache.storm.shade.org.apache.curator.framework.recipes.leader.LeaderSelector.getLeader(LeaderSelector.java:346) at org.apache.storm.shade.org.apache.curator.framework.recipes.leader.LeaderLatch.getLeader(LeaderLatch.java:454) {noformat} was: org.apache.curator.framework.recipes.leader.LeaderLatch#getLeader() throws KeeperException with Code#NONODE intermittently as mentioned in the stack trace below. It may be possible participant's ephemeral ZK node is removed because its connection/session is closed. You can see the below code at https://github.com/apache/curator/blob/master/curator-recipes/src/main/java/org/apache/curator/framework/recipes/leader/LeaderLatch.java#L451 {code} public Participant getLeader() throws Exception { Collection<String> participantNodes = LockInternals.getParticipantNodes(client, latchPath, LOCK_NAME, sorter); return LeaderSelector.getLeader(client, participantNodes); } {code} I guess it hits a race condition where a participant node is retrieved but when it invokes LeaderSelector#getLeader() it would have been removed because of session timeout and it throws KeeperException with NoNode code. It does not retry as the RetryLoop retries only for connection/session timeouts. But in this case, NoNode should have been retried. I could not find any APIs on CuratorClient to configure the kind of KeeperException codes to be retried. It may be good to have a way to take what kind of errors should be retried in org.apache.curator.framework.CuratorFrameworkFactory.Builder APIs. Intermittent Exception found with the stack trace: {noformat} 2016-11-15 06:09:33.954 o.a.s.d.nimbus [ERROR] Error when processing event org.apache.storm.shade.org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /storm/leader-lock/_c_97c09eed-5bba-4ac8-a05f-abdc4e8e95cf-latch-0000000002 at org.apache.storm.shade.org.apache.zookeeper.KeeperException.create(KeeperException.java:111) at org.apache.storm.shade.org.apache.zookeeper.KeeperException.create(KeeperException.java:51) at org.apache.storm.shade.org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1155) at org.apache.storm.shade.org.apache.curator.framework.imps.GetDataBuilderImpl$4.call(GetDataBuilderImpl.java:304) at org.apache.storm.shade.org.apache.curator.framework.imps.GetDataBuilderImpl$4.call(GetDataBuilderImpl.java:293) at org.apache.storm.shade.org.apache.curator.RetryLoop.callWithRetry(RetryLoop.java:108) at org.apache.storm.shade.org.apache.curator.framework.imps.GetDataBuilderImpl.pathInForeground(GetDataBuilderImpl.java:290) at org.apache.storm.shade.org.apache.curator.framework.imps.GetDataBuilderImpl.forPath(GetDataBuilderImpl.java:281) at org.apache.storm.shade.org.apache.curator.framework.imps.GetDataBuilderImpl.forPath(GetDataBuilderImpl.java:42) at org.apache.storm.shade.org.apache.curator.framework.recipes.leader.LeaderSelector.participantForPath(LeaderSelector.java:375) at org.apache.storm.shade.org.apache.curator.framework.recipes.leader.LeaderSelector.getLeader(LeaderSelector.java:346) at org.apache.storm.shade.org.apache.curator.framework.recipes.leader.LeaderLatch.getLeader(LeaderLatch.java:454) {noformat} > Intermittent failure in nimbus because of errors from LeaderLatch#getLeader() > ----------------------------------------------------------------------------- > > Key: STORM-2400 > URL: https://issues.apache.org/jira/browse/STORM-2400 > Project: Apache Storm > Issue Type: Bug > Reporter: Satish Duggana > Assignee: Satish Duggana > Fix For: 2.0.0, 1.1.0 > > Time Spent: 1h > Remaining Estimate: 0h > > This issue is reported to Curator with CURATOR-358. > org.apache.curator.framework.recipes.leader.LeaderLatch#getLeader() throws > KeeperException with Code#NONODE intermittently as mentioned in the stack > trace below. It may be possible participant's ephemeral ZK node is removed > because its connection/session is closed. > You can see the below code at > https://github.com/apache/curator/blob/master/curator-recipes/src/main/java/org/apache/curator/framework/recipes/leader/LeaderLatch.java#L451 > {code} > public Participant getLeader() throws Exception > { > Collection<String> participantNodes = > LockInternals.getParticipantNodes(client, latchPath, LOCK_NAME, sorter); > return LeaderSelector.getLeader(client, participantNodes); > } > {code} > I guess it hits a race condition where a participant node is retrieved but > when it invokes LeaderSelector#getLeader() it would have been removed because > of session timeout and it throws KeeperException with NoNode code. It does > not retry as the RetryLoop retries only for connection/session timeouts. But > in this case, NoNode should have been retried. I could not find any APIs on > CuratorClient to configure the kind of KeeperException codes to be retried. > It may be good to have a way to take what kind of errors should be retried in > org.apache.curator.framework.CuratorFrameworkFactory.Builder APIs. > Intermittent Exception found with the stack trace: > {noformat} > 2016-11-15 06:09:33.954 o.a.s.d.nimbus [ERROR] Error when processing event > org.apache.storm.shade.org.apache.zookeeper.KeeperException$NoNodeException: > KeeperErrorCode = NoNode for > /storm/leader-lock/_c_97c09eed-5bba-4ac8-a05f-abdc4e8e95cf-latch-0000000002 > at > org.apache.storm.shade.org.apache.zookeeper.KeeperException.create(KeeperException.java:111) > at > org.apache.storm.shade.org.apache.zookeeper.KeeperException.create(KeeperException.java:51) > at > org.apache.storm.shade.org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1155) > at > org.apache.storm.shade.org.apache.curator.framework.imps.GetDataBuilderImpl$4.call(GetDataBuilderImpl.java:304) > at > org.apache.storm.shade.org.apache.curator.framework.imps.GetDataBuilderImpl$4.call(GetDataBuilderImpl.java:293) > at > org.apache.storm.shade.org.apache.curator.RetryLoop.callWithRetry(RetryLoop.java:108) > at > org.apache.storm.shade.org.apache.curator.framework.imps.GetDataBuilderImpl.pathInForeground(GetDataBuilderImpl.java:290) > at > org.apache.storm.shade.org.apache.curator.framework.imps.GetDataBuilderImpl.forPath(GetDataBuilderImpl.java:281) > at > org.apache.storm.shade.org.apache.curator.framework.imps.GetDataBuilderImpl.forPath(GetDataBuilderImpl.java:42) > at > org.apache.storm.shade.org.apache.curator.framework.recipes.leader.LeaderSelector.participantForPath(LeaderSelector.java:375) > at > org.apache.storm.shade.org.apache.curator.framework.recipes.leader.LeaderSelector.getLeader(LeaderSelector.java:346) > at > org.apache.storm.shade.org.apache.curator.framework.recipes.leader.LeaderLatch.getLeader(LeaderLatch.java:454) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.15#6346)