Ryan Anderson created CURATOR-209:
-------------------------------------

             Summary: Background retry falls into infinite loop of reconnection 
after connection loss
                 Key: CURATOR-209
                 URL: https://issues.apache.org/jira/browse/CURATOR-209
             Project: Apache Curator
          Issue Type: Bug
          Components: Framework
    Affects Versions: 2.6.0
         Environment: sun java jdk 1.7.0_55, curator 2.6.0, zookeeper 3.3.6 on 
AWS EC2 in a 3 box ensemble
            Reporter: Ryan Anderson
            Priority: Critical


We've been unable to replicate this in our test environments, but approximately 
once a week in production (~50 machine cluster using curator/zk for service 
discovery) we will get a machine falling into a loop and spewing tens of 
thousands of errors that look like:

{code}
Background operation retry gave 
uporg.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode 
= ConnectionLoss
at org.apache.zookeeper.KeeperException.create(KeeperException.java:99) 
~[zookeeper-3.4.6.jar:3.4.6-1569965]
at 
org.apache.curator.framework.imps.CuratorFrameworkImpl.checkBackgroundRetry(CuratorFrameworkImpl.java:695)
 [curator-framework-2.6.0.jar:na]
at 
org.apache.curator.framework.imps.CuratorFrameworkImpl.processBackgroundOperation(CuratorFrameworkImpl.java:496)
 [curator-framework-2.6.0.jar:na]
at 
org.apache.curator.framework.imps.CreateBuilderImpl.sendBackgroundResponse(CreateBuilderImpl.java:538)
 [curator-framework-2.6.0.jar:na]
at 
org.apache.curator.framework.imps.CreateBuilderImpl.access$700(CreateBuilderImpl.java:44)
 [curator-framework-2.6.0.jar:na]
at 
org.apache.curator.framework.imps.CreateBuilderImpl$6.processResult(CreateBuilderImpl.java:497)
 [curator-framework-2.6.0.jar:na]
at 
org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:605) 
[zookeeper-3.4.6.jar:3.4.6-1569965]
at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498) 
[zookeeper-3.4.6.jar:3.4.6-1569965]
{code}

The rate at which we get these errors seems to increase linearly until we stop 
the process (starts at 10-20/sec, when we kill the box it's typically 
generating 1,000+/sec)

When the error first occurs, there's a slightly different stack trace:

{code}
Background operation retry gave 
uporg.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode 
= ConnectionLoss
at org.apache.zookeeper.KeeperException.create(KeeperException.java:99) 
~[zookeeper-3.4.6.jar:3.4.6-1569965]
at 
org.apache.curator.framework.imps.CuratorFrameworkImpl.checkBackgroundRetry(CuratorFrameworkImpl.java:695)
 [curator-framework-2.6.0.jar:na]
at 
org.apache.curator.framework.imps.CuratorFrameworkImpl.performBackgroundOperation(CuratorFrameworkImpl.java:813)
 [curator-framework-2.6.0.jar:na]
at 
org.apache.curator.framework.imps.CuratorFrameworkImpl.backgroundOperationsLoop(CuratorFrameworkImpl.java:779)
 [curator-framework-2.6.0.jar:na]
at 
org.apache.curator.framework.imps.CuratorFrameworkImpl.access$400(CuratorFrameworkImpl.java:58)
 [curator-framework-2.6.0.jar:na]
at 
org.apache.curator.framework.imps.CuratorFrameworkImpl$4.call(CuratorFrameworkImpl.java:265)
 [curator-framework-2.6.0.jar:na]
at java.util.concurrent.FutureTask.run(FutureTask.java:262) [na:1.7.0_55]
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) 
[na:1.7.0_55]
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) 
[na:1.7.0_55]
at java.lang.Thread.run(Thread.java:745) [na:1.7.0_55]
{code}

followed very closely by:

{code}
Background retry gave uporg.apache.curator.CuratorConnectionLossException: 
KeeperErrorCode = ConnectionLoss
at 
org.apache.curator.framework.imps.CuratorFrameworkImpl.performBackgroundOperation(CuratorFrameworkImpl.java:796)
 [curator-framework-2.6.0.jar:na]
at 
org.apache.curator.framework.imps.CuratorFrameworkImpl.backgroundOperationsLoop(CuratorFrameworkImpl.java:779)
 [curator-framework-2.6.0.jar:na]
at 
org.apache.curator.framework.imps.CuratorFrameworkImpl.access$400(CuratorFrameworkImpl.java:58)
 [curator-framework-2.6.0.jar:na]
at 
org.apache.curator.framework.imps.CuratorFrameworkImpl$4.call(CuratorFrameworkImpl.java:265)
 [curator-framework-2.6.0.jar:na]
at java.util.concurrent.FutureTask.run(FutureTask.java:262) [na:1.7.0_55]
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) 
[na:1.7.0_55]
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) 
[na:1.7.0_55]
at java.lang.Thread.run(Thread.java:745) [na:1.7.0_55]
{code}

After which it begins spewing the stack trace I first posted above. We're 
assuming that some sort of networking hiccup is occurring in EC2 that's causing 
the ConnectionLoss, which seems entirely momentary (none of our other boxes see 
it, and when we check the box it can connect to all the zk servers without any 
issues.) 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to