[ 
https://issues.apache.org/jira/browse/CURATOR-209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14635571#comment-14635571
 ] 

Jason Separovic commented on CURATOR-209:
-----------------------------------------

We are facing with the same issue in a 3 nodes cluster. we see a retry attempt 
and stack trace every millisecond

> Background retry falls into infinite loop of reconnection after connection 
> loss
> -------------------------------------------------------------------------------
>
>                 Key: CURATOR-209
>                 URL: https://issues.apache.org/jira/browse/CURATOR-209
>             Project: Apache Curator
>          Issue Type: Bug
>          Components: Framework
>    Affects Versions: 2.6.0
>         Environment: sun java jdk 1.7.0_55, curator 2.6.0, zookeeper 3.3.6 on 
> AWS EC2 in a 3 box ensemble
>            Reporter: Ryan Anderson
>            Priority: Critical
>              Labels: connectionloss, loop, reconnect
>
> We've been unable to replicate this in our test environments, but 
> approximately once a week in production (~50 machine cluster using curator/zk 
> for service discovery) we will get a machine falling into a loop and spewing 
> tens of thousands of errors that look like:
> {code}
> Background operation retry gave 
> uporg.apache.zookeeper.KeeperException$ConnectionLossException: 
> KeeperErrorCode = ConnectionLoss
> at org.apache.zookeeper.KeeperException.create(KeeperException.java:99) 
> ~[zookeeper-3.4.6.jar:3.4.6-1569965]
> at 
> org.apache.curator.framework.imps.CuratorFrameworkImpl.checkBackgroundRetry(CuratorFrameworkImpl.java:695)
>  [curator-framework-2.6.0.jar:na]
> at 
> org.apache.curator.framework.imps.CuratorFrameworkImpl.processBackgroundOperation(CuratorFrameworkImpl.java:496)
>  [curator-framework-2.6.0.jar:na]
> at 
> org.apache.curator.framework.imps.CreateBuilderImpl.sendBackgroundResponse(CreateBuilderImpl.java:538)
>  [curator-framework-2.6.0.jar:na]
> at 
> org.apache.curator.framework.imps.CreateBuilderImpl.access$700(CreateBuilderImpl.java:44)
>  [curator-framework-2.6.0.jar:na]
> at 
> org.apache.curator.framework.imps.CreateBuilderImpl$6.processResult(CreateBuilderImpl.java:497)
>  [curator-framework-2.6.0.jar:na]
> at 
> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:605) 
> [zookeeper-3.4.6.jar:3.4.6-1569965]
> at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498) 
> [zookeeper-3.4.6.jar:3.4.6-1569965]
> {code}
> The rate at which we get these errors seems to increase linearly until we 
> stop the process (starts at 10-20/sec, when we kill the box it's typically 
> generating 1,000+/sec)
> When the error first occurs, there's a slightly different stack trace:
> {code}
> Background operation retry gave 
> uporg.apache.zookeeper.KeeperException$ConnectionLossException: 
> KeeperErrorCode = ConnectionLoss
> at org.apache.zookeeper.KeeperException.create(KeeperException.java:99) 
> ~[zookeeper-3.4.6.jar:3.4.6-1569965]
> at 
> org.apache.curator.framework.imps.CuratorFrameworkImpl.checkBackgroundRetry(CuratorFrameworkImpl.java:695)
>  [curator-framework-2.6.0.jar:na]
> at 
> org.apache.curator.framework.imps.CuratorFrameworkImpl.performBackgroundOperation(CuratorFrameworkImpl.java:813)
>  [curator-framework-2.6.0.jar:na]
> at 
> org.apache.curator.framework.imps.CuratorFrameworkImpl.backgroundOperationsLoop(CuratorFrameworkImpl.java:779)
>  [curator-framework-2.6.0.jar:na]
> at 
> org.apache.curator.framework.imps.CuratorFrameworkImpl.access$400(CuratorFrameworkImpl.java:58)
>  [curator-framework-2.6.0.jar:na]
> at 
> org.apache.curator.framework.imps.CuratorFrameworkImpl$4.call(CuratorFrameworkImpl.java:265)
>  [curator-framework-2.6.0.jar:na]
> at java.util.concurrent.FutureTask.run(FutureTask.java:262) [na:1.7.0_55]
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>  [na:1.7.0_55]
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>  [na:1.7.0_55]
> at java.lang.Thread.run(Thread.java:745) [na:1.7.0_55]
> {code}
> followed very closely by:
> {code}
> Background retry gave uporg.apache.curator.CuratorConnectionLossException: 
> KeeperErrorCode = ConnectionLoss
> at 
> org.apache.curator.framework.imps.CuratorFrameworkImpl.performBackgroundOperation(CuratorFrameworkImpl.java:796)
>  [curator-framework-2.6.0.jar:na]
> at 
> org.apache.curator.framework.imps.CuratorFrameworkImpl.backgroundOperationsLoop(CuratorFrameworkImpl.java:779)
>  [curator-framework-2.6.0.jar:na]
> at 
> org.apache.curator.framework.imps.CuratorFrameworkImpl.access$400(CuratorFrameworkImpl.java:58)
>  [curator-framework-2.6.0.jar:na]
> at 
> org.apache.curator.framework.imps.CuratorFrameworkImpl$4.call(CuratorFrameworkImpl.java:265)
>  [curator-framework-2.6.0.jar:na]
> at java.util.concurrent.FutureTask.run(FutureTask.java:262) [na:1.7.0_55]
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>  [na:1.7.0_55]
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>  [na:1.7.0_55]
> at java.lang.Thread.run(Thread.java:745) [na:1.7.0_55]
> {code}
> After which it begins spewing the stack trace I first posted above. We're 
> assuming that some sort of networking hiccup is occurring in EC2 that's 
> causing the ConnectionLoss, which seems entirely momentary (none of our other 
> boxes see it, and when we check the box it can connect to all the zk servers 
> without any issues.) 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to