[ 
https://issues.apache.org/jira/browse/CURATOR-209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14635946#comment-14635946
 ] 

Ryan Anderson commented on CURATOR-209:
---------------------------------------

As far as we've been able to tell, it's not something we're explicitly calling. 
We use PersistentEmphermalNode for tracking services and what appears to be 
happening is that when this momentary network drop occurs, the 
PersistentEmphermalNode gets caught in some sort of loop attempting to 
reconnect and re-establish the node. For a retry policy we're using 
ExponentialBackoffRetry(1000, 3)

Since I originally reported this issue, we've seen it several more times, and 
the only additional data I've been able to learn about it is that it appears to 
be an exponentially growing loop, where one background thread fails and retries 
spawning another retry thread, and then both fail and double again... so the 
error happens faster and faster until the machine falls over. We can see in 
Curator logging a sequence of three state changes: "SUSPENDED", "LOST", 
"RECONNECTED", over and over again, faster and faster.



> Background retry falls into infinite loop of reconnection after connection 
> loss
> -------------------------------------------------------------------------------
>
>                 Key: CURATOR-209
>                 URL: https://issues.apache.org/jira/browse/CURATOR-209
>             Project: Apache Curator
>          Issue Type: Bug
>          Components: Framework
>    Affects Versions: 2.6.0
>         Environment: sun java jdk 1.7.0_55, curator 2.6.0, zookeeper 3.3.6 on 
> AWS EC2 in a 3 box ensemble
>            Reporter: Ryan Anderson
>            Priority: Critical
>              Labels: connectionloss, loop, reconnect
>
> We've been unable to replicate this in our test environments, but 
> approximately once a week in production (~50 machine cluster using curator/zk 
> for service discovery) we will get a machine falling into a loop and spewing 
> tens of thousands of errors that look like:
> {code}
> Background operation retry gave 
> uporg.apache.zookeeper.KeeperException$ConnectionLossException: 
> KeeperErrorCode = ConnectionLoss
> at org.apache.zookeeper.KeeperException.create(KeeperException.java:99) 
> ~[zookeeper-3.4.6.jar:3.4.6-1569965]
> at 
> org.apache.curator.framework.imps.CuratorFrameworkImpl.checkBackgroundRetry(CuratorFrameworkImpl.java:695)
>  [curator-framework-2.6.0.jar:na]
> at 
> org.apache.curator.framework.imps.CuratorFrameworkImpl.processBackgroundOperation(CuratorFrameworkImpl.java:496)
>  [curator-framework-2.6.0.jar:na]
> at 
> org.apache.curator.framework.imps.CreateBuilderImpl.sendBackgroundResponse(CreateBuilderImpl.java:538)
>  [curator-framework-2.6.0.jar:na]
> at 
> org.apache.curator.framework.imps.CreateBuilderImpl.access$700(CreateBuilderImpl.java:44)
>  [curator-framework-2.6.0.jar:na]
> at 
> org.apache.curator.framework.imps.CreateBuilderImpl$6.processResult(CreateBuilderImpl.java:497)
>  [curator-framework-2.6.0.jar:na]
> at 
> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:605) 
> [zookeeper-3.4.6.jar:3.4.6-1569965]
> at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498) 
> [zookeeper-3.4.6.jar:3.4.6-1569965]
> {code}
> The rate at which we get these errors seems to increase linearly until we 
> stop the process (starts at 10-20/sec, when we kill the box it's typically 
> generating 1,000+/sec)
> When the error first occurs, there's a slightly different stack trace:
> {code}
> Background operation retry gave 
> uporg.apache.zookeeper.KeeperException$ConnectionLossException: 
> KeeperErrorCode = ConnectionLoss
> at org.apache.zookeeper.KeeperException.create(KeeperException.java:99) 
> ~[zookeeper-3.4.6.jar:3.4.6-1569965]
> at 
> org.apache.curator.framework.imps.CuratorFrameworkImpl.checkBackgroundRetry(CuratorFrameworkImpl.java:695)
>  [curator-framework-2.6.0.jar:na]
> at 
> org.apache.curator.framework.imps.CuratorFrameworkImpl.performBackgroundOperation(CuratorFrameworkImpl.java:813)
>  [curator-framework-2.6.0.jar:na]
> at 
> org.apache.curator.framework.imps.CuratorFrameworkImpl.backgroundOperationsLoop(CuratorFrameworkImpl.java:779)
>  [curator-framework-2.6.0.jar:na]
> at 
> org.apache.curator.framework.imps.CuratorFrameworkImpl.access$400(CuratorFrameworkImpl.java:58)
>  [curator-framework-2.6.0.jar:na]
> at 
> org.apache.curator.framework.imps.CuratorFrameworkImpl$4.call(CuratorFrameworkImpl.java:265)
>  [curator-framework-2.6.0.jar:na]
> at java.util.concurrent.FutureTask.run(FutureTask.java:262) [na:1.7.0_55]
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>  [na:1.7.0_55]
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>  [na:1.7.0_55]
> at java.lang.Thread.run(Thread.java:745) [na:1.7.0_55]
> {code}
> followed very closely by:
> {code}
> Background retry gave uporg.apache.curator.CuratorConnectionLossException: 
> KeeperErrorCode = ConnectionLoss
> at 
> org.apache.curator.framework.imps.CuratorFrameworkImpl.performBackgroundOperation(CuratorFrameworkImpl.java:796)
>  [curator-framework-2.6.0.jar:na]
> at 
> org.apache.curator.framework.imps.CuratorFrameworkImpl.backgroundOperationsLoop(CuratorFrameworkImpl.java:779)
>  [curator-framework-2.6.0.jar:na]
> at 
> org.apache.curator.framework.imps.CuratorFrameworkImpl.access$400(CuratorFrameworkImpl.java:58)
>  [curator-framework-2.6.0.jar:na]
> at 
> org.apache.curator.framework.imps.CuratorFrameworkImpl$4.call(CuratorFrameworkImpl.java:265)
>  [curator-framework-2.6.0.jar:na]
> at java.util.concurrent.FutureTask.run(FutureTask.java:262) [na:1.7.0_55]
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>  [na:1.7.0_55]
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>  [na:1.7.0_55]
> at java.lang.Thread.run(Thread.java:745) [na:1.7.0_55]
> {code}
> After which it begins spewing the stack trace I first posted above. We're 
> assuming that some sort of networking hiccup is occurring in EC2 that's 
> causing the ConnectionLoss, which seems entirely momentary (none of our other 
> boxes see it, and when we check the box it can connect to all the zk servers 
> without any issues.) 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to