[ https://issues.apache.org/jira/browse/CURATOR-209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14635571#comment-14635571 ]
Jason Separovic commented on CURATOR-209: ----------------------------------------- We are facing with the same issue in a 3 nodes cluster. we see a retry attempt and stack trace every millisecond > Background retry falls into infinite loop of reconnection after connection > loss > ------------------------------------------------------------------------------- > > Key: CURATOR-209 > URL: https://issues.apache.org/jira/browse/CURATOR-209 > Project: Apache Curator > Issue Type: Bug > Components: Framework > Affects Versions: 2.6.0 > Environment: sun java jdk 1.7.0_55, curator 2.6.0, zookeeper 3.3.6 on > AWS EC2 in a 3 box ensemble > Reporter: Ryan Anderson > Priority: Critical > Labels: connectionloss, loop, reconnect > > We've been unable to replicate this in our test environments, but > approximately once a week in production (~50 machine cluster using curator/zk > for service discovery) we will get a machine falling into a loop and spewing > tens of thousands of errors that look like: > {code} > Background operation retry gave > uporg.apache.zookeeper.KeeperException$ConnectionLossException: > KeeperErrorCode = ConnectionLoss > at org.apache.zookeeper.KeeperException.create(KeeperException.java:99) > ~[zookeeper-3.4.6.jar:3.4.6-1569965] > at > org.apache.curator.framework.imps.CuratorFrameworkImpl.checkBackgroundRetry(CuratorFrameworkImpl.java:695) > [curator-framework-2.6.0.jar:na] > at > org.apache.curator.framework.imps.CuratorFrameworkImpl.processBackgroundOperation(CuratorFrameworkImpl.java:496) > [curator-framework-2.6.0.jar:na] > at > org.apache.curator.framework.imps.CreateBuilderImpl.sendBackgroundResponse(CreateBuilderImpl.java:538) > [curator-framework-2.6.0.jar:na] > at > org.apache.curator.framework.imps.CreateBuilderImpl.access$700(CreateBuilderImpl.java:44) > [curator-framework-2.6.0.jar:na] > at > org.apache.curator.framework.imps.CreateBuilderImpl$6.processResult(CreateBuilderImpl.java:497) > [curator-framework-2.6.0.jar:na] > at > org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:605) > [zookeeper-3.4.6.jar:3.4.6-1569965] > at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498) > [zookeeper-3.4.6.jar:3.4.6-1569965] > {code} > The rate at which we get these errors seems to increase linearly until we > stop the process (starts at 10-20/sec, when we kill the box it's typically > generating 1,000+/sec) > When the error first occurs, there's a slightly different stack trace: > {code} > Background operation retry gave > uporg.apache.zookeeper.KeeperException$ConnectionLossException: > KeeperErrorCode = ConnectionLoss > at org.apache.zookeeper.KeeperException.create(KeeperException.java:99) > ~[zookeeper-3.4.6.jar:3.4.6-1569965] > at > org.apache.curator.framework.imps.CuratorFrameworkImpl.checkBackgroundRetry(CuratorFrameworkImpl.java:695) > [curator-framework-2.6.0.jar:na] > at > org.apache.curator.framework.imps.CuratorFrameworkImpl.performBackgroundOperation(CuratorFrameworkImpl.java:813) > [curator-framework-2.6.0.jar:na] > at > org.apache.curator.framework.imps.CuratorFrameworkImpl.backgroundOperationsLoop(CuratorFrameworkImpl.java:779) > [curator-framework-2.6.0.jar:na] > at > org.apache.curator.framework.imps.CuratorFrameworkImpl.access$400(CuratorFrameworkImpl.java:58) > [curator-framework-2.6.0.jar:na] > at > org.apache.curator.framework.imps.CuratorFrameworkImpl$4.call(CuratorFrameworkImpl.java:265) > [curator-framework-2.6.0.jar:na] > at java.util.concurrent.FutureTask.run(FutureTask.java:262) [na:1.7.0_55] > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > [na:1.7.0_55] > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > [na:1.7.0_55] > at java.lang.Thread.run(Thread.java:745) [na:1.7.0_55] > {code} > followed very closely by: > {code} > Background retry gave uporg.apache.curator.CuratorConnectionLossException: > KeeperErrorCode = ConnectionLoss > at > org.apache.curator.framework.imps.CuratorFrameworkImpl.performBackgroundOperation(CuratorFrameworkImpl.java:796) > [curator-framework-2.6.0.jar:na] > at > org.apache.curator.framework.imps.CuratorFrameworkImpl.backgroundOperationsLoop(CuratorFrameworkImpl.java:779) > [curator-framework-2.6.0.jar:na] > at > org.apache.curator.framework.imps.CuratorFrameworkImpl.access$400(CuratorFrameworkImpl.java:58) > [curator-framework-2.6.0.jar:na] > at > org.apache.curator.framework.imps.CuratorFrameworkImpl$4.call(CuratorFrameworkImpl.java:265) > [curator-framework-2.6.0.jar:na] > at java.util.concurrent.FutureTask.run(FutureTask.java:262) [na:1.7.0_55] > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > [na:1.7.0_55] > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > [na:1.7.0_55] > at java.lang.Thread.run(Thread.java:745) [na:1.7.0_55] > {code} > After which it begins spewing the stack trace I first posted above. We're > assuming that some sort of networking hiccup is occurring in EC2 that's > causing the ConnectionLoss, which seems entirely momentary (none of our other > boxes see it, and when we check the box it can connect to all the zk servers > without any issues.) -- This message was sent by Atlassian JIRA (v6.3.4#6332)