[ https://issues.apache.org/jira/browse/ACCUMULO-1449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
John Vines updated ACCUMULO-1449: --------------------------------- Fix Version/s: (was: 1.5.1) (was: 1.6.0) 1.7.0 > Connector/ZooCache code enters infinite loop when Zookeeper connection lost. > ---------------------------------------------------------------------------- > > Key: ACCUMULO-1449 > URL: https://issues.apache.org/jira/browse/ACCUMULO-1449 > Project: Accumulo > Issue Type: Sub-task > Components: client > Affects Versions: 1.5.0 > Environment: accumulo-1.5.0-RC4, zookeeper-3.4.5, hadoop-1.0.4, > CentOS 6.4 > Reporter: Luke Brassard > Fix For: 1.7.0 > > > While using 1.5.0-RC4 a long-lived {{Connector}} went into an infinite loop > of Zookeeper "ConnectionLoss" and "Session expired" failures. In a > multithreaded application, all using the same {{Connector}}, there were > errors whenever there were calls to {{conn.createScanner()}} and > {{conn.createBatchScanner()}}. Here are a couple stacktraces: > {code} > 013-05-22 09:12:28,250 [zookeeper.ZooCache] WARN : Zookeeper error, will retry > org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode > = Session expired for /accumulo/5e982cc9-6959-4064-9712-2ff3dc1003d8 > at org.apache.zookeeper.KeeperException.create(KeeperException.java:127) > at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) > at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1041) > at org.apache.accumulo.fate.zookeeper.ZooCache$2.run(ZooCache.java:208) > at org.apache.accumulo.fate.zookeeper.ZooCache.retry(ZooCache.java:130) > at org.apache.accumulo.fate.zookeeper.ZooCache.get(ZooCache.java:233) > at org.apache.accumulo.fate.zookeeper.ZooCache.get(ZooCache.java:188) > at > org.apache.accumulo.core.client.ZooKeeperInstance.getInstanceID(ZooKeeperInstance.java:151) > at org.apache.accumulo.core.zookeeper.ZooUtil.getRoot(ZooUtil.java:24) > at org.apache.accumulo.core.client.impl.Tables.getMap(Tables.java:46) > at > org.apache.accumulo.core.client.impl.Tables.getNameToIdMap(Tables.java:78) > at > org.apache.accumulo.core.client.impl.Tables.getTableId(Tables.java:64) > at > org.apache.accumulo.core.client.impl.ConnectorImpl.getTableId(ConnectorImpl.java:75) > at > org.apache.accumulo.core.client.impl.ConnectorImpl.createScanner(ConnectorImpl.java:137) > {code} > {code} > 2013-05-22 09:12:23,849 [zookeeper.ZooCache] WARN : Zookeeper error, will > retry > org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode > = ConnectionLoss for /accumulo/5e982cc9-6959-4064-9712-2ff3dc1003d8 > at org.apache.zookeeper.KeeperException.create(KeeperException.java:99) > at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) > at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1041) > at org.apache.accumulo.fate.zookeeper.ZooCache$2.run(ZooCache.java:208) > at org.apache.accumulo.fate.zookeeper.ZooCache.retry(ZooCache.java:130) > at org.apache.accumulo.fate.zookeeper.ZooCache.get(ZooCache.java:233) > at org.apache.accumulo.fate.zookeeper.ZooCache.get(ZooCache.java:188) > at > org.apache.accumulo.core.client.ZooKeeperInstance.getInstanceID(ZooKeeperInstance.java:151) > at org.apache.accumulo.core.zookeeper.ZooUtil.getRoot(ZooUtil.java:24) > at org.apache.accumulo.core.client.impl.Tables.getMap(Tables.java:46) > at > org.apache.accumulo.core.client.impl.Tables.getNameToIdMap(Tables.java:78) > at > org.apache.accumulo.core.client.impl.Tables.getTableId(Tables.java:64) > at > org.apache.accumulo.core.client.impl.ConnectorImpl.getTableId(ConnectorImpl.java:75) > at > org.apache.accumulo.core.client.impl.ConnectorImpl.createBatchScanner(ConnectorImpl.java:89) > {code} > The method {{ZooCache.retry(ZooRunnable op)}} (ZooCache.java:128) has a > {{while(true)}} loop that should probably have a max retries or timeout that > will eventually cause the method to throw an exception that can be handled > appropriately by the client. As it is currently, this loop will never be > exited when Zookeeper continues to error. > Note: There may have been a network hiccup that triggered the bug, but there > was no way to recover without restarting the application. -- This message was sent by Atlassian JIRA (v6.1#6144)