Luke Brassard created ACCUMULO-1449:
---------------------------------------

             Summary: Connector/ZooCache code enters infinite loop when 
Zookeeper connection lost.
                 Key: ACCUMULO-1449
                 URL: https://issues.apache.org/jira/browse/ACCUMULO-1449
             Project: Accumulo
          Issue Type: Bug
          Components: client
    Affects Versions: 1.5.1
         Environment: accumulo-1.5.0-RC4, zookeeper-3.4.5, hadoop-1.0.4, CentOS 
6.4
            Reporter: Luke Brassard


While using 1.5.0-RC4 a long-lived {{Connector}} went into an infinite loop of 
Zookeeper "ConnectionLoss" and "Session expired" failures. In a multithreaded 
application, all using the same {{Connector}}, there were errors whenever there 
were calls to {{conn.createScanner()}} and {{conn.createBatchScanner()}}. Here 
are a couple stacktraces:

{code}
013-05-22 09:12:28,250 [zookeeper.ZooCache] WARN : Zookeeper error, will retry
org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = 
Session expired for /accumulo/5e982cc9-6959-4064-9712-2ff3dc1003d8
        at org.apache.zookeeper.KeeperException.create(KeeperException.java:127)
        at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
        at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1041)
        at org.apache.accumulo.fate.zookeeper.ZooCache$2.run(ZooCache.java:208)
        at org.apache.accumulo.fate.zookeeper.ZooCache.retry(ZooCache.java:130)
        at org.apache.accumulo.fate.zookeeper.ZooCache.get(ZooCache.java:233)
        at org.apache.accumulo.fate.zookeeper.ZooCache.get(ZooCache.java:188)
        at 
org.apache.accumulo.core.client.ZooKeeperInstance.getInstanceID(ZooKeeperInstance.java:151)
        at org.apache.accumulo.core.zookeeper.ZooUtil.getRoot(ZooUtil.java:24)
        at org.apache.accumulo.core.client.impl.Tables.getMap(Tables.java:46)
        at 
org.apache.accumulo.core.client.impl.Tables.getNameToIdMap(Tables.java:78)
        at 
org.apache.accumulo.core.client.impl.Tables.getTableId(Tables.java:64)
        at 
org.apache.accumulo.core.client.impl.ConnectorImpl.getTableId(ConnectorImpl.java:75)
        at 
org.apache.accumulo.core.client.impl.ConnectorImpl.createScanner(ConnectorImpl.java:137)
{code}    

{code}    
2013-05-22 09:12:23,849 [zookeeper.ZooCache] WARN : Zookeeper error, will retry
org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = 
ConnectionLoss for /accumulo/5e982cc9-6959-4064-9712-2ff3dc1003d8
        at org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
        at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
        at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1041)
        at org.apache.accumulo.fate.zookeeper.ZooCache$2.run(ZooCache.java:208)
        at org.apache.accumulo.fate.zookeeper.ZooCache.retry(ZooCache.java:130)
        at org.apache.accumulo.fate.zookeeper.ZooCache.get(ZooCache.java:233)
        at org.apache.accumulo.fate.zookeeper.ZooCache.get(ZooCache.java:188)
        at 
org.apache.accumulo.core.client.ZooKeeperInstance.getInstanceID(ZooKeeperInstance.java:151)
        at org.apache.accumulo.core.zookeeper.ZooUtil.getRoot(ZooUtil.java:24)
        at org.apache.accumulo.core.client.impl.Tables.getMap(Tables.java:46)
        at 
org.apache.accumulo.core.client.impl.Tables.getNameToIdMap(Tables.java:78)
        at 
org.apache.accumulo.core.client.impl.Tables.getTableId(Tables.java:64)
        at 
org.apache.accumulo.core.client.impl.ConnectorImpl.getTableId(ConnectorImpl.java:75)
        at 
org.apache.accumulo.core.client.impl.ConnectorImpl.createBatchScanner(ConnectorImpl.java:89)
{code}

The method {{ZooCache.retry(ZooRunnable op)}} (ZooCache.java:128) has a 
{{while(true)}} loop that should probably have a max retries or timeout that 
will eventually cause the method to throw an exception that can be handled 
appropriately by the client. As it is currently, this loop will never be exited 
when Zookeeper continues to error.

Note: There may have been a network hiccup that triggered the bug, but there 
was no way to recover without restarting the application.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to