After failing to reproduce this issue locally, I enabled some trace logging and
re-tested on Amazon EC2 and have further information on this now.
The issue seems to be specific to java.net.UnknownHostException.
The first error I see is:
[java] 2013-06-25 19:59:47,883 ERROR c.n.c.f.i.CuratorFrameworkImpl
[main] Background exception was not retry-able or retry gave up
[java] java.net.UnknownHostException:
ec2-107-21-126-93.compute-1.amazonaws.com
[java] at java.net.Inet6AddressImpl.lookupAllHostAddr(Native Method)
[java] at
java.net.InetAddress$1.lookupAllHostAddr(InetAddress.java:850)
[java] at
java.net.InetAddress.getAddressFromNameService(InetAddress.java:1201)
[java] at java.net.InetAddress.getAllByName0(InetAddress.java:1154)
[java] at java.net.InetAddress.getAllByName(InetAddress.java:1084)
[java] at java.net.InetAddress.getAllByName(InetAddress.java:1020)
[java] at
org.apache.zookeeper.client.StaticHostProvider.<init>(StaticHostProvider.java:60)
[java] at org.apache.zookeeper.ZooKeeper.<init>(ZooKeeper.java:445)
[java] at
com.netflix.curator.utils.DefaultZookeeperFactory.newZooKeeper(DefaultZookeeperFactory.java:27)
This error is not escalated to the application code, so when the application
tries performing an operation on the curator client, I get the following
logging in a loop:
BEGIN LOOP
[java] 2013-06-25 20:00:03,131 ERROR c.n.c.ConnectionState [main] Connection
timed out for connection string
(10.96.214.121:8090,ec2-107-21-126-93.compute-1.amazonaws.com:8090,10.112.81.128:8090)
and timeout (15000) / elapsed (15272)
[java] org.apache.zookeeper.KeeperException$ConnectionLossException:
KeeperErrorCode = ConnectionLoss
...
[java] 2013-06-25 20:00:03,134 TRACE c.d.n.ZKClient$1 [main] addCount()
connections-timed-out
[java] 2013-06-25 20:00:03,135 DEBUG c.n.c.RetryLoop [main] Retry-able
exception received
[java] org.apache.zookeeper.KeeperException$ConnectionLossException:
KeeperErrorCode = ConnectionLoss
....
[java] 2013-06-25 20:04:47,518 TRACE c.d.n.ZKClient$1 [main] addCount()
retries-allowed
[java] 2013-06-25 20:04:47,519 DEBUG c.n.c.RetryLoop [main] Retrying operation
END LOOP
On Jun 25, 2013, at 11:16 AM, Andy Grove <[email protected]> wrote:
> Hi,
>
> I'm using the following code to connect to my zookeeper instances:
>
> client = CuratorFrameworkFactory.newClient(connectString,
> sessionTimeout, connectTimeout,new ExponentialBackoffRetry(1000, 3));
>
> I have three hosts, lets call them host1, host2 and host3. If all hosts are
> running then everything works as expected.
>
> If host1 is down (server shut down) then all operations on the curator client
> fail and I see errors like this:
>
> ERROR com.netflix.curator.ConnectionState - Connection timed out for
> connection string (host1:8090,host2:8090,host3:8090) and timeout (15000) /
> elapsed (15310)
>
> It doesn't matter what order I specify the hosts in, I always get these
> errors and my operation eventually fails with:
>
> [java] Caused by:
> org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode
> = ConnectionLoss
> [java] at
> com.netflix.curator.ConnectionState.getZooKeeper(ConnectionState.java:101)
> [java] at
> com.netflix.curator.CuratorZookeeperClient.getZooKeeper(CuratorZookeeperClient.java:107)
> [java] at
> com.netflix.curator.framework.imps.CuratorFrameworkImpl.getZooKeeper(CuratorFrameworkImpl.java:445)
> [java] at
> com.netflix.curator.framework.imps.ExistsBuilderImpl$2.call(ExistsBuilderImpl.java:171)
> [java] at
> com.netflix.curator.framework.imps.ExistsBuilderImpl$2.call(ExistsBuilderImpl.java:160)
> [java] at
> com.netflix.curator.RetryLoop.callWithRetry(RetryLoop.java:106)
> [java] at
> com.netflix.curator.framework.imps.ExistsBuilderImpl.pathInForeground(ExistsBuilderImpl.java:156)
> [java] at
> com.netflix.curator.framework.imps.ExistsBuilderImpl.forPath(ExistsBuilderImpl.java:147)
> [java] at
> com.netflix.curator.framework.imps.ExistsBuilderImpl.forPath(ExistsBuilderImpl.java:35)
> [java] at
> com.dbshards.nameserver.ZKClient.createPath(ZKClient.java:406)
>
> I would expect Curator/Zookeeper to try this operation with host2 or host3
> after an error connecting to host1 but this is not the case. I even have a
> retry loop in my code that tries the operation 10 times and it fails every
> time if host1 is in the connect string.
>
> I'm hoping I'm missing something obvious here. Any help would be appreciated.
>
> Thanks,
>
> Andy.
>
> --
> Andy Grove
> VP, R&D
> CodeFutures Corporation
>
> Share Nothing, Shard Everything!
> http://www.dbshards.com
>
>
>
>