After failing to reproduce this issue locally, I enabled some trace logging and 
re-tested on Amazon EC2 and have further information on this now. 

The issue seems to be specific to java.net.UnknownHostException.

The first error I see is:

     [java]  2013-06-25 19:59:47,883 ERROR c.n.c.f.i.CuratorFrameworkImpl 
[main] Background exception was not retry-able or retry gave up
     [java]  java.net.UnknownHostException: 
ec2-107-21-126-93.compute-1.amazonaws.com
     [java]     at java.net.Inet6AddressImpl.lookupAllHostAddr(Native Method)
     [java]     at 
java.net.InetAddress$1.lookupAllHostAddr(InetAddress.java:850)
     [java]     at 
java.net.InetAddress.getAddressFromNameService(InetAddress.java:1201)
     [java]     at java.net.InetAddress.getAllByName0(InetAddress.java:1154)
     [java]     at java.net.InetAddress.getAllByName(InetAddress.java:1084)
     [java]     at java.net.InetAddress.getAllByName(InetAddress.java:1020)
     [java]     at 
org.apache.zookeeper.client.StaticHostProvider.<init>(StaticHostProvider.java:60)
     [java]     at org.apache.zookeeper.ZooKeeper.<init>(ZooKeeper.java:445)
     [java]     at 
com.netflix.curator.utils.DefaultZookeeperFactory.newZooKeeper(DefaultZookeeperFactory.java:27)

This error is not escalated to the application code, so when the application 
tries performing an operation on the curator client, I get the following 
logging in a loop:

BEGIN LOOP

[java]  2013-06-25 20:00:03,131 ERROR c.n.c.ConnectionState [main] Connection 
timed out for connection string 
(10.96.214.121:8090,ec2-107-21-126-93.compute-1.amazonaws.com:8090,10.112.81.128:8090)
 and timeout (15000) / elapsed (15272)
[java]  org.apache.zookeeper.KeeperException$ConnectionLossException: 
KeeperErrorCode = ConnectionLoss

...

[java] 2013-06-25 20:00:03,134 TRACE c.d.n.ZKClient$1 [main] addCount() 
connections-timed-out
[java]  2013-06-25 20:00:03,135 DEBUG c.n.c.RetryLoop [main] Retry-able 
exception received
[java]  org.apache.zookeeper.KeeperException$ConnectionLossException: 
KeeperErrorCode = ConnectionLoss

....

[java] 2013-06-25 20:04:47,518 TRACE c.d.n.ZKClient$1 [main] addCount() 
retries-allowed
[java]  2013-06-25 20:04:47,519 DEBUG c.n.c.RetryLoop [main] Retrying operation

END LOOP

On Jun 25, 2013, at 11:16 AM, Andy Grove <[email protected]> wrote:

> Hi,
> 
> I'm using the following code to connect to my zookeeper instances:
> 
>             client = CuratorFrameworkFactory.newClient(connectString, 
> sessionTimeout, connectTimeout,new ExponentialBackoffRetry(1000, 3));
> 
> I have three hosts, lets call them host1, host2 and host3. If all hosts are 
> running then everything works as expected.
> 
> If host1 is down (server shut down) then all operations on the curator client 
> fail and I see errors like this:
> 
> ERROR com.netflix.curator.ConnectionState - Connection timed out for 
> connection string (host1:8090,host2:8090,host3:8090) and timeout (15000) / 
> elapsed (15310)
> 
> It doesn't matter what order I specify the hosts in, I always get these 
> errors and my operation eventually fails with:
> 
>      [java] Caused by: 
> org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode 
> = ConnectionLoss
>      [java]   at 
> com.netflix.curator.ConnectionState.getZooKeeper(ConnectionState.java:101)
>      [java]   at 
> com.netflix.curator.CuratorZookeeperClient.getZooKeeper(CuratorZookeeperClient.java:107)
>      [java]   at 
> com.netflix.curator.framework.imps.CuratorFrameworkImpl.getZooKeeper(CuratorFrameworkImpl.java:445)
>      [java]   at 
> com.netflix.curator.framework.imps.ExistsBuilderImpl$2.call(ExistsBuilderImpl.java:171)
>      [java]   at 
> com.netflix.curator.framework.imps.ExistsBuilderImpl$2.call(ExistsBuilderImpl.java:160)
>      [java]   at 
> com.netflix.curator.RetryLoop.callWithRetry(RetryLoop.java:106)
>      [java]   at 
> com.netflix.curator.framework.imps.ExistsBuilderImpl.pathInForeground(ExistsBuilderImpl.java:156)
>      [java]   at 
> com.netflix.curator.framework.imps.ExistsBuilderImpl.forPath(ExistsBuilderImpl.java:147)
>      [java]   at 
> com.netflix.curator.framework.imps.ExistsBuilderImpl.forPath(ExistsBuilderImpl.java:35)
>      [java]   at 
> com.dbshards.nameserver.ZKClient.createPath(ZKClient.java:406)
> 
> I would expect Curator/Zookeeper to try this operation with host2 or host3 
> after an error connecting to host1 but this is not the case. I even have a 
> retry loop in my code that tries the operation 10 times and it fails every 
> time if host1 is in the connect string.
> 
> I'm hoping I'm missing something obvious here. Any help would be appreciated.
> 
> Thanks,
> 
> Andy.
> 
> --
> Andy Grove
> VP, R&D
> CodeFutures Corporation
> 
> Share Nothing, Shard Everything!
> http://www.dbshards.com
> 
> 
> 
> 

Reply via email to