[ https://issues.apache.org/jira/browse/CURATOR-293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15125861#comment-15125861 ]
Wang XiaoTian commented on CURATOR-293: --------------------------------------- We can solve the issue by calling the API"client.getZookeeperClient().getZooKeeper()" periodically when receiving the "ConnectionState.LOST" event and using a handler thread pool to process the arriving state events concurrently, so that the event will not blocked, obviously the client.getZookeeperClient().getZooKeeper() is a thread-safe API. Actually the framework can do the same thing for the sake of fault-tolerant feature and do not enforce the user to handle it, just catch the exception and handle it appropriately instead of putting it in a background exception queue and ignore it, by the way, I don't think the "client.getZookeeperClient().getZooKeeper()" is a public friendly API to the user. Another issue is about the StaticHostProvider.java, it is implemented by InetAddress.java, and there is an addressCache in the InetAddress.java, see "https://github.com/openjdk-mirror/jdk7u-jdk/blob/master/src/share/classes/sun/net/InetAddressCachePolicy.java", the addressCache will cache the resolved hostname and when a given unresolved hostname be passed, the InetAddress try to resolve the hostname by querying the address cache at first time, I don't know why the last resolved hostname be lost in the cache. (perhaps for the reason of the cache policy) > Curator can NOT reconnect after connection lost and session expired when the > connection come up while the DNS server is not ready yet.(zookeeper > connection string using domain names) > -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- > > Key: CURATOR-293 > URL: https://issues.apache.org/jira/browse/CURATOR-293 > Project: Apache Curator > Issue Type: Bug > Components: Client > Affects Versions: 2.9.1 > Reporter: huanhuan li > Priority: Critical > Attachments: CuratorConnectionLostEventTest.java > > > 1. Add following lines to the /etc/hosts: > x.x.x.x zk1.test.com > x.x.x.x zk2.test.com > x.x.x.x zk3.test.com > 2. RUN the test programme > 3. shutdown the network connection to x.x.x.x > 4. wait until the session expires (for example 10 min) > 5. remove the added 3 lines in /etc/hosts > 6. open the network connection to x.x.x.x > 7. watch that curator cannot reconnect > 8. add the 3 lines to /etc/hosts > 9. watch that curator cannot reconnect either > The log may look like the following: > [main-SendThread(172.24.2.35:2181)][INFO ]2016-01-26 11:07:45.005 > [ClientCnxn.logStartConnect] - Opening socket connection to server > 172.24.2.35/172.24.2.35:2181. Will not attempt to authenticate using SASL > (unknown error) > [main-SendThread(172.24.2.35:2181)][INFO ]2016-01-26 11:07:45.050 > [ClientCnxn.primeConnection] - Socket connection established to > 172.24.2.35/172.24.2.35:2181, initiating session > [main-EventThread][WARN ]2016-01-26 11:07:45.093 > [ConnectionState.handleExpiredSession] - Session expired event received > [main-EventThread][DEBUG]2016-01-26 11:07:45.093 [ConnectionState.reset] - > reset > [main-SendThread(172.24.2.35:2181)][INFO ]2016-01-26 11:07:45.093 > [ClientCnxn.run] - Unable to reconnect to ZooKeeper service, session > 0x1525d9593a537af has expired, closing socket connection > [main-EventThread][INFO ]2016-01-26 11:07:45.095 [ZooKeeper.<init>] - > Initiating client connection, > connectString=zk1.test.com:2181,zk2.test.com:2181,zk3.test.com:2181 > sessionTimeout=60000 watcher=org.apache.curator.ConnectionState@7e7d611f > [main-EventThread][INFO ]2016-01-26 11:07:45.488 [ClientCnxn.run] - > EventThread shut down > [main-SendThread(111.206.227.147:2181)][INFO ]2016-01-26 11:07:45.615 > [ClientCnxn.logStartConnect] - Opening socket connection to server > 111.206.227.147/111.206.227.147:2181. Will not attempt to authenticate using > SASL (unknown error) > [Curator-ConnectionStateManager-0][DEBUG]2016-01-26 11:07:58.523 > [CuratorZookeeperClient.blockUntilConnectedOrTimedOut] - > blockUntilConnectedOrTimedOut() end. isConnected: false -- This message was sent by Atlassian JIRA (v6.3.4#6332)