[ https://issues.apache.org/jira/browse/ZOOKEEPER-1506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14503871#comment-14503871 ]
Michi Mutsuzaki commented on ZOOKEEPER-1506: -------------------------------------------- Thanks Raul for testing this. I'd try replacing calls to getHostName to getHostString. For example, I found another one in QuorumCnxManager.java: org/apache/zookeeper/server/quorum/QuorumCnxManager.java: String addr = self.getElectionAddress().getHostName() + ":" + self.getElectionAddress().getPort(); > Re-try DNS hostname -> IP resolution if node connection fails > ------------------------------------------------------------- > > Key: ZOOKEEPER-1506 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1506 > Project: ZooKeeper > Issue Type: Improvement > Components: server > Affects Versions: 3.4.5 > Environment: Ubuntu 11.04 64-bit > Reporter: Mike Heffner > Assignee: Michi Mutsuzaki > Priority: Critical > Labels: patch > Fix For: 3.4.7, 3.5.1, 3.6.0 > > Attachments: ZOOKEEPER-1506-fix.patch, ZOOKEEPER-1506.patch, > ZOOKEEPER-1506.patch, ZOOKEEPER-1506.patch, ZOOKEEPER-1506.patch, > ZOOKEEPER-1506.patch, ZOOKEEPER-1506.patch, ZOOKEEPER-1506.patch, > zk-dns-caching-refresh.patch > > > In our zoo.cfg we use hostnames to identify the ZK servers that are part of > an ensemble. These hostnames are configured with a low (<= 60s) TTL and the > IP address they map to can and does change. Our procedure for > replacing/upgrading a ZK node is to boot an entirely new instance and remap > the hostname to the new instance's IP address. Our expectation is that when > the original ZK node is terminated/shutdown, the remaining nodes in the > ensemble would reconnect to the new instance. > However, what we are noticing is that the remaining ZK nodes do not attempt > to re-resolve the hostname->IP mapping for the new server. Once the original > ZK node is terminated, the existing servers continue to attempt contacting it > at the old IP address. It would be great if the ZK servers could try to > re-resolve the hostname when attempting to connect to a lost ZK server, > instead of caching the lookup indefinitely. Currently we must do a rolling > restart of the ZK ensemble after swapping a node -- which at three nodes > means we periodically lose quorum. > The exact method we are following is to boot new instances in EC2 and attach > one, of a set of three, Elastic IP address. External to EC2 this IP address > remains the same and maps to whatever instance it is attached to. Internal to > EC2, the elastic IP hostname has a TTL of about 45-60 seconds and is remapped > to the internal (10.x.y.z) address of the instance it is attached to. > Therefore, in our case we would like ZK to pickup the new 10.x.y.z address > that the elastic IP hostname gets mapped to and reconnect appropriately. -- This message was sent by Atlassian JIRA (v6.3.4#6332)