[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2982?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16371425#comment-16371425
 ] 

Andor Molnar commented on ZOOKEEPER-2982:
-----------------------------------------

[~eronwright]

I've tried this on localhost by adding fake dns names to /etc/hosts like this:
{noformat}
127.0.0.1 one.andor.org
127.0.0.1 two.andor.org
#127.0.0.1 three.andor.org{noformat}
First, all of the 3 entries were commented out and I started ZooKeeper nodes 
with the following server config:
{noformat}
server.1=one.andor.org:2222:2223
server.2=two.andor.org:3333:3334
server.3=three.andor.org:4444:4445
{noformat}
Nodes were unable to connect because of the following resolution error:
{noformat}
2018-02-21 14:33:25,509 [myid:1] - WARN 
[QuorumPeer[myid=1](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):QuorumPeer$QuorumServer@172]
 - Failed to resolve address: two.andor.org
java.net.UnknownHostException: two.andor.org
at java.net.InetAddress.getAllByName0(InetAddress.java:1273)
at java.net.InetAddress.getAllByName(InetAddress.java:1185)
at java.net.InetAddress.getAllByName(InetAddress.java:1119)
at java.net.InetAddress.getByName(InetAddress.java:1069)
at 
org.apache.zookeeper.server.quorum.QuorumPeer$QuorumServer.recreateSocketAddresses(QuorumPeer.java:170)
at 
org.apache.zookeeper.server.quorum.QuorumPeer.recreateSocketAddresses(QuorumPeer.java:726)
at 
org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:686)
at 
org.apache.zookeeper.server.quorum.QuorumCnxManager.connectAll(QuorumCnxManager.java:720)
at 
org.apache.zookeeper.server.quorum.FastLeaderElection.lookForLeader(FastLeaderElection.java:919)
at 
org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:1171){noformat}
Similar entries are keep repeated in both server logs. As I can see ZK is 
trying to call recreateSocketAddresses() and tries to re-resolve the address 
every time it's trying to connect.

This is the case _without_ your patch.

When I enabled the entries in /etc/hosts, the following error showed up in the 
logs:
{noformat}
2018-02-21 14:37:07,178 [myid:1] - WARN 
[QuorumPeer[myid=1](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):QuorumCnxManager@663]
 - Cannot open channel to 2 at election address two.andor.org/127.0.0.1:3334
java.net.ConnectException: Connection refused (Connection refused)
at java.net.PlainSocketImpl.socketConnect(Native Method)
at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339)
at 
java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:200)
at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182)
at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
at java.net.Socket.connect(Socket.java:580)
at 
org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:641)
at 
org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:692)
at 
org.apache.zookeeper.server.quorum.QuorumCnxManager.connectAll(QuorumCnxManager.java:720)
at 
org.apache.zookeeper.server.quorum.FastLeaderElection.lookForLeader(FastLeaderElection.java:919)
at 
org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:1171){noformat}
The error shows that DNS resolution was successful (127.0.0.1) and the 
connection issue is different (Connection refused) which might be related to my 
silly test environment (socket has not been created on the other side), but the 
key takeaway here is that [~abrahamfine] is probably right and the 
re-resolution happens properly.

I repeated the test with your patch too and the results are the same. No 
difference.

Maybe I'm missing something and the test might not be relevant at all, but at 
least it's a little bit confusing.

[~eronwright]Would you please attach logs running the same ensemble _without_ 
your patch too?

> Re-try DNS hostname -> IP resolution
> ------------------------------------
>
>                 Key: ZOOKEEPER-2982
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2982
>             Project: ZooKeeper
>          Issue Type: Bug
>          Components: server
>    Affects Versions: 3.5.0, 3.5.1, 3.5.3
>            Reporter: Eron Wright 
>            Priority: Blocker
>             Fix For: 3.5.4, 3.6.0
>
>         Attachments: fixed.log
>
>
> ZOOKEEPER-1506 fixed a DNS resolution issue in 3.4.  Some portions of the fix 
> haven't yet been ported to 3.5.
> To recap the outstanding problem in 3.5, if a given ZK server is started 
> before all peer addresses are resolvable, that server may cache a negative 
> lookup result and forever fail to resolve the address.    For example, 
> deploying ZK 3.5 to Kubernetes using a StatefulSet plus a Service (headless) 
> may fail because the DNS records are created lazily.
> {code}
> 2018-02-18 09:11:22,583 [myid:0] - WARN  
> [QuorumPeer[myid=0](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):Follower@95]
>  - Exception when following the leader
> java.net.UnknownHostException: zk-2.zk.default.svc.cluster.local
>         at 
> java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:184)
>         at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
>         at java.net.Socket.connect(Socket.java:589)
>         at 
> org.apache.zookeeper.server.quorum.Learner.sockConnect(Learner.java:227)
>         at 
> org.apache.zookeeper.server.quorum.Learner.connectToLeader(Learner.java:256)
>         at 
> org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:76)
>         at 
> org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:1133)
> {code}
> In the above example, the address `zk-2.zk.default.svc.cluster.local` was not 
> resolvable when the server started, but became resolvable shortly thereafter. 
>    The server should eventually succeed but doesn't.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to