Lander Visterin created ZOOKEEPER-3991:
------------------------------------------

             Summary: QuorumCnxManager Listener port bind retry does not retry 
DNS lookup
                 Key: ZOOKEEPER-3991
                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3991
             Project: ZooKeeper
          Issue Type: Bug
          Components: quorum
    Affects Versions: 3.6.2
            Reporter: Lander Visterin
         Attachments: RecreateAddress.patch, repro.tar.gz

We run Zookeeper in a container environment where DNS is not stable. As 
recommended by the documentation, we set _electionPortBindRetry_ to 0 (keeps 
retrying forever).

On some instances, we get the following exception in an infinite loop, even 
though the address already became resolve-able:

 
{noformat}
zk-2_1  | 2020-11-03 10:57:08,407 [myid:3] - ERROR 
[ListenerHandler-zk-2.test:3888:QuorumCnxManager$Listener$ListenerHandler@1093] 
- Exception while listening
zk-2_1  | java.net.SocketException: Unresolved address
zk-2_1  |       at java.base/java.net.ServerSocket.bind(Unknown Source)
zk-2_1  |       at java.base/java.net.ServerSocket.bind(Unknown Source)
zk-2_1  |       at 
org.apache.zookeeper.server.quorum.QuorumCnxManager$Listener$ListenerHandler.createNewServerSocket(QuorumCnxManager.java:1140)
zk-2_1  |       at 
org.apache.zookeeper.server.quorum.QuorumCnxManager$Listener$ListenerHandler.acceptConnections(QuorumCnxManager.java:1064)
zk-2_1  |       at 
org.apache.zookeeper.server.quorum.QuorumCnxManager$Listener$ListenerHandler.run(QuorumCnxManager.java:1033)
zk-2_1  |       at 
java.base/java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
zk-2_1  |       at java.base/java.util.concurrent.FutureTask.run(Unknown Source)
zk-2_1  |       at 
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
zk-2_1  |       at 
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
zk-2_1  |       at java.base/java.lang.Thread.run(Unknown Source){noformat}
Zookeeper does not actually retry the DNS resolution, it just keeps using the 
old failed result.

 

This happens because the InetSocketAddress is created once and the DNS lookup 
happens when it is created.

This issue has come up previously in 
https://issues.apache.org/jira/browse/ZOOKEEPER-1506 but it appears to still 
happen here.

I have attached a repro.tar.gz to help reproduce this issue. Steps:
 * Untar repro.tar.gz
 * docker-compose up
 * See the exception keeps happening for zk-2, not for the others
 * Open db.test and uncomment the zk-2 line, increment the serial and save
 * Wait a few seconds for the DNS to refresh
 * Verify that you can resolve zk-2.test now (dig @172.16.60.2 zk-2.test) but 
the error keeps appearing

I have also attached a patch that resolves this. The patch will retry DNS 
resolution if the address is still unresolved every time it tries to create the 
server socket.

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to