[ https://issues.apache.org/jira/browse/ZOOKEEPER-3320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16897209#comment-16897209 ]
Hudson commented on ZOOKEEPER-3320: ----------------------------------- FAILURE: Integrated in Jenkins build ZooKeeper-trunk #638 (See [https://builds.apache.org/job/ZooKeeper-trunk/638/]) Revert "ZOOKEEPER-3320: Leader election port stop listen when hostname (andor: rev a89c0942e45bb16e5282eee9d3a56ebbddbaae15) * (edit) zookeeper-docs/src/main/resources/markdown/zookeeperAdmin.md * (edit) zookeeper-server/src/test/java/org/apache/zookeeper/server/quorum/CnxManagerTest.java * (edit) zookeeper-server/src/main/java/org/apache/zookeeper/server/quorum/QuorumCnxManager.java > Leader election port stop listen when hostname unresolvable for some time > -------------------------------------------------------------------------- > > Key: ZOOKEEPER-3320 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3320 > Project: ZooKeeper > Issue Type: Bug > Components: leaderElection > Affects Versions: 3.4.10, 3.5.4 > Reporter: Igor Skokov > Assignee: Igor Skokov > Priority: Major > Labels: pull-request-available > Time Spent: 8h 50m > Remaining Estimate: 0h > > When trying to run Zookeeper 3.5.4 cluster on Kubernetes, I found out that in > some circumstances Zookeeper node stop listening on leader election port. > This cause unavailability of ZK cluster. > Zookeeper deployed as StatefulSet in term of Kubernetes and has following > dynamic configuration: > {code:java} > zookeeper-0.zookeeper:2182:2183:participant;2181 > zookeeper-1.zookeeper:2182:2183:participant;2181 > zookeeper-2.zookeeper:2182:2183:participant;2181 > {code} > Bind address contains DNS name which generated by Kubernetes for each > StatefulSet pod. > These DNS names will become resolvable after container start, but with some > delay. That delay cause stopping of leader election port listener in > QuorumCnxManager.Listener class. > Error happens in QuorumCnxManager.Listener "run" method, it tries to bind > leader election port to hostname which not resolvable at this moment. Retry > count is hard-coded and equals to 3(with backoff of 1 sec). > Zookeeper server log contains following errors: > {code:java} > 2019-03-17 07:56:04,844 [myid:1] - WARN > [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled):QuorumPeer@1230] - > Unexpected exception > java.net.SocketException: Unresolved address > at java.base/java.net.ServerSocket.bind(ServerSocket.java:374) > at java.base/java.net.ServerSocket.bind(ServerSocket.java:335) > at org.apache.zookeeper.server.quorum.Leader.<init>(Leader.java:241) > at > org.apache.zookeeper.server.quorum.QuorumPeer.makeLeader(QuorumPeer.java:1023) > at > org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:1226) > 2019-03-17 07:56:04,844 [myid:1] - WARN > [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled):QuorumPeer@1261] - > PeerState set to LOOKING > 2019-03-17 07:56:04,845 [myid:1] - INFO > [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled):QuorumPeer@1136] - > LOOKING > 2019-03-17 07:56:04,845 [myid:1] - INFO > [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled):FastLeaderElection@893] > - New election. My id = 1, proposed zxid=0x0 > 2019-03-17 07:56:04,846 [myid:1] - INFO > [WorkerReceiver[myid=1]:FastLeaderElection@687] - Notification: 2 (message > format version), 1 (n.leader), 0x0 (n.zxid), 0xf (n.round), LOOKING > (n.state), 1 (n.sid), 0x0 (n.peerEPoch), LOOKING (my state)0 (n.config > version) > 2019-03-17 07:56:04,979 [myid:1] - INFO > [zookeeper-0.zookeeper:2183:QuorumCnxManager$Listener@892] - Leaving listener > 2019-03-17 07:56:04,979 [myid:1] - ERROR > [zookeeper-0.zookeeper:2183:QuorumCnxManager$Listener@894] - As I'm leaving > the listener thread, I won't be able to participate in leader election any > longer: zookeeper-0.zookeeper:2183 > {code} > This error happens on most nodes on cluster start and Zookeeper is unable to > form quorum. This will leave cluster in unusable state. > As I can see, error present on branches 3.4 and 3.5. > I think, this error can be fixed by configurable number of retries(instead of > hard-coded value of 3). > Other way to fix this is removing of max retries at all. Currently, ZK server > only stop leader election listener and continue to serve on other ports. > Maybe, if leader election halts, we should abort process. -- This message was sent by Atlassian JIRA (v7.6.14#76016)