[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-3320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16797657#comment-16797657
 ] 

Brian Nixon commented on ZOOKEEPER-3320:
----------------------------------------

A configurable retry seems like a good idea to me. Either something like 
"election port bind time" or "dns unavailable time" if we want to be more 
general. Do you want to contribute a short diff?

This may also be related to ZOOKEEPER-2982 (or may not, making a note to check 
later).

> Leader election port stop listen when hostname unresolvable for some time 
> --------------------------------------------------------------------------
>
>                 Key: ZOOKEEPER-3320
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3320
>             Project: ZooKeeper
>          Issue Type: Bug
>          Components: leaderElection
>    Affects Versions: 3.4.10, 3.5.4
>            Reporter: Igor Skokov
>            Priority: Major
>
> When trying to run Zookeeper 3.5.4 cluster on Kubernetes, I found out that in 
> some circumstances Zookeeper node stop listening on leader election port. 
> This cause unavailability of ZK cluster. 
> Zookeeper deployed  as StatefulSet in term of Kubernetes and has following 
> dynamic configuration:
> {code:java}
> zookeeper-0.zookeeper:2182:2183:participant;2181
> zookeeper-1.zookeeper:2182:2183:participant;2181
> zookeeper-2.zookeeper:2182:2183:participant;2181
> {code}
> Bind address contains DNS name which generated by Kubernetes for each 
> StatefulSet pod.
> These DNS names will become resolvable after container start, but with some 
> delay. That delay cause stopping of leader election port listener in 
> QuorumCnxManager.Listener class.
> Error happens in QuorumCnxManager.Listener "run" method, it tries to bind 
> leader election port to hostname which not resolvable at this moment. Retry 
> count is hard-coded and equals to 3(with backoff of 1 sec). 
> Zookeeper server log contains following errors:
> {code:java}
> 2019-03-17 07:56:04,844 [myid:1] - WARN  
> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled):QuorumPeer@1230] - 
> Unexpected exception
> java.net.SocketException: Unresolved address
>       at java.base/java.net.ServerSocket.bind(ServerSocket.java:374)
>       at java.base/java.net.ServerSocket.bind(ServerSocket.java:335)
>       at org.apache.zookeeper.server.quorum.Leader.<init>(Leader.java:241)
>       at 
> org.apache.zookeeper.server.quorum.QuorumPeer.makeLeader(QuorumPeer.java:1023)
>       at 
> org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:1226)
> 2019-03-17 07:56:04,844 [myid:1] - WARN  
> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled):QuorumPeer@1261] - 
> PeerState set to LOOKING
> 2019-03-17 07:56:04,845 [myid:1] - INFO  
> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled):QuorumPeer@1136] - 
> LOOKING
> 2019-03-17 07:56:04,845 [myid:1] - INFO  
> [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled):FastLeaderElection@893]
>  - New election. My id =  1, proposed zxid=0x0
> 2019-03-17 07:56:04,846 [myid:1] - INFO  
> [WorkerReceiver[myid=1]:FastLeaderElection@687] - Notification: 2 (message 
> format version), 1 (n.leader), 0x0 (n.zxid), 0xf (n.round), LOOKING 
> (n.state), 1 (n.sid), 0x0 (n.peerEPoch), LOOKING (my state)0 (n.config 
> version)
> 2019-03-17 07:56:04,979 [myid:1] - INFO  
> [zookeeper-0.zookeeper:2183:QuorumCnxManager$Listener@892] - Leaving listener
> 2019-03-17 07:56:04,979 [myid:1] - ERROR 
> [zookeeper-0.zookeeper:2183:QuorumCnxManager$Listener@894] - As I'm leaving 
> the listener thread, I won't be able to participate in leader election any 
> longer: zookeeper-0.zookeeper:2183
> {code}
> This error happens on most nodes on cluster start and Zookeeper is unable to 
> form quorum. This will leave cluster in unusable state.
> As I can see, error present on branches 3.4 and 3.5. 
> I think, this error can be fixed by configurable number of retries(instead of 
> hard-coded value of 3). 
> Other way to fix this is removing of max retries at all. Currently, ZK server 
> only stop leader election listener and continue to serve on other ports. 
> Maybe, if leader election halts, we should abort process.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to