Hi, We have a 5 node ZK-3.4.9-cluster running in DC1 and a couple of observers across WAN in DC2. There was a network outage between DC1 and DC2 for about 1 minute after which the observers took almost 15 minutes to connect to the leader in DC1. How can I ensure that observer recovers faster? My expectation was that it would reconnect once the network healed.
zookeeper-leader.log ---------------------- zookeeper.log.3:2017-02-02 05:37:22,045 - ERROR [LearnerHandler-/10.110.11.16:55076:LearnerHandler@631] - Unexpected exception causing shutdown while sock still open zookeeper.log.3:2017-02-02 05:37:22,045 - WARN [LearnerHandler-/10.110.11.16:55076:LearnerHandler@644] - ******* GOODBYE / 10.110.11.16:55076 ******** zookeeper.log.3:2017-02-02 05:54:05,069 - INFO [zk-02/10.100.104.82:3888 :QuorumCnxManager$Listener@541] - Received connection request / 10.110.11.16:34573 zookeeper.log.3:2017-02-02 05:54:05,249 - INFO [LearnerHandler-/10.110.11.16:40257:LearnerHandler@329] - Follower sid: 17 : info : org.apache.zookeeper.server.quorum.QuorumPeer$QuorumServer@35bc2f85 zookeeper.log.3:2017-02-02 05:54:05,320 - INFO [LearnerHandler-/10.110.11.16:40257:LearnerHandler@384] - Synchronizing with Follower sid: 17 maxCommittedLog=0x104c66a22 minCommittedLog=0x104c6682e peerLastZxid=0x100919196 zookeeper.log.3:2017-02-02 05:54:05,320 - WARN [LearnerHandler-/10.110.11.16:40257:LearnerHandler@451] - Unhandled proposal scenario zookeeper.log.3:2017-02-02 05:54:05,321 - INFO [LearnerHandler-/10.110.11.16:40257:LearnerHandler@458] - Sending SNAP zookeeper.log.3:2017-02-02 05:54:05,321 - INFO [LearnerHandler-/10.110.11.16:40257:LearnerHandler@482] - Sending snapshot last zxid of peer is 0x100919196 zxid of leader is 0x104c66a22sent zxid of db as 0x104c66a22 zookeeper.log.3:2017-02-02 05:54:06,208 - INFO [LearnerHandler-/10.110.11.16:40257:LearnerHandler@518] - Received NEWLEADER-ACK message from 17 zookeeper-observer.log ---------------------- zookeeper.log.2:2017-02-02 05:37:22,637 - WARN [QuorumPeer[myid=17]/0:0:0:0:0:0:0:0:2181:Observer@77] - Exception when observing the leader zookeeper.log.2: at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:833) zookeeper.log.2:2017-02-02 05:37:22,638 - INFO [QuorumPeer[myid=17]/0:0:0:0:0:0:0:0:2181:Observer@135] - shutdown called zookeeper.log.2: at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:837) zookeeper.log.2:2017-02-02 05:37:22,672 - INFO [QuorumPeer[myid=17]/0:0:0:0:0:0:0:0:2181:ZooKeeperServer@497] - shutting down zookeeper.log.2:2017-02-02 05:37:22,672 - INFO [QuorumPeer[myid=17]/0:0:0:0:0:0:0:0:2181:ObserverRequestProcessor@121] - Shutting down zookeeper.log.2:2017-02-02 05:37:22,672 - INFO [QuorumPeer[myid=17]/0:0:0:0:0:0:0:0:2181:CommitProcessor@184] - Shutting down zookeeper.log.2:2017-02-02 05:37:22,672 - INFO [QuorumPeer[myid=17]/0:0:0:0:0:0:0:0:2181:FinalRequestProcessor@402] - shutdown of request processor complete zookeeper.log.2:2017-02-02 05:37:22,672 - INFO [QuorumPeer[myid=17]/0:0:0:0:0:0:0:0:2181:QuorumPeer@774] - LOOKING zookeeper.log.2:2017-02-02 05:37:22,673 - INFO [QuorumPeer[myid=17]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@818] - New election. My id = 17, proposed zxid=0x8000000000000000 zookeeper.log.2:2017-02-02 05:37:22,874 - INFO [QuorumPeer[myid=17]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] - Notification time out: 400 zookeeper.log.2:2017-02-02 05:37:23,274 - INFO [QuorumPeer[myid=17]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] - Notification time out: 800 zookeeper.log.2:2017-02-02 05:37:24,075 - INFO [QuorumPeer[myid=17]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] - Notification time out: 1600 zookeeper.log.2:2017-02-02 05:37:25,675 - INFO [QuorumPeer[myid=17]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] - Notification time out: 3200 zookeeper.log.2:2017-02-02 05:37:28,876 - INFO [QuorumPeer[myid=17]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] - Notification time out: 6400 zookeeper.log.2:2017-02-02 05:37:35,276 - INFO [QuorumPeer[myid=17]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] - Notification time out: 12800 zookeeper.log.2:2017-02-02 05:37:48,076 - INFO [QuorumPeer[myid=17]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] - Notification time out: 25600 zookeeper.log.2:2017-02-02 05:38:13,677 - INFO [QuorumPeer[myid=17]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] - Notification time out: 51200 zookeeper.log.2:2017-02-02 05:39:04,877 - INFO [QuorumPeer[myid=17]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] - Notification time out: 60000 zookeeper.log.2:2017-02-02 05:40:04,878 - INFO [QuorumPeer[myid=17]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] - Notification time out: 60000 zookeeper.log.2:2017-02-02 05:41:04,878 - INFO [QuorumPeer[myid=17]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] - Notification time out: 60000 zookeeper.log.2:2017-02-02 05:42:04,879 - INFO [QuorumPeer[myid=17]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] - Notification time out: 60000 zookeeper.log.2:2017-02-02 05:43:04,879 - INFO [QuorumPeer[myid=17]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] - Notification time out: 60000 zookeeper.log.2:2017-02-02 05:44:04,879 - INFO [QuorumPeer[myid=17]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] - Notification time out: 60000 zookeeper.log.1:2017-02-02 05:45:04,880 - INFO [QuorumPeer[myid=17]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] - Notification time out: 60000 zookeeper.log.1:2017-02-02 05:46:04,880 - INFO [QuorumPeer[myid=17]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] - Notification time out: 60000 zookeeper.log.1:2017-02-02 05:47:04,881 - INFO [QuorumPeer[myid=17]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] - Notification time out: 60000 zookeeper.log.1:2017-02-02 05:48:04,881 - INFO [QuorumPeer[myid=17]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] - Notification time out: 60000 zookeeper.log.1:2017-02-02 05:49:04,881 - INFO [QuorumPeer[myid=17]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] - Notification time out: 60000 zookeeper.log.1:2017-02-02 05:50:04,882 - INFO [QuorumPeer[myid=17]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] - Notification time out: 60000 zookeeper.log.1:2017-02-02 05:51:04,882 - INFO [QuorumPeer[myid=17]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] - Notification time out: 60000 zookeeper.log.1:2017-02-02 05:52:04,882 - INFO [QuorumPeer[myid=17]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] - Notification time out: 60000 zookeeper.log.1:2017-02-02 05:53:04,883 - INFO [QuorumPeer[myid=17]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] - Notification time out: 60000 zookeeper.log.1:2017-02-02 05:54:04,883 - INFO [QuorumPeer[myid=17]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] - Notification time out: 60000 zookeeper.log.1:2017-02-02 05:54:05,103 - INFO [QuorumPeer[myid=17]/0:0:0:0:0:0:0:0:2181:QuorumPeer@831] - OBSERVING zookeeper.log.1:2017-02-02 05:54:05,103 - INFO [QuorumPeer[myid=17]/0:0:0:0:0:0:0:0:2181:ZooKeeperServer@173] - Created server with tickTime 3000 minSessionTimeout 6000 maxSessionTimeout 60000 datadir /var/lib/zookeeper/data/version-2 snapdir /var/lib/zookeeper/data/version-2 zookeeper.log.1:2017-02-02 05:54:05,103 - INFO [QuorumPeer[myid=17]/0:0:0:0:0:0:0:0:2181:ObserverZooKeeperServer@64] - syncEnabled =false zookeeper.log.1:2017-02-02 05:54:05,108 - INFO [QuorumPeer[myid=17]/0:0:0:0:0:0:0:0:2181:QuorumPeer$QuorumServer@149] - Resolved hostname: zk-02 to address: zk-02/10.100.104.82 zookeeper.log.1:2017-02-02 05:54:05,108 - INFO [QuorumPeer[myid=17]/0:0:0:0:0:0:0:0:2181:Observer@65] - Observing zk-02/ 10.100.104.82:2888 zookeeper.log.1:2017-02-02 05:54:05,182 - INFO [QuorumPeer[myid=17]/0:0:0:0:0:0:0:0:2181:FileSnap@83] - Reading snapshot /var/lib/zookeeper/data/version-2/snapshot.100919196 zookeeper.log.1:2017-02-02 05:54:05,353 - INFO [QuorumPeer[myid=17]/0:0:0:0:0:0:0:0:2181:Learner@329] - Getting a snapshot from leader zookeeper.log.1:2017-02-02 05:54:06,109 - INFO [QuorumPeer[myid=17]/0:0:0:0:0:0:0:0:2181:FileTxnSnapLog@240] - Snapshotting: 0x104c66a22 to /var/lib/zookeeper/data/version-2/snapshot.104c66a22 zookeeper.log.1:2017-02-02 05:54:06,169 - WARN [QuorumPeer[myid=17]/0:0:0:0:0:0:0:0:2181:Learner@408] - Got zxid 0x104c66a23 expected 0x1 Should I have some monitor process on the observer that checks if "Zookeeper is not serving requests" and restart it? Regards, Swami
