[ https://issues.apache.org/jira/browse/ZOOKEEPER-3814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Mate Szalay-Beko updated ZOOKEEPER-3814: ---------------------------------------- Summary: ZooKeeper config propagates even with disabled dynamic reconfig (was: ZooKeeper caching of config) > ZooKeeper config propagates even with disabled dynamic reconfig > --------------------------------------------------------------- > > Key: ZOOKEEPER-3814 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3814 > Project: ZooKeeper > Issue Type: Bug > Components: leaderElection, quorum, server > Affects Versions: 3.5.6 > Reporter: Rajkiran Sura > Assignee: Mate Szalay-Beko > Priority: Major > > Hello, > We recently upgraded our 5 node ZooKeeper ensemble from v3.4.8 to v3.5.6. > Encountered no issues as such. > This is how the ZooKeeper config looks like: > {quote}tickTime=2000 > dataDir=/zookeeper-data/ > initLimit=5 > syncLimit=2 > maxClientCnxns=2048 > autopurge.snapRetainCount=3 > autopurge.purgeInterval=1 > 4lw.commands.whitelist=stat, ruok, conf, isro, mntr > authProvider.1=org.apache.zookeeper.server.auth.SASLAuthenticationProvider > requireClientAuthScheme=sasl > quorum.cnxn.threads.size=20 > quorum.auth.enableSasl=true > quorum.auth.kerberos.servicePrincipal= zookeeper/_HOST > quorum.auth.learnerRequireSasl=true > quorum.auth.learner.saslLoginContext=QuorumLearner > quorum.auth.serverRequireSasl=true > quorum.auth.server.saslLoginContext=QuorumServer > server.17=node1.foo.bar.com:2888:3888;2181 > server.19=node2.foo.bar.com:2888:3888;2181 > server.20=node3.foo.bar.com:2888:3888;2181 > server.21=node4.foo.bar.com:2888:3888;2181 > server.22=node5.bar.com:2888:3888;2181 > {quote} > Post upgrade, we had to migrate server.22 on the same node, but with > *FOO*.bar.com domain name due to kerberos referral issues. And, we used > different server-identifier, i.e., *23* when we migrated. So, here is how the > new config looked like: > {quote}server.17=node1.foo.bar.com:2888:3888;2181 > server.19=node2.foo.bar.com:2888:3888;2181 > server.20=node3.foo.bar.com:2888:3888;2181 > server.21=node4.foo.bar.com:2888:3888;2181 > *server.23=node5.{color:#00875a}foo{color}.bar.com:2888:3888;2181* > {quote} > We restarted all the nodes in the ensemble with the above updated config. And > the migrated node joined the quorum successfully and was serving all clients > directly connected to it, without any issues. > Recently, when a leader election happened, > server.*23*=node5.foo.bar.com(migrated node) was chosen as Leader (as it has > highest ID). But then, ZooKeeper was unable to serve any clients and *all* > the servers were _somehow still_ trying to establish a channel to 22 (old DNS > name: node5.bar.com) and were throwing below error in a loop: > {quote}{{2020-05-02 01:43:03,026 [myid:23] - WARN > [WorkerSender[myid=23]:QuorumPeer$QuorumServer@196] - Failed to resolve > address: node4.bar.com}} > {{java.net.UnknownHostException: node5.bar.com: Name or service not known}} > {{ at java.base/java.net.Inet4AddressImpl.lookupAllHostAddr(Native Method)}} > {{ at > java.base/java.net.InetAddress$PlatformNameService.lookupAllHostAddr(InetAddress.java:929)}} > {{ at > java.base/java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1515)}} > {{ at > java.base/java.net.InetAddress$NameServiceAddresses.get(InetAddress.java:848)}} > {{ at java.base/java.net.InetAddress.getAllByName0(InetAddress.java:1505)}} > {{ at java.base/java.net.InetAddress.getAllByName(InetAddress.java:1364)}} > {{ at java.base/java.net.InetAddress.getAllByName(InetAddress.java:1298)}} > {{ at java.base/java.net.InetAddress.getByName(InetAddress.java:1248)}} > {{ at > org.apache.zookeeper.server.quorum.QuorumPeer$QuorumServer.recreateSocketAddresses(QuorumPeer.java:194)}} > {{ at > org.apache.zookeeper.server.quorum.QuorumPeer.recreateSocketAddresses(QuorumPeer.java:774)}} > {{ at > org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:701)}} > {{ at > org.apache.zookeeper.server.quorum.QuorumCnxManager.toSend(QuorumCnxManager.java:620)}} > {{ at > org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender.process(FastLeaderElection.java:477)}} > {{ at > org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender.run(FastLeaderElection.java:456)}} > {{ at java.base/java.lang.Thread.run(Thread.java:834)}} > {{2020-05-02 01:43:03,026 [myid:23] - WARN > [WorkerSender[myid=23]:QuorumCnxManager@679] - Cannot open channel to 22 at > election address node5.bar.com:3888}} > {{java.net.UnknownHostException: node5.bar.com}} > {{ at > java.base/java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:220)}} > {{ at java.base/java.net.SocksSocketImpl.connect(SocksSocketImpl.java:403)}} > {{ at java.base/java.net.Socket.connect(Socket.java:591)}} > {{ at > org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:650)}} > {{ at > org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:714)}} > {{ at > org.apache.zookeeper.server.quorum.QuorumCnxManager.toSend(QuorumCnxManager.java:620)}} > {{ at > org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender.process(FastLeaderElection.java:477)}} > {{ at > org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender.run(FastLeaderElection.java:456)}} > {{ at java.base/java.lang.Thread.run(Thread.java:834)}} > {quote} > Fetching config from live ZooKeeper znode also doesn't show "*22*" being a > member of the ensemble. Its not clear how "22" is still coming into the > picture. > {quote}In [4]: zk.get('/zookeeper/config') > Out[4]: > ('server.17=node1.foo.bar.com:2888:3888:participant;0.0.0.0:2181\n > server.19=node2.foo.bar.com:2888:3888:participant;0.0.0.0:2181\n > server.20=node3.foo.bar.com:2888:3888:participant;0.0.0.0:2181\n > server.21=node4.foo.bar.com:2888:3888:participant;0.0.0.0:2181\n > server.23=node5.foo.bar.com:2888:3888:participant;0.0.0.0:2181\n > version=0', > ZnodeStat(czxid=0, mzxid=0, ctime=0, mtime=1588399290245, version=-1, > cversion=0, aversion=-1, ephemeralOwner=0, dataLength=360, numChildren=0, > pzxid=0)) > {quote} > We suspected some weird caching issue and restarted ZooKeeper across all the > nodes but that didn't help. So, whenever node5 becomes the Leader, ID:22 is > popping up. We even rebooted node5 and that hasn't helped too. > We also looked at '/zookeeper/config' content from snapshot files and did not > find any reference to ID:22. > Any help would be greatly appreciated. > NOTE: dynamic config is disabled. > Thanks, > Rajkiran -- This message was sent by Atlassian Jira (v8.3.4#803005)