[jira] [Commented] (ZOOKEEPER-3814) ZooKeeper caching of config
[ https://issues.apache.org/jira/browse/ZOOKEEPER-3814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17111298#comment-17111298 ] Rajkiran Sura commented on ZOOKEEPER-3814: -- > I created a PR ([https://github.com/apache/zookeeper/pull/1356]) That's great! > Could you please share the sequence of steps you were executing when you saw > the original issue? I also used the exact sequence of steps that you have described in the above comment. Just one minor correction in the last step, we just restart the service on server.6 as the it has already been started with new config. Regards, Rajkiran > ZooKeeper caching of config > --- > > Key: ZOOKEEPER-3814 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3814 > Project: ZooKeeper > Issue Type: Bug > Components: leaderElection, quorum, server >Affects Versions: 3.5.6 >Reporter: Rajkiran Sura >Assignee: Mate Szalay-Beko >Priority: Major > > Hello, > We recently upgraded our 5 node ZooKeeper ensemble from v3.4.8 to v3.5.6. > Encountered no issues as such. > This is how the ZooKeeper config looks like: > {quote}tickTime=2000 > dataDir=/zookeeper-data/ > initLimit=5 > syncLimit=2 > maxClientCnxns=2048 > autopurge.snapRetainCount=3 > autopurge.purgeInterval=1 > 4lw.commands.whitelist=stat, ruok, conf, isro, mntr > authProvider.1=org.apache.zookeeper.server.auth.SASLAuthenticationProvider > requireClientAuthScheme=sasl > quorum.cnxn.threads.size=20 > quorum.auth.enableSasl=true > quorum.auth.kerberos.servicePrincipal= zookeeper/_HOST > quorum.auth.learnerRequireSasl=true > quorum.auth.learner.saslLoginContext=QuorumLearner > quorum.auth.serverRequireSasl=true > quorum.auth.server.saslLoginContext=QuorumServer > server.17=node1.foo.bar.com:2888:3888;2181 > server.19=node2.foo.bar.com:2888:3888;2181 > server.20=node3.foo.bar.com:2888:3888;2181 > server.21=node4.foo.bar.com:2888:3888;2181 > server.22=node5.bar.com:2888:3888;2181 > {quote} > Post upgrade, we had to migrate server.22 on the same node, but with > *FOO*.bar.com domain name due to kerberos referral issues. And, we used > different server-identifier, i.e., *23* when we migrated. So, here is how the > new config looked like: > {quote}server.17=node1.foo.bar.com:2888:3888;2181 > server.19=node2.foo.bar.com:2888:3888;2181 > server.20=node3.foo.bar.com:2888:3888;2181 > server.21=node4.foo.bar.com:2888:3888;2181 > *server.23=node5.{color:#00875a}foo{color}.bar.com:2888:3888;2181* > {quote} > We restarted all the nodes in the ensemble with the above updated config. And > the migrated node joined the quorum successfully and was serving all clients > directly connected to it, without any issues. > Recently, when a leader election happened, > server.*23*=node5.foo.bar.com(migrated node) was chosen as Leader (as it has > highest ID). But then, ZooKeeper was unable to serve any clients and *all* > the servers were _somehow still_ trying to establish a channel to 22 (old DNS > name: node5.bar.com) and were throwing below error in a loop: > {quote}{{2020-05-02 01:43:03,026 [myid:23] - WARN > [WorkerSender[myid=23]:QuorumPeer$QuorumServer@196] - Failed to resolve > address: node4.bar.com}} > {{java.net.UnknownHostException: node5.bar.com: Name or service not known}} > {{ at java.base/java.net.Inet4AddressImpl.lookupAllHostAddr(Native Method)}} > {{ at > java.base/java.net.InetAddress$PlatformNameService.lookupAllHostAddr(InetAddress.java:929)}} > {{ at > java.base/java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1515)}} > {{ at > java.base/java.net.InetAddress$NameServiceAddresses.get(InetAddress.java:848)}} > {{ at java.base/java.net.InetAddress.getAllByName0(InetAddress.java:1505)}} > {{ at java.base/java.net.InetAddress.getAllByName(InetAddress.java:1364)}} > {{ at java.base/java.net.InetAddress.getAllByName(InetAddress.java:1298)}} > {{ at java.base/java.net.InetAddress.getByName(InetAddress.java:1248)}} > {{ at > org.apache.zookeeper.server.quorum.QuorumPeer$QuorumServer.recreateSocketAddresses(QuorumPeer.java:194)}} > {{ at > org.apache.zookeeper.server.quorum.QuorumPeer.recreateSocketAddresses(QuorumPeer.java:774)}} > {{ at > org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:701)}} > {{ at > org.apache.zookeeper.server.quorum.QuorumCnxManager.toSend(QuorumCnxManager.java:620)}} > {{ at > org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender.process(FastLeaderElection.java:477)}} > {{ at > org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender.run(FastLeaderElection.java:456)}} > {{ at java.base/java.lang.Thread.run(Thread.java:834)}} > {{2020-05-02 01:43:03,026 [myid:23] - WARN > [WorkerSender[myid=23]:QuorumCnxManager@679] - Cannot open channel to 22 at > election address
[jira] [Commented] (ZOOKEEPER-3814) ZooKeeper caching of config
[ https://issues.apache.org/jira/browse/ZOOKEEPER-3814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17111006#comment-17111006 ] Mate Szalay-Beko commented on ZOOKEEPER-3814: - [~shralex] thanks for your insights! bq. Another change where a config could change is during the gossip happening in leader election - servers send around their configs peer-to-peer, and update their config to a later one if they see one (FastLeaderElection.java, look for processReconfig). There too, you could require that the reconfigEnable flag is on before calling processReconfig. I checked this part and I think reconfigEnable flag is validated already in the QuorumPeer.processReconfig function, so the changes will not propagate this way (as far as I understood). > ZooKeeper caching of config > --- > > Key: ZOOKEEPER-3814 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3814 > Project: ZooKeeper > Issue Type: Bug > Components: leaderElection, quorum, server >Affects Versions: 3.5.6 >Reporter: Rajkiran Sura >Assignee: Mate Szalay-Beko >Priority: Major > > Hello, > We recently upgraded our 5 node ZooKeeper ensemble from v3.4.8 to v3.5.6. > Encountered no issues as such. > This is how the ZooKeeper config looks like: > {quote}tickTime=2000 > dataDir=/zookeeper-data/ > initLimit=5 > syncLimit=2 > maxClientCnxns=2048 > autopurge.snapRetainCount=3 > autopurge.purgeInterval=1 > 4lw.commands.whitelist=stat, ruok, conf, isro, mntr > authProvider.1=org.apache.zookeeper.server.auth.SASLAuthenticationProvider > requireClientAuthScheme=sasl > quorum.cnxn.threads.size=20 > quorum.auth.enableSasl=true > quorum.auth.kerberos.servicePrincipal= zookeeper/_HOST > quorum.auth.learnerRequireSasl=true > quorum.auth.learner.saslLoginContext=QuorumLearner > quorum.auth.serverRequireSasl=true > quorum.auth.server.saslLoginContext=QuorumServer > server.17=node1.foo.bar.com:2888:3888;2181 > server.19=node2.foo.bar.com:2888:3888;2181 > server.20=node3.foo.bar.com:2888:3888;2181 > server.21=node4.foo.bar.com:2888:3888;2181 > server.22=node5.bar.com:2888:3888;2181 > {quote} > Post upgrade, we had to migrate server.22 on the same node, but with > *FOO*.bar.com domain name due to kerberos referral issues. And, we used > different server-identifier, i.e., *23* when we migrated. So, here is how the > new config looked like: > {quote}server.17=node1.foo.bar.com:2888:3888;2181 > server.19=node2.foo.bar.com:2888:3888;2181 > server.20=node3.foo.bar.com:2888:3888;2181 > server.21=node4.foo.bar.com:2888:3888;2181 > *server.23=node5.{color:#00875a}foo{color}.bar.com:2888:3888;2181* > {quote} > We restarted all the nodes in the ensemble with the above updated config. And > the migrated node joined the quorum successfully and was serving all clients > directly connected to it, without any issues. > Recently, when a leader election happened, > server.*23*=node5.foo.bar.com(migrated node) was chosen as Leader (as it has > highest ID). But then, ZooKeeper was unable to serve any clients and *all* > the servers were _somehow still_ trying to establish a channel to 22 (old DNS > name: node5.bar.com) and were throwing below error in a loop: > {quote}{{2020-05-02 01:43:03,026 [myid:23] - WARN > [WorkerSender[myid=23]:QuorumPeer$QuorumServer@196] - Failed to resolve > address: node4.bar.com}} > {{java.net.UnknownHostException: node5.bar.com: Name or service not known}} > {{ at java.base/java.net.Inet4AddressImpl.lookupAllHostAddr(Native Method)}} > {{ at > java.base/java.net.InetAddress$PlatformNameService.lookupAllHostAddr(InetAddress.java:929)}} > {{ at > java.base/java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1515)}} > {{ at > java.base/java.net.InetAddress$NameServiceAddresses.get(InetAddress.java:848)}} > {{ at java.base/java.net.InetAddress.getAllByName0(InetAddress.java:1505)}} > {{ at java.base/java.net.InetAddress.getAllByName(InetAddress.java:1364)}} > {{ at java.base/java.net.InetAddress.getAllByName(InetAddress.java:1298)}} > {{ at java.base/java.net.InetAddress.getByName(InetAddress.java:1248)}} > {{ at > org.apache.zookeeper.server.quorum.QuorumPeer$QuorumServer.recreateSocketAddresses(QuorumPeer.java:194)}} > {{ at > org.apache.zookeeper.server.quorum.QuorumPeer.recreateSocketAddresses(QuorumPeer.java:774)}} > {{ at > org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:701)}} > {{ at > org.apache.zookeeper.server.quorum.QuorumCnxManager.toSend(QuorumCnxManager.java:620)}} > {{ at > org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender.process(FastLeaderElection.java:477)}} > {{ at > org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender.run(FastLeaderElection.java:456)}} > {{ at java.base/java.lang.Thread.run(Thread.java:834)}}
[jira] [Commented] (ZOOKEEPER-3814) ZooKeeper caching of config
[ https://issues.apache.org/jira/browse/ZOOKEEPER-3814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17110998#comment-17110998 ] Mate Szalay-Beko commented on ZOOKEEPER-3814: - [~rajsura] I created a PR (https://github.com/apache/zookeeper/pull/1356) for ZOOKEEPER-3829 and using this patch, I was able to do the following steps: * have server.1, server.2, server.3, server.4, server.5 up and running * stop server.5 * stop server.1 * start server.1 with the new config (removing server.5, adding server.6 with the new hostname) * stop server.2 * start server.2 with the new config * stop server.3 * start server.3 with the new config * stop server.4 * start server.4 with the new config * start server.6 with the new config (but re-using the data folder of server.5) during these steps, the cluster was up and running, always had at least 3 members. In the end I checked the logfiles of server.6 and I haven't seen any attempt to try to connect to server.5. Could you please verify that this is the sequence of steps you executed? > ZooKeeper caching of config > --- > > Key: ZOOKEEPER-3814 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3814 > Project: ZooKeeper > Issue Type: Bug > Components: leaderElection, quorum, server >Affects Versions: 3.5.6 >Reporter: Rajkiran Sura >Assignee: Mate Szalay-Beko >Priority: Major > > Hello, > We recently upgraded our 5 node ZooKeeper ensemble from v3.4.8 to v3.5.6. > Encountered no issues as such. > This is how the ZooKeeper config looks like: > {quote}tickTime=2000 > dataDir=/zookeeper-data/ > initLimit=5 > syncLimit=2 > maxClientCnxns=2048 > autopurge.snapRetainCount=3 > autopurge.purgeInterval=1 > 4lw.commands.whitelist=stat, ruok, conf, isro, mntr > authProvider.1=org.apache.zookeeper.server.auth.SASLAuthenticationProvider > requireClientAuthScheme=sasl > quorum.cnxn.threads.size=20 > quorum.auth.enableSasl=true > quorum.auth.kerberos.servicePrincipal= zookeeper/_HOST > quorum.auth.learnerRequireSasl=true > quorum.auth.learner.saslLoginContext=QuorumLearner > quorum.auth.serverRequireSasl=true > quorum.auth.server.saslLoginContext=QuorumServer > server.17=node1.foo.bar.com:2888:3888;2181 > server.19=node2.foo.bar.com:2888:3888;2181 > server.20=node3.foo.bar.com:2888:3888;2181 > server.21=node4.foo.bar.com:2888:3888;2181 > server.22=node5.bar.com:2888:3888;2181 > {quote} > Post upgrade, we had to migrate server.22 on the same node, but with > *FOO*.bar.com domain name due to kerberos referral issues. And, we used > different server-identifier, i.e., *23* when we migrated. So, here is how the > new config looked like: > {quote}server.17=node1.foo.bar.com:2888:3888;2181 > server.19=node2.foo.bar.com:2888:3888;2181 > server.20=node3.foo.bar.com:2888:3888;2181 > server.21=node4.foo.bar.com:2888:3888;2181 > *server.23=node5.{color:#00875a}foo{color}.bar.com:2888:3888;2181* > {quote} > We restarted all the nodes in the ensemble with the above updated config. And > the migrated node joined the quorum successfully and was serving all clients > directly connected to it, without any issues. > Recently, when a leader election happened, > server.*23*=node5.foo.bar.com(migrated node) was chosen as Leader (as it has > highest ID). But then, ZooKeeper was unable to serve any clients and *all* > the servers were _somehow still_ trying to establish a channel to 22 (old DNS > name: node5.bar.com) and were throwing below error in a loop: > {quote}{{2020-05-02 01:43:03,026 [myid:23] - WARN > [WorkerSender[myid=23]:QuorumPeer$QuorumServer@196] - Failed to resolve > address: node4.bar.com}} > {{java.net.UnknownHostException: node5.bar.com: Name or service not known}} > {{ at java.base/java.net.Inet4AddressImpl.lookupAllHostAddr(Native Method)}} > {{ at > java.base/java.net.InetAddress$PlatformNameService.lookupAllHostAddr(InetAddress.java:929)}} > {{ at > java.base/java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1515)}} > {{ at > java.base/java.net.InetAddress$NameServiceAddresses.get(InetAddress.java:848)}} > {{ at java.base/java.net.InetAddress.getAllByName0(InetAddress.java:1505)}} > {{ at java.base/java.net.InetAddress.getAllByName(InetAddress.java:1364)}} > {{ at java.base/java.net.InetAddress.getAllByName(InetAddress.java:1298)}} > {{ at java.base/java.net.InetAddress.getByName(InetAddress.java:1248)}} > {{ at > org.apache.zookeeper.server.quorum.QuorumPeer$QuorumServer.recreateSocketAddresses(QuorumPeer.java:194)}} > {{ at > org.apache.zookeeper.server.quorum.QuorumPeer.recreateSocketAddresses(QuorumPeer.java:774)}} > {{ at > org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:701)}} > {{ at > org.apache.zookeeper.server.quorum.QuorumCnxManager.toSend(QuorumCnxManager.java:620)}} > {{ at >
[jira] [Commented] (ZOOKEEPER-3814) ZooKeeper caching of config
[ https://issues.apache.org/jira/browse/ZOOKEEPER-3814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17110946#comment-17110946 ] Mate Szalay-Beko commented on ZOOKEEPER-3814: - There are other two rolling-restart backward compatibility issues raised recently. I think we should solve them in a single fix. I mark this ticket as duplicate of ZOOKEEPER-3829 and push a PR there soon. That PR will solve all the three issues and also contain some unit tests to verify these rolling-restart scenarios. > ZooKeeper caching of config > --- > > Key: ZOOKEEPER-3814 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3814 > Project: ZooKeeper > Issue Type: Bug > Components: leaderElection, quorum, server >Affects Versions: 3.5.6 >Reporter: Rajkiran Sura >Assignee: Mate Szalay-Beko >Priority: Major > > Hello, > We recently upgraded our 5 node ZooKeeper ensemble from v3.4.8 to v3.5.6. > Encountered no issues as such. > This is how the ZooKeeper config looks like: > {quote}tickTime=2000 > dataDir=/zookeeper-data/ > initLimit=5 > syncLimit=2 > maxClientCnxns=2048 > autopurge.snapRetainCount=3 > autopurge.purgeInterval=1 > 4lw.commands.whitelist=stat, ruok, conf, isro, mntr > authProvider.1=org.apache.zookeeper.server.auth.SASLAuthenticationProvider > requireClientAuthScheme=sasl > quorum.cnxn.threads.size=20 > quorum.auth.enableSasl=true > quorum.auth.kerberos.servicePrincipal= zookeeper/_HOST > quorum.auth.learnerRequireSasl=true > quorum.auth.learner.saslLoginContext=QuorumLearner > quorum.auth.serverRequireSasl=true > quorum.auth.server.saslLoginContext=QuorumServer > server.17=node1.foo.bar.com:2888:3888;2181 > server.19=node2.foo.bar.com:2888:3888;2181 > server.20=node3.foo.bar.com:2888:3888;2181 > server.21=node4.foo.bar.com:2888:3888;2181 > server.22=node5.bar.com:2888:3888;2181 > {quote} > Post upgrade, we had to migrate server.22 on the same node, but with > *FOO*.bar.com domain name due to kerberos referral issues. And, we used > different server-identifier, i.e., *23* when we migrated. So, here is how the > new config looked like: > {quote}server.17=node1.foo.bar.com:2888:3888;2181 > server.19=node2.foo.bar.com:2888:3888;2181 > server.20=node3.foo.bar.com:2888:3888;2181 > server.21=node4.foo.bar.com:2888:3888;2181 > *server.23=node5.{color:#00875a}foo{color}.bar.com:2888:3888;2181* > {quote} > We restarted all the nodes in the ensemble with the above updated config. And > the migrated node joined the quorum successfully and was serving all clients > directly connected to it, without any issues. > Recently, when a leader election happened, > server.*23*=node5.foo.bar.com(migrated node) was chosen as Leader (as it has > highest ID). But then, ZooKeeper was unable to serve any clients and *all* > the servers were _somehow still_ trying to establish a channel to 22 (old DNS > name: node5.bar.com) and were throwing below error in a loop: > {quote}{{2020-05-02 01:43:03,026 [myid:23] - WARN > [WorkerSender[myid=23]:QuorumPeer$QuorumServer@196] - Failed to resolve > address: node4.bar.com}} > {{java.net.UnknownHostException: node5.bar.com: Name or service not known}} > {{ at java.base/java.net.Inet4AddressImpl.lookupAllHostAddr(Native Method)}} > {{ at > java.base/java.net.InetAddress$PlatformNameService.lookupAllHostAddr(InetAddress.java:929)}} > {{ at > java.base/java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1515)}} > {{ at > java.base/java.net.InetAddress$NameServiceAddresses.get(InetAddress.java:848)}} > {{ at java.base/java.net.InetAddress.getAllByName0(InetAddress.java:1505)}} > {{ at java.base/java.net.InetAddress.getAllByName(InetAddress.java:1364)}} > {{ at java.base/java.net.InetAddress.getAllByName(InetAddress.java:1298)}} > {{ at java.base/java.net.InetAddress.getByName(InetAddress.java:1248)}} > {{ at > org.apache.zookeeper.server.quorum.QuorumPeer$QuorumServer.recreateSocketAddresses(QuorumPeer.java:194)}} > {{ at > org.apache.zookeeper.server.quorum.QuorumPeer.recreateSocketAddresses(QuorumPeer.java:774)}} > {{ at > org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:701)}} > {{ at > org.apache.zookeeper.server.quorum.QuorumCnxManager.toSend(QuorumCnxManager.java:620)}} > {{ at > org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender.process(FastLeaderElection.java:477)}} > {{ at > org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender.run(FastLeaderElection.java:456)}} > {{ at java.base/java.lang.Thread.run(Thread.java:834)}} > {{2020-05-02 01:43:03,026 [myid:23] - WARN > [WorkerSender[myid=23]:QuorumCnxManager@679] - Cannot open channel to 22 at > election address node5.bar.com:3888}} > {{java.net.UnknownHostException: node5.bar.com}} > {{ at >
[jira] [Commented] (ZOOKEEPER-3814) ZooKeeper caching of config
[ https://issues.apache.org/jira/browse/ZOOKEEPER-3814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17108866#comment-17108866 ] Alexander Shraer commented on ZOOKEEPER-3814: - [~symat] the way this works is the usually, config changes happen in two rounds - a proposal sets {{lastSeenQuorumVerifier, which writes the .next file, but then a commit calls processReconfig which calls setQuorumVerifier. Same happens when a learner syncs with leader - the leader's proposal is now NEW_LEADER and the leader's commit is UPTODATE. The commit / UPTODATE is the thing actually changing the config, not }}{{lastSeenQuorumVerifier (though writing out .next files should also be prevented in this case, I think)}}{{. Another change where a config could change is during the gossip happening in leader election - servers send around their configs peer-to-peer, and update their config to a later one if they see one (FastLeaderElection.java, look for processReconfig). There too, you could require that the reconfigEnable flag is on before calling processReconfig.}} > ZooKeeper caching of config > --- > > Key: ZOOKEEPER-3814 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3814 > Project: ZooKeeper > Issue Type: Bug > Components: leaderElection, quorum, server >Affects Versions: 3.5.6 >Reporter: Rajkiran Sura >Assignee: Mate Szalay-Beko >Priority: Major > > Hello, > We recently upgraded our 5 node ZooKeeper ensemble from v3.4.8 to v3.5.6. > Encountered no issues as such. > This is how the ZooKeeper config looks like: > {quote}tickTime=2000 > dataDir=/zookeeper-data/ > initLimit=5 > syncLimit=2 > maxClientCnxns=2048 > autopurge.snapRetainCount=3 > autopurge.purgeInterval=1 > 4lw.commands.whitelist=stat, ruok, conf, isro, mntr > authProvider.1=org.apache.zookeeper.server.auth.SASLAuthenticationProvider > requireClientAuthScheme=sasl > quorum.cnxn.threads.size=20 > quorum.auth.enableSasl=true > quorum.auth.kerberos.servicePrincipal= zookeeper/_HOST > quorum.auth.learnerRequireSasl=true > quorum.auth.learner.saslLoginContext=QuorumLearner > quorum.auth.serverRequireSasl=true > quorum.auth.server.saslLoginContext=QuorumServer > server.17=node1.foo.bar.com:2888:3888;2181 > server.19=node2.foo.bar.com:2888:3888;2181 > server.20=node3.foo.bar.com:2888:3888;2181 > server.21=node4.foo.bar.com:2888:3888;2181 > server.22=node5.bar.com:2888:3888;2181 > {quote} > Post upgrade, we had to migrate server.22 on the same node, but with > *FOO*.bar.com domain name due to kerberos referral issues. And, we used > different server-identifier, i.e., *23* when we migrated. So, here is how the > new config looked like: > {quote}server.17=node1.foo.bar.com:2888:3888;2181 > server.19=node2.foo.bar.com:2888:3888;2181 > server.20=node3.foo.bar.com:2888:3888;2181 > server.21=node4.foo.bar.com:2888:3888;2181 > *server.23=node5.{color:#00875a}foo{color}.bar.com:2888:3888;2181* > {quote} > We restarted all the nodes in the ensemble with the above updated config. And > the migrated node joined the quorum successfully and was serving all clients > directly connected to it, without any issues. > Recently, when a leader election happened, > server.*23*=node5.foo.bar.com(migrated node) was chosen as Leader (as it has > highest ID). But then, ZooKeeper was unable to serve any clients and *all* > the servers were _somehow still_ trying to establish a channel to 22 (old DNS > name: node5.bar.com) and were throwing below error in a loop: > {quote}{{2020-05-02 01:43:03,026 [myid:23] - WARN > [WorkerSender[myid=23]:QuorumPeer$QuorumServer@196] - Failed to resolve > address: node4.bar.com}} > {{java.net.UnknownHostException: node5.bar.com: Name or service not known}} > {{ at java.base/java.net.Inet4AddressImpl.lookupAllHostAddr(Native Method)}} > {{ at > java.base/java.net.InetAddress$PlatformNameService.lookupAllHostAddr(InetAddress.java:929)}} > {{ at > java.base/java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1515)}} > {{ at > java.base/java.net.InetAddress$NameServiceAddresses.get(InetAddress.java:848)}} > {{ at java.base/java.net.InetAddress.getAllByName0(InetAddress.java:1505)}} > {{ at java.base/java.net.InetAddress.getAllByName(InetAddress.java:1364)}} > {{ at java.base/java.net.InetAddress.getAllByName(InetAddress.java:1298)}} > {{ at java.base/java.net.InetAddress.getByName(InetAddress.java:1248)}} > {{ at > org.apache.zookeeper.server.quorum.QuorumPeer$QuorumServer.recreateSocketAddresses(QuorumPeer.java:194)}} > {{ at > org.apache.zookeeper.server.quorum.QuorumPeer.recreateSocketAddresses(QuorumPeer.java:774)}} > {{ at > org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:701)}} > {{ at > org.apache.zookeeper.server.quorum.QuorumCnxManager.toSend(QuorumCnxManager.java:620)}} >
[jira] [Commented] (ZOOKEEPER-3814) ZooKeeper caching of config
[ https://issues.apache.org/jira/browse/ZOOKEEPER-3814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17108856#comment-17108856 ] Rajkiran Sura commented on ZOOKEEPER-3814: -- {quote}but in the meanwhile I recommend using dynamic reconfig to change the quorum. {quote} Yes, we started to rely on dynamic-reconfig. But, I would like to note that dynamic-reconfig isn't really dynamic when you have Quorum auth enabled with GSSAPI via SASL. i.e., the config is changed but the new member doesn't join the ensemble until all the members are restarted. Thus, its no more dynamic. Looks more scarier. FTR: I have raised https://issues.apache.org/jira/browse/ZOOKEEPER-3824 for this issue. Thanks Mate. Regards, Rajkiran > ZooKeeper caching of config > --- > > Key: ZOOKEEPER-3814 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3814 > Project: ZooKeeper > Issue Type: Bug > Components: leaderElection, quorum, server >Affects Versions: 3.5.6 >Reporter: Rajkiran Sura >Assignee: Mate Szalay-Beko >Priority: Major > > Hello, > We recently upgraded our 5 node ZooKeeper ensemble from v3.4.8 to v3.5.6. > Encountered no issues as such. > This is how the ZooKeeper config looks like: > {quote}tickTime=2000 > dataDir=/zookeeper-data/ > initLimit=5 > syncLimit=2 > maxClientCnxns=2048 > autopurge.snapRetainCount=3 > autopurge.purgeInterval=1 > 4lw.commands.whitelist=stat, ruok, conf, isro, mntr > authProvider.1=org.apache.zookeeper.server.auth.SASLAuthenticationProvider > requireClientAuthScheme=sasl > quorum.cnxn.threads.size=20 > quorum.auth.enableSasl=true > quorum.auth.kerberos.servicePrincipal= zookeeper/_HOST > quorum.auth.learnerRequireSasl=true > quorum.auth.learner.saslLoginContext=QuorumLearner > quorum.auth.serverRequireSasl=true > quorum.auth.server.saslLoginContext=QuorumServer > server.17=node1.foo.bar.com:2888:3888;2181 > server.19=node2.foo.bar.com:2888:3888;2181 > server.20=node3.foo.bar.com:2888:3888;2181 > server.21=node4.foo.bar.com:2888:3888;2181 > server.22=node5.bar.com:2888:3888;2181 > {quote} > Post upgrade, we had to migrate server.22 on the same node, but with > *FOO*.bar.com domain name due to kerberos referral issues. And, we used > different server-identifier, i.e., *23* when we migrated. So, here is how the > new config looked like: > {quote}server.17=node1.foo.bar.com:2888:3888;2181 > server.19=node2.foo.bar.com:2888:3888;2181 > server.20=node3.foo.bar.com:2888:3888;2181 > server.21=node4.foo.bar.com:2888:3888;2181 > *server.23=node5.{color:#00875a}foo{color}.bar.com:2888:3888;2181* > {quote} > We restarted all the nodes in the ensemble with the above updated config. And > the migrated node joined the quorum successfully and was serving all clients > directly connected to it, without any issues. > Recently, when a leader election happened, > server.*23*=node5.foo.bar.com(migrated node) was chosen as Leader (as it has > highest ID). But then, ZooKeeper was unable to serve any clients and *all* > the servers were _somehow still_ trying to establish a channel to 22 (old DNS > name: node5.bar.com) and were throwing below error in a loop: > {quote}{{2020-05-02 01:43:03,026 [myid:23] - WARN > [WorkerSender[myid=23]:QuorumPeer$QuorumServer@196] - Failed to resolve > address: node4.bar.com}} > {{java.net.UnknownHostException: node5.bar.com: Name or service not known}} > {{ at java.base/java.net.Inet4AddressImpl.lookupAllHostAddr(Native Method)}} > {{ at > java.base/java.net.InetAddress$PlatformNameService.lookupAllHostAddr(InetAddress.java:929)}} > {{ at > java.base/java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1515)}} > {{ at > java.base/java.net.InetAddress$NameServiceAddresses.get(InetAddress.java:848)}} > {{ at java.base/java.net.InetAddress.getAllByName0(InetAddress.java:1505)}} > {{ at java.base/java.net.InetAddress.getAllByName(InetAddress.java:1364)}} > {{ at java.base/java.net.InetAddress.getAllByName(InetAddress.java:1298)}} > {{ at java.base/java.net.InetAddress.getByName(InetAddress.java:1248)}} > {{ at > org.apache.zookeeper.server.quorum.QuorumPeer$QuorumServer.recreateSocketAddresses(QuorumPeer.java:194)}} > {{ at > org.apache.zookeeper.server.quorum.QuorumPeer.recreateSocketAddresses(QuorumPeer.java:774)}} > {{ at > org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:701)}} > {{ at > org.apache.zookeeper.server.quorum.QuorumCnxManager.toSend(QuorumCnxManager.java:620)}} > {{ at > org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender.process(FastLeaderElection.java:477)}} > {{ at > org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender.run(FastLeaderElection.java:456)}} > {{ at java.base/java.lang.Thread.run(Thread.java:834)}} > {{2020-05-02 01:43:03,026 [myid:23] -
[jira] [Commented] (ZOOKEEPER-3814) ZooKeeper caching of config
[ https://issues.apache.org/jira/browse/ZOOKEEPER-3814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17108373#comment-17108373 ] Mate Szalay-Beko commented on ZOOKEEPER-3814: - unfortunately I haven't found any trivial fixes yet. I will try more approaches next week, but in the meanwhile I recommend using dynamic reconfig to change the quorum. > ZooKeeper caching of config > --- > > Key: ZOOKEEPER-3814 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3814 > Project: ZooKeeper > Issue Type: Bug > Components: leaderElection, quorum, server >Affects Versions: 3.5.6 >Reporter: Rajkiran Sura >Assignee: Mate Szalay-Beko >Priority: Major > > Hello, > We recently upgraded our 5 node ZooKeeper ensemble from v3.4.8 to v3.5.6. > Encountered no issues as such. > This is how the ZooKeeper config looks like: > {quote}tickTime=2000 > dataDir=/zookeeper-data/ > initLimit=5 > syncLimit=2 > maxClientCnxns=2048 > autopurge.snapRetainCount=3 > autopurge.purgeInterval=1 > 4lw.commands.whitelist=stat, ruok, conf, isro, mntr > authProvider.1=org.apache.zookeeper.server.auth.SASLAuthenticationProvider > requireClientAuthScheme=sasl > quorum.cnxn.threads.size=20 > quorum.auth.enableSasl=true > quorum.auth.kerberos.servicePrincipal= zookeeper/_HOST > quorum.auth.learnerRequireSasl=true > quorum.auth.learner.saslLoginContext=QuorumLearner > quorum.auth.serverRequireSasl=true > quorum.auth.server.saslLoginContext=QuorumServer > server.17=node1.foo.bar.com:2888:3888;2181 > server.19=node2.foo.bar.com:2888:3888;2181 > server.20=node3.foo.bar.com:2888:3888;2181 > server.21=node4.foo.bar.com:2888:3888;2181 > server.22=node5.bar.com:2888:3888;2181 > {quote} > Post upgrade, we had to migrate server.22 on the same node, but with > *FOO*.bar.com domain name due to kerberos referral issues. And, we used > different server-identifier, i.e., *23* when we migrated. So, here is how the > new config looked like: > {quote}server.17=node1.foo.bar.com:2888:3888;2181 > server.19=node2.foo.bar.com:2888:3888;2181 > server.20=node3.foo.bar.com:2888:3888;2181 > server.21=node4.foo.bar.com:2888:3888;2181 > *server.23=node5.{color:#00875a}foo{color}.bar.com:2888:3888;2181* > {quote} > We restarted all the nodes in the ensemble with the above updated config. And > the migrated node joined the quorum successfully and was serving all clients > directly connected to it, without any issues. > Recently, when a leader election happened, > server.*23*=node5.foo.bar.com(migrated node) was chosen as Leader (as it has > highest ID). But then, ZooKeeper was unable to serve any clients and *all* > the servers were _somehow still_ trying to establish a channel to 22 (old DNS > name: node5.bar.com) and were throwing below error in a loop: > {quote}{{2020-05-02 01:43:03,026 [myid:23] - WARN > [WorkerSender[myid=23]:QuorumPeer$QuorumServer@196] - Failed to resolve > address: node4.bar.com}} > {{java.net.UnknownHostException: node5.bar.com: Name or service not known}} > {{ at java.base/java.net.Inet4AddressImpl.lookupAllHostAddr(Native Method)}} > {{ at > java.base/java.net.InetAddress$PlatformNameService.lookupAllHostAddr(InetAddress.java:929)}} > {{ at > java.base/java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1515)}} > {{ at > java.base/java.net.InetAddress$NameServiceAddresses.get(InetAddress.java:848)}} > {{ at java.base/java.net.InetAddress.getAllByName0(InetAddress.java:1505)}} > {{ at java.base/java.net.InetAddress.getAllByName(InetAddress.java:1364)}} > {{ at java.base/java.net.InetAddress.getAllByName(InetAddress.java:1298)}} > {{ at java.base/java.net.InetAddress.getByName(InetAddress.java:1248)}} > {{ at > org.apache.zookeeper.server.quorum.QuorumPeer$QuorumServer.recreateSocketAddresses(QuorumPeer.java:194)}} > {{ at > org.apache.zookeeper.server.quorum.QuorumPeer.recreateSocketAddresses(QuorumPeer.java:774)}} > {{ at > org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:701)}} > {{ at > org.apache.zookeeper.server.quorum.QuorumCnxManager.toSend(QuorumCnxManager.java:620)}} > {{ at > org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender.process(FastLeaderElection.java:477)}} > {{ at > org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender.run(FastLeaderElection.java:456)}} > {{ at java.base/java.lang.Thread.run(Thread.java:834)}} > {{2020-05-02 01:43:03,026 [myid:23] - WARN > [WorkerSender[myid=23]:QuorumCnxManager@679] - Cannot open channel to 22 at > election address node5.bar.com:3888}} > {{java.net.UnknownHostException: node5.bar.com}} > {{ at > java.base/java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:220)}} > {{ at java.base/java.net.SocksSocketImpl.connect(SocksSocketImpl.java:403)}} > {{ at
[jira] [Commented] (ZOOKEEPER-3814) ZooKeeper caching of config
[ https://issues.apache.org/jira/browse/ZOOKEEPER-3814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17108259#comment-17108259 ] Rajkiran Sura commented on ZOOKEEPER-3814: -- Many thanks Mate, for looking into this. Glad that you could pin-point the problem. Regards, Rajkiran > ZooKeeper caching of config > --- > > Key: ZOOKEEPER-3814 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3814 > Project: ZooKeeper > Issue Type: Bug > Components: leaderElection, quorum, server >Affects Versions: 3.5.6 >Reporter: Rajkiran Sura >Assignee: Mate Szalay-Beko >Priority: Major > > Hello, > We recently upgraded our 5 node ZooKeeper ensemble from v3.4.8 to v3.5.6. > Encountered no issues as such. > This is how the ZooKeeper config looks like: > {quote}tickTime=2000 > dataDir=/zookeeper-data/ > initLimit=5 > syncLimit=2 > maxClientCnxns=2048 > autopurge.snapRetainCount=3 > autopurge.purgeInterval=1 > 4lw.commands.whitelist=stat, ruok, conf, isro, mntr > authProvider.1=org.apache.zookeeper.server.auth.SASLAuthenticationProvider > requireClientAuthScheme=sasl > quorum.cnxn.threads.size=20 > quorum.auth.enableSasl=true > quorum.auth.kerberos.servicePrincipal= zookeeper/_HOST > quorum.auth.learnerRequireSasl=true > quorum.auth.learner.saslLoginContext=QuorumLearner > quorum.auth.serverRequireSasl=true > quorum.auth.server.saslLoginContext=QuorumServer > server.17=node1.foo.bar.com:2888:3888;2181 > server.19=node2.foo.bar.com:2888:3888;2181 > server.20=node3.foo.bar.com:2888:3888;2181 > server.21=node4.foo.bar.com:2888:3888;2181 > server.22=node5.bar.com:2888:3888;2181 > {quote} > Post upgrade, we had to migrate server.22 on the same node, but with > *FOO*.bar.com domain name due to kerberos referral issues. And, we used > different server-identifier, i.e., *23* when we migrated. So, here is how the > new config looked like: > {quote}server.17=node1.foo.bar.com:2888:3888;2181 > server.19=node2.foo.bar.com:2888:3888;2181 > server.20=node3.foo.bar.com:2888:3888;2181 > server.21=node4.foo.bar.com:2888:3888;2181 > *server.23=node5.{color:#00875a}foo{color}.bar.com:2888:3888;2181* > {quote} > We restarted all the nodes in the ensemble with the above updated config. And > the migrated node joined the quorum successfully and was serving all clients > directly connected to it, without any issues. > Recently, when a leader election happened, > server.*23*=node5.foo.bar.com(migrated node) was chosen as Leader (as it has > highest ID). But then, ZooKeeper was unable to serve any clients and *all* > the servers were _somehow still_ trying to establish a channel to 22 (old DNS > name: node5.bar.com) and were throwing below error in a loop: > {quote}{{2020-05-02 01:43:03,026 [myid:23] - WARN > [WorkerSender[myid=23]:QuorumPeer$QuorumServer@196] - Failed to resolve > address: node4.bar.com}} > {{java.net.UnknownHostException: node5.bar.com: Name or service not known}} > {{ at java.base/java.net.Inet4AddressImpl.lookupAllHostAddr(Native Method)}} > {{ at > java.base/java.net.InetAddress$PlatformNameService.lookupAllHostAddr(InetAddress.java:929)}} > {{ at > java.base/java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1515)}} > {{ at > java.base/java.net.InetAddress$NameServiceAddresses.get(InetAddress.java:848)}} > {{ at java.base/java.net.InetAddress.getAllByName0(InetAddress.java:1505)}} > {{ at java.base/java.net.InetAddress.getAllByName(InetAddress.java:1364)}} > {{ at java.base/java.net.InetAddress.getAllByName(InetAddress.java:1298)}} > {{ at java.base/java.net.InetAddress.getByName(InetAddress.java:1248)}} > {{ at > org.apache.zookeeper.server.quorum.QuorumPeer$QuorumServer.recreateSocketAddresses(QuorumPeer.java:194)}} > {{ at > org.apache.zookeeper.server.quorum.QuorumPeer.recreateSocketAddresses(QuorumPeer.java:774)}} > {{ at > org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:701)}} > {{ at > org.apache.zookeeper.server.quorum.QuorumCnxManager.toSend(QuorumCnxManager.java:620)}} > {{ at > org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender.process(FastLeaderElection.java:477)}} > {{ at > org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender.run(FastLeaderElection.java:456)}} > {{ at java.base/java.lang.Thread.run(Thread.java:834)}} > {{2020-05-02 01:43:03,026 [myid:23] - WARN > [WorkerSender[myid=23]:QuorumCnxManager@679] - Cannot open channel to 22 at > election address node5.bar.com:3888}} > {{java.net.UnknownHostException: node5.bar.com}} > {{ at > java.base/java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:220)}} > {{ at java.base/java.net.SocksSocketImpl.connect(SocksSocketImpl.java:403)}} > {{ at java.base/java.net.Socket.connect(Socket.java:591)}} > {{ at >
[jira] [Commented] (ZOOKEEPER-3814) ZooKeeper caching of config
[ https://issues.apache.org/jira/browse/ZOOKEEPER-3814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17106366#comment-17106366 ] Mate Szalay-Beko commented on ZOOKEEPER-3814: - OK, I think I found the root cause of the problem. Since ZOOKEEPER-107 we automatically manage membership configuration by propagating the config in the cluster. In ZOOKEEPER-2819 the backward-compatibility with regular rolling-restarts has been fixed. Still, this error and my tests shows, that the config still get propagated during the processing of the NEWLEADER message to the {{lastSeenQuorumVerifier}} variable. I am testing the case when I stop updating the {{lastSeenQuorumVerifier}} if dynamic reconfig is enabled. It fixes the issue, but introduce some other problems during rolling restarts. I keep looking for possible fixes. > ZooKeeper caching of config > --- > > Key: ZOOKEEPER-3814 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3814 > Project: ZooKeeper > Issue Type: Bug > Components: leaderElection, quorum, server >Affects Versions: 3.5.6 >Reporter: Rajkiran Sura >Assignee: Mate Szalay-Beko >Priority: Major > > Hello, > We recently upgraded our 5 node ZooKeeper ensemble from v3.4.8 to v3.5.6. > Encountered no issues as such. > This is how the ZooKeeper config looks like: > {quote}tickTime=2000 > dataDir=/zookeeper-data/ > initLimit=5 > syncLimit=2 > maxClientCnxns=2048 > autopurge.snapRetainCount=3 > autopurge.purgeInterval=1 > 4lw.commands.whitelist=stat, ruok, conf, isro, mntr > authProvider.1=org.apache.zookeeper.server.auth.SASLAuthenticationProvider > requireClientAuthScheme=sasl > quorum.cnxn.threads.size=20 > quorum.auth.enableSasl=true > quorum.auth.kerberos.servicePrincipal= zookeeper/_HOST > quorum.auth.learnerRequireSasl=true > quorum.auth.learner.saslLoginContext=QuorumLearner > quorum.auth.serverRequireSasl=true > quorum.auth.server.saslLoginContext=QuorumServer > server.17=node1.foo.bar.com:2888:3888;2181 > server.19=node2.foo.bar.com:2888:3888;2181 > server.20=node3.foo.bar.com:2888:3888;2181 > server.21=node4.foo.bar.com:2888:3888;2181 > server.22=node5.bar.com:2888:3888;2181 > {quote} > Post upgrade, we had to migrate server.22 on the same node, but with > *FOO*.bar.com domain name due to kerberos referral issues. And, we used > different server-identifier, i.e., *23* when we migrated. So, here is how the > new config looked like: > {quote}server.17=node1.foo.bar.com:2888:3888;2181 > server.19=node2.foo.bar.com:2888:3888;2181 > server.20=node3.foo.bar.com:2888:3888;2181 > server.21=node4.foo.bar.com:2888:3888;2181 > *server.23=node5.{color:#00875a}foo{color}.bar.com:2888:3888;2181* > {quote} > We restarted all the nodes in the ensemble with the above updated config. And > the migrated node joined the quorum successfully and was serving all clients > directly connected to it, without any issues. > Recently, when a leader election happened, > server.*23*=node5.foo.bar.com(migrated node) was chosen as Leader (as it has > highest ID). But then, ZooKeeper was unable to serve any clients and *all* > the servers were _somehow still_ trying to establish a channel to 22 (old DNS > name: node5.bar.com) and were throwing below error in a loop: > {quote}{{2020-05-02 01:43:03,026 [myid:23] - WARN > [WorkerSender[myid=23]:QuorumPeer$QuorumServer@196] - Failed to resolve > address: node4.bar.com}} > {{java.net.UnknownHostException: node5.bar.com: Name or service not known}} > {{ at java.base/java.net.Inet4AddressImpl.lookupAllHostAddr(Native Method)}} > {{ at > java.base/java.net.InetAddress$PlatformNameService.lookupAllHostAddr(InetAddress.java:929)}} > {{ at > java.base/java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1515)}} > {{ at > java.base/java.net.InetAddress$NameServiceAddresses.get(InetAddress.java:848)}} > {{ at java.base/java.net.InetAddress.getAllByName0(InetAddress.java:1505)}} > {{ at java.base/java.net.InetAddress.getAllByName(InetAddress.java:1364)}} > {{ at java.base/java.net.InetAddress.getAllByName(InetAddress.java:1298)}} > {{ at java.base/java.net.InetAddress.getByName(InetAddress.java:1248)}} > {{ at > org.apache.zookeeper.server.quorum.QuorumPeer$QuorumServer.recreateSocketAddresses(QuorumPeer.java:194)}} > {{ at > org.apache.zookeeper.server.quorum.QuorumPeer.recreateSocketAddresses(QuorumPeer.java:774)}} > {{ at > org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:701)}} > {{ at > org.apache.zookeeper.server.quorum.QuorumCnxManager.toSend(QuorumCnxManager.java:620)}} > {{ at > org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender.process(FastLeaderElection.java:477)}} > {{ at >
[jira] [Commented] (ZOOKEEPER-3814) ZooKeeper caching of config
[ https://issues.apache.org/jira/browse/ZOOKEEPER-3814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17105395#comment-17105395 ] Mate Szalay-Beko commented on ZOOKEEPER-3814: - update: I was wrong, the order of the rolling restart seems not to be important. I got the same error simply by: - have server.1, server.2, server.3 up and running - stop server.3 - start server.4 with the new config (but re-using the data and config folder of server.3) I think the problem is that {{server.3}} was committed locally somehow to the last valid view of the quorum. And when {{server.4}} comes up, it get the {{server.3}} from somewhere. Interestingly, it doesn't get it from {{zoo.cfg.dynamic.next}}. When I do the following test, I still got the same problem: - have server.1, server.2, server.3 up and running - stop server.3 - delete {{zoo.cfg.dynamic.next}} from the config folder of server 3/4 - start server.4 with the new config (but re-using the data and config folder of server.3) - at this point I still see the same errors in the log + I also notice that the freshly generated {{zoo.cfg.dynamic.next}} is still wrong. I need to dig into the code now to find out the problem. But this really seems to be a bug (or at least something that shouldn't happen when dynamic config is disabled). > ZooKeeper caching of config > --- > > Key: ZOOKEEPER-3814 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3814 > Project: ZooKeeper > Issue Type: Bug > Components: leaderElection, quorum, server >Affects Versions: 3.5.6 >Reporter: Rajkiran Sura >Assignee: Mate Szalay-Beko >Priority: Major > > Hello, > We recently upgraded our 5 node ZooKeeper ensemble from v3.4.8 to v3.5.6. > Encountered no issues as such. > This is how the ZooKeeper config looks like: > {quote}tickTime=2000 > dataDir=/zookeeper-data/ > initLimit=5 > syncLimit=2 > maxClientCnxns=2048 > autopurge.snapRetainCount=3 > autopurge.purgeInterval=1 > 4lw.commands.whitelist=stat, ruok, conf, isro, mntr > authProvider.1=org.apache.zookeeper.server.auth.SASLAuthenticationProvider > requireClientAuthScheme=sasl > quorum.cnxn.threads.size=20 > quorum.auth.enableSasl=true > quorum.auth.kerberos.servicePrincipal= zookeeper/_HOST > quorum.auth.learnerRequireSasl=true > quorum.auth.learner.saslLoginContext=QuorumLearner > quorum.auth.serverRequireSasl=true > quorum.auth.server.saslLoginContext=QuorumServer > server.17=node1.foo.bar.com:2888:3888;2181 > server.19=node2.foo.bar.com:2888:3888;2181 > server.20=node3.foo.bar.com:2888:3888;2181 > server.21=node4.foo.bar.com:2888:3888;2181 > server.22=node5.bar.com:2888:3888;2181 > {quote} > Post upgrade, we had to migrate server.22 on the same node, but with > *FOO*.bar.com domain name due to kerberos referral issues. And, we used > different server-identifier, i.e., *23* when we migrated. So, here is how the > new config looked like: > {quote}server.17=node1.foo.bar.com:2888:3888;2181 > server.19=node2.foo.bar.com:2888:3888;2181 > server.20=node3.foo.bar.com:2888:3888;2181 > server.21=node4.foo.bar.com:2888:3888;2181 > *server.23=node5.{color:#00875a}foo{color}.bar.com:2888:3888;2181* > {quote} > We restarted all the nodes in the ensemble with the above updated config. And > the migrated node joined the quorum successfully and was serving all clients > directly connected to it, without any issues. > Recently, when a leader election happened, > server.*23*=node5.foo.bar.com(migrated node) was chosen as Leader (as it has > highest ID). But then, ZooKeeper was unable to serve any clients and *all* > the servers were _somehow still_ trying to establish a channel to 22 (old DNS > name: node5.bar.com) and were throwing below error in a loop: > {quote}{{2020-05-02 01:43:03,026 [myid:23] - WARN > [WorkerSender[myid=23]:QuorumPeer$QuorumServer@196] - Failed to resolve > address: node4.bar.com}} > {{java.net.UnknownHostException: node5.bar.com: Name or service not known}} > {{ at java.base/java.net.Inet4AddressImpl.lookupAllHostAddr(Native Method)}} > {{ at > java.base/java.net.InetAddress$PlatformNameService.lookupAllHostAddr(InetAddress.java:929)}} > {{ at > java.base/java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1515)}} > {{ at > java.base/java.net.InetAddress$NameServiceAddresses.get(InetAddress.java:848)}} > {{ at java.base/java.net.InetAddress.getAllByName0(InetAddress.java:1505)}} > {{ at java.base/java.net.InetAddress.getAllByName(InetAddress.java:1364)}} > {{ at java.base/java.net.InetAddress.getAllByName(InetAddress.java:1298)}} > {{ at java.base/java.net.InetAddress.getByName(InetAddress.java:1248)}} > {{ at > org.apache.zookeeper.server.quorum.QuorumPeer$QuorumServer.recreateSocketAddresses(QuorumPeer.java:194)}} > {{ at >
[jira] [Commented] (ZOOKEEPER-3814) ZooKeeper caching of config
[ https://issues.apache.org/jira/browse/ZOOKEEPER-3814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17105384#comment-17105384 ] Mate Szalay-Beko commented on ZOOKEEPER-3814: - OK, using rolling restarts, I successfully reproduced your case, following these steps: - have server.1, server.2, server.3 up and running - stop server.1 - start server.1 with the new config (removing server.3, adding server.4 with the new hostname) - stop server.2 - start server.2 with the new config - stop server.3 - start server.4 with the new config (but re-using the data folder of server.3) Now I get the same error as you have (in the server.4 logs I see that it tries to connect to the old hostname of server.3, and fails obviously). When I get the {{/zookeeper/config}} object, I can see that there is no mentioning of {{server.3}}. However, the {{zoo.cfg.dynamic.next}} files haven't got updated and still contains the old list of servers on all nodes. > ZooKeeper caching of config > --- > > Key: ZOOKEEPER-3814 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3814 > Project: ZooKeeper > Issue Type: Bug > Components: leaderElection, quorum, server >Affects Versions: 3.5.6 >Reporter: Rajkiran Sura >Assignee: Mate Szalay-Beko >Priority: Major > > Hello, > We recently upgraded our 5 node ZooKeeper ensemble from v3.4.8 to v3.5.6. > Encountered no issues as such. > This is how the ZooKeeper config looks like: > {quote}tickTime=2000 > dataDir=/zookeeper-data/ > initLimit=5 > syncLimit=2 > maxClientCnxns=2048 > autopurge.snapRetainCount=3 > autopurge.purgeInterval=1 > 4lw.commands.whitelist=stat, ruok, conf, isro, mntr > authProvider.1=org.apache.zookeeper.server.auth.SASLAuthenticationProvider > requireClientAuthScheme=sasl > quorum.cnxn.threads.size=20 > quorum.auth.enableSasl=true > quorum.auth.kerberos.servicePrincipal= zookeeper/_HOST > quorum.auth.learnerRequireSasl=true > quorum.auth.learner.saslLoginContext=QuorumLearner > quorum.auth.serverRequireSasl=true > quorum.auth.server.saslLoginContext=QuorumServer > server.17=node1.foo.bar.com:2888:3888;2181 > server.19=node2.foo.bar.com:2888:3888;2181 > server.20=node3.foo.bar.com:2888:3888;2181 > server.21=node4.foo.bar.com:2888:3888;2181 > server.22=node5.bar.com:2888:3888;2181 > {quote} > Post upgrade, we had to migrate server.22 on the same node, but with > *FOO*.bar.com domain name due to kerberos referral issues. And, we used > different server-identifier, i.e., *23* when we migrated. So, here is how the > new config looked like: > {quote}server.17=node1.foo.bar.com:2888:3888;2181 > server.19=node2.foo.bar.com:2888:3888;2181 > server.20=node3.foo.bar.com:2888:3888;2181 > server.21=node4.foo.bar.com:2888:3888;2181 > *server.23=node5.{color:#00875a}foo{color}.bar.com:2888:3888;2181* > {quote} > We restarted all the nodes in the ensemble with the above updated config. And > the migrated node joined the quorum successfully and was serving all clients > directly connected to it, without any issues. > Recently, when a leader election happened, > server.*23*=node5.foo.bar.com(migrated node) was chosen as Leader (as it has > highest ID). But then, ZooKeeper was unable to serve any clients and *all* > the servers were _somehow still_ trying to establish a channel to 22 (old DNS > name: node5.bar.com) and were throwing below error in a loop: > {quote}{{2020-05-02 01:43:03,026 [myid:23] - WARN > [WorkerSender[myid=23]:QuorumPeer$QuorumServer@196] - Failed to resolve > address: node4.bar.com}} > {{java.net.UnknownHostException: node5.bar.com: Name or service not known}} > {{ at java.base/java.net.Inet4AddressImpl.lookupAllHostAddr(Native Method)}} > {{ at > java.base/java.net.InetAddress$PlatformNameService.lookupAllHostAddr(InetAddress.java:929)}} > {{ at > java.base/java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1515)}} > {{ at > java.base/java.net.InetAddress$NameServiceAddresses.get(InetAddress.java:848)}} > {{ at java.base/java.net.InetAddress.getAllByName0(InetAddress.java:1505)}} > {{ at java.base/java.net.InetAddress.getAllByName(InetAddress.java:1364)}} > {{ at java.base/java.net.InetAddress.getAllByName(InetAddress.java:1298)}} > {{ at java.base/java.net.InetAddress.getByName(InetAddress.java:1248)}} > {{ at > org.apache.zookeeper.server.quorum.QuorumPeer$QuorumServer.recreateSocketAddresses(QuorumPeer.java:194)}} > {{ at > org.apache.zookeeper.server.quorum.QuorumPeer.recreateSocketAddresses(QuorumPeer.java:774)}} > {{ at > org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:701)}} > {{ at > org.apache.zookeeper.server.quorum.QuorumCnxManager.toSend(QuorumCnxManager.java:620)}} > {{ at >
[jira] [Commented] (ZOOKEEPER-3814) ZooKeeper caching of config
[ https://issues.apache.org/jira/browse/ZOOKEEPER-3814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17105358#comment-17105358 ] Mate Szalay-Beko commented on ZOOKEEPER-3814: - > Just checking, if you simulated the removal and addition of server via legacy > rolling-restarts method? Also, we have quorum authn/authz enabled. could you please describe the order of the server restarts you followed? Was there a time when the old server (with {{myid=22}}) was still running, while other servers were already restarted with the new config containing {{server.23}} ? This can be important, since in ZooKeeper 3.5+ the leader election protocol changed (see ZOOKEEPER-107) in a way that the servers are sending their id/hostname to each other and this can cause that the {{server.22}} remained in the config of the other servers. > ZooKeeper caching of config > --- > > Key: ZOOKEEPER-3814 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3814 > Project: ZooKeeper > Issue Type: Bug > Components: leaderElection, quorum, server >Affects Versions: 3.5.6 >Reporter: Rajkiran Sura >Assignee: Mate Szalay-Beko >Priority: Major > > Hello, > We recently upgraded our 5 node ZooKeeper ensemble from v3.4.8 to v3.5.6. > Encountered no issues as such. > This is how the ZooKeeper config looks like: > {quote}tickTime=2000 > dataDir=/zookeeper-data/ > initLimit=5 > syncLimit=2 > maxClientCnxns=2048 > autopurge.snapRetainCount=3 > autopurge.purgeInterval=1 > 4lw.commands.whitelist=stat, ruok, conf, isro, mntr > authProvider.1=org.apache.zookeeper.server.auth.SASLAuthenticationProvider > requireClientAuthScheme=sasl > quorum.cnxn.threads.size=20 > quorum.auth.enableSasl=true > quorum.auth.kerberos.servicePrincipal= zookeeper/_HOST > quorum.auth.learnerRequireSasl=true > quorum.auth.learner.saslLoginContext=QuorumLearner > quorum.auth.serverRequireSasl=true > quorum.auth.server.saslLoginContext=QuorumServer > server.17=node1.foo.bar.com:2888:3888;2181 > server.19=node2.foo.bar.com:2888:3888;2181 > server.20=node3.foo.bar.com:2888:3888;2181 > server.21=node4.foo.bar.com:2888:3888;2181 > server.22=node5.bar.com:2888:3888;2181 > {quote} > Post upgrade, we had to migrate server.22 on the same node, but with > *FOO*.bar.com domain name due to kerberos referral issues. And, we used > different server-identifier, i.e., *23* when we migrated. So, here is how the > new config looked like: > {quote}server.17=node1.foo.bar.com:2888:3888;2181 > server.19=node2.foo.bar.com:2888:3888;2181 > server.20=node3.foo.bar.com:2888:3888;2181 > server.21=node4.foo.bar.com:2888:3888;2181 > *server.23=node5.{color:#00875a}foo{color}.bar.com:2888:3888;2181* > {quote} > We restarted all the nodes in the ensemble with the above updated config. And > the migrated node joined the quorum successfully and was serving all clients > directly connected to it, without any issues. > Recently, when a leader election happened, > server.*23*=node5.foo.bar.com(migrated node) was chosen as Leader (as it has > highest ID). But then, ZooKeeper was unable to serve any clients and *all* > the servers were _somehow still_ trying to establish a channel to 22 (old DNS > name: node5.bar.com) and were throwing below error in a loop: > {quote}{{2020-05-02 01:43:03,026 [myid:23] - WARN > [WorkerSender[myid=23]:QuorumPeer$QuorumServer@196] - Failed to resolve > address: node4.bar.com}} > {{java.net.UnknownHostException: node5.bar.com: Name or service not known}} > {{ at java.base/java.net.Inet4AddressImpl.lookupAllHostAddr(Native Method)}} > {{ at > java.base/java.net.InetAddress$PlatformNameService.lookupAllHostAddr(InetAddress.java:929)}} > {{ at > java.base/java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1515)}} > {{ at > java.base/java.net.InetAddress$NameServiceAddresses.get(InetAddress.java:848)}} > {{ at java.base/java.net.InetAddress.getAllByName0(InetAddress.java:1505)}} > {{ at java.base/java.net.InetAddress.getAllByName(InetAddress.java:1364)}} > {{ at java.base/java.net.InetAddress.getAllByName(InetAddress.java:1298)}} > {{ at java.base/java.net.InetAddress.getByName(InetAddress.java:1248)}} > {{ at > org.apache.zookeeper.server.quorum.QuorumPeer$QuorumServer.recreateSocketAddresses(QuorumPeer.java:194)}} > {{ at > org.apache.zookeeper.server.quorum.QuorumPeer.recreateSocketAddresses(QuorumPeer.java:774)}} > {{ at > org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:701)}} > {{ at > org.apache.zookeeper.server.quorum.QuorumCnxManager.toSend(QuorumCnxManager.java:620)}} > {{ at > org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender.process(FastLeaderElection.java:477)}} > {{ at >
[jira] [Commented] (ZOOKEEPER-3814) ZooKeeper caching of config
[ https://issues.apache.org/jira/browse/ZOOKEEPER-3814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17104188#comment-17104188 ] Rajkiran Sura commented on ZOOKEEPER-3814: -- {quote}The existence of the .next file indicates that a reconfiguration was halted in the middle, before completing. {quote} Also, as I mentioned earlier, we had initially not enabled dynamicReconfig, so not sure why the "dynamic.next" was coming into picture. {quote}The standard way of changing the id would be removing the old id from the cluster and adding the new one using one or more reconfig commands. {quote} FTR: We were trying to achieve this via legacy rolling restarts method. i.e., first remove old ID, do a rolling restart. Then, add new ID, do a rolling restart. This worked for us perfectly fine(as in the newly added ID joined the cluster and was serving upto-date data). But, then when a ZooKeeper failover happened and this newly added ID became leader, we had problems (i.e., none of the ZooKeeper members were serving the clients). Thanks Mate and Alexander for looking into this. > ZooKeeper caching of config > --- > > Key: ZOOKEEPER-3814 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3814 > Project: ZooKeeper > Issue Type: Bug > Components: leaderElection, quorum, server >Affects Versions: 3.5.6 >Reporter: Rajkiran Sura >Assignee: Mate Szalay-Beko >Priority: Major > > Hello, > We recently upgraded our 5 node ZooKeeper ensemble from v3.4.8 to v3.5.6. > Encountered no issues as such. > This is how the ZooKeeper config looks like: > {quote}tickTime=2000 > dataDir=/zookeeper-data/ > initLimit=5 > syncLimit=2 > maxClientCnxns=2048 > autopurge.snapRetainCount=3 > autopurge.purgeInterval=1 > 4lw.commands.whitelist=stat, ruok, conf, isro, mntr > authProvider.1=org.apache.zookeeper.server.auth.SASLAuthenticationProvider > requireClientAuthScheme=sasl > quorum.cnxn.threads.size=20 > quorum.auth.enableSasl=true > quorum.auth.kerberos.servicePrincipal= zookeeper/_HOST > quorum.auth.learnerRequireSasl=true > quorum.auth.learner.saslLoginContext=QuorumLearner > quorum.auth.serverRequireSasl=true > quorum.auth.server.saslLoginContext=QuorumServer > server.17=node1.foo.bar.com:2888:3888;2181 > server.19=node2.foo.bar.com:2888:3888;2181 > server.20=node3.foo.bar.com:2888:3888;2181 > server.21=node4.foo.bar.com:2888:3888;2181 > server.22=node5.bar.com:2888:3888;2181 > {quote} > Post upgrade, we had to migrate server.22 on the same node, but with > *FOO*.bar.com domain name due to kerberos referral issues. And, we used > different server-identifier, i.e., *23* when we migrated. So, here is how the > new config looked like: > {quote}server.17=node1.foo.bar.com:2888:3888;2181 > server.19=node2.foo.bar.com:2888:3888;2181 > server.20=node3.foo.bar.com:2888:3888;2181 > server.21=node4.foo.bar.com:2888:3888;2181 > *server.23=node5.{color:#00875a}foo{color}.bar.com:2888:3888;2181* > {quote} > We restarted all the nodes in the ensemble with the above updated config. And > the migrated node joined the quorum successfully and was serving all clients > directly connected to it, without any issues. > Recently, when a leader election happened, > server.*23*=node5.foo.bar.com(migrated node) was chosen as Leader (as it has > highest ID). But then, ZooKeeper was unable to serve any clients and *all* > the servers were _somehow still_ trying to establish a channel to 22 (old DNS > name: node5.bar.com) and were throwing below error in a loop: > {quote}{{2020-05-02 01:43:03,026 [myid:23] - WARN > [WorkerSender[myid=23]:QuorumPeer$QuorumServer@196] - Failed to resolve > address: node4.bar.com}} > {{java.net.UnknownHostException: node5.bar.com: Name or service not known}} > {{ at java.base/java.net.Inet4AddressImpl.lookupAllHostAddr(Native Method)}} > {{ at > java.base/java.net.InetAddress$PlatformNameService.lookupAllHostAddr(InetAddress.java:929)}} > {{ at > java.base/java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1515)}} > {{ at > java.base/java.net.InetAddress$NameServiceAddresses.get(InetAddress.java:848)}} > {{ at java.base/java.net.InetAddress.getAllByName0(InetAddress.java:1505)}} > {{ at java.base/java.net.InetAddress.getAllByName(InetAddress.java:1364)}} > {{ at java.base/java.net.InetAddress.getAllByName(InetAddress.java:1298)}} > {{ at java.base/java.net.InetAddress.getByName(InetAddress.java:1248)}} > {{ at > org.apache.zookeeper.server.quorum.QuorumPeer$QuorumServer.recreateSocketAddresses(QuorumPeer.java:194)}} > {{ at > org.apache.zookeeper.server.quorum.QuorumPeer.recreateSocketAddresses(QuorumPeer.java:774)}} > {{ at > org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:701)}} > {{ at >
[jira] [Commented] (ZOOKEEPER-3814) ZooKeeper caching of config
[ https://issues.apache.org/jira/browse/ZOOKEEPER-3814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17104120#comment-17104120 ] Mate Szalay-Beko commented on ZOOKEEPER-3814: - {quote}Just checking, if you simulated the removal and addition of server via legacy rolling-restarts method? Also, we have quorum authn/authz enabled. {quote} these are good points, I will try again. Thanks! {quote}The existence of the .next file indicates that a reconfiguration was halted in the middle, before completing. It could be either an explicit command, or that the leader was trying to push its config to others but failed. {quote} This is interesting. Can this be caused by the missing write permission of ZooKeeper on the config folder? (If this missing permission can cause such problems, then we should highlight it in our upgrade documentation and also maybe ZooKeeper should fail to start if this permission is missing.) > ZooKeeper caching of config > --- > > Key: ZOOKEEPER-3814 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3814 > Project: ZooKeeper > Issue Type: Bug > Components: leaderElection, quorum, server >Affects Versions: 3.5.6 >Reporter: Rajkiran Sura >Assignee: Mate Szalay-Beko >Priority: Major > > Hello, > We recently upgraded our 5 node ZooKeeper ensemble from v3.4.8 to v3.5.6. > Encountered no issues as such. > This is how the ZooKeeper config looks like: > {quote}tickTime=2000 > dataDir=/zookeeper-data/ > initLimit=5 > syncLimit=2 > maxClientCnxns=2048 > autopurge.snapRetainCount=3 > autopurge.purgeInterval=1 > 4lw.commands.whitelist=stat, ruok, conf, isro, mntr > authProvider.1=org.apache.zookeeper.server.auth.SASLAuthenticationProvider > requireClientAuthScheme=sasl > quorum.cnxn.threads.size=20 > quorum.auth.enableSasl=true > quorum.auth.kerberos.servicePrincipal= zookeeper/_HOST > quorum.auth.learnerRequireSasl=true > quorum.auth.learner.saslLoginContext=QuorumLearner > quorum.auth.serverRequireSasl=true > quorum.auth.server.saslLoginContext=QuorumServer > server.17=node1.foo.bar.com:2888:3888;2181 > server.19=node2.foo.bar.com:2888:3888;2181 > server.20=node3.foo.bar.com:2888:3888;2181 > server.21=node4.foo.bar.com:2888:3888;2181 > server.22=node5.bar.com:2888:3888;2181 > {quote} > Post upgrade, we had to migrate server.22 on the same node, but with > *FOO*.bar.com domain name due to kerberos referral issues. And, we used > different server-identifier, i.e., *23* when we migrated. So, here is how the > new config looked like: > {quote}server.17=node1.foo.bar.com:2888:3888;2181 > server.19=node2.foo.bar.com:2888:3888;2181 > server.20=node3.foo.bar.com:2888:3888;2181 > server.21=node4.foo.bar.com:2888:3888;2181 > *server.23=node5.{color:#00875a}foo{color}.bar.com:2888:3888;2181* > {quote} > We restarted all the nodes in the ensemble with the above updated config. And > the migrated node joined the quorum successfully and was serving all clients > directly connected to it, without any issues. > Recently, when a leader election happened, > server.*23*=node5.foo.bar.com(migrated node) was chosen as Leader (as it has > highest ID). But then, ZooKeeper was unable to serve any clients and *all* > the servers were _somehow still_ trying to establish a channel to 22 (old DNS > name: node5.bar.com) and were throwing below error in a loop: > {quote}{{2020-05-02 01:43:03,026 [myid:23] - WARN > [WorkerSender[myid=23]:QuorumPeer$QuorumServer@196] - Failed to resolve > address: node4.bar.com}} > {{java.net.UnknownHostException: node5.bar.com: Name or service not known}} > {{ at java.base/java.net.Inet4AddressImpl.lookupAllHostAddr(Native Method)}} > {{ at > java.base/java.net.InetAddress$PlatformNameService.lookupAllHostAddr(InetAddress.java:929)}} > {{ at > java.base/java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1515)}} > {{ at > java.base/java.net.InetAddress$NameServiceAddresses.get(InetAddress.java:848)}} > {{ at java.base/java.net.InetAddress.getAllByName0(InetAddress.java:1505)}} > {{ at java.base/java.net.InetAddress.getAllByName(InetAddress.java:1364)}} > {{ at java.base/java.net.InetAddress.getAllByName(InetAddress.java:1298)}} > {{ at java.base/java.net.InetAddress.getByName(InetAddress.java:1248)}} > {{ at > org.apache.zookeeper.server.quorum.QuorumPeer$QuorumServer.recreateSocketAddresses(QuorumPeer.java:194)}} > {{ at > org.apache.zookeeper.server.quorum.QuorumPeer.recreateSocketAddresses(QuorumPeer.java:774)}} > {{ at > org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:701)}} > {{ at > org.apache.zookeeper.server.quorum.QuorumCnxManager.toSend(QuorumCnxManager.java:620)}} > {{ at > org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender.process(FastLeaderElection.java:477)}} > {{ at >
[jira] [Commented] (ZOOKEEPER-3814) ZooKeeper caching of config
[ https://issues.apache.org/jira/browse/ZOOKEEPER-3814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17103581#comment-17103581 ] Alexander Shraer commented on ZOOKEEPER-3814: - Hi, when a server boots, its config is used to connect to an ensemble, but then if an ensemble has already been running (its not the first time everyone boots) the config of the leader overwrites the starting config of the node. This is usually used to have nodes that aren't part of the ensemble join the cluster. Its possible that this is what happened. What I'm not sure about is how the id 23 got into the /zookeeper/config znode. The existence of the .next file indicates that a reconfiguration was halted in the middle, before completing. It could be either an explicit command, or that the leader was trying to push its config to others but failed. The standard way of changing the id would be removing the old id from the cluster and adding the new one using one or more reconfig commands. > ZooKeeper caching of config > --- > > Key: ZOOKEEPER-3814 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3814 > Project: ZooKeeper > Issue Type: Bug > Components: leaderElection, quorum, server >Affects Versions: 3.5.6 >Reporter: Rajkiran Sura >Assignee: Mate Szalay-Beko >Priority: Major > > Hello, > We recently upgraded our 5 node ZooKeeper ensemble from v3.4.8 to v3.5.6. > Encountered no issues as such. > This is how the ZooKeeper config looks like: > {quote}tickTime=2000 > dataDir=/zookeeper-data/ > initLimit=5 > syncLimit=2 > maxClientCnxns=2048 > autopurge.snapRetainCount=3 > autopurge.purgeInterval=1 > 4lw.commands.whitelist=stat, ruok, conf, isro, mntr > authProvider.1=org.apache.zookeeper.server.auth.SASLAuthenticationProvider > requireClientAuthScheme=sasl > quorum.cnxn.threads.size=20 > quorum.auth.enableSasl=true > quorum.auth.kerberos.servicePrincipal= zookeeper/_HOST > quorum.auth.learnerRequireSasl=true > quorum.auth.learner.saslLoginContext=QuorumLearner > quorum.auth.serverRequireSasl=true > quorum.auth.server.saslLoginContext=QuorumServer > server.17=node1.foo.bar.com:2888:3888;2181 > server.19=node2.foo.bar.com:2888:3888;2181 > server.20=node3.foo.bar.com:2888:3888;2181 > server.21=node4.foo.bar.com:2888:3888;2181 > server.22=node5.bar.com:2888:3888;2181 > {quote} > Post upgrade, we had to migrate server.22 on the same node, but with > *FOO*.bar.com domain name due to kerberos referral issues. And, we used > different server-identifier, i.e., *23* when we migrated. So, here is how the > new config looked like: > {quote}server.17=node1.foo.bar.com:2888:3888;2181 > server.19=node2.foo.bar.com:2888:3888;2181 > server.20=node3.foo.bar.com:2888:3888;2181 > server.21=node4.foo.bar.com:2888:3888;2181 > *server.23=node5.{color:#00875a}foo{color}.bar.com:2888:3888;2181* > {quote} > We restarted all the nodes in the ensemble with the above updated config. And > the migrated node joined the quorum successfully and was serving all clients > directly connected to it, without any issues. > Recently, when a leader election happened, > server.*23*=node5.foo.bar.com(migrated node) was chosen as Leader (as it has > highest ID). But then, ZooKeeper was unable to serve any clients and *all* > the servers were _somehow still_ trying to establish a channel to 22 (old DNS > name: node5.bar.com) and were throwing below error in a loop: > {quote}{{2020-05-02 01:43:03,026 [myid:23] - WARN > [WorkerSender[myid=23]:QuorumPeer$QuorumServer@196] - Failed to resolve > address: node4.bar.com}} > {{java.net.UnknownHostException: node5.bar.com: Name or service not known}} > {{ at java.base/java.net.Inet4AddressImpl.lookupAllHostAddr(Native Method)}} > {{ at > java.base/java.net.InetAddress$PlatformNameService.lookupAllHostAddr(InetAddress.java:929)}} > {{ at > java.base/java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1515)}} > {{ at > java.base/java.net.InetAddress$NameServiceAddresses.get(InetAddress.java:848)}} > {{ at java.base/java.net.InetAddress.getAllByName0(InetAddress.java:1505)}} > {{ at java.base/java.net.InetAddress.getAllByName(InetAddress.java:1364)}} > {{ at java.base/java.net.InetAddress.getAllByName(InetAddress.java:1298)}} > {{ at java.base/java.net.InetAddress.getByName(InetAddress.java:1248)}} > {{ at > org.apache.zookeeper.server.quorum.QuorumPeer$QuorumServer.recreateSocketAddresses(QuorumPeer.java:194)}} > {{ at > org.apache.zookeeper.server.quorum.QuorumPeer.recreateSocketAddresses(QuorumPeer.java:774)}} > {{ at > org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:701)}} > {{ at > org.apache.zookeeper.server.quorum.QuorumCnxManager.toSend(QuorumCnxManager.java:620)}} > {{ at >
[jira] [Commented] (ZOOKEEPER-3814) ZooKeeper caching of config
[ https://issues.apache.org/jira/browse/ZOOKEEPER-3814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17102671#comment-17102671 ] Rajkiran Sura commented on ZOOKEEPER-3814: -- Hi Mate, Many thanks for looking further into this. {quote}If by any chance you still have (and you have the permission to share) the full server logs from all the servers during the time when you changed the hostname of the last node, I would be happy to take a look. {quote} Unfortunately the logs have rolled back. Also, when we changed the hostname, it was able to join the cluster without any issues after we did a rolling restart. And was also serving clients without any issues for a week. Then, when next leader election happened, it got elected as leader and we had trouble serving the clients. {quote}I have a docker environment ([https://github.com/symat/zookeeper-docker-test]) where I tried to create a cluster and simulating the config change you did. {quote} Just checking, if you simulated the removal and addition of server via legacy rolling-restarts method? Also, we have quorum authn/authz enabled. FWIW: Even I tried simulating this afresh using a 3-node v3.5.6 cluster. But, wasn't able to reproduce it exactly. Also, not sure if this makes any difference but the production cluster was upgraded from v3.4.8 to v3.5.6. But, for my reproducer/simulation I directly initialized it with v3.5.6. > ZooKeeper caching of config > --- > > Key: ZOOKEEPER-3814 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3814 > Project: ZooKeeper > Issue Type: Bug > Components: leaderElection, quorum, server >Affects Versions: 3.5.6 >Reporter: Rajkiran Sura >Assignee: Mate Szalay-Beko >Priority: Major > > Hello, > We recently upgraded our 5 node ZooKeeper ensemble from v3.4.8 to v3.5.6. > Encountered no issues as such. > This is how the ZooKeeper config looks like: > {quote}tickTime=2000 > dataDir=/zookeeper-data/ > initLimit=5 > syncLimit=2 > maxClientCnxns=2048 > autopurge.snapRetainCount=3 > autopurge.purgeInterval=1 > 4lw.commands.whitelist=stat, ruok, conf, isro, mntr > authProvider.1=org.apache.zookeeper.server.auth.SASLAuthenticationProvider > requireClientAuthScheme=sasl > quorum.cnxn.threads.size=20 > quorum.auth.enableSasl=true > quorum.auth.kerberos.servicePrincipal= zookeeper/_HOST > quorum.auth.learnerRequireSasl=true > quorum.auth.learner.saslLoginContext=QuorumLearner > quorum.auth.serverRequireSasl=true > quorum.auth.server.saslLoginContext=QuorumServer > server.17=node1.foo.bar.com:2888:3888;2181 > server.19=node2.foo.bar.com:2888:3888;2181 > server.20=node3.foo.bar.com:2888:3888;2181 > server.21=node4.foo.bar.com:2888:3888;2181 > server.22=node5.bar.com:2888:3888;2181 > {quote} > Post upgrade, we had to migrate server.22 on the same node, but with > *FOO*.bar.com domain name due to kerberos referral issues. And, we used > different server-identifier, i.e., *23* when we migrated. So, here is how the > new config looked like: > {quote}server.17=node1.foo.bar.com:2888:3888;2181 > server.19=node2.foo.bar.com:2888:3888;2181 > server.20=node3.foo.bar.com:2888:3888;2181 > server.21=node4.foo.bar.com:2888:3888;2181 > *server.23=node5.{color:#00875a}foo{color}.bar.com:2888:3888;2181* > {quote} > We restarted all the nodes in the ensemble with the above updated config. And > the migrated node joined the quorum successfully and was serving all clients > directly connected to it, without any issues. > Recently, when a leader election happened, > server.*23*=node5.foo.bar.com(migrated node) was chosen as Leader (as it has > highest ID). But then, ZooKeeper was unable to serve any clients and *all* > the servers were _somehow still_ trying to establish a channel to 22 (old DNS > name: node5.bar.com) and were throwing below error in a loop: > {quote}{{2020-05-02 01:43:03,026 [myid:23] - WARN > [WorkerSender[myid=23]:QuorumPeer$QuorumServer@196] - Failed to resolve > address: node4.bar.com}} > {{java.net.UnknownHostException: node5.bar.com: Name or service not known}} > {{ at java.base/java.net.Inet4AddressImpl.lookupAllHostAddr(Native Method)}} > {{ at > java.base/java.net.InetAddress$PlatformNameService.lookupAllHostAddr(InetAddress.java:929)}} > {{ at > java.base/java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1515)}} > {{ at > java.base/java.net.InetAddress$NameServiceAddresses.get(InetAddress.java:848)}} > {{ at java.base/java.net.InetAddress.getAllByName0(InetAddress.java:1505)}} > {{ at java.base/java.net.InetAddress.getAllByName(InetAddress.java:1364)}} > {{ at java.base/java.net.InetAddress.getAllByName(InetAddress.java:1298)}} > {{ at java.base/java.net.InetAddress.getByName(InetAddress.java:1248)}} > {{ at >
[jira] [Commented] (ZOOKEEPER-3814) ZooKeeper caching of config
[ https://issues.apache.org/jira/browse/ZOOKEEPER-3814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17102572#comment-17102572 ] Mate Szalay-Beko commented on ZOOKEEPER-3814: - [~shralex] , [~hanm] - AFAIK you were working on the dynamic reconfig feature, maybe you have some idea what might have happened here. Could you please take a look? > ZooKeeper caching of config > --- > > Key: ZOOKEEPER-3814 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3814 > Project: ZooKeeper > Issue Type: Bug > Components: leaderElection, quorum, server >Affects Versions: 3.5.6 >Reporter: Rajkiran Sura >Assignee: Mate Szalay-Beko >Priority: Major > > Hello, > We recently upgraded our 5 node ZooKeeper ensemble from v3.4.8 to v3.5.6. > Encountered no issues as such. > This is how the ZooKeeper config looks like: > {quote}tickTime=2000 > dataDir=/zookeeper-data/ > initLimit=5 > syncLimit=2 > maxClientCnxns=2048 > autopurge.snapRetainCount=3 > autopurge.purgeInterval=1 > 4lw.commands.whitelist=stat, ruok, conf, isro, mntr > authProvider.1=org.apache.zookeeper.server.auth.SASLAuthenticationProvider > requireClientAuthScheme=sasl > quorum.cnxn.threads.size=20 > quorum.auth.enableSasl=true > quorum.auth.kerberos.servicePrincipal= zookeeper/_HOST > quorum.auth.learnerRequireSasl=true > quorum.auth.learner.saslLoginContext=QuorumLearner > quorum.auth.serverRequireSasl=true > quorum.auth.server.saslLoginContext=QuorumServer > server.17=node1.foo.bar.com:2888:3888;2181 > server.19=node2.foo.bar.com:2888:3888;2181 > server.20=node3.foo.bar.com:2888:3888;2181 > server.21=node4.foo.bar.com:2888:3888;2181 > server.22=node5.bar.com:2888:3888;2181 > {quote} > Post upgrade, we had to migrate server.22 on the same node, but with > *FOO*.bar.com domain name due to kerberos referral issues. And, we used > different server-identifier, i.e., *23* when we migrated. So, here is how the > new config looked like: > {quote}server.17=node1.foo.bar.com:2888:3888;2181 > server.19=node2.foo.bar.com:2888:3888;2181 > server.20=node3.foo.bar.com:2888:3888;2181 > server.21=node4.foo.bar.com:2888:3888;2181 > *server.23=node5.{color:#00875a}foo{color}.bar.com:2888:3888;2181* > {quote} > We restarted all the nodes in the ensemble with the above updated config. And > the migrated node joined the quorum successfully and was serving all clients > directly connected to it, without any issues. > Recently, when a leader election happened, > server.*23*=node5.foo.bar.com(migrated node) was chosen as Leader (as it has > highest ID). But then, ZooKeeper was unable to serve any clients and *all* > the servers were _somehow still_ trying to establish a channel to 22 (old DNS > name: node5.bar.com) and were throwing below error in a loop: > {quote}{{2020-05-02 01:43:03,026 [myid:23] - WARN > [WorkerSender[myid=23]:QuorumPeer$QuorumServer@196] - Failed to resolve > address: node4.bar.com}} > {{java.net.UnknownHostException: node5.bar.com: Name or service not known}} > {{ at java.base/java.net.Inet4AddressImpl.lookupAllHostAddr(Native Method)}} > {{ at > java.base/java.net.InetAddress$PlatformNameService.lookupAllHostAddr(InetAddress.java:929)}} > {{ at > java.base/java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1515)}} > {{ at > java.base/java.net.InetAddress$NameServiceAddresses.get(InetAddress.java:848)}} > {{ at java.base/java.net.InetAddress.getAllByName0(InetAddress.java:1505)}} > {{ at java.base/java.net.InetAddress.getAllByName(InetAddress.java:1364)}} > {{ at java.base/java.net.InetAddress.getAllByName(InetAddress.java:1298)}} > {{ at java.base/java.net.InetAddress.getByName(InetAddress.java:1248)}} > {{ at > org.apache.zookeeper.server.quorum.QuorumPeer$QuorumServer.recreateSocketAddresses(QuorumPeer.java:194)}} > {{ at > org.apache.zookeeper.server.quorum.QuorumPeer.recreateSocketAddresses(QuorumPeer.java:774)}} > {{ at > org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:701)}} > {{ at > org.apache.zookeeper.server.quorum.QuorumCnxManager.toSend(QuorumCnxManager.java:620)}} > {{ at > org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender.process(FastLeaderElection.java:477)}} > {{ at > org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender.run(FastLeaderElection.java:456)}} > {{ at java.base/java.lang.Thread.run(Thread.java:834)}} > {{2020-05-02 01:43:03,026 [myid:23] - WARN > [WorkerSender[myid=23]:QuorumCnxManager@679] - Cannot open channel to 22 at > election address node5.bar.com:3888}} > {{java.net.UnknownHostException: node5.bar.com}} > {{ at > java.base/java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:220)}} > {{ at java.base/java.net.SocksSocketImpl.connect(SocksSocketImpl.java:403)}} > {{ at
[jira] [Commented] (ZOOKEEPER-3814) ZooKeeper caching of config
[ https://issues.apache.org/jira/browse/ZOOKEEPER-3814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17102567#comment-17102567 ] Mate Szalay-Beko commented on ZOOKEEPER-3814: - I have a docker environment ([https://github.com/symat/zookeeper-docker-test]) where I tried to create a cluster and simulating the config change you did. I haven't faced your issue. For me, the new node become a leader without any problem and the old id/hostname were never appeared in the logs of the rest of the nodes. Also my \{{zoo.cfg.dynamic.next}} file was always contain the correct server configs. Regarding to some of your previous comment: {quote}Latest observation, we noticed that ZooKeeper was complaining about dynamic.next file, event though we HAVE NOT ENABLED dynamic-reconfiguration. {quote} This is the way how the newer ZooKeeper works, this is not a bug. I think you can find more info here: [https://zookeeper.apache.org/doc/r3.5.7/zookeeperReconfig.html] {quote}And zookeeper user did not have perms to that config directory, so we fixed that restarted zookeeper. And then it dumped below dynamic.next, which contains the OLD migrated node as a member {quote} This is really strange. This really suggest that somehow the nodes "cached" the old config and their configuration haven't got updated with the server list of the new static config. I was checking the code, and can not find what the root-cause might have been. I tried many scenarios manually but never manage to reproduce what you see. I wonder if executing the "{{sync /zookeeper/config}} " command would help in this case. If by any chance you still have (and you have the permission to share) the full server logs from all the servers during the time when you changed the hostname of the last node, I would be happy to take a look. I might see something that help us to find the root cause of this issue. > ZooKeeper caching of config > --- > > Key: ZOOKEEPER-3814 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3814 > Project: ZooKeeper > Issue Type: Bug > Components: leaderElection, quorum, server >Affects Versions: 3.5.6 >Reporter: Rajkiran Sura >Assignee: Mate Szalay-Beko >Priority: Major > > Hello, > We recently upgraded our 5 node ZooKeeper ensemble from v3.4.8 to v3.5.6. > Encountered no issues as such. > This is how the ZooKeeper config looks like: > {quote}tickTime=2000 > dataDir=/zookeeper-data/ > initLimit=5 > syncLimit=2 > maxClientCnxns=2048 > autopurge.snapRetainCount=3 > autopurge.purgeInterval=1 > 4lw.commands.whitelist=stat, ruok, conf, isro, mntr > authProvider.1=org.apache.zookeeper.server.auth.SASLAuthenticationProvider > requireClientAuthScheme=sasl > quorum.cnxn.threads.size=20 > quorum.auth.enableSasl=true > quorum.auth.kerberos.servicePrincipal= zookeeper/_HOST > quorum.auth.learnerRequireSasl=true > quorum.auth.learner.saslLoginContext=QuorumLearner > quorum.auth.serverRequireSasl=true > quorum.auth.server.saslLoginContext=QuorumServer > server.17=node1.foo.bar.com:2888:3888;2181 > server.19=node2.foo.bar.com:2888:3888;2181 > server.20=node3.foo.bar.com:2888:3888;2181 > server.21=node4.foo.bar.com:2888:3888;2181 > server.22=node5.bar.com:2888:3888;2181 > {quote} > Post upgrade, we had to migrate server.22 on the same node, but with > *FOO*.bar.com domain name due to kerberos referral issues. And, we used > different server-identifier, i.e., *23* when we migrated. So, here is how the > new config looked like: > {quote}server.17=node1.foo.bar.com:2888:3888;2181 > server.19=node2.foo.bar.com:2888:3888;2181 > server.20=node3.foo.bar.com:2888:3888;2181 > server.21=node4.foo.bar.com:2888:3888;2181 > *server.23=node5.{color:#00875a}foo{color}.bar.com:2888:3888;2181* > {quote} > We restarted all the nodes in the ensemble with the above updated config. And > the migrated node joined the quorum successfully and was serving all clients > directly connected to it, without any issues. > Recently, when a leader election happened, > server.*23*=node5.foo.bar.com(migrated node) was chosen as Leader (as it has > highest ID). But then, ZooKeeper was unable to serve any clients and *all* > the servers were _somehow still_ trying to establish a channel to 22 (old DNS > name: node5.bar.com) and were throwing below error in a loop: > {quote}{{2020-05-02 01:43:03,026 [myid:23] - WARN > [WorkerSender[myid=23]:QuorumPeer$QuorumServer@196] - Failed to resolve > address: node4.bar.com}} > {{java.net.UnknownHostException: node5.bar.com: Name or service not known}} > {{ at java.base/java.net.Inet4AddressImpl.lookupAllHostAddr(Native Method)}} > {{ at > java.base/java.net.InetAddress$PlatformNameService.lookupAllHostAddr(InetAddress.java:929)}} > {{ at >
[jira] [Commented] (ZOOKEEPER-3814) ZooKeeper caching of config
[ https://issues.apache.org/jira/browse/ZOOKEEPER-3814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17102383#comment-17102383 ] Mate Szalay-Beko commented on ZOOKEEPER-3814: - OK, my bad... I must have misread something. I will try to reproduce this issue locally, let's hope I can succeed. If I can't do that, then I will ask you to share the full logs around the leader election when server 23 was elected. It would be great to have the server logs from server 23 and from all the other servers. There are quite a few warnings / info messages in the code that can help us investigating the issue. > ZooKeeper caching of config > --- > > Key: ZOOKEEPER-3814 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3814 > Project: ZooKeeper > Issue Type: Bug > Components: leaderElection, quorum, server >Affects Versions: 3.5.6 >Reporter: Rajkiran Sura >Assignee: Mate Szalay-Beko >Priority: Major > > Hello, > We recently upgraded our 5 node ZooKeeper ensemble from v3.4.8 to v3.5.6. > Encountered no issues as such. > This is how the ZooKeeper config looks like: > {quote}tickTime=2000 > dataDir=/zookeeper-data/ > initLimit=5 > syncLimit=2 > maxClientCnxns=2048 > autopurge.snapRetainCount=3 > autopurge.purgeInterval=1 > 4lw.commands.whitelist=stat, ruok, conf, isro, mntr > authProvider.1=org.apache.zookeeper.server.auth.SASLAuthenticationProvider > requireClientAuthScheme=sasl > quorum.cnxn.threads.size=20 > quorum.auth.enableSasl=true > quorum.auth.kerberos.servicePrincipal= zookeeper/_HOST > quorum.auth.learnerRequireSasl=true > quorum.auth.learner.saslLoginContext=QuorumLearner > quorum.auth.serverRequireSasl=true > quorum.auth.server.saslLoginContext=QuorumServer > server.17=node1.foo.bar.com:2888:3888;2181 > server.19=node2.foo.bar.com:2888:3888;2181 > server.20=node3.foo.bar.com:2888:3888;2181 > server.21=node4.foo.bar.com:2888:3888;2181 > server.22=node5.bar.com:2888:3888;2181 > {quote} > Post upgrade, we had to migrate server.22 on the same node, but with > *FOO*.bar.com domain name due to kerberos referral issues. And, we used > different server-identifier, i.e., *23* when we migrated. So, here is how the > new config looked like: > {quote}server.17=node1.foo.bar.com:2888:3888;2181 > server.19=node2.foo.bar.com:2888:3888;2181 > server.20=node3.foo.bar.com:2888:3888;2181 > server.21=node4.foo.bar.com:2888:3888;2181 > *server.23=node5.{color:#00875a}foo{color}.bar.com:2888:3888;2181* > {quote} > We restarted all the nodes in the ensemble with the above updated config. And > the migrated node joined the quorum successfully and was serving all clients > directly connected to it, without any issues. > Recently, when a leader election happened, > server.*23*=node5.foo.bar.com(migrated node) was chosen as Leader (as it has > highest ID). But then, ZooKeeper was unable to serve any clients and *all* > the servers were _somehow still_ trying to establish a channel to 22 (old DNS > name: node5.bar.com) and were throwing below error in a loop: > {quote}{{2020-05-02 01:43:03,026 [myid:23] - WARN > [WorkerSender[myid=23]:QuorumPeer$QuorumServer@196] - Failed to resolve > address: node4.bar.com}} > {{java.net.UnknownHostException: node5.bar.com: Name or service not known}} > {{ at java.base/java.net.Inet4AddressImpl.lookupAllHostAddr(Native Method)}} > {{ at > java.base/java.net.InetAddress$PlatformNameService.lookupAllHostAddr(InetAddress.java:929)}} > {{ at > java.base/java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1515)}} > {{ at > java.base/java.net.InetAddress$NameServiceAddresses.get(InetAddress.java:848)}} > {{ at java.base/java.net.InetAddress.getAllByName0(InetAddress.java:1505)}} > {{ at java.base/java.net.InetAddress.getAllByName(InetAddress.java:1364)}} > {{ at java.base/java.net.InetAddress.getAllByName(InetAddress.java:1298)}} > {{ at java.base/java.net.InetAddress.getByName(InetAddress.java:1248)}} > {{ at > org.apache.zookeeper.server.quorum.QuorumPeer$QuorumServer.recreateSocketAddresses(QuorumPeer.java:194)}} > {{ at > org.apache.zookeeper.server.quorum.QuorumPeer.recreateSocketAddresses(QuorumPeer.java:774)}} > {{ at > org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:701)}} > {{ at > org.apache.zookeeper.server.quorum.QuorumCnxManager.toSend(QuorumCnxManager.java:620)}} > {{ at > org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender.process(FastLeaderElection.java:477)}} > {{ at > org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender.run(FastLeaderElection.java:456)}} > {{ at java.base/java.lang.Thread.run(Thread.java:834)}} > {{2020-05-02 01:43:03,026 [myid:23] - WARN > [WorkerSender[myid=23]:QuorumCnxManager@679] - Cannot open channel to 22 at > election address
[jira] [Commented] (ZOOKEEPER-3814) ZooKeeper caching of config
[ https://issues.apache.org/jira/browse/ZOOKEEPER-3814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17102321#comment-17102321 ] Rajkiran Sura commented on ZOOKEEPER-3814: -- Hi Mate, Yes, as mentioned in first update, we have kept both the myid and config in-sync with the changes. {quote}server.17=node1.foo.bar.com:2888:3888;2181 server.19=node2.foo.bar.com:2888:3888;2181 server.20=node3.foo.bar.com:2888:3888;2181 server.21=node4.foo.bar.com:2888:3888;2181 *server.{color:#FF}23{color}=node5.{color:#00875a}foo{color}.bar.com:2888:3888;2181* {quote} **Also, if there were to be a mismatch between ID in myid and config, the ZooKeeper wouldn't even start-up. In our case, it was able to join quorum and sync data. And was also serving the clients. But, had trouble when it was nominated as leader. Thanks, Rajkiran > ZooKeeper caching of config > --- > > Key: ZOOKEEPER-3814 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3814 > Project: ZooKeeper > Issue Type: Bug > Components: leaderElection, quorum, server >Affects Versions: 3.5.6 >Reporter: Rajkiran Sura >Assignee: Mate Szalay-Beko >Priority: Major > > Hello, > We recently upgraded our 5 node ZooKeeper ensemble from v3.4.8 to v3.5.6. > Encountered no issues as such. > This is how the ZooKeeper config looks like: > {quote}tickTime=2000 > dataDir=/zookeeper-data/ > initLimit=5 > syncLimit=2 > maxClientCnxns=2048 > autopurge.snapRetainCount=3 > autopurge.purgeInterval=1 > 4lw.commands.whitelist=stat, ruok, conf, isro, mntr > authProvider.1=org.apache.zookeeper.server.auth.SASLAuthenticationProvider > requireClientAuthScheme=sasl > quorum.cnxn.threads.size=20 > quorum.auth.enableSasl=true > quorum.auth.kerberos.servicePrincipal= zookeeper/_HOST > quorum.auth.learnerRequireSasl=true > quorum.auth.learner.saslLoginContext=QuorumLearner > quorum.auth.serverRequireSasl=true > quorum.auth.server.saslLoginContext=QuorumServer > server.17=node1.foo.bar.com:2888:3888;2181 > server.19=node2.foo.bar.com:2888:3888;2181 > server.20=node3.foo.bar.com:2888:3888;2181 > server.21=node4.foo.bar.com:2888:3888;2181 > server.22=node5.bar.com:2888:3888;2181 > {quote} > Post upgrade, we had to migrate server.22 on the same node, but with > *FOO*.bar.com domain name due to kerberos referral issues. And, we used > different server-identifier, i.e., *23* when we migrated. So, here is how the > new config looked like: > {quote}server.17=node1.foo.bar.com:2888:3888;2181 > server.19=node2.foo.bar.com:2888:3888;2181 > server.20=node3.foo.bar.com:2888:3888;2181 > server.21=node4.foo.bar.com:2888:3888;2181 > *server.23=node5.{color:#00875a}foo{color}.bar.com:2888:3888;2181* > {quote} > We restarted all the nodes in the ensemble with the above updated config. And > the migrated node joined the quorum successfully and was serving all clients > directly connected to it, without any issues. > Recently, when a leader election happened, > server.*23*=node5.foo.bar.com(migrated node) was chosen as Leader (as it has > highest ID). But then, ZooKeeper was unable to serve any clients and *all* > the servers were _somehow still_ trying to establish a channel to 22 (old DNS > name: node5.bar.com) and were throwing below error in a loop: > {quote}{{2020-05-02 01:43:03,026 [myid:23] - WARN > [WorkerSender[myid=23]:QuorumPeer$QuorumServer@196] - Failed to resolve > address: node4.bar.com}} > {{java.net.UnknownHostException: node5.bar.com: Name or service not known}} > {{ at java.base/java.net.Inet4AddressImpl.lookupAllHostAddr(Native Method)}} > {{ at > java.base/java.net.InetAddress$PlatformNameService.lookupAllHostAddr(InetAddress.java:929)}} > {{ at > java.base/java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1515)}} > {{ at > java.base/java.net.InetAddress$NameServiceAddresses.get(InetAddress.java:848)}} > {{ at java.base/java.net.InetAddress.getAllByName0(InetAddress.java:1505)}} > {{ at java.base/java.net.InetAddress.getAllByName(InetAddress.java:1364)}} > {{ at java.base/java.net.InetAddress.getAllByName(InetAddress.java:1298)}} > {{ at java.base/java.net.InetAddress.getByName(InetAddress.java:1248)}} > {{ at > org.apache.zookeeper.server.quorum.QuorumPeer$QuorumServer.recreateSocketAddresses(QuorumPeer.java:194)}} > {{ at > org.apache.zookeeper.server.quorum.QuorumPeer.recreateSocketAddresses(QuorumPeer.java:774)}} > {{ at > org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:701)}} > {{ at > org.apache.zookeeper.server.quorum.QuorumCnxManager.toSend(QuorumCnxManager.java:620)}} > {{ at > org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender.process(FastLeaderElection.java:477)}} > {{ at >
[jira] [Commented] (ZOOKEEPER-3814) ZooKeeper caching of config
[ https://issues.apache.org/jira/browse/ZOOKEEPER-3814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17100922#comment-17100922 ] Rajkiran Sura commented on ZOOKEEPER-3814: -- Hi Mate, Thanks for your reply. Yes, I did change the ID in the 'myid' file. As per [~eolivelli] suggestion in [here|[http://zookeeper-user.578899.n2.nabble.com/ZooKeeper-config-caching-issues-td7584905.html]], we did add/remove nodes the traditional way (as we had disabled dynamic-reconfig since the beginning) and that did not help. i.e., the new node was able to join the cluster but the cluster was unresponsive when it became the leader. So, we had to finally enable and use dynamic-reconfig to fix the problem. So, this definitely looks like a bug in some corner which is hard-coded/told to look only for dynamicConfig? Thanks a lot, Rajkiran > ZooKeeper caching of config > --- > > Key: ZOOKEEPER-3814 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3814 > Project: ZooKeeper > Issue Type: Bug > Components: leaderElection, quorum, server >Affects Versions: 3.5.6 >Reporter: Rajkiran Sura >Assignee: Mate Szalay-Beko >Priority: Major > > Hello, > We recently upgraded our 5 node ZooKeeper ensemble from v3.4.8 to v3.5.6. > Encountered no issues as such. > This is how the ZooKeeper config looks like: > {quote}tickTime=2000 > dataDir=/zookeeper-data/ > initLimit=5 > syncLimit=2 > maxClientCnxns=2048 > autopurge.snapRetainCount=3 > autopurge.purgeInterval=1 > 4lw.commands.whitelist=stat, ruok, conf, isro, mntr > authProvider.1=org.apache.zookeeper.server.auth.SASLAuthenticationProvider > requireClientAuthScheme=sasl > quorum.cnxn.threads.size=20 > quorum.auth.enableSasl=true > quorum.auth.kerberos.servicePrincipal= zookeeper/_HOST > quorum.auth.learnerRequireSasl=true > quorum.auth.learner.saslLoginContext=QuorumLearner > quorum.auth.serverRequireSasl=true > quorum.auth.server.saslLoginContext=QuorumServer > server.17=node1.foo.bar.com:2888:3888;2181 > server.19=node2.foo.bar.com:2888:3888;2181 > server.20=node3.foo.bar.com:2888:3888;2181 > server.21=node4.foo.bar.com:2888:3888;2181 > server.22=node5.bar.com:2888:3888;2181 > {quote} > Post upgrade, we had to migrate server.22 on the same node, but with > *FOO*.bar.com domain name due to kerberos referral issues. And, we used > different server-identifier, i.e., *23* when we migrated. So, here is how the > new config looked like: > {quote}server.17=node1.foo.bar.com:2888:3888;2181 > server.19=node2.foo.bar.com:2888:3888;2181 > server.20=node3.foo.bar.com:2888:3888;2181 > server.21=node4.foo.bar.com:2888:3888;2181 > *server.23=node5.{color:#00875a}foo{color}.bar.com:2888:3888;2181* > {quote} > We restarted all the nodes in the ensemble with the above updated config. And > the migrated node joined the quorum successfully and was serving all clients > directly connected to it, without any issues. > Recently, when a leader election happened, > server.*23*=node5.foo.bar.com(migrated node) was chosen as Leader (as it has > highest ID). But then, ZooKeeper was unable to serve any clients and *all* > the servers were _somehow still_ trying to establish a channel to 22 (old DNS > name: node5.bar.com) and were throwing below error in a loop: > {quote}{{2020-05-02 01:43:03,026 [myid:23] - WARN > [WorkerSender[myid=23]:QuorumPeer$QuorumServer@196] - Failed to resolve > address: node4.bar.com}} > {{java.net.UnknownHostException: node5.bar.com: Name or service not known}} > {{ at java.base/java.net.Inet4AddressImpl.lookupAllHostAddr(Native Method)}} > {{ at > java.base/java.net.InetAddress$PlatformNameService.lookupAllHostAddr(InetAddress.java:929)}} > {{ at > java.base/java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1515)}} > {{ at > java.base/java.net.InetAddress$NameServiceAddresses.get(InetAddress.java:848)}} > {{ at java.base/java.net.InetAddress.getAllByName0(InetAddress.java:1505)}} > {{ at java.base/java.net.InetAddress.getAllByName(InetAddress.java:1364)}} > {{ at java.base/java.net.InetAddress.getAllByName(InetAddress.java:1298)}} > {{ at java.base/java.net.InetAddress.getByName(InetAddress.java:1248)}} > {{ at > org.apache.zookeeper.server.quorum.QuorumPeer$QuorumServer.recreateSocketAddresses(QuorumPeer.java:194)}} > {{ at > org.apache.zookeeper.server.quorum.QuorumPeer.recreateSocketAddresses(QuorumPeer.java:774)}} > {{ at > org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:701)}} > {{ at > org.apache.zookeeper.server.quorum.QuorumCnxManager.toSend(QuorumCnxManager.java:620)}} > {{ at > org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender.process(FastLeaderElection.java:477)}} > {{ at >
[jira] [Commented] (ZOOKEEPER-3814) ZooKeeper caching of config
[ https://issues.apache.org/jira/browse/ZOOKEEPER-3814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17098727#comment-17098727 ] Mate Szalay-Beko commented on ZOOKEEPER-3814: - Hello Rajkiran, did you also change the content of the 'myid' file from 22 to 23 when you migrated the node? Please note, that the newer ZooKeeper you use sending it's ID during the leader election protocol initiation. So if a server still thinks that his ID is 22 then it will send this ID to the others, who will believe the ID and try to find a host address for this server (either from the config or from the last committed view). Kind regards, Mate > ZooKeeper caching of config > --- > > Key: ZOOKEEPER-3814 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3814 > Project: ZooKeeper > Issue Type: Bug > Components: leaderElection, quorum, server >Affects Versions: 3.5.6 >Reporter: Rajkiran Sura >Priority: Major > > Hello, > We recently upgraded our 5 node ZooKeeper ensemble from v3.4.8 to v3.5.6. > Encountered no issues as such. > This is how the ZooKeeper config looks like: > {quote}tickTime=2000 > dataDir=/zookeeper-data/ > initLimit=5 > syncLimit=2 > maxClientCnxns=2048 > autopurge.snapRetainCount=3 > autopurge.purgeInterval=1 > 4lw.commands.whitelist=stat, ruok, conf, isro, mntr > authProvider.1=org.apache.zookeeper.server.auth.SASLAuthenticationProvider > requireClientAuthScheme=sasl > quorum.cnxn.threads.size=20 > quorum.auth.enableSasl=true > quorum.auth.kerberos.servicePrincipal= zookeeper/_HOST > quorum.auth.learnerRequireSasl=true > quorum.auth.learner.saslLoginContext=QuorumLearner > quorum.auth.serverRequireSasl=true > quorum.auth.server.saslLoginContext=QuorumServer > server.17=node1.foo.bar.com:2888:3888;2181 > server.19=node2.foo.bar.com:2888:3888;2181 > server.20=node3.foo.bar.com:2888:3888;2181 > server.21=node4.foo.bar.com:2888:3888;2181 > server.22=node5.bar.com:2888:3888;2181 > {quote} > Post upgrade, we had to migrate server.22 on the same node, but with > *FOO*.bar.com domain name due to kerberos referral issues. And, we used > different server-identifier, i.e., *23* when we migrated. So, here is how the > new config looked like: > {quote}server.17=node1.foo.bar.com:2888:3888;2181 > server.19=node2.foo.bar.com:2888:3888;2181 > server.20=node3.foo.bar.com:2888:3888;2181 > server.21=node4.foo.bar.com:2888:3888;2181 > *server.23=node5.{color:#00875a}foo{color}.bar.com:2888:3888;2181* > {quote} > We restarted all the nodes in the ensemble with the above updated config. And > the migrated node joined the quorum successfully and was serving all clients > directly connected to it, without any issues. > Recently, when a leader election happened, > server.*23*=node5.foo.bar.com(migrated node) was chosen as Leader (as it has > highest ID). But then, ZooKeeper was unable to serve any clients and *all* > the servers were _somehow still_ trying to establish a channel to 22 (old DNS > name: node5.bar.com) and were throwing below error in a loop: > {quote}{{2020-05-02 01:43:03,026 [myid:23] - WARN > [WorkerSender[myid=23]:QuorumPeer$QuorumServer@196] - Failed to resolve > address: node4.bar.com}} > {{java.net.UnknownHostException: node5.bar.com: Name or service not known}} > {{ at java.base/java.net.Inet4AddressImpl.lookupAllHostAddr(Native Method)}} > {{ at > java.base/java.net.InetAddress$PlatformNameService.lookupAllHostAddr(InetAddress.java:929)}} > {{ at > java.base/java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1515)}} > {{ at > java.base/java.net.InetAddress$NameServiceAddresses.get(InetAddress.java:848)}} > {{ at java.base/java.net.InetAddress.getAllByName0(InetAddress.java:1505)}} > {{ at java.base/java.net.InetAddress.getAllByName(InetAddress.java:1364)}} > {{ at java.base/java.net.InetAddress.getAllByName(InetAddress.java:1298)}} > {{ at java.base/java.net.InetAddress.getByName(InetAddress.java:1248)}} > {{ at > org.apache.zookeeper.server.quorum.QuorumPeer$QuorumServer.recreateSocketAddresses(QuorumPeer.java:194)}} > {{ at > org.apache.zookeeper.server.quorum.QuorumPeer.recreateSocketAddresses(QuorumPeer.java:774)}} > {{ at > org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:701)}} > {{ at > org.apache.zookeeper.server.quorum.QuorumCnxManager.toSend(QuorumCnxManager.java:620)}} > {{ at > org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender.process(FastLeaderElection.java:477)}} > {{ at > org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender.run(FastLeaderElection.java:456)}} > {{ at java.base/java.lang.Thread.run(Thread.java:834)}} > {{2020-05-02 01:43:03,026 [myid:23] - WARN > [WorkerSender[myid=23]:QuorumCnxManager@679] - Cannot open channel to 22 at > election address
[jira] [Commented] (ZOOKEEPER-3814) ZooKeeper caching of config
[ https://issues.apache.org/jira/browse/ZOOKEEPER-3814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17097875#comment-17097875 ] Rajkiran Sura commented on ZOOKEEPER-3814: -- FTR: We haven't enabled dynamic reconfig at all. > ZooKeeper caching of config > --- > > Key: ZOOKEEPER-3814 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3814 > Project: ZooKeeper > Issue Type: Bug > Components: leaderElection, quorum, server >Affects Versions: 3.5.6 >Reporter: Rajkiran Sura >Priority: Major > > Hello, > We recently upgraded our 5 node ZooKeeper ensemble from v3.4.8 to v3.5.6. > Encountered no issues as such. > This is how the ZooKeeper config looks like: > {quote}tickTime=2000 > dataDir=/zookeeper-data/ > initLimit=5 > syncLimit=2 > maxClientCnxns=2048 > autopurge.snapRetainCount=3 > autopurge.purgeInterval=1 > 4lw.commands.whitelist=stat, ruok, conf, isro, mntr > authProvider.1=org.apache.zookeeper.server.auth.SASLAuthenticationProvider > requireClientAuthScheme=sasl > quorum.cnxn.threads.size=20 > quorum.auth.enableSasl=true > quorum.auth.kerberos.servicePrincipal= zookeeper/_HOST > quorum.auth.learnerRequireSasl=true > quorum.auth.learner.saslLoginContext=QuorumLearner > quorum.auth.serverRequireSasl=true > quorum.auth.server.saslLoginContext=QuorumServer > server.17=node1.foo.bar.com:2888:3888;2181 > server.19=node2.foo.bar.com:2888:3888;2181 > server.20=node3.foo.bar.com:2888:3888;2181 > server.21=node4.foo.bar.com:2888:3888;2181 > server.22=node5.bar.com:2888:3888;2181 > {quote} > Post upgrade, we had to migrate server.22 on the same node, but with > *FOO*.bar.com domain name due to kerberos referral issues. And, we used > different server-identifier, i.e., *23* when we migrated. So, here is how the > new config looked like: > {quote}server.17=node1.foo.bar.com:2888:3888;2181 > server.19=node2.foo.bar.com:2888:3888;2181 > server.20=node3.foo.bar.com:2888:3888;2181 > server.21=node4.foo.bar.com:2888:3888;2181 > *server.23=node5.{color:#00875a}foo{color}.bar.com:2888:3888;2181* > {quote} > We restarted all the nodes in the ensemble with the above updated config. And > the migrated node joined the quorum successfully and was serving all clients > directly connected to it, without any issues. > Recently, when a leader election happened, > server.*23*=node5.foo.bar.com(migrated node) was chosen as Leader (as it has > highest ID). But then, ZooKeeper was unable to serve any clients and *all* > the servers were _somehow still_ trying to establish a channel to 22 (old DNS > name: node5.bar.com) and were throwing below error in a loop: > {quote}{{2020-05-02 01:43:03,026 [myid:23] - WARN > [WorkerSender[myid=23]:QuorumPeer$QuorumServer@196] - Failed to resolve > address: node4.bar.com}} > {{java.net.UnknownHostException: node5.bar.com: Name or service not known}} > {{ at java.base/java.net.Inet4AddressImpl.lookupAllHostAddr(Native Method)}} > {{ at > java.base/java.net.InetAddress$PlatformNameService.lookupAllHostAddr(InetAddress.java:929)}} > {{ at > java.base/java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1515)}} > {{ at > java.base/java.net.InetAddress$NameServiceAddresses.get(InetAddress.java:848)}} > {{ at java.base/java.net.InetAddress.getAllByName0(InetAddress.java:1505)}} > {{ at java.base/java.net.InetAddress.getAllByName(InetAddress.java:1364)}} > {{ at java.base/java.net.InetAddress.getAllByName(InetAddress.java:1298)}} > {{ at java.base/java.net.InetAddress.getByName(InetAddress.java:1248)}} > {{ at > org.apache.zookeeper.server.quorum.QuorumPeer$QuorumServer.recreateSocketAddresses(QuorumPeer.java:194)}} > {{ at > org.apache.zookeeper.server.quorum.QuorumPeer.recreateSocketAddresses(QuorumPeer.java:774)}} > {{ at > org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:701)}} > {{ at > org.apache.zookeeper.server.quorum.QuorumCnxManager.toSend(QuorumCnxManager.java:620)}} > {{ at > org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender.process(FastLeaderElection.java:477)}} > {{ at > org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender.run(FastLeaderElection.java:456)}} > {{ at java.base/java.lang.Thread.run(Thread.java:834)}} > {{2020-05-02 01:43:03,026 [myid:23] - WARN > [WorkerSender[myid=23]:QuorumCnxManager@679] - Cannot open channel to 22 at > election address node5.bar.com:3888}} > {{java.net.UnknownHostException: node5.bar.com}} > {{ at > java.base/java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:220)}} > {{ at java.base/java.net.SocksSocketImpl.connect(SocksSocketImpl.java:403)}} > {{ at java.base/java.net.Socket.connect(Socket.java:591)}} > {{ at > org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:650)}} > {{ at >
[jira] [Commented] (ZOOKEEPER-3814) ZooKeeper caching of config
[ https://issues.apache.org/jira/browse/ZOOKEEPER-3814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17097874#comment-17097874 ] Rajkiran Sura commented on ZOOKEEPER-3814: -- Latest observation, we noticed that ZooKeeper was complaining about dynamic.next file, event though we HAVE NOT ENABLED dynamic-reconfiguration. {quote}2020-05-02 01:43:05,870 [myid:21] - ERROR [QuorumPeer[myid=21](plain=/0.0.0.0:2181)(secure=disabled):QuorumPeer@1637] - Error writing next dynamic config file to disk: {quote} And zookeeper user did not have perms to that config directory, so we fixed that restarted zookeeper. And then it dumped below dynamic.next, which contains the OLD migrated node as a member :O {quote}$ sudo cat /opt/zookeeper/conf/zoo.cfg.dynamic.next server.17=node1.foo.bar.com:2888:3888:participant;0.0.0.0:2181 server.19=node2.foo.bar.com:2888:3888:participant;0.0.0.0:2181 server.20=node3.foo.bar.com:2888:3888:participant;0.0.0.0:2181 server.21=node4.foo.bar.com:2888:3888:participant;0.0.0.0:2181 *server.{color:#de350b}22=node5.bar.com{color}*:2888:3888:participant;0.0.0.0:2181 {quote} So, this looks like a bug. And from where is it still fetching this? How do we fix it. Any lead/help is very much appreciated. Thanks in advance, Rajkiran > ZooKeeper caching of config > --- > > Key: ZOOKEEPER-3814 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3814 > Project: ZooKeeper > Issue Type: Bug > Components: leaderElection, quorum, server >Affects Versions: 3.5.6 >Reporter: Rajkiran Sura >Priority: Major > > Hello, > We recently upgraded our 5 node ZooKeeper ensemble from v3.4.8 to v3.5.6. > Encountered no issues as such. > This is how the ZooKeeper config looks like: > {quote}tickTime=2000 > dataDir=/zookeeper-data/ > initLimit=5 > syncLimit=2 > maxClientCnxns=2048 > autopurge.snapRetainCount=3 > autopurge.purgeInterval=1 > 4lw.commands.whitelist=stat, ruok, conf, isro, mntr > authProvider.1=org.apache.zookeeper.server.auth.SASLAuthenticationProvider > requireClientAuthScheme=sasl > quorum.cnxn.threads.size=20 > quorum.auth.enableSasl=true > quorum.auth.kerberos.servicePrincipal= zookeeper/_HOST > quorum.auth.learnerRequireSasl=true > quorum.auth.learner.saslLoginContext=QuorumLearner > quorum.auth.serverRequireSasl=true > quorum.auth.server.saslLoginContext=QuorumServer > server.17=node1.foo.bar.com:2888:3888;2181 > server.19=node2.foo.bar.com:2888:3888;2181 > server.20=node3.foo.bar.com:2888:3888;2181 > server.21=node4.foo.bar.com:2888:3888;2181 > server.22=node5.bar.com:2888:3888;2181 > {quote} > Post upgrade, we had to migrate server.22 on the same node, but with > *FOO*.bar.com domain name due to kerberos referral issues. And, we used > different server-identifier, i.e., *23* when we migrated. So, here is how the > new config looked like: > {quote}server.17=node1.foo.bar.com:2888:3888;2181 > server.19=node2.foo.bar.com:2888:3888;2181 > server.20=node3.foo.bar.com:2888:3888;2181 > server.21=node4.foo.bar.com:2888:3888;2181 > *server.23=node5.{color:#00875a}foo{color}.bar.com:2888:3888;2181* > {quote} > We restarted all the nodes in the ensemble with the above updated config. And > the migrated node joined the quorum successfully and was serving all clients > directly connected to it, without any issues. > Recently, when a leader election happened, > server.*23*=node5.foo.bar.com(migrated node) was chosen as Leader (as it has > highest ID). But then, ZooKeeper was unable to serve any clients and *all* > the servers were _somehow still_ trying to establish a channel to 22 (old DNS > name: node5.bar.com) and were throwing below error in a loop: > {quote}{{2020-05-02 01:43:03,026 [myid:23] - WARN > [WorkerSender[myid=23]:QuorumPeer$QuorumServer@196] - Failed to resolve > address: node4.bar.com}} > {{java.net.UnknownHostException: node5.bar.com: Name or service not known}} > {{ at java.base/java.net.Inet4AddressImpl.lookupAllHostAddr(Native Method)}} > {{ at > java.base/java.net.InetAddress$PlatformNameService.lookupAllHostAddr(InetAddress.java:929)}} > {{ at > java.base/java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1515)}} > {{ at > java.base/java.net.InetAddress$NameServiceAddresses.get(InetAddress.java:848)}} > {{ at java.base/java.net.InetAddress.getAllByName0(InetAddress.java:1505)}} > {{ at java.base/java.net.InetAddress.getAllByName(InetAddress.java:1364)}} > {{ at java.base/java.net.InetAddress.getAllByName(InetAddress.java:1298)}} > {{ at java.base/java.net.InetAddress.getByName(InetAddress.java:1248)}} > {{ at > org.apache.zookeeper.server.quorum.QuorumPeer$QuorumServer.recreateSocketAddresses(QuorumPeer.java:194)}} > {{ at > org.apache.zookeeper.server.quorum.QuorumPeer.recreateSocketAddresses(QuorumPeer.java:774)}} > {{ at >