[jira] [Commented] (ZOOKEEPER-3814) ZooKeeper caching of config
[ https://issues.apache.org/jira/browse/ZOOKEEPER-3814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17108866#comment-17108866 ] Alexander Shraer commented on ZOOKEEPER-3814: - [~symat] the way this works is the usually, config changes happen in two rounds - a proposal sets {{lastSeenQuorumVerifier, which writes the .next file, but then a commit calls processReconfig which calls setQuorumVerifier. Same happens when a learner syncs with leader - the leader's proposal is now NEW_LEADER and the leader's commit is UPTODATE. The commit / UPTODATE is the thing actually changing the config, not }}{{lastSeenQuorumVerifier (though writing out .next files should also be prevented in this case, I think)}}{{. Another change where a config could change is during the gossip happening in leader election - servers send around their configs peer-to-peer, and update their config to a later one if they see one (FastLeaderElection.java, look for processReconfig). There too, you could require that the reconfigEnable flag is on before calling processReconfig.}} > ZooKeeper caching of config > --- > > Key: ZOOKEEPER-3814 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3814 > Project: ZooKeeper > Issue Type: Bug > Components: leaderElection, quorum, server >Affects Versions: 3.5.6 >Reporter: Rajkiran Sura >Assignee: Mate Szalay-Beko >Priority: Major > > Hello, > We recently upgraded our 5 node ZooKeeper ensemble from v3.4.8 to v3.5.6. > Encountered no issues as such. > This is how the ZooKeeper config looks like: > {quote}tickTime=2000 > dataDir=/zookeeper-data/ > initLimit=5 > syncLimit=2 > maxClientCnxns=2048 > autopurge.snapRetainCount=3 > autopurge.purgeInterval=1 > 4lw.commands.whitelist=stat, ruok, conf, isro, mntr > authProvider.1=org.apache.zookeeper.server.auth.SASLAuthenticationProvider > requireClientAuthScheme=sasl > quorum.cnxn.threads.size=20 > quorum.auth.enableSasl=true > quorum.auth.kerberos.servicePrincipal= zookeeper/_HOST > quorum.auth.learnerRequireSasl=true > quorum.auth.learner.saslLoginContext=QuorumLearner > quorum.auth.serverRequireSasl=true > quorum.auth.server.saslLoginContext=QuorumServer > server.17=node1.foo.bar.com:2888:3888;2181 > server.19=node2.foo.bar.com:2888:3888;2181 > server.20=node3.foo.bar.com:2888:3888;2181 > server.21=node4.foo.bar.com:2888:3888;2181 > server.22=node5.bar.com:2888:3888;2181 > {quote} > Post upgrade, we had to migrate server.22 on the same node, but with > *FOO*.bar.com domain name due to kerberos referral issues. And, we used > different server-identifier, i.e., *23* when we migrated. So, here is how the > new config looked like: > {quote}server.17=node1.foo.bar.com:2888:3888;2181 > server.19=node2.foo.bar.com:2888:3888;2181 > server.20=node3.foo.bar.com:2888:3888;2181 > server.21=node4.foo.bar.com:2888:3888;2181 > *server.23=node5.{color:#00875a}foo{color}.bar.com:2888:3888;2181* > {quote} > We restarted all the nodes in the ensemble with the above updated config. And > the migrated node joined the quorum successfully and was serving all clients > directly connected to it, without any issues. > Recently, when a leader election happened, > server.*23*=node5.foo.bar.com(migrated node) was chosen as Leader (as it has > highest ID). But then, ZooKeeper was unable to serve any clients and *all* > the servers were _somehow still_ trying to establish a channel to 22 (old DNS > name: node5.bar.com) and were throwing below error in a loop: > {quote}{{2020-05-02 01:43:03,026 [myid:23] - WARN > [WorkerSender[myid=23]:QuorumPeer$QuorumServer@196] - Failed to resolve > address: node4.bar.com}} > {{java.net.UnknownHostException: node5.bar.com: Name or service not known}} > {{ at java.base/java.net.Inet4AddressImpl.lookupAllHostAddr(Native Method)}} > {{ at > java.base/java.net.InetAddress$PlatformNameService.lookupAllHostAddr(InetAddress.java:929)}} > {{ at > java.base/java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1515)}} > {{ at > java.base/java.net.InetAddress$NameServiceAddresses.get(InetAddress.java:848)}} > {{ at java.base/java.net.InetAddress.getAllByName0(InetAddress.java:1505)}} > {{ at java.base/java.net.InetAddress.getAllByName(InetAddress.java:1364)}} > {{ at java.base/java.net.InetAddress.getAllByName(InetAddress.java:1298)}} > {{ at java.base/java.net.InetAddress.getByName(InetAddress.java:1248)}} > {{ at > org.apache.zookeeper.server.quorum.QuorumPeer$QuorumServer.recreateSocketAddresses(QuorumPeer.java:194)}} > {{ at > org.apache.zookeeper.server.quorum.QuorumPeer.recreateSocketAddresses(QuorumPeer.java:774)}} > {{ at > org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:701)}} > {{ at > org.apache.zookeeper.server.quorum.QuorumCnxManager.toSend(QuorumCnxManage
[jira] [Commented] (ZOOKEEPER-3814) ZooKeeper caching of config
[ https://issues.apache.org/jira/browse/ZOOKEEPER-3814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17108856#comment-17108856 ] Rajkiran Sura commented on ZOOKEEPER-3814: -- {quote}but in the meanwhile I recommend using dynamic reconfig to change the quorum. {quote} Yes, we started to rely on dynamic-reconfig. But, I would like to note that dynamic-reconfig isn't really dynamic when you have Quorum auth enabled with GSSAPI via SASL. i.e., the config is changed but the new member doesn't join the ensemble until all the members are restarted. Thus, its no more dynamic. Looks more scarier. FTR: I have raised https://issues.apache.org/jira/browse/ZOOKEEPER-3824 for this issue. Thanks Mate. Regards, Rajkiran > ZooKeeper caching of config > --- > > Key: ZOOKEEPER-3814 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3814 > Project: ZooKeeper > Issue Type: Bug > Components: leaderElection, quorum, server >Affects Versions: 3.5.6 >Reporter: Rajkiran Sura >Assignee: Mate Szalay-Beko >Priority: Major > > Hello, > We recently upgraded our 5 node ZooKeeper ensemble from v3.4.8 to v3.5.6. > Encountered no issues as such. > This is how the ZooKeeper config looks like: > {quote}tickTime=2000 > dataDir=/zookeeper-data/ > initLimit=5 > syncLimit=2 > maxClientCnxns=2048 > autopurge.snapRetainCount=3 > autopurge.purgeInterval=1 > 4lw.commands.whitelist=stat, ruok, conf, isro, mntr > authProvider.1=org.apache.zookeeper.server.auth.SASLAuthenticationProvider > requireClientAuthScheme=sasl > quorum.cnxn.threads.size=20 > quorum.auth.enableSasl=true > quorum.auth.kerberos.servicePrincipal= zookeeper/_HOST > quorum.auth.learnerRequireSasl=true > quorum.auth.learner.saslLoginContext=QuorumLearner > quorum.auth.serverRequireSasl=true > quorum.auth.server.saslLoginContext=QuorumServer > server.17=node1.foo.bar.com:2888:3888;2181 > server.19=node2.foo.bar.com:2888:3888;2181 > server.20=node3.foo.bar.com:2888:3888;2181 > server.21=node4.foo.bar.com:2888:3888;2181 > server.22=node5.bar.com:2888:3888;2181 > {quote} > Post upgrade, we had to migrate server.22 on the same node, but with > *FOO*.bar.com domain name due to kerberos referral issues. And, we used > different server-identifier, i.e., *23* when we migrated. So, here is how the > new config looked like: > {quote}server.17=node1.foo.bar.com:2888:3888;2181 > server.19=node2.foo.bar.com:2888:3888;2181 > server.20=node3.foo.bar.com:2888:3888;2181 > server.21=node4.foo.bar.com:2888:3888;2181 > *server.23=node5.{color:#00875a}foo{color}.bar.com:2888:3888;2181* > {quote} > We restarted all the nodes in the ensemble with the above updated config. And > the migrated node joined the quorum successfully and was serving all clients > directly connected to it, without any issues. > Recently, when a leader election happened, > server.*23*=node5.foo.bar.com(migrated node) was chosen as Leader (as it has > highest ID). But then, ZooKeeper was unable to serve any clients and *all* > the servers were _somehow still_ trying to establish a channel to 22 (old DNS > name: node5.bar.com) and were throwing below error in a loop: > {quote}{{2020-05-02 01:43:03,026 [myid:23] - WARN > [WorkerSender[myid=23]:QuorumPeer$QuorumServer@196] - Failed to resolve > address: node4.bar.com}} > {{java.net.UnknownHostException: node5.bar.com: Name or service not known}} > {{ at java.base/java.net.Inet4AddressImpl.lookupAllHostAddr(Native Method)}} > {{ at > java.base/java.net.InetAddress$PlatformNameService.lookupAllHostAddr(InetAddress.java:929)}} > {{ at > java.base/java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1515)}} > {{ at > java.base/java.net.InetAddress$NameServiceAddresses.get(InetAddress.java:848)}} > {{ at java.base/java.net.InetAddress.getAllByName0(InetAddress.java:1505)}} > {{ at java.base/java.net.InetAddress.getAllByName(InetAddress.java:1364)}} > {{ at java.base/java.net.InetAddress.getAllByName(InetAddress.java:1298)}} > {{ at java.base/java.net.InetAddress.getByName(InetAddress.java:1248)}} > {{ at > org.apache.zookeeper.server.quorum.QuorumPeer$QuorumServer.recreateSocketAddresses(QuorumPeer.java:194)}} > {{ at > org.apache.zookeeper.server.quorum.QuorumPeer.recreateSocketAddresses(QuorumPeer.java:774)}} > {{ at > org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:701)}} > {{ at > org.apache.zookeeper.server.quorum.QuorumCnxManager.toSend(QuorumCnxManager.java:620)}} > {{ at > org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender.process(FastLeaderElection.java:477)}} > {{ at > org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender.run(FastLeaderElection.java:456)}} > {{ at java.base/java.lang.Thread.run(Thread.java:834)}} > {{2020-05-02 01:43:03,0
[jira] [Commented] (ZOOKEEPER-3831) Add a test that does a minimal validation of Apache Curator
[ https://issues.apache.org/jira/browse/ZOOKEEPER-3831?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17108540#comment-17108540 ] Jordan Zimmerman commented on ZOOKEEPER-3831: - I'm excluding zookeeper in Maven and this will only be in the test path so it shouldn't pollute ZooKeeper's classpath. > Add a test that does a minimal validation of Apache Curator > --- > > Key: ZOOKEEPER-3831 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3831 > Project: ZooKeeper > Issue Type: Improvement > Components: tests >Affects Versions: 3.6.1 >Reporter: Jordan Zimmerman >Assignee: Jordan Zimmerman >Priority: Minor > > Given that Apache Curator is one of the most widely used ZooKeeper clients it > would be beneficial for ZooKeeper to have a minimal test to ensure that the > codebase doesn't cause incompatibilities with Curator in the future. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (ZOOKEEPER-3831) Add a test that does a minimal validation of Apache Curator
[ https://issues.apache.org/jira/browse/ZOOKEEPER-3831?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17108540#comment-17108540 ] Jordan Zimmerman edited comment on ZOOKEEPER-3831 at 5/15/20, 6:35 PM: --- I'm excluding zookeeper in Maven and this will only be in the test path so it shouldn't pollute ZooKeeper's classpath. But, maybe a "compatability" module is in order? was (Author: randgalt): I'm excluding zookeeper in Maven and this will only be in the test path so it shouldn't pollute ZooKeeper's classpath. > Add a test that does a minimal validation of Apache Curator > --- > > Key: ZOOKEEPER-3831 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3831 > Project: ZooKeeper > Issue Type: Improvement > Components: tests >Affects Versions: 3.6.1 >Reporter: Jordan Zimmerman >Assignee: Jordan Zimmerman >Priority: Minor > > Given that Apache Curator is one of the most widely used ZooKeeper clients it > would be beneficial for ZooKeeper to have a minimal test to ensure that the > codebase doesn't cause incompatibilities with Curator in the future. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ZOOKEEPER-3831) Add a test that does a minimal validation of Apache Curator
[ https://issues.apache.org/jira/browse/ZOOKEEPER-3831?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17108539#comment-17108539 ] Enrico Olivelli commented on ZOOKEEPER-3831: Very interesting. I think this should stay in a separate module under Zookeeper its module. In order not to have a polluted classpath > Add a test that does a minimal validation of Apache Curator > --- > > Key: ZOOKEEPER-3831 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3831 > Project: ZooKeeper > Issue Type: Improvement > Components: tests >Affects Versions: 3.6.1 >Reporter: Jordan Zimmerman >Assignee: Jordan Zimmerman >Priority: Minor > > Given that Apache Curator is one of the most widely used ZooKeeper clients it > would be beneficial for ZooKeeper to have a minimal test to ensure that the > codebase doesn't cause incompatibilities with Curator in the future. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ZOOKEEPER-3831) Add a test that does a minimal validation of Apache Curator
[ https://issues.apache.org/jira/browse/ZOOKEEPER-3831?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17108525#comment-17108525 ] Jordan Zimmerman commented on ZOOKEEPER-3831: - I have a PR near ready. We just need to release a new version of Curator. > Add a test that does a minimal validation of Apache Curator > --- > > Key: ZOOKEEPER-3831 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3831 > Project: ZooKeeper > Issue Type: Improvement > Components: tests >Affects Versions: 3.6.1 >Reporter: Jordan Zimmerman >Assignee: Jordan Zimmerman >Priority: Minor > > Given that Apache Curator is one of the most widely used ZooKeeper clients it > would be beneficial for ZooKeeper to have a minimal test to ensure that the > codebase doesn't cause incompatibilities with Curator in the future. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ZOOKEEPER-3831) Add a test that does a minimal validation of Apache Curator
Jordan Zimmerman created ZOOKEEPER-3831: --- Summary: Add a test that does a minimal validation of Apache Curator Key: ZOOKEEPER-3831 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3831 Project: ZooKeeper Issue Type: Improvement Components: tests Affects Versions: 3.6.1 Reporter: Jordan Zimmerman Assignee: Jordan Zimmerman Given that Apache Curator is one of the most widely used ZooKeeper clients it would be beneficial for ZooKeeper to have a minimal test to ensure that the codebase doesn't cause incompatibilities with Curator in the future. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ZOOKEEPER-3830) After add a new node, zookeeper cluster won't commit any proposal if this new node is leader
Keli Wang created ZOOKEEPER-3830: Summary: After add a new node, zookeeper cluster won't commit any proposal if this new node is leader Key: ZOOKEEPER-3830 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3830 Project: ZooKeeper Issue Type: Bug Environment: Zookeeper 3.5.8 JDK 1.8 Reporter: Keli Wang Attachments: reproduce-zkclusters.tar.gz I have a zookeeper cluster with 3 nodes, node3 is the leader of the cluster. {code:java} server.1=node1 server.2=node2 server.3=node3 # current leader{code} With dynamic reconfiguration disabled, I scale this cluster to 4 nodes with 2 steps: # Start node4 with new config, now node4 is a follower. # Modify config and restart node1, node2 and node3 one by one. The new cluster config is: {code:java} server.1=node1 server.2=node2 server.3=node3 server.4=node4 # current leader {code} After restart, node4 is the leader of this cluster. But I cannot connect to this cluster using zkCli now. If I restart node4, node3 will be the new leader, and now I can connect to cluster using zkCli again. After some digging, I find node4's Leader#allowedToCommit field is false, so this cluster won't commit any new proposals. I have attached a zookeeper cluster to reproduce this problem. The cluster in the attachment can run in one single machine. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ZOOKEEPER-3814) ZooKeeper caching of config
[ https://issues.apache.org/jira/browse/ZOOKEEPER-3814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17108373#comment-17108373 ] Mate Szalay-Beko commented on ZOOKEEPER-3814: - unfortunately I haven't found any trivial fixes yet. I will try more approaches next week, but in the meanwhile I recommend using dynamic reconfig to change the quorum. > ZooKeeper caching of config > --- > > Key: ZOOKEEPER-3814 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3814 > Project: ZooKeeper > Issue Type: Bug > Components: leaderElection, quorum, server >Affects Versions: 3.5.6 >Reporter: Rajkiran Sura >Assignee: Mate Szalay-Beko >Priority: Major > > Hello, > We recently upgraded our 5 node ZooKeeper ensemble from v3.4.8 to v3.5.6. > Encountered no issues as such. > This is how the ZooKeeper config looks like: > {quote}tickTime=2000 > dataDir=/zookeeper-data/ > initLimit=5 > syncLimit=2 > maxClientCnxns=2048 > autopurge.snapRetainCount=3 > autopurge.purgeInterval=1 > 4lw.commands.whitelist=stat, ruok, conf, isro, mntr > authProvider.1=org.apache.zookeeper.server.auth.SASLAuthenticationProvider > requireClientAuthScheme=sasl > quorum.cnxn.threads.size=20 > quorum.auth.enableSasl=true > quorum.auth.kerberos.servicePrincipal= zookeeper/_HOST > quorum.auth.learnerRequireSasl=true > quorum.auth.learner.saslLoginContext=QuorumLearner > quorum.auth.serverRequireSasl=true > quorum.auth.server.saslLoginContext=QuorumServer > server.17=node1.foo.bar.com:2888:3888;2181 > server.19=node2.foo.bar.com:2888:3888;2181 > server.20=node3.foo.bar.com:2888:3888;2181 > server.21=node4.foo.bar.com:2888:3888;2181 > server.22=node5.bar.com:2888:3888;2181 > {quote} > Post upgrade, we had to migrate server.22 on the same node, but with > *FOO*.bar.com domain name due to kerberos referral issues. And, we used > different server-identifier, i.e., *23* when we migrated. So, here is how the > new config looked like: > {quote}server.17=node1.foo.bar.com:2888:3888;2181 > server.19=node2.foo.bar.com:2888:3888;2181 > server.20=node3.foo.bar.com:2888:3888;2181 > server.21=node4.foo.bar.com:2888:3888;2181 > *server.23=node5.{color:#00875a}foo{color}.bar.com:2888:3888;2181* > {quote} > We restarted all the nodes in the ensemble with the above updated config. And > the migrated node joined the quorum successfully and was serving all clients > directly connected to it, without any issues. > Recently, when a leader election happened, > server.*23*=node5.foo.bar.com(migrated node) was chosen as Leader (as it has > highest ID). But then, ZooKeeper was unable to serve any clients and *all* > the servers were _somehow still_ trying to establish a channel to 22 (old DNS > name: node5.bar.com) and were throwing below error in a loop: > {quote}{{2020-05-02 01:43:03,026 [myid:23] - WARN > [WorkerSender[myid=23]:QuorumPeer$QuorumServer@196] - Failed to resolve > address: node4.bar.com}} > {{java.net.UnknownHostException: node5.bar.com: Name or service not known}} > {{ at java.base/java.net.Inet4AddressImpl.lookupAllHostAddr(Native Method)}} > {{ at > java.base/java.net.InetAddress$PlatformNameService.lookupAllHostAddr(InetAddress.java:929)}} > {{ at > java.base/java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1515)}} > {{ at > java.base/java.net.InetAddress$NameServiceAddresses.get(InetAddress.java:848)}} > {{ at java.base/java.net.InetAddress.getAllByName0(InetAddress.java:1505)}} > {{ at java.base/java.net.InetAddress.getAllByName(InetAddress.java:1364)}} > {{ at java.base/java.net.InetAddress.getAllByName(InetAddress.java:1298)}} > {{ at java.base/java.net.InetAddress.getByName(InetAddress.java:1248)}} > {{ at > org.apache.zookeeper.server.quorum.QuorumPeer$QuorumServer.recreateSocketAddresses(QuorumPeer.java:194)}} > {{ at > org.apache.zookeeper.server.quorum.QuorumPeer.recreateSocketAddresses(QuorumPeer.java:774)}} > {{ at > org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:701)}} > {{ at > org.apache.zookeeper.server.quorum.QuorumCnxManager.toSend(QuorumCnxManager.java:620)}} > {{ at > org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender.process(FastLeaderElection.java:477)}} > {{ at > org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender.run(FastLeaderElection.java:456)}} > {{ at java.base/java.lang.Thread.run(Thread.java:834)}} > {{2020-05-02 01:43:03,026 [myid:23] - WARN > [WorkerSender[myid=23]:QuorumCnxManager@679] - Cannot open channel to 22 at > election address node5.bar.com:3888}} > {{java.net.UnknownHostException: node5.bar.com}} > {{ at > java.base/java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:220)}} > {{ at java.base/java.net.SocksSocketImpl.connect(SocksSocketImpl.java:403)}} > {{
[jira] [Comment Edited] (ZOOKEEPER-3829) Zookeeper refuses request after node expansion
[ https://issues.apache.org/jira/browse/ZOOKEEPER-3829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17108046#comment-17108046 ] benwang li edited comment on ZOOKEEPER-3829 at 5/15/20, 2:56 PM: - We start `CommitProcessor` [here|https://github.com/apache/zookeeper/blob/e87bad6774e7269ef21a156aff9dad089ef54794/zookeeper-server/src/main/java/org/apache/zookeeper/server/quorum/CommitProcessor.java#L455] . We shutdown `CommitProcessor` [here|https://github.com/apache/zookeeper/blob/e87bad6774e7269ef21a156aff9dad089ef54794/zookeeper-server/src/main/java/org/apache/zookeeper/server/quorum/CommitProcessor.java#L637]. But when we call `start` method again, the `workerPool` will not work anymore. I submit the node D logs attachment `d.log`, and we can see that happens. {code:java} 307: 2020-05-14 18:04:12,022 [myid:4] - INFO [QuorumPeer[myid=4](plain=/0:0:0:0:0:0:0:0:2183)(secure=disabled):CommitProcessor@362] - Shutting down 308 2020-05-14 18:04:12,022 [myid:4] - INFO [FollowerRequestProcessor:4:FollowerRequestProcessor@110] - FollowerRequestProcessor exited loop! 309 2020-05-14 18:04:12,022 [myid:4] - INFO [CommitProcessor:4:CommitProcessor@195] - CommitProcessor exited loop! 310 2020-05-14 18:04:12,023 [myid:4] - INFO [QuorumPeer[myid=4](plain=/0:0:0:0:0:0:0:0:2183)(secure=disabled):FinalRequestProcessor@514] - shutdown of request processor complete 311 2020-05-14 18:04:12,024 [myid:4] - DEBUG [QuorumPeer[myid=4](plain=/0:0:0:0:0:0:0:0:2183)(secure=disabled):FileTxnLog$FileTxnIterator@655] - Created new input stream /data1/zookeeper/logs/version-2/log.2a000b 312 2020-05-14 18:04:12,024 [myid:4] - DEBUG [QuorumPeer[myid=4](plain=/0:0:0:0:0:0:0:0:2183)(secure=disabled):FileTxnLog$FileTxnIterator@658] - Created new input archive /data1/zookeeper/logs/version-2/log.2a000b 313 2020-05-14 18:04:12,024 [myid:4] - DEBUG [QuorumPeer[myid=4](plain=/0:0:0:0:0:0:0:0:2183)(secure=disabled):FileTxnLog$FileTxnIterator@696] - EOF exception java.io.EOFException: Failed to read /data1/zookeeper/logs/version-2/log.2a000b 314 -- 315 2020-05-14 18:04:29,000 [myid:4] - DEBUG [QuorumPeer[myid=4](plain=/0:0:0:0:0:0:0:0:2183)(secure=disabled):SessionTrackerImpl@274] - Adding session 0x3082f5048fc 316 2020-05-14 18:04:29,000 [myid:4] - DEBUG [QuorumPeer[myid=4](plain=/0:0:0:0:0:0:0:0:2183)(secure=disabled):SessionTrackerImpl@274] - Adding session 0x40a33f8f3f40002 317 2020-05-14 18:04:29,000 [myid:4] - DEBUG [QuorumPeer[myid=4](plain=/0:0:0:0:0:0:0:0:2183)(secure=disabled):SessionTrackerImpl@274] - Adding session 0x40a33f8f3f4 318 2020-05-14 18:04:29,000 [myid:4] - DEBUG [QuorumPeer[myid=4](plain=/0:0:0:0:0:0:0:0:2183)(secure=disabled):SessionTrackerImpl@274] - Adding session 0x40a33f8f3f40001 319 2020-05-14 18:04:29,000 [myid:4] - INFO [QuorumPeer[myid=4](plain=/0:0:0:0:0:0:0:0:2183)(secure=disabled):CommitProcessor@256] - Configuring CommitProcessor with 24 worker threads. 320 2020-05-14 18:04:29,002 [myid:4] - INFO [QuorumPeer[myid=4](plain=/0:0:0:0:0:0:0:0:2183)(secure=disabled):ContainerManager@64] - Using checkIntervalMs=6 maxPerMinute=1 321 2020-05-14 18:04:29,003 [myid:4] - DEBUG [LearnerHandler-/146.196.79.232:38708:LearnerHandler@534] - Sending UPTODATE message to 3 {code} was (Author: sundyli): We start `CommitProcessor` [here|https://github.com/apache/zookeeper/blob/e87bad6774e7269ef21a156aff9dad089ef54794/zookeeper-server/src/main/java/org/apache/zookeeper/server/quorum/CommitProcessor.java#L455] . We shutdown `CommitProcessor` [here|https://github.com/apache/zookeeper/blob/e87bad6774e7269ef21a156aff9dad089ef54794/zookeeper-server/src/main/java/org/apache/zookeeper/server/quorum/CommitProcessor.java#L637]. But when we call `start` method again, the `workerPool` will not work anymore. I submit the node D logs attachment `d.log`, and we can see that happens. {code:java} 308 2020-05-14 18:04:12,022 [myid:4] - INFO [FollowerRequestProcessor:4:FollowerRequestProcessor@110] - FollowerRequestProcessor exited loop! 309 2020-05-14 18:04:12,022 [myid:4] - INFO [CommitProcessor:4:CommitProcessor@195] - CommitProcessor exited loop! 310 2020-05-14 18:04:12,023 [myid:4] - INFO [QuorumPeer[myid=4](plain=/0:0:0:0:0:0:0:0:2183)(secure=disabled):FinalRequestProcessor@514] - shutdown of request processor complete 311 2020-05-14 18:04:12,024 [myid:4] - DEBUG [QuorumPeer[myid=4](plain=/0:0:0:0:0:0:0:0:2183)(secure=disabled):FileTxnLog$FileTxnIterator@655] - Created new input stream /data1/zookeeper/logs/version-2/log.2a000b 312 2020-05-14 18:04:12,024 [myid:4] - DEBUG [QuorumPeer[myid=4](plain=/0:0:0:0:0:0:0:0:2183)(secure=disabled):FileTxnLog$FileTxnIterator@658] - Created new input archive /data1/zookeeper/logs/version-2/log.2a000b 313 2020-05-14 18:04:12,024 [myid:4] - DEBUG [QuorumPeer[myid=4](plain=/0:0:0
[jira] [Commented] (ZOOKEEPER-3829) Zookeeper refuses request after node expansion
[ https://issues.apache.org/jira/browse/ZOOKEEPER-3829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17108370#comment-17108370 ] Mate Szalay-Beko commented on ZOOKEEPER-3829: - {quote}Hi, I reproduced it with your docker-compose scripts {quote} great, thanks for the detailed steps! I will try them locally on Monday and I can verify your findings. (I used slightly different docker-compose commands, maybe those made the difference.) The config looks OK, except the {{initLimit}}, which should be way smaller. It should be given in number of ticks instead of millisecs. But I don't think it should matter much in this case. Thanks for taking so much time chasing a ZooKeeper error! :) > Zookeeper refuses request after node expansion > -- > > Key: ZOOKEEPER-3829 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3829 > Project: ZooKeeper > Issue Type: Bug > Components: server >Affects Versions: 3.5.6 >Reporter: benwang li >Priority: Major > Attachments: d.log, screenshot-1.png > > > It's easy to reproduce this bug. > {code:java} > //代码占位符 > > Step 1. Deploy 3 nodes A,B,C with configuration A,B,C . > Step 2. Deploy node ` D` with configuration `A,B,C,D` , cluster state is ok > now. > Step 3. Restart nodes A,B,C with configuration A,B,C,D, then the leader will > be D, cluster hangs, but it can accept `mntr` command, other command like `ls > /` will be blocked. > Step 4. Restart nodes D, cluster state is back to normal now. > > {code} > > We have looked into the code of 3.5.6 version, and we found it may be the > issue of `workerPool` . > The `CommitProcessor` shutdown and make `workerPool` shutdown, but > `workerPool` still exists. It will never work anymore, yet the cluster still > thinks it's ok. > > I think the bug may still exist in master branch. > We have tested it in our machines by reset the `workerPool` to null. If it's > ok, please assign this issue to me, and then I'll create a PR. > > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ZOOKEEPER-3829) Zookeeper refuses request after node expansion
[ https://issues.apache.org/jira/browse/ZOOKEEPER-3829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17108368#comment-17108368 ] benwang li commented on ZOOKEEPER-3829: --- Did you try your proposed fix already and saw that it solves your original issue? - Sorry, I forgot to answer this. Yes, I fixed it and tested, it works normally after the fix. Have you checked my reply message? (I reproduced it with your docker repo). It must be some configuration that makes this issue happen. I will try to find which config is wrong. > Zookeeper refuses request after node expansion > -- > > Key: ZOOKEEPER-3829 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3829 > Project: ZooKeeper > Issue Type: Bug > Components: server >Affects Versions: 3.5.6 >Reporter: benwang li >Priority: Major > Attachments: d.log, screenshot-1.png > > > It's easy to reproduce this bug. > {code:java} > //代码占位符 > > Step 1. Deploy 3 nodes A,B,C with configuration A,B,C . > Step 2. Deploy node ` D` with configuration `A,B,C,D` , cluster state is ok > now. > Step 3. Restart nodes A,B,C with configuration A,B,C,D, then the leader will > be D, cluster hangs, but it can accept `mntr` command, other command like `ls > /` will be blocked. > Step 4. Restart nodes D, cluster state is back to normal now. > > {code} > > We have looked into the code of 3.5.6 version, and we found it may be the > issue of `workerPool` . > The `CommitProcessor` shutdown and make `workerPool` shutdown, but > `workerPool` still exists. It will never work anymore, yet the cluster still > thinks it's ok. > > I think the bug may still exist in master branch. > We have tested it in our machines by reset the `workerPool` to null. If it's > ok, please assign this issue to me, and then I'll create a PR. > > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ZOOKEEPER-3829) Zookeeper refuses request after node expansion
[ https://issues.apache.org/jira/browse/ZOOKEEPER-3829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17108353#comment-17108353 ] Mate Szalay-Beko commented on ZOOKEEPER-3829: - I see now :) (you pasted the logs from line 308 in your previous comment, this is why I missed it) But still I don't know if it the two log lines are produced by the same CommitProcessor object instance. As I said, my understanding is that a new ZooKeeperServer with a new clean CommitProcessor is getting created after each leader election. What I see in Line 225 is that this server was (re)started? Then at least a whole leader election is missing from the log. Then I see that the server become a Follower. Then in line 299 it can not follow the current leader anymore. I guess then happens a new leader election, missing from the logs. But we see that the LearnerZooKeeperServer is shutting down (also closing the CommitProcessor). And then the next thing I see is what you are mentioning: "Configuring CommitProcessor with 24 worker threads". But this time the server is already a leader, as it is sending the UPTODATE messages (lines 321, 322). So my assumption would be that this time this CommitProcessor is inside a LederZooKeeperServer, not inside a LearnerZooKeeperServer. So these are actually different CommitProcessors and different workerPools. Anyway, I am not saying you are not right (this is a quite complicated piece of code). All I say is that I am not convinced yet and it is very hard for me to tell what is happening, as I don't see the full logs and also I was not able to reproduce the problem locally. (maybe my mistake, I don't know) I don't think it is related to docker compose vs. plain docker. Based on your description, something must have been stucked, I am just not sure if it is the workerPool in the CommitProcessor. Can I ask again: "Did you try your proposed fix already and saw that it solves your original issue?" (you can download the ZooKeeper code then apply your fix and run `mvn clean install -DskipTests` and change the zookeeper jar files in the docker image for testing) > Zookeeper refuses request after node expansion > -- > > Key: ZOOKEEPER-3829 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3829 > Project: ZooKeeper > Issue Type: Bug > Components: server >Affects Versions: 3.5.6 >Reporter: benwang li >Priority: Major > Attachments: d.log, screenshot-1.png > > > It's easy to reproduce this bug. > {code:java} > //代码占位符 > > Step 1. Deploy 3 nodes A,B,C with configuration A,B,C . > Step 2. Deploy node ` D` with configuration `A,B,C,D` , cluster state is ok > now. > Step 3. Restart nodes A,B,C with configuration A,B,C,D, then the leader will > be D, cluster hangs, but it can accept `mntr` command, other command like `ls > /` will be blocked. > Step 4. Restart nodes D, cluster state is back to normal now. > > {code} > > We have looked into the code of 3.5.6 version, and we found it may be the > issue of `workerPool` . > The `CommitProcessor` shutdown and make `workerPool` shutdown, but > `workerPool` still exists. It will never work anymore, yet the cluster still > thinks it's ok. > > I think the bug may still exist in master branch. > We have tested it in our machines by reset the `workerPool` to null. If it's > ok, please assign this issue to me, and then I'll create a PR. > > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (ZOOKEEPER-3829) Zookeeper refuses request after node expansion
[ https://issues.apache.org/jira/browse/ZOOKEEPER-3829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17108338#comment-17108338 ] benwang li edited comment on ZOOKEEPER-3829 at 5/15/20, 2:30 PM: - [~symat] Hi, I reproduced it with your docker-compose scripts. my zoo.cfg as ClickHouse documents [tips|[https://clickhouse.tech/docs/en/operations/tips/]] {code:java} libenwang@ck015:~/git/zookeeper-docker-test$ cat conf/zoo.cfg dataDir=/data dataLogDir=/datalog tickTime=2000 initLimit=3 syncLimit=10 maxClientCnxns=2000 maxSessionTimeout=6000 autopurge.snapRetainCount=10 autopurge.purgeInterval=1 preAllocSize=131072 snapCount=300 leaderServes=yes standaloneEnabled=false clientPort=2181 admin.serverPort=8084 {code} Scripts {code:java} export ZOOKEEPER_GIT_REPO=~/git/zookeeper export ZOOKEEPER_DOCKER_TEST_GIT_REPO=~/git/zookeeper-docker-test # you always need to do a maven install to have the assembly tar.gz file updated! cd $ZOOKEEPER_GIT_REPO mvn clean install -DskipTests cd $ZOOKEEPER_DOCKER_TEST_GIT_REPO sudo rm -rf data docker-compose --project-name zookeeper --file 3_nodes_zk_mounted_data_folder.yml up -d docker exec -it zookeeper_zoo1_1 /bin/bash /zookeeper/bin/zkCli.sh create /clickhouse aaa docker-compose --project-name zookeeper --file 4_nodes_zk_mounted_data_folder.yml create zoo4 docker-compose --project-name zookeeper --file 4_nodes_zk_mounted_data_folder.yml start zoo4 docker-compose --project-name zookeeper --file 4_nodes_zk_mounted_data_folder.yml stop zoo1 docker-compose --project-name zookeeper --file 4_nodes_zk_mounted_data_folder.yml create zoo1 docker-compose --project-name zookeeper --file 4_nodes_zk_mounted_data_folder.yml start zoo1 docker-compose --project-name zookeeper --file 4_nodes_zk_mounted_data_folder.yml stop zoo2 docker-compose --project-name zookeeper --file 4_nodes_zk_mounted_data_folder.yml create zoo2 docker-compose --project-name zookeeper --file 4_nodes_zk_mounted_data_folder.yml start zoo2 docker-compose --project-name zookeeper --file 4_nodes_zk_mounted_data_folder.yml stop zoo3 docker-compose --project-name zookeeper --file 4_nodes_zk_mounted_data_folder.yml create zoo3 docker-compose --project-name zookeeper --file 4_nodes_zk_mounted_data_folder.yml start zoo3 # This hangs docker exec -it zookeeper_zoo4_1 /bin/bash /zookeeper/bin/zkCli.sh ls / docker-compose --project-name zookeeper --file 4_nodes_zk_mounted_data_folder.yml down {code} was (Author: sundyli): [~symat] Hi, I reproduced it with your docker-compose scripts. my zoo.cfg {code:java} libenwang@ck015:~/git/zookeeper-docker-test$ cat conf/zoo.cfg dataDir=/data dataLogDir=/datalog tickTime=2000 initLimit=3 syncLimit=10 maxClientCnxns=2000 maxSessionTimeout=6000 autopurge.snapRetainCount=10 autopurge.purgeInterval=1 preAllocSize=131072 snapCount=300 leaderServes=yes standaloneEnabled=false clientPort=2181 admin.serverPort=8084 {code} Scripts {code:java} export ZOOKEEPER_GIT_REPO=~/git/zookeeper export ZOOKEEPER_DOCKER_TEST_GIT_REPO=~/git/zookeeper-docker-test # you always need to do a maven install to have the assembly tar.gz file updated! cd $ZOOKEEPER_GIT_REPO mvn clean install -DskipTests cd $ZOOKEEPER_DOCKER_TEST_GIT_REPO sudo rm -rf data docker-compose --project-name zookeeper --file 3_nodes_zk_mounted_data_folder.yml up -d docker exec -it zookeeper_zoo1_1 /bin/bash /zookeeper/bin/zkCli.sh create /clickhouse aaa docker-compose --project-name zookeeper --file 4_nodes_zk_mounted_data_folder.yml create zoo4 docker-compose --project-name zookeeper --file 4_nodes_zk_mounted_data_folder.yml start zoo4 docker-compose --project-name zookeeper --file 4_nodes_zk_mounted_data_folder.yml stop zoo1 docker-compose --project-name zookeeper --file 4_nodes_zk_mounted_data_folder.yml create zoo1 docker-compose --project-name zookeeper --file 4_nodes_zk_mounted_data_folder.yml start zoo1 docker-compose --project-name zookeeper --file 4_nodes_zk_mounted_data_folder.yml stop zoo2 docker-compose --project-name zookeeper --file 4_nodes_zk_mounted_data_folder.yml create zoo2 docker-compose --project-name zookeeper --file 4_nodes_zk_mounted_data_folder.yml start zoo2 docker-compose --project-name zookeeper --file 4_nodes_zk_mounted_data_folder.yml stop zoo3 docker-compose --project-name zookeeper --file 4_nodes_zk_mounted_data_folder.yml create zoo3 docker-compose --project-name zookeeper --file 4_nodes_zk_mounted_data_folder.yml start zoo3 # This hangs docker exec -it zookeeper_zoo4_1 /bin/bash /zookeeper/bin/zkCli.sh ls / docker-compose --project-name zookeeper --file 4_nodes_zk_mounted_data_folder.yml down {code} > Zookeeper refuses request after node expansion > -- > > Key: ZOOKEEPER-3829 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3
[jira] [Commented] (ZOOKEEPER-3829) Zookeeper refuses request after node expansion
[ https://issues.apache.org/jira/browse/ZOOKEEPER-3829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17108338#comment-17108338 ] benwang li commented on ZOOKEEPER-3829: --- [~symat] Hi, I reproduced it with your docker-compose scripts. my zoo.cfg {code:java} libenwang@ck015:~/git/zookeeper-docker-test$ cat conf/zoo.cfg dataDir=/data dataLogDir=/datalog tickTime=2000 initLimit=3 syncLimit=10 maxClientCnxns=2000 maxSessionTimeout=6000 autopurge.snapRetainCount=10 autopurge.purgeInterval=1 preAllocSize=131072 snapCount=300 leaderServes=yes standaloneEnabled=false clientPort=2181 admin.serverPort=8084 {code} Scripts {code:java} export ZOOKEEPER_GIT_REPO=~/git/zookeeper export ZOOKEEPER_DOCKER_TEST_GIT_REPO=~/git/zookeeper-docker-test # you always need to do a maven install to have the assembly tar.gz file updated! cd $ZOOKEEPER_GIT_REPO mvn clean install -DskipTests cd $ZOOKEEPER_DOCKER_TEST_GIT_REPO sudo rm -rf data docker-compose --project-name zookeeper --file 3_nodes_zk_mounted_data_folder.yml up -d docker exec -it zookeeper_zoo1_1 /bin/bash /zookeeper/bin/zkCli.sh create /clickhouse aaa docker-compose --project-name zookeeper --file 4_nodes_zk_mounted_data_folder.yml create zoo4 docker-compose --project-name zookeeper --file 4_nodes_zk_mounted_data_folder.yml start zoo4 docker-compose --project-name zookeeper --file 4_nodes_zk_mounted_data_folder.yml stop zoo1 docker-compose --project-name zookeeper --file 4_nodes_zk_mounted_data_folder.yml create zoo1 docker-compose --project-name zookeeper --file 4_nodes_zk_mounted_data_folder.yml start zoo1 docker-compose --project-name zookeeper --file 4_nodes_zk_mounted_data_folder.yml stop zoo2 docker-compose --project-name zookeeper --file 4_nodes_zk_mounted_data_folder.yml create zoo2 docker-compose --project-name zookeeper --file 4_nodes_zk_mounted_data_folder.yml start zoo2 docker-compose --project-name zookeeper --file 4_nodes_zk_mounted_data_folder.yml stop zoo3 docker-compose --project-name zookeeper --file 4_nodes_zk_mounted_data_folder.yml create zoo3 docker-compose --project-name zookeeper --file 4_nodes_zk_mounted_data_folder.yml start zoo3 # This hangs docker exec -it zookeeper_zoo4_1 /bin/bash /zookeeper/bin/zkCli.sh ls / docker-compose --project-name zookeeper --file 4_nodes_zk_mounted_data_folder.yml down {code} > Zookeeper refuses request after node expansion > -- > > Key: ZOOKEEPER-3829 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3829 > Project: ZooKeeper > Issue Type: Bug > Components: server >Affects Versions: 3.5.6 >Reporter: benwang li >Priority: Major > Attachments: d.log, screenshot-1.png > > > It's easy to reproduce this bug. > {code:java} > //代码占位符 > > Step 1. Deploy 3 nodes A,B,C with configuration A,B,C . > Step 2. Deploy node ` D` with configuration `A,B,C,D` , cluster state is ok > now. > Step 3. Restart nodes A,B,C with configuration A,B,C,D, then the leader will > be D, cluster hangs, but it can accept `mntr` command, other command like `ls > /` will be blocked. > Step 4. Restart nodes D, cluster state is back to normal now. > > {code} > > We have looked into the code of 3.5.6 version, and we found it may be the > issue of `workerPool` . > The `CommitProcessor` shutdown and make `workerPool` shutdown, but > `workerPool` still exists. It will never work anymore, yet the cluster still > thinks it's ok. > > I think the bug may still exist in master branch. > We have tested it in our machines by reset the `workerPool` to null. If it's > ok, please assign this issue to me, and then I'll create a PR. > > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ZOOKEEPER-3824) ZooKeeper dynamic reconfig doesn't work with GSSAPI/SASL enabled Quorum authn/z
[ https://issues.apache.org/jira/browse/ZOOKEEPER-3824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17108266#comment-17108266 ] Rajkiran Sura commented on ZOOKEEPER-3824: -- Tagging [~symat] [~shralex] [~hanm] [~eolivelli] if they have any thoughts wrt this issue. Thanks, Rajkiran > ZooKeeper dynamic reconfig doesn't work with GSSAPI/SASL enabled Quorum > authn/z > --- > > Key: ZOOKEEPER-3824 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3824 > Project: ZooKeeper > Issue Type: Bug > Components: kerberos, leaderElection, quorum, server >Affects Versions: 3.5.6 > Environment: O.S. :- RHEL7 >Reporter: Rajkiran Sura >Priority: Major > > With 'DynamicReconfig' feature in v3.5.6, ideally the servers can be added > and removed without restarting ZooKeeper service on any of the nodes. > But, with Keberos (GSSAPI via SASL) enabled quorum > authentication/authorization, this is not possible. Because, when you try to > add a new server, it won't be able to connect to any of the members in the > ensemble and the data won't be synced. This is because all the members reject > it based on authorization. For this to make it work, we need to do > 'reconfig', then restart leader, the new member and rest of the members. > Is this the expected behavior with Quorum-auth + DynamicReconfig? Or am I > missing something here. > This is our basic quorum-auth config: > {quote}quorum.auth.serverRequireSasl=true > quorum.auth.kerberos.servicePrincipal=zookeeper/_HOST > quorum.auth.enableSasl=true > quorum.auth.learner.saslLoginContext=QuorumLearner > quorum.auth.learnerRequireSasl=true > quorum.cnxn.threads.size=20 > quorum.auth.server.saslLoginContext=QuorumServer > {quote} > FTR: I raised this question in [ZooKeeper-user > forum|http://zookeeper-user.578899.n2.nabble.com/ZooKeeper-dynamic-reconfig-issue-when-Quorum-authn-authz-is-enabled-td7584927.html] > and both Mate and Enrico suspect this to be a bug. > Also this is easily reproducible in a Kerbers (GSSAPI via SASL) enabled > quorum based ensemble. > > Regards, > Rajkiran > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ZOOKEEPER-3814) ZooKeeper caching of config
[ https://issues.apache.org/jira/browse/ZOOKEEPER-3814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17108259#comment-17108259 ] Rajkiran Sura commented on ZOOKEEPER-3814: -- Many thanks Mate, for looking into this. Glad that you could pin-point the problem. Regards, Rajkiran > ZooKeeper caching of config > --- > > Key: ZOOKEEPER-3814 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3814 > Project: ZooKeeper > Issue Type: Bug > Components: leaderElection, quorum, server >Affects Versions: 3.5.6 >Reporter: Rajkiran Sura >Assignee: Mate Szalay-Beko >Priority: Major > > Hello, > We recently upgraded our 5 node ZooKeeper ensemble from v3.4.8 to v3.5.6. > Encountered no issues as such. > This is how the ZooKeeper config looks like: > {quote}tickTime=2000 > dataDir=/zookeeper-data/ > initLimit=5 > syncLimit=2 > maxClientCnxns=2048 > autopurge.snapRetainCount=3 > autopurge.purgeInterval=1 > 4lw.commands.whitelist=stat, ruok, conf, isro, mntr > authProvider.1=org.apache.zookeeper.server.auth.SASLAuthenticationProvider > requireClientAuthScheme=sasl > quorum.cnxn.threads.size=20 > quorum.auth.enableSasl=true > quorum.auth.kerberos.servicePrincipal= zookeeper/_HOST > quorum.auth.learnerRequireSasl=true > quorum.auth.learner.saslLoginContext=QuorumLearner > quorum.auth.serverRequireSasl=true > quorum.auth.server.saslLoginContext=QuorumServer > server.17=node1.foo.bar.com:2888:3888;2181 > server.19=node2.foo.bar.com:2888:3888;2181 > server.20=node3.foo.bar.com:2888:3888;2181 > server.21=node4.foo.bar.com:2888:3888;2181 > server.22=node5.bar.com:2888:3888;2181 > {quote} > Post upgrade, we had to migrate server.22 on the same node, but with > *FOO*.bar.com domain name due to kerberos referral issues. And, we used > different server-identifier, i.e., *23* when we migrated. So, here is how the > new config looked like: > {quote}server.17=node1.foo.bar.com:2888:3888;2181 > server.19=node2.foo.bar.com:2888:3888;2181 > server.20=node3.foo.bar.com:2888:3888;2181 > server.21=node4.foo.bar.com:2888:3888;2181 > *server.23=node5.{color:#00875a}foo{color}.bar.com:2888:3888;2181* > {quote} > We restarted all the nodes in the ensemble with the above updated config. And > the migrated node joined the quorum successfully and was serving all clients > directly connected to it, without any issues. > Recently, when a leader election happened, > server.*23*=node5.foo.bar.com(migrated node) was chosen as Leader (as it has > highest ID). But then, ZooKeeper was unable to serve any clients and *all* > the servers were _somehow still_ trying to establish a channel to 22 (old DNS > name: node5.bar.com) and were throwing below error in a loop: > {quote}{{2020-05-02 01:43:03,026 [myid:23] - WARN > [WorkerSender[myid=23]:QuorumPeer$QuorumServer@196] - Failed to resolve > address: node4.bar.com}} > {{java.net.UnknownHostException: node5.bar.com: Name or service not known}} > {{ at java.base/java.net.Inet4AddressImpl.lookupAllHostAddr(Native Method)}} > {{ at > java.base/java.net.InetAddress$PlatformNameService.lookupAllHostAddr(InetAddress.java:929)}} > {{ at > java.base/java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1515)}} > {{ at > java.base/java.net.InetAddress$NameServiceAddresses.get(InetAddress.java:848)}} > {{ at java.base/java.net.InetAddress.getAllByName0(InetAddress.java:1505)}} > {{ at java.base/java.net.InetAddress.getAllByName(InetAddress.java:1364)}} > {{ at java.base/java.net.InetAddress.getAllByName(InetAddress.java:1298)}} > {{ at java.base/java.net.InetAddress.getByName(InetAddress.java:1248)}} > {{ at > org.apache.zookeeper.server.quorum.QuorumPeer$QuorumServer.recreateSocketAddresses(QuorumPeer.java:194)}} > {{ at > org.apache.zookeeper.server.quorum.QuorumPeer.recreateSocketAddresses(QuorumPeer.java:774)}} > {{ at > org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:701)}} > {{ at > org.apache.zookeeper.server.quorum.QuorumCnxManager.toSend(QuorumCnxManager.java:620)}} > {{ at > org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender.process(FastLeaderElection.java:477)}} > {{ at > org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender.run(FastLeaderElection.java:456)}} > {{ at java.base/java.lang.Thread.run(Thread.java:834)}} > {{2020-05-02 01:43:03,026 [myid:23] - WARN > [WorkerSender[myid=23]:QuorumCnxManager@679] - Cannot open channel to 22 at > election address node5.bar.com:3888}} > {{java.net.UnknownHostException: node5.bar.com}} > {{ at > java.base/java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:220)}} > {{ at java.base/java.net.SocksSocketImpl.connect(SocksSocketImpl.java:403)}} > {{ at java.base/java.net.Socket.connect(Socket.java:591)}} > {{ at > o
[jira] [Comment Edited] (ZOOKEEPER-3829) Zookeeper refuses request after node expansion
[ https://issues.apache.org/jira/browse/ZOOKEEPER-3829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17108258#comment-17108258 ] benwang li edited comment on ZOOKEEPER-3829 at 5/15/20, 1:04 PM: - [~symat] The logs are in {code:java} line 307: 2020-05-14 18:04:12,022 [myid:4] - INFO [QuorumPeer[myid=4](plain=/0:0:0:0:0:0:0:0:2183)(secure=disabled):CommitProcessor@362] - Shutting down line 319: 2020-05-14 18:04:29,000 [myid:4] - INFO [QuorumPeer[myid=4](plain=/0:0:0:0:0:0:0:0:2183)(secure=disabled):CommitProcessor@256] - Configuring CommitProcessor with 24 worker threads. {code} Thanks for your feedback, I reproduced it without docker, I will try to reproduce it with docker-compose. was (Author: sundyli): [~symat] The logs are in {code:java} line 307: 2020-05-14 18:04:12,022 [myid:4] - INFO [QuorumPeer[myid=4](plain=/0:0:0:0:0:0:0:0:2183)(secure=disabled):CommitProcessor@362] - Shutting down line 309: 2020-05-14 18:04:29,000 [myid:4] - INFO [QuorumPeer[myid=4](plain=/0:0:0:0:0:0:0:0:2183)(secure=disabled):CommitProcessor@256] - Configuring CommitProcessor with 24 worker threads. {code} Thanks for your feedback, I reproduced it without docker, I will try to reproduce it with docker-compose. > Zookeeper refuses request after node expansion > -- > > Key: ZOOKEEPER-3829 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3829 > Project: ZooKeeper > Issue Type: Bug > Components: server >Affects Versions: 3.5.6 >Reporter: benwang li >Priority: Major > Attachments: d.log, screenshot-1.png > > > It's easy to reproduce this bug. > {code:java} > //代码占位符 > > Step 1. Deploy 3 nodes A,B,C with configuration A,B,C . > Step 2. Deploy node ` D` with configuration `A,B,C,D` , cluster state is ok > now. > Step 3. Restart nodes A,B,C with configuration A,B,C,D, then the leader will > be D, cluster hangs, but it can accept `mntr` command, other command like `ls > /` will be blocked. > Step 4. Restart nodes D, cluster state is back to normal now. > > {code} > > We have looked into the code of 3.5.6 version, and we found it may be the > issue of `workerPool` . > The `CommitProcessor` shutdown and make `workerPool` shutdown, but > `workerPool` still exists. It will never work anymore, yet the cluster still > thinks it's ok. > > I think the bug may still exist in master branch. > We have tested it in our machines by reset the `workerPool` to null. If it's > ok, please assign this issue to me, and then I'll create a PR. > > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ZOOKEEPER-3829) Zookeeper refuses request after node expansion
[ https://issues.apache.org/jira/browse/ZOOKEEPER-3829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17108258#comment-17108258 ] benwang li commented on ZOOKEEPER-3829: --- [~symat] The logs are in {code:java} line 307: 2020-05-14 18:04:12,022 [myid:4] - INFO [QuorumPeer[myid=4](plain=/0:0:0:0:0:0:0:0:2183)(secure=disabled):CommitProcessor@362] - Shutting down line 309: 2020-05-14 18:04:29,000 [myid:4] - INFO [QuorumPeer[myid=4](plain=/0:0:0:0:0:0:0:0:2183)(secure=disabled):CommitProcessor@256] - Configuring CommitProcessor with 24 worker threads. {code} Thanks for your feedback, I reproduced it without docker, I will try to reproduce it with docker-compose. > Zookeeper refuses request after node expansion > -- > > Key: ZOOKEEPER-3829 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3829 > Project: ZooKeeper > Issue Type: Bug > Components: server >Affects Versions: 3.5.6 >Reporter: benwang li >Priority: Major > Attachments: d.log, screenshot-1.png > > > It's easy to reproduce this bug. > {code:java} > //代码占位符 > > Step 1. Deploy 3 nodes A,B,C with configuration A,B,C . > Step 2. Deploy node ` D` with configuration `A,B,C,D` , cluster state is ok > now. > Step 3. Restart nodes A,B,C with configuration A,B,C,D, then the leader will > be D, cluster hangs, but it can accept `mntr` command, other command like `ls > /` will be blocked. > Step 4. Restart nodes D, cluster state is back to normal now. > > {code} > > We have looked into the code of 3.5.6 version, and we found it may be the > issue of `workerPool` . > The `CommitProcessor` shutdown and make `workerPool` shutdown, but > `workerPool` still exists. It will never work anymore, yet the cluster still > thinks it's ok. > > I think the bug may still exist in master branch. > We have tested it in our machines by reset the `workerPool` to null. If it's > ok, please assign this issue to me, and then I'll create a PR. > > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ZOOKEEPER-3829) Zookeeper refuses request after node expansion
[ https://issues.apache.org/jira/browse/ZOOKEEPER-3829?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] benwang li updated ZOOKEEPER-3829: -- Attachment: screenshot-1.png > Zookeeper refuses request after node expansion > -- > > Key: ZOOKEEPER-3829 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3829 > Project: ZooKeeper > Issue Type: Bug > Components: server >Affects Versions: 3.5.6 >Reporter: benwang li >Priority: Major > Attachments: d.log, screenshot-1.png > > > It's easy to reproduce this bug. > {code:java} > //代码占位符 > > Step 1. Deploy 3 nodes A,B,C with configuration A,B,C . > Step 2. Deploy node ` D` with configuration `A,B,C,D` , cluster state is ok > now. > Step 3. Restart nodes A,B,C with configuration A,B,C,D, then the leader will > be D, cluster hangs, but it can accept `mntr` command, other command like `ls > /` will be blocked. > Step 4. Restart nodes D, cluster state is back to normal now. > > {code} > > We have looked into the code of 3.5.6 version, and we found it may be the > issue of `workerPool` . > The `CommitProcessor` shutdown and make `workerPool` shutdown, but > `workerPool` still exists. It will never work anymore, yet the cluster still > thinks it's ok. > > I think the bug may still exist in master branch. > We have tested it in our machines by reset the `workerPool` to null. If it's > ok, please assign this issue to me, and then I'll create a PR. > > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ZOOKEEPER-3829) Zookeeper refuses request after node expansion
[ https://issues.apache.org/jira/browse/ZOOKEEPER-3829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17108113#comment-17108113 ] Mate Szalay-Beko commented on ZOOKEEPER-3829: - Did you actually see in the logs this printout ["Shutting down"|https://github.com/apache/zookeeper/blob/e87bad6774e7269ef21a156aff9dad089ef54794/zookeeper-server/src/main/java/org/apache/zookeeper/server/quorum/CommitProcessor.java#L632] before the {{start}} method would be called on the same CommitProcessor? I see this one in your logs: {code:java} 310 2020-05-14 18:04:12,023 [myid:4] - INFO [QuorumPeer[myid=4](plain=/0:0:0:0:0:0:0:0:2183)(secure=disabled):FinalRequestProcessor@514] - shutdown of request processor complete {code} But this is about shutting down the {{FinalRequestProcessor}} and not the {{CommitProcessor}}. > Zookeeper refuses request after node expansion > -- > > Key: ZOOKEEPER-3829 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3829 > Project: ZooKeeper > Issue Type: Bug > Components: server >Affects Versions: 3.5.6 >Reporter: benwang li >Priority: Major > Attachments: d.log > > > It's easy to reproduce this bug. > {code:java} > //代码占位符 > > Step 1. Deploy 3 nodes A,B,C with configuration A,B,C . > Step 2. Deploy node ` D` with configuration `A,B,C,D` , cluster state is ok > now. > Step 3. Restart nodes A,B,C with configuration A,B,C,D, then the leader will > be D, cluster hangs, but it can accept `mntr` command, other command like `ls > /` will be blocked. > Step 4. Restart nodes D, cluster state is back to normal now. > > {code} > > We have looked into the code of 3.5.6 version, and we found it may be the > issue of `workerPool` . > The `CommitProcessor` shutdown and make `workerPool` shutdown, but > `workerPool` still exists. It will never work anymore, yet the cluster still > thinks it's ok. > > I think the bug may still exist in master branch. > We have tested it in our machines by reset the `workerPool` to null. If it's > ok, please assign this issue to me, and then I'll create a PR. > > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ZOOKEEPER-3829) Zookeeper refuses request after node expansion
[ https://issues.apache.org/jira/browse/ZOOKEEPER-3829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17108106#comment-17108106 ] Mate Szalay-Beko commented on ZOOKEEPER-3829: - I failed to reproduce your case. I created docker compose files ([https://github.com/symat/zookeeper-docker-test]) and using 3.5.6, I executed these steps: * start A,B,C with config (A,B,C) * start D with config (A,B,C,D) * stop A * start A with config (A,B,C,D) * stop B * start B with config (A,B,C,D) * stop C * start C with config (A,B,C,D) At the end, everything worked for me just fine, leader was D and all nodes were up, forming a quorum (A,B,C,D) and zkCli worked ("{{ls /"}}) There must be some differences between your reproduction and mine. Can you please share your zoo.cfg? My looks like: {code:java} dataDir=/data dataLogDir=/datalog tickTime=2000 initLimit=5 syncLimit=2 autopurge.snapRetainCount=3 autopurge.purgeInterval=0 maxClientCnxns=60 standaloneEnabled=true admin.enableServer=true localSessionsEnabled=true localSessionsUpgradingEnabled=true 4lw.commands.whitelist=stat, ruok, conf, isro, wchc, wchp, srvr, mntr, cons clientCnxnSocket=org.apache.zookeeper.ClientCnxnSocketNetty serverCnxnFactory=org.apache.zookeeper.server.NettyServerCnxnFactory # server host/port config in my case (when I have 4 nodes) server.1=zoo1:2888:3888;2181 server.2=zoo2:2888:3888;2181 server.3=zoo3:2888:3888;2181 server.4=zoo4:2888:3888;2181 {code} I checked the log file you uploaded. But I don't really see why you think the problem is with CommitProcessor. Maybe I miss something. Is this the full log file from your D node? Also I checked the code. I think the {{CommitProcessor}} class should never be reused after a {{shutdown()}} is called. After a new leader election, a new {{LeaderZooKeeperServer}} / {{FollowerZooKeeperServer}} / {{ObserverZooKeeperServer}} object will be created (depending on the role of the given server), with a fresh {{CommitProcessor}} and new {{workerPool}}. So AFAICT (based only on a high-level look on the code) it shouldn't really matter to set {{workerPool=null}} in the shutdown method. But maybe I just don't follow your reasoning, or missed something in the code. Feel free to create a PR then we can see what you suggest. Did you try your proposed fix already and saw that it solves your original issue? > Zookeeper refuses request after node expansion > -- > > Key: ZOOKEEPER-3829 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3829 > Project: ZooKeeper > Issue Type: Bug > Components: server >Affects Versions: 3.5.6 >Reporter: benwang li >Priority: Major > Attachments: d.log > > > It's easy to reproduce this bug. > {code:java} > //代码占位符 > > Step 1. Deploy 3 nodes A,B,C with configuration A,B,C . > Step 2. Deploy node ` D` with configuration `A,B,C,D` , cluster state is ok > now. > Step 3. Restart nodes A,B,C with configuration A,B,C,D, then the leader will > be D, cluster hangs, but it can accept `mntr` command, other command like `ls > /` will be blocked. > Step 4. Restart nodes D, cluster state is back to normal now. > > {code} > > We have looked into the code of 3.5.6 version, and we found it may be the > issue of `workerPool` . > The `CommitProcessor` shutdown and make `workerPool` shutdown, but > `workerPool` still exists. It will never work anymore, yet the cluster still > thinks it's ok. > > I think the bug may still exist in master branch. > We have tested it in our machines by reset the `workerPool` to null. If it's > ok, please assign this issue to me, and then I'll create a PR. > > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (ZOOKEEPER-3829) Zookeeper refuses request after node expansion
[ https://issues.apache.org/jira/browse/ZOOKEEPER-3829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17108046#comment-17108046 ] benwang li edited comment on ZOOKEEPER-3829 at 5/15/20, 8:04 AM: - We start `CommitProcessor` [here|https://github.com/apache/zookeeper/blob/e87bad6774e7269ef21a156aff9dad089ef54794/zookeeper-server/src/main/java/org/apache/zookeeper/server/quorum/CommitProcessor.java#L455] . We shutdown `CommitProcessor` [here|https://github.com/apache/zookeeper/blob/e87bad6774e7269ef21a156aff9dad089ef54794/zookeeper-server/src/main/java/org/apache/zookeeper/server/quorum/CommitProcessor.java#L637]. But when we call `start` method again, the `workerPool` will not work anymore. I submit the node D logs attachment `d.log`, and we can see that happens. {code:java} 308 2020-05-14 18:04:12,022 [myid:4] - INFO [FollowerRequestProcessor:4:FollowerRequestProcessor@110] - FollowerRequestProcessor exited loop! 309 2020-05-14 18:04:12,022 [myid:4] - INFO [CommitProcessor:4:CommitProcessor@195] - CommitProcessor exited loop! 310 2020-05-14 18:04:12,023 [myid:4] - INFO [QuorumPeer[myid=4](plain=/0:0:0:0:0:0:0:0:2183)(secure=disabled):FinalRequestProcessor@514] - shutdown of request processor complete 311 2020-05-14 18:04:12,024 [myid:4] - DEBUG [QuorumPeer[myid=4](plain=/0:0:0:0:0:0:0:0:2183)(secure=disabled):FileTxnLog$FileTxnIterator@655] - Created new input stream /data1/zookeeper/logs/version-2/log.2a000b 312 2020-05-14 18:04:12,024 [myid:4] - DEBUG [QuorumPeer[myid=4](plain=/0:0:0:0:0:0:0:0:2183)(secure=disabled):FileTxnLog$FileTxnIterator@658] - Created new input archive /data1/zookeeper/logs/version-2/log.2a000b 313 2020-05-14 18:04:12,024 [myid:4] - DEBUG [QuorumPeer[myid=4](plain=/0:0:0:0:0:0:0:0:2183)(secure=disabled):FileTxnLog$FileTxnIterator@696] - EOF exception java.io.EOFException: Failed to read /data1/zookeeper/logs/version-2/log.2a000b 314 -- 315 2020-05-14 18:04:29,000 [myid:4] - DEBUG [QuorumPeer[myid=4](plain=/0:0:0:0:0:0:0:0:2183)(secure=disabled):SessionTrackerImpl@274] - Adding session 0x3082f5048fc 316 2020-05-14 18:04:29,000 [myid:4] - DEBUG [QuorumPeer[myid=4](plain=/0:0:0:0:0:0:0:0:2183)(secure=disabled):SessionTrackerImpl@274] - Adding session 0x40a33f8f3f40002 317 2020-05-14 18:04:29,000 [myid:4] - DEBUG [QuorumPeer[myid=4](plain=/0:0:0:0:0:0:0:0:2183)(secure=disabled):SessionTrackerImpl@274] - Adding session 0x40a33f8f3f4 318 2020-05-14 18:04:29,000 [myid:4] - DEBUG [QuorumPeer[myid=4](plain=/0:0:0:0:0:0:0:0:2183)(secure=disabled):SessionTrackerImpl@274] - Adding session 0x40a33f8f3f40001 319 2020-05-14 18:04:29,000 [myid:4] - INFO [QuorumPeer[myid=4](plain=/0:0:0:0:0:0:0:0:2183)(secure=disabled):CommitProcessor@256] - Configuring CommitProcessor with 24 worker threads. 320 2020-05-14 18:04:29,002 [myid:4] - INFO [QuorumPeer[myid=4](plain=/0:0:0:0:0:0:0:0:2183)(secure=disabled):ContainerManager@64] - Using checkIntervalMs=6 maxPerMinute=1 321 2020-05-14 18:04:29,003 [myid:4] - DEBUG [LearnerHandler-/146.196.79.232:38708:LearnerHandler@534] - Sending UPTODATE message to 3 {code} was (Author: sundyli): We start `CommitProcessor` [here|https://github.com/apache/zookeeper/blob/e87bad6774e7269ef21a156aff9dad089ef54794/zookeeper-server/src/main/java/org/apache/zookeeper/server/quorum/CommitProcessor.java#L455] . We shutdown `CommitProcessor` [here|https://github.com/apache/zookeeper/blob/e87bad6774e7269ef21a156aff9dad089ef54794/zookeeper-server/src/main/java/org/apache/zookeeper/server/quorum/CommitProcessor.java#L637]. But when we call `start` method again, the `workerPool` will work anymore. I submit the node D logs attachment `d.log`, and we can see that happens. {code:java} 308 2020-05-14 18:04:12,022 [myid:4] - INFO [FollowerRequestProcessor:4:FollowerRequestProcessor@110] - FollowerRequestProcessor exited loop! 309 2020-05-14 18:04:12,022 [myid:4] - INFO [CommitProcessor:4:CommitProcessor@195] - CommitProcessor exited loop! 310 2020-05-14 18:04:12,023 [myid:4] - INFO [QuorumPeer[myid=4](plain=/0:0:0:0:0:0:0:0:2183)(secure=disabled):FinalRequestProcessor@514] - shutdown of request processor complete 311 2020-05-14 18:04:12,024 [myid:4] - DEBUG [QuorumPeer[myid=4](plain=/0:0:0:0:0:0:0:0:2183)(secure=disabled):FileTxnLog$FileTxnIterator@655] - Created new input stream /data1/zookeeper/logs/version-2/log.2a000b 312 2020-05-14 18:04:12,024 [myid:4] - DEBUG [QuorumPeer[myid=4](plain=/0:0:0:0:0:0:0:0:2183)(secure=disabled):FileTxnLog$FileTxnIterator@658] - Created new input archive /data1/zookeeper/logs/version-2/log.2a000b 313 2020-05-14 18:04:12,024 [myid:4] - DEBUG [QuorumPeer[myid=4](plain=/0:0:0:0:0:0:0:0:2183)(secure=disabled):FileTxnLog$FileTxnIterator@696] - EOF exception java.io.EOFException: Failed to read /data1/zookeep
[jira] [Comment Edited] (ZOOKEEPER-3829) Zookeeper refuses request after node expansion
[ https://issues.apache.org/jira/browse/ZOOKEEPER-3829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17108046#comment-17108046 ] benwang li edited comment on ZOOKEEPER-3829 at 5/15/20, 7:55 AM: - We start `CommitProcessor` [here|https://github.com/apache/zookeeper/blob/e87bad6774e7269ef21a156aff9dad089ef54794/zookeeper-server/src/main/java/org/apache/zookeeper/server/quorum/CommitProcessor.java#L455] . We shutdown `CommitProcessor` [here|https://github.com/apache/zookeeper/blob/e87bad6774e7269ef21a156aff9dad089ef54794/zookeeper-server/src/main/java/org/apache/zookeeper/server/quorum/CommitProcessor.java#L637]. But when we call `start` method again, the `workerPool` will work anymore. I submit the node D logs attachment `d.log`, and we can see that happens. {code:java} 308 2020-05-14 18:04:12,022 [myid:4] - INFO [FollowerRequestProcessor:4:FollowerRequestProcessor@110] - FollowerRequestProcessor exited loop! 309 2020-05-14 18:04:12,022 [myid:4] - INFO [CommitProcessor:4:CommitProcessor@195] - CommitProcessor exited loop! 310 2020-05-14 18:04:12,023 [myid:4] - INFO [QuorumPeer[myid=4](plain=/0:0:0:0:0:0:0:0:2183)(secure=disabled):FinalRequestProcessor@514] - shutdown of request processor complete 311 2020-05-14 18:04:12,024 [myid:4] - DEBUG [QuorumPeer[myid=4](plain=/0:0:0:0:0:0:0:0:2183)(secure=disabled):FileTxnLog$FileTxnIterator@655] - Created new input stream /data1/zookeeper/logs/version-2/log.2a000b 312 2020-05-14 18:04:12,024 [myid:4] - DEBUG [QuorumPeer[myid=4](plain=/0:0:0:0:0:0:0:0:2183)(secure=disabled):FileTxnLog$FileTxnIterator@658] - Created new input archive /data1/zookeeper/logs/version-2/log.2a000b 313 2020-05-14 18:04:12,024 [myid:4] - DEBUG [QuorumPeer[myid=4](plain=/0:0:0:0:0:0:0:0:2183)(secure=disabled):FileTxnLog$FileTxnIterator@696] - EOF exception java.io.EOFException: Failed to read /data1/zookeeper/logs/version-2/log.2a000b 314 -- 315 2020-05-14 18:04:29,000 [myid:4] - DEBUG [QuorumPeer[myid=4](plain=/0:0:0:0:0:0:0:0:2183)(secure=disabled):SessionTrackerImpl@274] - Adding session 0x3082f5048fc 316 2020-05-14 18:04:29,000 [myid:4] - DEBUG [QuorumPeer[myid=4](plain=/0:0:0:0:0:0:0:0:2183)(secure=disabled):SessionTrackerImpl@274] - Adding session 0x40a33f8f3f40002 317 2020-05-14 18:04:29,000 [myid:4] - DEBUG [QuorumPeer[myid=4](plain=/0:0:0:0:0:0:0:0:2183)(secure=disabled):SessionTrackerImpl@274] - Adding session 0x40a33f8f3f4 318 2020-05-14 18:04:29,000 [myid:4] - DEBUG [QuorumPeer[myid=4](plain=/0:0:0:0:0:0:0:0:2183)(secure=disabled):SessionTrackerImpl@274] - Adding session 0x40a33f8f3f40001 319 2020-05-14 18:04:29,000 [myid:4] - INFO [QuorumPeer[myid=4](plain=/0:0:0:0:0:0:0:0:2183)(secure=disabled):CommitProcessor@256] - Configuring CommitProcessor with 24 worker threads. 320 2020-05-14 18:04:29,002 [myid:4] - INFO [QuorumPeer[myid=4](plain=/0:0:0:0:0:0:0:0:2183)(secure=disabled):ContainerManager@64] - Using checkIntervalMs=6 maxPerMinute=1 321 2020-05-14 18:04:29,003 [myid:4] - DEBUG [LearnerHandler-/146.196.79.232:38708:LearnerHandler@534] - Sending UPTODATE message to 3 {code} was (Author: sundyli): We start `CommitProcessor` [here|https://github.com/apache/zookeeper/blob/e87bad6774e7269ef21a156aff9dad089ef54794/zookeeper-server/src/main/java/org/apache/zookeeper/server/quorum/CommitProcessor.java#L455] . We shutdown `CommitProcessor` [here|https://github.com/apache/zookeeper/blob/e87bad6774e7269ef21a156aff9dad089ef54794/zookeeper-server/src/main/java/org/apache/zookeeper/server/quorum/CommitProcessor.java#L637]. But when we call `start` method again, the `workerPool` will work anymore. I will submit the node D logs, and we can see that happens. > Zookeeper refuses request after node expansion > -- > > Key: ZOOKEEPER-3829 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3829 > Project: ZooKeeper > Issue Type: Bug > Components: server >Affects Versions: 3.5.6 >Reporter: benwang li >Priority: Major > Attachments: d.log > > > It's easy to reproduce this bug. > {code:java} > //代码占位符 > > Step 1. Deploy 3 nodes A,B,C with configuration A,B,C . > Step 2. Deploy node ` D` with configuration `A,B,C,D` , cluster state is ok > now. > Step 3. Restart nodes A,B,C with configuration A,B,C,D, then the leader will > be D, cluster hangs, but it can accept `mntr` command, other command like `ls > /` will be blocked. > Step 4. Restart nodes D, cluster state is back to normal now. > > {code} > > We have looked into the code of 3.5.6 version, and we found it may be the > issue of `workerPool` . > The `CommitProcessor` shutdown and make `workerPool` shutdown, but > `workerPool` still exists. It will never work anymore, y
[jira] [Updated] (ZOOKEEPER-3829) Zookeeper refuses request after node expansion
[ https://issues.apache.org/jira/browse/ZOOKEEPER-3829?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] benwang li updated ZOOKEEPER-3829: -- Attachment: d.log > Zookeeper refuses request after node expansion > -- > > Key: ZOOKEEPER-3829 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3829 > Project: ZooKeeper > Issue Type: Bug > Components: server >Affects Versions: 3.5.6 >Reporter: benwang li >Priority: Major > Attachments: d.log > > > It's easy to reproduce this bug. > {code:java} > //代码占位符 > > Step 1. Deploy 3 nodes A,B,C with configuration A,B,C . > Step 2. Deploy node ` D` with configuration `A,B,C,D` , cluster state is ok > now. > Step 3. Restart nodes A,B,C with configuration A,B,C,D, then the leader will > be D, cluster hangs, but it can accept `mntr` command, other command like `ls > /` will be blocked. > Step 4. Restart nodes D, cluster state is back to normal now. > > {code} > > We have looked into the code of 3.5.6 version, and we found it may be the > issue of `workerPool` . > The `CommitProcessor` shutdown and make `workerPool` shutdown, but > `workerPool` still exists. It will never work anymore, yet the cluster still > thinks it's ok. > > I think the bug may still exist in master branch. > We have tested it in our machines by reset the `workerPool` to null. If it's > ok, please assign this issue to me, and then I'll create a PR. > > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ZOOKEEPER-3829) Zookeeper refuses request after node expansion
[ https://issues.apache.org/jira/browse/ZOOKEEPER-3829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17108046#comment-17108046 ] benwang li commented on ZOOKEEPER-3829: --- We start `CommitProcessor` [here|https://github.com/apache/zookeeper/blob/e87bad6774e7269ef21a156aff9dad089ef54794/zookeeper-server/src/main/java/org/apache/zookeeper/server/quorum/CommitProcessor.java#L455] . We shutdown `CommitProcessor` [here|https://github.com/apache/zookeeper/blob/e87bad6774e7269ef21a156aff9dad089ef54794/zookeeper-server/src/main/java/org/apache/zookeeper/server/quorum/CommitProcessor.java#L637]. But when we call `start` method again, the `workerPool` will work anymore. I will submit the node D logs, and we can see that happens. > Zookeeper refuses request after node expansion > -- > > Key: ZOOKEEPER-3829 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3829 > Project: ZooKeeper > Issue Type: Bug > Components: server >Affects Versions: 3.5.6 >Reporter: benwang li >Priority: Major > > It's easy to reproduce this bug. > {code:java} > //代码占位符 > > Step 1. Deploy 3 nodes A,B,C with configuration A,B,C . > Step 2. Deploy node ` D` with configuration `A,B,C,D` , cluster state is ok > now. > Step 3. Restart nodes A,B,C with configuration A,B,C,D, then the leader will > be D, cluster hangs, but it can accept `mntr` command, other command like `ls > /` will be blocked. > Step 4. Restart nodes D, cluster state is back to normal now. > > {code} > > We have looked into the code of 3.5.6 version, and we found it may be the > issue of `workerPool` . > The `CommitProcessor` shutdown and make `workerPool` shutdown, but > `workerPool` still exists. It will never work anymore, yet the cluster still > thinks it's ok. > > I think the bug may still exist in master branch. > We have tested it in our machines by reset the `workerPool` to null. If it's > ok, please assign this issue to me, and then I'll create a PR. > > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ZOOKEEPER-3829) Zookeeper refuses request after node expansion
[ https://issues.apache.org/jira/browse/ZOOKEEPER-3829?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] benwang li updated ZOOKEEPER-3829: -- Description: It's easy to reproduce this bug. {code:java} //代码占位符 Step 1. Deploy 3 nodes A,B,C with configuration A,B,C . Step 2. Deploy node ` D` with configuration `A,B,C,D` , cluster state is ok now. Step 3. Restart nodes A,B,C with configuration A,B,C,D, then the leader will be D, cluster hangs, but it can accept `mntr` command, other command like `ls /` will be blocked. Step 4. Restart nodes D, cluster state is back to normal now. {code} We have looked into the code of 3.5.6 version, and we found it may be the issue of `workerPool` . The `CommitProcessor` shutdown and make `workerPool` shutdown, but `workerPool` still exists. It will never work anymore, yet the cluster still thinks it's ok. I think the bug may still exist in master branch. We have tested it in our machines by reset the `workerPool` to null. If it's ok, please assign this issue to me, and then I'll create a PR. was: It's easy to reproduce this bug. {code:java} //代码占位符 Step 1. Deploy 3 nodes A,B,C with configuration A,B,C . Step 2. Deploy node ` D` with configuration `A,B,C,D` , cluster state is ok now. Step 3. Restart nodes A,B,C with configuration A,B,C,D, then the leader will be D, cluster hangs. Step 4. Restart nodes D, cluster state is back to normal now. {code} We have looked into the code of 3.5.6 version, and we found it may be the issue of `workerPool` . The `CommitProcessor` shutdown and make `workerPool` shutdown, but `workerPool` still exists. It will never work anymore, yet the cluster still thinks it's ok. I think the bug may still exist in master branch. We have tested it in our machines by reset the `workerPool` to null. If it's ok, please assign this issue to me, and then I'll create a PR. > Zookeeper refuses request after node expansion > -- > > Key: ZOOKEEPER-3829 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3829 > Project: ZooKeeper > Issue Type: Bug > Components: server >Affects Versions: 3.5.6 >Reporter: benwang li >Priority: Major > > It's easy to reproduce this bug. > {code:java} > //代码占位符 > > Step 1. Deploy 3 nodes A,B,C with configuration A,B,C . > Step 2. Deploy node ` D` with configuration `A,B,C,D` , cluster state is ok > now. > Step 3. Restart nodes A,B,C with configuration A,B,C,D, then the leader will > be D, cluster hangs, but it can accept `mntr` command, other command like `ls > /` will be blocked. > Step 4. Restart nodes D, cluster state is back to normal now. > > {code} > > We have looked into the code of 3.5.6 version, and we found it may be the > issue of `workerPool` . > The `CommitProcessor` shutdown and make `workerPool` shutdown, but > `workerPool` still exists. It will never work anymore, yet the cluster still > thinks it's ok. > > I think the bug may still exist in master branch. > We have tested it in our machines by reset the `workerPool` to null. If it's > ok, please assign this issue to me, and then I'll create a PR. > > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (ZOOKEEPER-3828) zookeeper CLI client gets connection timeout when thee leader is restarted
[ https://issues.apache.org/jira/browse/ZOOKEEPER-3828?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mate Szalay-Beko reassigned ZOOKEEPER-3828: --- Assignee: (was: Mate Szalay-Beko) > zookeeper CLI client gets connection timeout when thee leader is restarted > -- > > Key: ZOOKEEPER-3828 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3828 > Project: ZooKeeper > Issue Type: Bug > Components: java client >Affects Versions: 3.6.1 >Reporter: Aishwarya Soni >Priority: Minor > > I have configured 5 nodes zookeeper cluster using 3.6.1 version in a docker > containerized environment. As a part of some destructive testing, I restarted > zookeeper leader. Now, re-election happened and all 5 nodes (containers) are > back in good state with new leader. But when I login to one of the container > and go inside zk Cli (./zkCli.sh) and run the cmd *ls /* I see below error, > {color:#00} {color} > *{color:#00}[zk: localhost:2181(CONNECTING) 1]{color}* > *{color:#00}[zk: localhost:2181(CONNECTING) 1] ls /{color}* > *{color:#00}2020-05-14 23:48:26,556 [myid:localhost:2181] - WARN > [main-SendThread(localhost:2181):ClientCnxn$SendThread@1229] - Client session > timed out, have not heard from server in 30001ms for session id 0x0{color}* > *{color:#00}2020-05-14 23:48:26,556 [myid:localhost:2181] - WARN > [main-SendThread(localhost:2181):ClientCnxn$SendThread@1272] - Session 0x0 > for sever localhost/127.0.0.1:2181, Closing socket connection. Attempting > reconnect except it is a SessionExpiredException.{color}* > *{color:#00}org.apache.zookeeper.ClientCnxn$SessionTimeoutException: > Client session timed out, have not heard from server in 30001ms for session > id 0x0{color}* > *{color:#00}at > org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1230){color}* > *{color:#00}KeeperErrorCode = ConnectionLoss for /{color}* > *{color:#00}[zk: localhost:2181(CONNECTING) 2] 2020-05-14 23:48:28,089 > [myid:localhost:2181] - INFO > [main-SendThread(localhost:2181):ClientCnxn$SendThread@1154] - Opening socket > connection to server localhost/127.0.0.1:2181.{color}* > *{color:#00}2020-05-14 23:48:28,089 [myid:localhost:2181] - INFO > [main-SendThread(localhost:2181):ClientCnxn$SendThread@1156] - SASL config > status: Will not attempt to authenticate using SASL (unknown error){color}* > *{color:#00}2020-05-14 23:48:28,090 [myid:localhost:2181] - INFO > [main-SendThread(localhost:2181):ClientCnxn$SendThread@986] - Socket > connection established, initiating session, client: /127.0.0.1:60384, server: > localhost/127.0.0.1:2181{color}* > *{color:#00}2020-05-14 23:48:58,119 [myid:localhost:2181] - WARN > [main-SendThread(localhost:2181):ClientCnxn$SendThread@1229] - Client session > timed out, have not heard from server in 30030ms for session id 0x0{color}* > *{color:#00}2020-05-14 23:48:58,120 [myid:localhost:2181] - WARN > [main-SendThread(localhost:2181):ClientCnxn$SendThread@1272] - Session 0x0 > for sever localhost/127.0.0.1:2181, Closing socket connection. Attempting > reconnect except it is a SessionExpiredException.{color}* > *{color:#00}org.apache.zookeeper.ClientCnxn$SessionTimeoutException: > Client session timed out, have not heard from server in 30030ms for session > id 0x0{color}* > *{color:#00}at > org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1230){color}* > *{color:#00}2020-05-14 23:49:00,003 [myid:localhost:2181] - INFO > [main-SendThread(localhost:2181):ClientCnxn$SendThread@1154] - Opening socket > connection to server localhost/127.0.0.1:2181.{color}* > *{color:#00}2020-05-14 23:49:00,004 [myid:localhost:2181] - INFO > [main-SendThread(localhost:2181):ClientCnxn$SendThread@1156] - SASL config > status: Will not attempt to authenticate using SASL (unknown error){color}* > *{color:#00}2020-05-14 23:49:00,004 [myid:localhost:2181] - INFO > [main-SendThread(localhost:2181):ClientCnxn$SendThread@986] - Socket > connection established, initiating session, client: /127.0.0.1:32936, server: > localhost/127.0.0.1:2181{color}* > *{color:#00}2020-05-14 23:49:30,032 [myid:localhost:2181] - WARN > [main-SendThread(localhost:2181):ClientCnxn$SendThread@1229] - Client session > timed out, have not heard from server in 30029ms for session id 0x0{color}* > *{color:#00}2020-05-14 23:49:30,033 [myid:localhost:2181] - WARN > [main-SendThread(localhost:2181):ClientCnxn$SendThread@1272] - Session 0x0 > for sever localhost/127.0.0.1:2181, Closing socket connection. Attempting > reconnect except it is a SessionExpiredException.{color}* > *{color:#00}org.apache.zookeeper.ClientCnxn$SessionTimeoutException: > Client session timed out,
[jira] [Commented] (ZOOKEEPER-3829) Zookeeper refuses request after node expansion
[ https://issues.apache.org/jira/browse/ZOOKEEPER-3829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17108023#comment-17108023 ] benwang li commented on ZOOKEEPER-3829: --- [~eolivelli] I think it's the same even on latest release. But I didn't test it in the latest release. My workmates and I can reproduce it in 3.5.6 version every time, how to create a reproducer test? > Zookeeper refuses request after node expansion > -- > > Key: ZOOKEEPER-3829 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3829 > Project: ZooKeeper > Issue Type: Bug > Components: server >Affects Versions: 3.5.6 >Reporter: benwang li >Priority: Major > > It's easy to reproduce this bug. > {code:java} > //代码占位符 > > Step 1. Deploy 3 nodes A,B,C with configuration A,B,C . > Step 2. Deploy node ` D` with configuration `A,B,C,D` , cluster state is ok > now. > Step 3. Restart nodes A,B,C with configuration A,B,C,D, then the leader will > be D, cluster hangs. > Step 4. Restart nodes D, cluster state is back to normal now. > > {code} > > We have looked into the code of 3.5.6 version, and we found it may be the > issue of `workerPool` . > The `CommitProcessor` shutdown and make `workerPool` shutdown, but > `workerPool` still exists. It will never work anymore, yet the cluster still > thinks it's ok. > > I think the bug may still exist in master branch. > We have tested it in our machines by reset the `workerPool` to null. If it's > ok, please assign this issue to me, and then I'll create a PR. > > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ZOOKEEPER-3822) Zookeeper 3.6.1 EndOfStreamException
[ https://issues.apache.org/jira/browse/ZOOKEEPER-3822?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17108025#comment-17108025 ] Mate Szalay-Beko commented on ZOOKEEPER-3822: - {quote}And from there all servers just report ... and don't recover. {quote} what do you mean by "don't recover"? Were the servers unreachable at this point? The exception you pasted only shows that some clients closed the connection to the ZooKeeper server. > Zookeeper 3.6.1 EndOfStreamException > > > Key: ZOOKEEPER-3822 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3822 > Project: ZooKeeper > Issue Type: Bug >Affects Versions: 3.6.1 >Reporter: Sebastian Schmitz >Priority: Critical > Attachments: kafka.log, kafka_test.log, zookeeper.log, > zookeeper_test.log > > > Hello, > after Zookeeper 3.6.1 solved the issue with leader-election containing the IP > and so causing it to fail in separate networks, like in our docker-setup I > updated from 3.4.14 to 3.6.1 in Dev- and Test-Environments. It all went > smoothly and ran for one day. This night I had a new Update of the > environment as we deploy as a whole package of all containers (Kafka, > Zookeeper, Mirrormaker etc.) we also replace the Zookeeper-Containers with > latest ones. In this case, there was no change, the containers were just > removed and deployed again. As the config and data of zookeeper is not stored > inside the containers that's not a problem but this night it broke the whole > clusters of Zookeeper and so also Kafka was down. > * zookeeper_node_1 was stopped and the container removed and created again > * zookeeper_node_1 starts up and the election takes place > * zookeeper_node_2 is elected as leader again > * zookeeper_node_2 is stopped and the container removed and created again > * zookeeper_node_3 is elected as the leader while zookeeper_node_2 is down > * zookeeper_node_2 starts up and zookeeper_node_3 remains leader > And from there all servers just report > 2020-05-07 14:07:57,187 [myid:3] - WARN > [NIOWorkerThread-2:NIOServerCnxn@364] - Unexpected exception2020-05-07 > 14:07:57,187 [myid:3] - WARN [NIOWorkerThread-2:NIOServerCnxn@364] - > Unexpected exceptionEndOfStreamException: Unable to read additional data from > client, it probably closed the socket: address = /z.z.z.z:46060, session = > 0x2014386bbde at > org.apache.zookeeper.server.NIOServerCnxn.handleFailedRead(NIOServerCnxn.java:163) > at org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:326) at > org.apache.zookeeper.server.NIOServerCnxnFactory$IOWorkRequest.doWork(NIOServerCnxnFactory.java:522) > at > org.apache.zookeeper.server.WorkerService$ScheduledWorkRequest.run(WorkerService.java:154) > at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown > Source) at > java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) > at java.base/java.lang.Thread.run(Unknown Source) > and don't recover. > I was able to recover the cluster in Test-Environment by stopping and > starting all the zookeeper-nodes. The cluster in dev is still in that state > and I'm checking the logs to find out more... > The full logs of the deployment of Zookeeper and Kafka that started at 02:00 > are attached. The first time in local NZ-time and the second one is UTC. the > IPs I replaced are x.x.x.x for node_1, y.y.y.y for node_2 and z.z.z.z for > node_3 > The Kafka-Servers are running on the same machine. Which means that the > EndOfStreamEceptions could also be connections from Kafka as I don't think > that zookeeper_node_3 establish a session with itself? > > Edit: > I just found some interesting log from Test-Environment: > zookeeper_node_1: 2020-05-07 14:10:29,418 [myid:1] INFO > [NIOWorkerThread-6:ZooKeeperServer@1375] Refusing session request for client > /f.f.f.f:42012 as it has seen zxid 0xc6 our last zxid is 0xc528f8 > client must try another server > zookeeper_node_2: 2020-05-07 14:10:29,680 [myid:2] INFO > [NIOWorkerThread-4:ZooKeeperServer@1375] Refusing session request for client > /f.f.f.f:51506 as it has seen zxid 0xc6 our last zxid is 0xc528f8 > client must try another server > These entried are repeated there before the EndOfStreamException shows up... > I found that was set by zookeeper_node_3: > zookeeper_node_3: 2020-05-07 14:09:44,495 [myid:3] INFO > [QuorumPeer[myid=3](plain=0.0.0.0:2181)(secure=disabled):Leader@1501] Have > quorum of supporters, sids: [[1, 3],[1, 3]]; starting up and setting last > processed zxid: 0xc6 > zookeeper_node_3: 2020-05-07 14:10:12,587 [myid:3] INFO > [LearnerHandler-/z.z.z.z:60156:LearnerHandler@800] Synchronizing with Learner > sid: 2 maxCommittedLog=0xc528f8 minCommittedLog=0xc52704 >
[jira] [Commented] (ZOOKEEPER-3822) Zookeeper 3.6.1 EndOfStreamException
[ https://issues.apache.org/jira/browse/ZOOKEEPER-3822?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17108017#comment-17108017 ] Mate Szalay-Beko commented on ZOOKEEPER-3822: - I haven't go deep into the logs, but I see may errors in the zookeeper_test server logs like: {code:java} May 08 02:11:15 zookeeper_node_2: 2020-05-07 14:11:15,265 [myid:2] - INFO [NIOWorkerThread-2:ZooKeeperServer@1375] - Refusing session request for client /z2.z2.z2.z2:51826 as it has seen zxid 0xc6 our last zxid is 0xc528f8 client must try another server {code} It indicates that the server (myid=2) didn't catch up with the leader (myid=3) yet. I am not sure if it is a bug, or it is just simply caused by that there was some traffic on the cluster and the restarts happened too quickly one after another. Is this case reproducible? > Zookeeper 3.6.1 EndOfStreamException > > > Key: ZOOKEEPER-3822 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3822 > Project: ZooKeeper > Issue Type: Bug >Affects Versions: 3.6.1 >Reporter: Sebastian Schmitz >Priority: Critical > Attachments: kafka.log, kafka_test.log, zookeeper.log, > zookeeper_test.log > > > Hello, > after Zookeeper 3.6.1 solved the issue with leader-election containing the IP > and so causing it to fail in separate networks, like in our docker-setup I > updated from 3.4.14 to 3.6.1 in Dev- and Test-Environments. It all went > smoothly and ran for one day. This night I had a new Update of the > environment as we deploy as a whole package of all containers (Kafka, > Zookeeper, Mirrormaker etc.) we also replace the Zookeeper-Containers with > latest ones. In this case, there was no change, the containers were just > removed and deployed again. As the config and data of zookeeper is not stored > inside the containers that's not a problem but this night it broke the whole > clusters of Zookeeper and so also Kafka was down. > * zookeeper_node_1 was stopped and the container removed and created again > * zookeeper_node_1 starts up and the election takes place > * zookeeper_node_2 is elected as leader again > * zookeeper_node_2 is stopped and the container removed and created again > * zookeeper_node_3 is elected as the leader while zookeeper_node_2 is down > * zookeeper_node_2 starts up and zookeeper_node_3 remains leader > And from there all servers just report > 2020-05-07 14:07:57,187 [myid:3] - WARN > [NIOWorkerThread-2:NIOServerCnxn@364] - Unexpected exception2020-05-07 > 14:07:57,187 [myid:3] - WARN [NIOWorkerThread-2:NIOServerCnxn@364] - > Unexpected exceptionEndOfStreamException: Unable to read additional data from > client, it probably closed the socket: address = /z.z.z.z:46060, session = > 0x2014386bbde at > org.apache.zookeeper.server.NIOServerCnxn.handleFailedRead(NIOServerCnxn.java:163) > at org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:326) at > org.apache.zookeeper.server.NIOServerCnxnFactory$IOWorkRequest.doWork(NIOServerCnxnFactory.java:522) > at > org.apache.zookeeper.server.WorkerService$ScheduledWorkRequest.run(WorkerService.java:154) > at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown > Source) at > java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) > at java.base/java.lang.Thread.run(Unknown Source) > and don't recover. > I was able to recover the cluster in Test-Environment by stopping and > starting all the zookeeper-nodes. The cluster in dev is still in that state > and I'm checking the logs to find out more... > The full logs of the deployment of Zookeeper and Kafka that started at 02:00 > are attached. The first time in local NZ-time and the second one is UTC. the > IPs I replaced are x.x.x.x for node_1, y.y.y.y for node_2 and z.z.z.z for > node_3 > The Kafka-Servers are running on the same machine. Which means that the > EndOfStreamEceptions could also be connections from Kafka as I don't think > that zookeeper_node_3 establish a session with itself? > > Edit: > I just found some interesting log from Test-Environment: > zookeeper_node_1: 2020-05-07 14:10:29,418 [myid:1] INFO > [NIOWorkerThread-6:ZooKeeperServer@1375] Refusing session request for client > /f.f.f.f:42012 as it has seen zxid 0xc6 our last zxid is 0xc528f8 > client must try another server > zookeeper_node_2: 2020-05-07 14:10:29,680 [myid:2] INFO > [NIOWorkerThread-4:ZooKeeperServer@1375] Refusing session request for client > /f.f.f.f:51506 as it has seen zxid 0xc6 our last zxid is 0xc528f8 > client must try another server > These entried are repeated there before the EndOfStreamException shows up... > I found that was set by zookeeper_node_3: > zookeeper_node_3: 2020-05-07 14:09:44,495 [myid:3] INFO
[jira] [Resolved] (ZOOKEEPER-3690) Improving leader efficiency via not processing learner's requests in commit processor
[ https://issues.apache.org/jira/browse/ZOOKEEPER-3690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Enrico Olivelli resolved ZOOKEEPER-3690. Fix Version/s: 3.7.0 Resolution: Fixed committed to master branch thank you [~lvfangmin] > Improving leader efficiency via not processing learner's requests in commit > processor > - > > Key: ZOOKEEPER-3690 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3690 > Project: ZooKeeper > Issue Type: Improvement >Reporter: Fangmin Lv >Assignee: Fangmin Lv >Priority: Minor > Labels: pull-request-available > Fix For: 3.7.0 > > Time Spent: 2h 10m > Remaining Estimate: 0h > > Currently, all the requests forwarded from learners will be processed like > the locally received requests from leader's clients, which is non-trivial > effort and not necessary to process those in CommitProcessor with session > queue create/destroy > To improve the efficiency, we could skip processing those requests in > leader's commit processor. Based on the benchmark, this optimization improved > around 30% maximum write throughput for read/write mixed workload. -- This message was sent by Atlassian Jira (v8.3.4#803005)