[jira] [Commented] (ZOOKEEPER-3828) zookeeper CLI client gets connection timeout when thee leader is restarted
[ https://issues.apache.org/jira/browse/ZOOKEEPER-3828?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17111665#comment-17111665 ] Aishwarya Soni commented on ZOOKEEPER-3828: --- I have one more observation to add. So I tested the leader killing scenario on versions 3.4.12 and 3.5.5. It all went smoothly. I didn't have any issues going to the zookeeper terminal inside the container and accessing the znodes after the quorum, or during the quorum. Now, what changed in between 3.5.5 till 3.6.1 related to this, I have no clue yet. I agree with you that it should find a working server and connect to it. But in this case, it is unable to do it. Can you test in the dockerized env of zookeeper? Steps: # deploy a 5 node zookeeper 3.6.1 cluster # docker stop container running in leader mode # ssh to any container within the quorum # login to the terminal *./bin/zkCli.sh* # run *ls /* > zookeeper CLI client gets connection timeout when thee leader is restarted > -- > > Key: ZOOKEEPER-3828 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3828 > Project: ZooKeeper > Issue Type: Bug > Components: java client >Affects Versions: 3.6.1 >Reporter: Aishwarya Soni >Priority: Minor > > I have configured 5 nodes zookeeper cluster using 3.6.1 version in a docker > containerized environment. As a part of some destructive testing, I restarted > zookeeper leader. Now, re-election happened and all 5 nodes (containers) are > back in good state with new leader. But when I login to one of the container > and go inside zk Cli (./zkCli.sh) and run the cmd *ls /* I see below error, > {color:#00} {color} > *{color:#00}[zk: localhost:2181(CONNECTING) 1]{color}* > *{color:#00}[zk: localhost:2181(CONNECTING) 1] ls /{color}* > *{color:#00}2020-05-14 23:48:26,556 [myid:localhost:2181] - WARN > [main-SendThread(localhost:2181):ClientCnxn$SendThread@1229] - Client session > timed out, have not heard from server in 30001ms for session id 0x0{color}* > *{color:#00}2020-05-14 23:48:26,556 [myid:localhost:2181] - WARN > [main-SendThread(localhost:2181):ClientCnxn$SendThread@1272] - Session 0x0 > for sever localhost/127.0.0.1:2181, Closing socket connection. Attempting > reconnect except it is a SessionExpiredException.{color}* > *{color:#00}org.apache.zookeeper.ClientCnxn$SessionTimeoutException: > Client session timed out, have not heard from server in 30001ms for session > id 0x0{color}* > *{color:#00}at > org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1230){color}* > *{color:#00}KeeperErrorCode = ConnectionLoss for /{color}* > *{color:#00}[zk: localhost:2181(CONNECTING) 2] 2020-05-14 23:48:28,089 > [myid:localhost:2181] - INFO > [main-SendThread(localhost:2181):ClientCnxn$SendThread@1154] - Opening socket > connection to server localhost/127.0.0.1:2181.{color}* > *{color:#00}2020-05-14 23:48:28,089 [myid:localhost:2181] - INFO > [main-SendThread(localhost:2181):ClientCnxn$SendThread@1156] - SASL config > status: Will not attempt to authenticate using SASL (unknown error){color}* > *{color:#00}2020-05-14 23:48:28,090 [myid:localhost:2181] - INFO > [main-SendThread(localhost:2181):ClientCnxn$SendThread@986] - Socket > connection established, initiating session, client: /127.0.0.1:60384, server: > localhost/127.0.0.1:2181{color}* > *{color:#00}2020-05-14 23:48:58,119 [myid:localhost:2181] - WARN > [main-SendThread(localhost:2181):ClientCnxn$SendThread@1229] - Client session > timed out, have not heard from server in 30030ms for session id 0x0{color}* > *{color:#00}2020-05-14 23:48:58,120 [myid:localhost:2181] - WARN > [main-SendThread(localhost:2181):ClientCnxn$SendThread@1272] - Session 0x0 > for sever localhost/127.0.0.1:2181, Closing socket connection. Attempting > reconnect except it is a SessionExpiredException.{color}* > *{color:#00}org.apache.zookeeper.ClientCnxn$SessionTimeoutException: > Client session timed out, have not heard from server in 30030ms for session > id 0x0{color}* > *{color:#00}at > org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1230){color}* > *{color:#00}2020-05-14 23:49:00,003 [myid:localhost:2181] - INFO > [main-SendThread(localhost:2181):ClientCnxn$SendThread@1154] - Opening socket > connection to server localhost/127.0.0.1:2181.{color}* > *{color:#00}2020-05-14 23:49:00,004 [myid:localhost:2181] - INFO > [main-SendThread(localhost:2181):ClientCnxn$SendThread@1156] - SASL config > status: Will not attempt to authenticate using SASL (unknown error){color}* > *{color:#00}2020-05-14 23:49:00,004 [myid:localhost:2181] - INFO > [main-SendThread(localhost:2181):ClientCnxn$SendThread@986] - Socket > connection
[jira] [Comment Edited] (ZOOKEEPER-3756) Members failing to rejoin quorum
[ https://issues.apache.org/jira/browse/ZOOKEEPER-3756?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17111589#comment-17111589 ] Dai Shi edited comment on ZOOKEEPER-3756 at 5/19/20, 10:28 PM: --- Hi Mate, I just wanted to report back after testing 3.5.8. I am happy to say that it seems to work well after a brief bit of testing. I am no longer setting {{-Dzookeeper.cnxTimeout=500}}, and now when I roll the leader the cluster downtime is only 2-3 seconds instead of 30+ seconds. Thanks again for helping me debug and creating this fix! was (Author: dshi): Hi Mate, I just wanted to report back after testing 3.5.8. I am happy to say that it seems to work well after a brief bit of testing. I am no longer setting `-Dzookeeper.cnxTimeout=500`, and now when I roll the leader the cluster downtime is only 2-3 seconds instead of 30+ seconds. Thanks again for helping me debug and creating this fix! > Members failing to rejoin quorum > > > Key: ZOOKEEPER-3756 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3756 > Project: ZooKeeper > Issue Type: Bug > Components: leaderElection >Affects Versions: 3.5.6, 3.5.7 >Reporter: Dai Shi >Assignee: Mate Szalay-Beko >Priority: Major > Labels: pull-request-available > Fix For: 3.6.1, 3.5.8 > > Attachments: Dockerfile, configmap.yaml, docker-entrypoint.sh, > jmx.yaml, zoo-0.log, zoo-1.log, zoo-2.log, zoo-service.yaml, zookeeper.yaml > > Time Spent: 3.5h > Remaining Estimate: 0h > > Not sure if this is the place to ask, please close if it's not. > I am seeing some behavior that I can't explain since upgrading to 3.5: > In a 5 member quorum, when server 3 is the leader and each server has this in > their configuration: > {code:java} > server.1=100.71.255.254:2888:3888:participant;2181 > server.2=100.71.255.253:2888:3888:participant;2181 > server.3=100.71.255.252:2888:3888:participant;2181 > server.4=100.71.255.251:2888:3888:participant;2181 > server.5=100.71.255.250:2888:3888:participant;2181{code} > If servers 1 or 2 are restarted, they fail to rejoin the quorum with this in > the logs: > {code:java} > 2020-03-11 20:23:35,720 [myid:2] - INFO > [QuorumPeer[myid=2](plain=0.0.0.0:2181)(secure=disabled):QuorumPeer@1175] - > LOOKING > 2020-03-11 20:23:35,721 [myid:2] - INFO > [QuorumPeer[myid=2](plain=0.0.0.0:2181)(secure=disabled):FastLeaderElection@885] > - New election. My id = 2, proposed zxid=0x1b8005f4bba > 2020-03-11 20:23:35,733 [myid:2] - INFO > [WorkerSender[myid=2]:QuorumCnxManager@438] - Have smaller server identifier, > so dropping the connection: (3, 2) > 2020-03-11 20:23:35,734 [myid:2] - INFO > [0.0.0.0/0.0.0.0:3888:QuorumCnxManager$Listener@924] - Received connection > request 100.126.116.201:36140 > 2020-03-11 20:23:35,735 [myid:2] - INFO > [WorkerSender[myid=2]:QuorumCnxManager@438] - Have smaller server identifier, > so dropping the connection: (4, 2) > 2020-03-11 20:23:35,740 [myid:2] - INFO > [WorkerSender[myid=2]:QuorumCnxManager@438] - Have smaller server identifier, > so dropping the connection: (5, 2) > 2020-03-11 20:23:35,740 [myid:2] - INFO > [0.0.0.0/0.0.0.0:3888:QuorumCnxManager$Listener@924] - Received connection > request 100.126.116.201:36142 > 2020-03-11 20:23:35,740 [myid:2] - INFO > [WorkerReceiver[myid=2]:FastLeaderElection@679] - Notification: 2 (message > format version), 2 (n.leader), 0x1b8005f4bba (n.zxid), 0x1 (n.round), LOOKING > (n.state), 2 (n.sid), 0x1b8 (n.peerEPoch), LOOKING (my state)0 (n.config > version) > 2020-03-11 20:23:35,742 [myid:2] - WARN > [SendWorker:3:QuorumCnxManager$SendWorker@1143] - Interrupted while waiting > for message on queue > java.lang.InterruptedException > at > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2014) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2088) > at > java.util.concurrent.ArrayBlockingQueue.poll(ArrayBlockingQueue.java:418) > at > org.apache.zookeeper.server.quorum.QuorumCnxManager.pollSendQueue(QuorumCnxManager.java:1294) > at > org.apache.zookeeper.server.quorum.QuorumCnxManager.access$700(QuorumCnxManager.java:82) > at > org.apache.zookeeper.server.quorum.QuorumCnxManager$SendWorker.run(QuorumCnxManager.java:1131) > 2020-03-11 20:23:35,744 [myid:2] - WARN > [SendWorker:3:QuorumCnxManager$SendWorker@1153] - Send worker leaving thread > id 3 my id = 2 > 2020-03-11 20:23:35,745 [myid:2] - WARN > [RecvWorker:3:QuorumCnxManager$RecvWorker@1230] - Interrupting > SendWorker{code} > The only way I can seem to get them to rejoin the quorum is to restart t
[jira] [Commented] (ZOOKEEPER-3756) Members failing to rejoin quorum
[ https://issues.apache.org/jira/browse/ZOOKEEPER-3756?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17111589#comment-17111589 ] Dai Shi commented on ZOOKEEPER-3756: Hi Mate, I just wanted to report back after testing 3.5.8. I am happy to say that it seems to work well after a brief bit of testing. I am no longer setting `-Dzookeeper.cnxTimeout=500`, and now when I roll the leader the cluster downtime is only 2-3 seconds instead of 30+ seconds. Thanks again for helping me debug and creating this fix! > Members failing to rejoin quorum > > > Key: ZOOKEEPER-3756 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3756 > Project: ZooKeeper > Issue Type: Bug > Components: leaderElection >Affects Versions: 3.5.6, 3.5.7 >Reporter: Dai Shi >Assignee: Mate Szalay-Beko >Priority: Major > Labels: pull-request-available > Fix For: 3.6.1, 3.5.8 > > Attachments: Dockerfile, configmap.yaml, docker-entrypoint.sh, > jmx.yaml, zoo-0.log, zoo-1.log, zoo-2.log, zoo-service.yaml, zookeeper.yaml > > Time Spent: 3.5h > Remaining Estimate: 0h > > Not sure if this is the place to ask, please close if it's not. > I am seeing some behavior that I can't explain since upgrading to 3.5: > In a 5 member quorum, when server 3 is the leader and each server has this in > their configuration: > {code:java} > server.1=100.71.255.254:2888:3888:participant;2181 > server.2=100.71.255.253:2888:3888:participant;2181 > server.3=100.71.255.252:2888:3888:participant;2181 > server.4=100.71.255.251:2888:3888:participant;2181 > server.5=100.71.255.250:2888:3888:participant;2181{code} > If servers 1 or 2 are restarted, they fail to rejoin the quorum with this in > the logs: > {code:java} > 2020-03-11 20:23:35,720 [myid:2] - INFO > [QuorumPeer[myid=2](plain=0.0.0.0:2181)(secure=disabled):QuorumPeer@1175] - > LOOKING > 2020-03-11 20:23:35,721 [myid:2] - INFO > [QuorumPeer[myid=2](plain=0.0.0.0:2181)(secure=disabled):FastLeaderElection@885] > - New election. My id = 2, proposed zxid=0x1b8005f4bba > 2020-03-11 20:23:35,733 [myid:2] - INFO > [WorkerSender[myid=2]:QuorumCnxManager@438] - Have smaller server identifier, > so dropping the connection: (3, 2) > 2020-03-11 20:23:35,734 [myid:2] - INFO > [0.0.0.0/0.0.0.0:3888:QuorumCnxManager$Listener@924] - Received connection > request 100.126.116.201:36140 > 2020-03-11 20:23:35,735 [myid:2] - INFO > [WorkerSender[myid=2]:QuorumCnxManager@438] - Have smaller server identifier, > so dropping the connection: (4, 2) > 2020-03-11 20:23:35,740 [myid:2] - INFO > [WorkerSender[myid=2]:QuorumCnxManager@438] - Have smaller server identifier, > so dropping the connection: (5, 2) > 2020-03-11 20:23:35,740 [myid:2] - INFO > [0.0.0.0/0.0.0.0:3888:QuorumCnxManager$Listener@924] - Received connection > request 100.126.116.201:36142 > 2020-03-11 20:23:35,740 [myid:2] - INFO > [WorkerReceiver[myid=2]:FastLeaderElection@679] - Notification: 2 (message > format version), 2 (n.leader), 0x1b8005f4bba (n.zxid), 0x1 (n.round), LOOKING > (n.state), 2 (n.sid), 0x1b8 (n.peerEPoch), LOOKING (my state)0 (n.config > version) > 2020-03-11 20:23:35,742 [myid:2] - WARN > [SendWorker:3:QuorumCnxManager$SendWorker@1143] - Interrupted while waiting > for message on queue > java.lang.InterruptedException > at > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2014) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2088) > at > java.util.concurrent.ArrayBlockingQueue.poll(ArrayBlockingQueue.java:418) > at > org.apache.zookeeper.server.quorum.QuorumCnxManager.pollSendQueue(QuorumCnxManager.java:1294) > at > org.apache.zookeeper.server.quorum.QuorumCnxManager.access$700(QuorumCnxManager.java:82) > at > org.apache.zookeeper.server.quorum.QuorumCnxManager$SendWorker.run(QuorumCnxManager.java:1131) > 2020-03-11 20:23:35,744 [myid:2] - WARN > [SendWorker:3:QuorumCnxManager$SendWorker@1153] - Send worker leaving thread > id 3 my id = 2 > 2020-03-11 20:23:35,745 [myid:2] - WARN > [RecvWorker:3:QuorumCnxManager$RecvWorker@1230] - Interrupting > SendWorker{code} > The only way I can seem to get them to rejoin the quorum is to restart the > leader. > However, if I remove server 4 and 5 from the configuration of server 1 or 2 > (so only servers 1, 2, and 3 remain in the configuration file), then they can > rejoin the quorum fine. Is this expected and am I doing something wrong? Any > help or explanation would be greatly appreciated. Thank you. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ZOOKEEPER-3840) Use JDK 8 Facilities to Synchronize Access to DataTree Ephemerals
David Mollitor created ZOOKEEPER-3840: - Summary: Use JDK 8 Facilities to Synchronize Access to DataTree Ephemerals Key: ZOOKEEPER-3840 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3840 Project: ZooKeeper Issue Type: Improvement Reporter: David Mollitor Assignee: David Mollitor The current setup is a bit confusing and hard to verify. Use JDK8 facilities to manage this collection. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ZOOKEEPER-3814) ZooKeeper caching of config
[ https://issues.apache.org/jira/browse/ZOOKEEPER-3814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17111298#comment-17111298 ] Rajkiran Sura commented on ZOOKEEPER-3814: -- > I created a PR ([https://github.com/apache/zookeeper/pull/1356]) That's great! > Could you please share the sequence of steps you were executing when you saw > the original issue? I also used the exact sequence of steps that you have described in the above comment. Just one minor correction in the last step, we just restart the service on server.6 as the it has already been started with new config. Regards, Rajkiran > ZooKeeper caching of config > --- > > Key: ZOOKEEPER-3814 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3814 > Project: ZooKeeper > Issue Type: Bug > Components: leaderElection, quorum, server >Affects Versions: 3.5.6 >Reporter: Rajkiran Sura >Assignee: Mate Szalay-Beko >Priority: Major > > Hello, > We recently upgraded our 5 node ZooKeeper ensemble from v3.4.8 to v3.5.6. > Encountered no issues as such. > This is how the ZooKeeper config looks like: > {quote}tickTime=2000 > dataDir=/zookeeper-data/ > initLimit=5 > syncLimit=2 > maxClientCnxns=2048 > autopurge.snapRetainCount=3 > autopurge.purgeInterval=1 > 4lw.commands.whitelist=stat, ruok, conf, isro, mntr > authProvider.1=org.apache.zookeeper.server.auth.SASLAuthenticationProvider > requireClientAuthScheme=sasl > quorum.cnxn.threads.size=20 > quorum.auth.enableSasl=true > quorum.auth.kerberos.servicePrincipal= zookeeper/_HOST > quorum.auth.learnerRequireSasl=true > quorum.auth.learner.saslLoginContext=QuorumLearner > quorum.auth.serverRequireSasl=true > quorum.auth.server.saslLoginContext=QuorumServer > server.17=node1.foo.bar.com:2888:3888;2181 > server.19=node2.foo.bar.com:2888:3888;2181 > server.20=node3.foo.bar.com:2888:3888;2181 > server.21=node4.foo.bar.com:2888:3888;2181 > server.22=node5.bar.com:2888:3888;2181 > {quote} > Post upgrade, we had to migrate server.22 on the same node, but with > *FOO*.bar.com domain name due to kerberos referral issues. And, we used > different server-identifier, i.e., *23* when we migrated. So, here is how the > new config looked like: > {quote}server.17=node1.foo.bar.com:2888:3888;2181 > server.19=node2.foo.bar.com:2888:3888;2181 > server.20=node3.foo.bar.com:2888:3888;2181 > server.21=node4.foo.bar.com:2888:3888;2181 > *server.23=node5.{color:#00875a}foo{color}.bar.com:2888:3888;2181* > {quote} > We restarted all the nodes in the ensemble with the above updated config. And > the migrated node joined the quorum successfully and was serving all clients > directly connected to it, without any issues. > Recently, when a leader election happened, > server.*23*=node5.foo.bar.com(migrated node) was chosen as Leader (as it has > highest ID). But then, ZooKeeper was unable to serve any clients and *all* > the servers were _somehow still_ trying to establish a channel to 22 (old DNS > name: node5.bar.com) and were throwing below error in a loop: > {quote}{{2020-05-02 01:43:03,026 [myid:23] - WARN > [WorkerSender[myid=23]:QuorumPeer$QuorumServer@196] - Failed to resolve > address: node4.bar.com}} > {{java.net.UnknownHostException: node5.bar.com: Name or service not known}} > {{ at java.base/java.net.Inet4AddressImpl.lookupAllHostAddr(Native Method)}} > {{ at > java.base/java.net.InetAddress$PlatformNameService.lookupAllHostAddr(InetAddress.java:929)}} > {{ at > java.base/java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1515)}} > {{ at > java.base/java.net.InetAddress$NameServiceAddresses.get(InetAddress.java:848)}} > {{ at java.base/java.net.InetAddress.getAllByName0(InetAddress.java:1505)}} > {{ at java.base/java.net.InetAddress.getAllByName(InetAddress.java:1364)}} > {{ at java.base/java.net.InetAddress.getAllByName(InetAddress.java:1298)}} > {{ at java.base/java.net.InetAddress.getByName(InetAddress.java:1248)}} > {{ at > org.apache.zookeeper.server.quorum.QuorumPeer$QuorumServer.recreateSocketAddresses(QuorumPeer.java:194)}} > {{ at > org.apache.zookeeper.server.quorum.QuorumPeer.recreateSocketAddresses(QuorumPeer.java:774)}} > {{ at > org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:701)}} > {{ at > org.apache.zookeeper.server.quorum.QuorumCnxManager.toSend(QuorumCnxManager.java:620)}} > {{ at > org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender.process(FastLeaderElection.java:477)}} > {{ at > org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender.run(FastLeaderElection.java:456)}} > {{ at java.base/java.lang.Thread.run(Thread.java:834)}} > {{2020-05-02 01:43:03,026 [myid:23] - WARN > [WorkerSender[myid=23]:QuorumCnxManager@679] - Cannot open channel to 22 at > election addr
[jira] [Created] (ZOOKEEPER-3839) ReconfigBackupTest Remove getFileContent
David Mollitor created ZOOKEEPER-3839: - Summary: ReconfigBackupTest Remove getFileContent Key: ZOOKEEPER-3839 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3839 Project: ZooKeeper Issue Type: Improvement Reporter: David Mollitor Assignee: David Mollitor bq. // upgrade this once we have Google-Guava or Java 7+ https://github.com/apache/zookeeper/blob/a908001be9641d78040b1954acb0cd3a8e9e42c2/zookeeper-server/src/test/java/org/apache/zookeeper/server/quorum/ReconfigBackupTest.java#L53 OK. Done. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ZOOKEEPER-3838) Async handling of quorum connection requests, including SSL handshakes
Mate Szalay-Beko created ZOOKEEPER-3838: --- Summary: Async handling of quorum connection requests, including SSL handshakes Key: ZOOKEEPER-3838 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3838 Project: ZooKeeper Issue Type: Improvement Affects Versions: 3.5.8, 3.6.1 Reporter: Mate Szalay-Beko We are facing issues when the leader election takes too long, as the connection initiation between quorum members takes too much time when QuorumSSL is used. In the current implementation, we handle the connection requests (and SASL authentications) asynchronously when QuorumSASL is enabled. However, the asynchronous handling will not be enabled if only QuorumSSL is enabled (but QuorumSASL is disabled). And anyway, as far as I can see, the SSL handshake happens before the current asynchronous part. See: https://github.com/apache/zookeeper/blob/a908001be9641d78040b1954acb0cd3a8e9e42c2/zookeeper-server/src/main/java/org/apache/zookeeper/server/quorum/QuorumCnxManager.java#L1058 The goal would be to move the SSL handshake to the asynchronous code part, and also to make the connection request handling always asynchronous regardless of the QuorumSASL / QuorumSSL configs. Please note, we already did this for the connection initiation part in ZOOKEEPER-3756. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ZOOKEEPER-3837) Deprecate StringUtils Join
[ https://issues.apache.org/jira/browse/ZOOKEEPER-3837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David Mollitor updated ZOOKEEPER-3837: -- Description: Can do with with JDK 8 now. > Deprecate StringUtils Join > -- > > Key: ZOOKEEPER-3837 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3837 > Project: ZooKeeper > Issue Type: Improvement >Reporter: David Mollitor >Assignee: David Mollitor >Priority: Minor > > Can do with with JDK 8 now. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ZOOKEEPER-3837) Deprecate StringUtils Join
David Mollitor created ZOOKEEPER-3837: - Summary: Deprecate StringUtils Join Key: ZOOKEEPER-3837 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3837 Project: ZooKeeper Issue Type: Improvement Reporter: David Mollitor Assignee: David Mollitor -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ZOOKEEPER-3836) Use Commons and JDK Functions in ClientBase
[ https://issues.apache.org/jira/browse/ZOOKEEPER-3836?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David Mollitor updated ZOOKEEPER-3836: -- Summary: Use Commons and JDK Functions in ClientBase (was: Use Common IOUtils and JDK Functions in ClientBase) > Use Commons and JDK Functions in ClientBase > --- > > Key: ZOOKEEPER-3836 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3836 > Project: ZooKeeper > Issue Type: Improvement >Reporter: David Mollitor >Assignee: David Mollitor >Priority: Minor > > Remove code in {{ClientBase}} that now exist in JDK and Apache Commons. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ZOOKEEPER-3836) Use Common IOUtils and JDK Functions in ClientBase
David Mollitor created ZOOKEEPER-3836: - Summary: Use Common IOUtils and JDK Functions in ClientBase Key: ZOOKEEPER-3836 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3836 Project: ZooKeeper Issue Type: Improvement Reporter: David Mollitor Assignee: David Mollitor Remove code in {{ClientBase}} that now exist in JDK and Apache Commons. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Moved] (ZOOKEEPER-3835) Deprecate IOUtils copyBytes
[ https://issues.apache.org/jira/browse/ZOOKEEPER-3835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David Mollitor moved HIVE-23507 to ZOOKEEPER-3835: -- Key: ZOOKEEPER-3835 (was: HIVE-23507) Project: ZooKeeper (was: Hive) > Deprecate IOUtils copyBytes > --- > > Key: ZOOKEEPER-3835 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3835 > Project: ZooKeeper > Issue Type: Improvement >Reporter: David Mollitor >Assignee: David Mollitor >Priority: Minor > > Only used in a single unit test and can easily be replace with o.a.commons -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (ZOOKEEPER-3814) ZooKeeper caching of config
[ https://issues.apache.org/jira/browse/ZOOKEEPER-3814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17110998#comment-17110998 ] Mate Szalay-Beko edited comment on ZOOKEEPER-3814 at 5/19/20, 12:09 PM: [~rajsura] I created a PR (https://github.com/apache/zookeeper/pull/1356) for ZOOKEEPER-3829 and using this patch, I was able to do the following steps: * have server.1, server.2, server.3, server.4, server.5 up and running * stop server.5 * stop server.1 * start server.1 with the new config (removing server.5, adding server.6 with the new hostname) * stop server.2 * start server.2 with the new config * stop server.3 * start server.3 with the new config * stop server.4 * start server.4 with the new config * start server.6 with the new config (but re-using the data folder of server.5) during these steps, the cluster was up and running, always had at least 3 members. In the end I checked the logfiles of server.6 and I haven't seen any attempt to try to connect to server.5. I also tried a different sequence (although I think it makes less sense): * have server.1, server.2, server.3, server.4, server.5 up and running * stop server.5 * start server.6 with the new config (removing server.5, adding server.6 with the new hostname), re-using the data folder of server.5 * stop server.1 * start server.1 with the new config * stop server.2 * start server.2 with the new config * stop server.3 * start server.3 with the new config * stop server.4 * start server.4 with the new config * stop server.6 * start server.6 with the new config In this case I saw that server.6 was still trying to connect to server.5 after the first restart, but never after the second restart. I don't consider this a big deal, as I don't really think that this is a good sequence anyway. I think it is more logical to restart all the other nodes (to have their config updated) before I would start the new server.6. Could you please share the sequence of steps you were executing when you saw the original issue? was (Author: symat): [~rajsura] I created a PR (https://github.com/apache/zookeeper/pull/1356) for ZOOKEEPER-3829 and using this patch, I was able to do the following steps: * have server.1, server.2, server.3, server.4, server.5 up and running * stop server.5 * stop server.1 * start server.1 with the new config (removing server.5, adding server.6 with the new hostname) * stop server.2 * start server.2 with the new config * stop server.3 * start server.3 with the new config * stop server.4 * start server.4 with the new config * start server.6 with the new config (but re-using the data folder of server.5) during these steps, the cluster was up and running, always had at least 3 members. In the end I checked the logfiles of server.6 and I haven't seen any attempt to try to connect to server.5. Could you please verify that this is the sequence of steps you executed? > ZooKeeper caching of config > --- > > Key: ZOOKEEPER-3814 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3814 > Project: ZooKeeper > Issue Type: Bug > Components: leaderElection, quorum, server >Affects Versions: 3.5.6 >Reporter: Rajkiran Sura >Assignee: Mate Szalay-Beko >Priority: Major > > Hello, > We recently upgraded our 5 node ZooKeeper ensemble from v3.4.8 to v3.5.6. > Encountered no issues as such. > This is how the ZooKeeper config looks like: > {quote}tickTime=2000 > dataDir=/zookeeper-data/ > initLimit=5 > syncLimit=2 > maxClientCnxns=2048 > autopurge.snapRetainCount=3 > autopurge.purgeInterval=1 > 4lw.commands.whitelist=stat, ruok, conf, isro, mntr > authProvider.1=org.apache.zookeeper.server.auth.SASLAuthenticationProvider > requireClientAuthScheme=sasl > quorum.cnxn.threads.size=20 > quorum.auth.enableSasl=true > quorum.auth.kerberos.servicePrincipal= zookeeper/_HOST > quorum.auth.learnerRequireSasl=true > quorum.auth.learner.saslLoginContext=QuorumLearner > quorum.auth.serverRequireSasl=true > quorum.auth.server.saslLoginContext=QuorumServer > server.17=node1.foo.bar.com:2888:3888;2181 > server.19=node2.foo.bar.com:2888:3888;2181 > server.20=node3.foo.bar.com:2888:3888;2181 > server.21=node4.foo.bar.com:2888:3888;2181 > server.22=node5.bar.com:2888:3888;2181 > {quote} > Post upgrade, we had to migrate server.22 on the same node, but with > *FOO*.bar.com domain name due to kerberos referral issues. And, we used > different server-identifier, i.e., *23* when we migrated. So, here is how the > new config looked like: > {quote}server.17=node1.foo.bar.com:2888:3888;2181 > server.19=node2.foo.bar.com:2888:3888;2181 > server.20=node3.foo.bar.com:2888:3888;2181 > server.21=node4.foo.bar.com:2888:3888;2181 > *server.23=node5.{color:#00875a}foo{color}.bar.com:2888:3888;2181* > {quote} > We restart
[jira] [Comment Edited] (ZOOKEEPER-3829) Zookeeper refuses request after node expansion
[ https://issues.apache.org/jira/browse/ZOOKEEPER-3829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17110960#comment-17110960 ] Mate Szalay-Beko edited comment on ZOOKEEPER-3829 at 5/19/20, 12:02 PM: I'll push a PR with some proposed fixes and also some tests that reproduces the steps we discussed before. (these tests fail without the changes but after the changes they are passing). see: https://github.com/apache/zookeeper/pull/1356 was (Author: symat): I'll push a PR with some proposed fixes and also some tests that reproduces the steps we discussed before. (these tests fail without the changes but after the changes they are passing). Still, I don't consider this still as a final solution yet, as e.g. for the rolling restart case described in ZOOKEEPER-3814, these changes are causing an 'infinite loop' when quorum members are sending their notifications between a server with the old config and a server with the new config. This is leading to large amount of notifications sent between these servers, and the infinite loop only broke when all the servers has the new config. > Zookeeper refuses request after node expansion > -- > > Key: ZOOKEEPER-3829 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3829 > Project: ZooKeeper > Issue Type: Bug > Components: server >Affects Versions: 3.5.6 >Reporter: benwang li >Assignee: Mate Szalay-Beko >Priority: Major > Attachments: d.log, screenshot-1.png > > Time Spent: 40m > Remaining Estimate: 0h > > It's easy to reproduce this bug. > {code:java} > //代码占位符 > > Step 1. Deploy 3 nodes A,B,C with configuration A,B,C . > Step 2. Deploy node ` D` with configuration `A,B,C,D` , cluster state is ok > now. > Step 3. Restart nodes A,B,C with configuration A,B,C,D, then the leader will > be D, cluster hangs, but it can accept `mntr` command, other command like `ls > /` will be blocked. > Step 4. Restart nodes D, cluster state is back to normal now. > > {code} > > We have looked into the code of 3.5.6 version, and we found it may be the > issue of `workerPool` . > The `CommitProcessor` shutdown and make `workerPool` shutdown, but > `workerPool` still exists. It will never work anymore, yet the cluster still > thinks it's ok. > > I think the bug may still exist in master branch. > We have tested it in our machines by reset the `workerPool` to null. If it's > ok, please assign this issue to me, and then I'll create a PR. > > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (ZOOKEEPER-3829) Zookeeper refuses request after node expansion
[ https://issues.apache.org/jira/browse/ZOOKEEPER-3829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=1715#comment-1715 ] Mate Szalay-Beko edited comment on ZOOKEEPER-3829 at 5/19/20, 12:00 PM: {quote}we use docker-compose down, we still have this issue {quote} maybe "{{docker-compose down}}" is also removing the virtual network? I always use "{{stop zoo3}}" to stop a service and "{{up -d zoo3}}" to rebuild/start it again. was (Author: symat): {quote}we use docker-compose down, we still have this issue {quote} maybe {{docker-compose down}} is also removing the virtual network? I always use {{stop zoo3}} to stop a service and {{up -d zoo3}} to rebuild/start it again. > Zookeeper refuses request after node expansion > -- > > Key: ZOOKEEPER-3829 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3829 > Project: ZooKeeper > Issue Type: Bug > Components: server >Affects Versions: 3.5.6 >Reporter: benwang li >Assignee: Mate Szalay-Beko >Priority: Major > Attachments: d.log, screenshot-1.png > > Time Spent: 40m > Remaining Estimate: 0h > > It's easy to reproduce this bug. > {code:java} > //代码占位符 > > Step 1. Deploy 3 nodes A,B,C with configuration A,B,C . > Step 2. Deploy node ` D` with configuration `A,B,C,D` , cluster state is ok > now. > Step 3. Restart nodes A,B,C with configuration A,B,C,D, then the leader will > be D, cluster hangs, but it can accept `mntr` command, other command like `ls > /` will be blocked. > Step 4. Restart nodes D, cluster state is back to normal now. > > {code} > > We have looked into the code of 3.5.6 version, and we found it may be the > issue of `workerPool` . > The `CommitProcessor` shutdown and make `workerPool` shutdown, but > `workerPool` still exists. It will never work anymore, yet the cluster still > thinks it's ok. > > I think the bug may still exist in master branch. > We have tested it in our machines by reset the `workerPool` to null. If it's > ok, please assign this issue to me, and then I'll create a PR. > > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ZOOKEEPER-3829) Zookeeper refuses request after node expansion
[ https://issues.apache.org/jira/browse/ZOOKEEPER-3829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=1715#comment-1715 ] Mate Szalay-Beko commented on ZOOKEEPER-3829: - {quote}we use docker-compose down, we still have this issue {quote} maybe {{docker-compose down}} is also removing the virtual network? I always use {{stop zoo3}} to stop a service and {{up -d zoo3}} to rebuild/start it again. > Zookeeper refuses request after node expansion > -- > > Key: ZOOKEEPER-3829 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3829 > Project: ZooKeeper > Issue Type: Bug > Components: server >Affects Versions: 3.5.6 >Reporter: benwang li >Assignee: Mate Szalay-Beko >Priority: Major > Attachments: d.log, screenshot-1.png > > Time Spent: 40m > Remaining Estimate: 0h > > It's easy to reproduce this bug. > {code:java} > //代码占位符 > > Step 1. Deploy 3 nodes A,B,C with configuration A,B,C . > Step 2. Deploy node ` D` with configuration `A,B,C,D` , cluster state is ok > now. > Step 3. Restart nodes A,B,C with configuration A,B,C,D, then the leader will > be D, cluster hangs, but it can accept `mntr` command, other command like `ls > /` will be blocked. > Step 4. Restart nodes D, cluster state is back to normal now. > > {code} > > We have looked into the code of 3.5.6 version, and we found it may be the > issue of `workerPool` . > The `CommitProcessor` shutdown and make `workerPool` shutdown, but > `workerPool` still exists. It will never work anymore, yet the cluster still > thinks it's ok. > > I think the bug may still exist in master branch. > We have tested it in our machines by reset the `workerPool` to null. If it's > ok, please assign this issue to me, and then I'll create a PR. > > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ZOOKEEPER-3814) ZooKeeper caching of config
[ https://issues.apache.org/jira/browse/ZOOKEEPER-3814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17111006#comment-17111006 ] Mate Szalay-Beko commented on ZOOKEEPER-3814: - [~shralex] thanks for your insights! bq. Another change where a config could change is during the gossip happening in leader election - servers send around their configs peer-to-peer, and update their config to a later one if they see one (FastLeaderElection.java, look for processReconfig). There too, you could require that the reconfigEnable flag is on before calling processReconfig. I checked this part and I think reconfigEnable flag is validated already in the QuorumPeer.processReconfig function, so the changes will not propagate this way (as far as I understood). > ZooKeeper caching of config > --- > > Key: ZOOKEEPER-3814 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3814 > Project: ZooKeeper > Issue Type: Bug > Components: leaderElection, quorum, server >Affects Versions: 3.5.6 >Reporter: Rajkiran Sura >Assignee: Mate Szalay-Beko >Priority: Major > > Hello, > We recently upgraded our 5 node ZooKeeper ensemble from v3.4.8 to v3.5.6. > Encountered no issues as such. > This is how the ZooKeeper config looks like: > {quote}tickTime=2000 > dataDir=/zookeeper-data/ > initLimit=5 > syncLimit=2 > maxClientCnxns=2048 > autopurge.snapRetainCount=3 > autopurge.purgeInterval=1 > 4lw.commands.whitelist=stat, ruok, conf, isro, mntr > authProvider.1=org.apache.zookeeper.server.auth.SASLAuthenticationProvider > requireClientAuthScheme=sasl > quorum.cnxn.threads.size=20 > quorum.auth.enableSasl=true > quorum.auth.kerberos.servicePrincipal= zookeeper/_HOST > quorum.auth.learnerRequireSasl=true > quorum.auth.learner.saslLoginContext=QuorumLearner > quorum.auth.serverRequireSasl=true > quorum.auth.server.saslLoginContext=QuorumServer > server.17=node1.foo.bar.com:2888:3888;2181 > server.19=node2.foo.bar.com:2888:3888;2181 > server.20=node3.foo.bar.com:2888:3888;2181 > server.21=node4.foo.bar.com:2888:3888;2181 > server.22=node5.bar.com:2888:3888;2181 > {quote} > Post upgrade, we had to migrate server.22 on the same node, but with > *FOO*.bar.com domain name due to kerberos referral issues. And, we used > different server-identifier, i.e., *23* when we migrated. So, here is how the > new config looked like: > {quote}server.17=node1.foo.bar.com:2888:3888;2181 > server.19=node2.foo.bar.com:2888:3888;2181 > server.20=node3.foo.bar.com:2888:3888;2181 > server.21=node4.foo.bar.com:2888:3888;2181 > *server.23=node5.{color:#00875a}foo{color}.bar.com:2888:3888;2181* > {quote} > We restarted all the nodes in the ensemble with the above updated config. And > the migrated node joined the quorum successfully and was serving all clients > directly connected to it, without any issues. > Recently, when a leader election happened, > server.*23*=node5.foo.bar.com(migrated node) was chosen as Leader (as it has > highest ID). But then, ZooKeeper was unable to serve any clients and *all* > the servers were _somehow still_ trying to establish a channel to 22 (old DNS > name: node5.bar.com) and were throwing below error in a loop: > {quote}{{2020-05-02 01:43:03,026 [myid:23] - WARN > [WorkerSender[myid=23]:QuorumPeer$QuorumServer@196] - Failed to resolve > address: node4.bar.com}} > {{java.net.UnknownHostException: node5.bar.com: Name or service not known}} > {{ at java.base/java.net.Inet4AddressImpl.lookupAllHostAddr(Native Method)}} > {{ at > java.base/java.net.InetAddress$PlatformNameService.lookupAllHostAddr(InetAddress.java:929)}} > {{ at > java.base/java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1515)}} > {{ at > java.base/java.net.InetAddress$NameServiceAddresses.get(InetAddress.java:848)}} > {{ at java.base/java.net.InetAddress.getAllByName0(InetAddress.java:1505)}} > {{ at java.base/java.net.InetAddress.getAllByName(InetAddress.java:1364)}} > {{ at java.base/java.net.InetAddress.getAllByName(InetAddress.java:1298)}} > {{ at java.base/java.net.InetAddress.getByName(InetAddress.java:1248)}} > {{ at > org.apache.zookeeper.server.quorum.QuorumPeer$QuorumServer.recreateSocketAddresses(QuorumPeer.java:194)}} > {{ at > org.apache.zookeeper.server.quorum.QuorumPeer.recreateSocketAddresses(QuorumPeer.java:774)}} > {{ at > org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:701)}} > {{ at > org.apache.zookeeper.server.quorum.QuorumCnxManager.toSend(QuorumCnxManager.java:620)}} > {{ at > org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender.process(FastLeaderElection.java:477)}} > {{ at > org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender.run(FastLeaderElection.java:456)}} > {{ at java.base/java.lang.Thread.run(Thr
[jira] [Commented] (ZOOKEEPER-3814) ZooKeeper caching of config
[ https://issues.apache.org/jira/browse/ZOOKEEPER-3814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17110998#comment-17110998 ] Mate Szalay-Beko commented on ZOOKEEPER-3814: - [~rajsura] I created a PR (https://github.com/apache/zookeeper/pull/1356) for ZOOKEEPER-3829 and using this patch, I was able to do the following steps: * have server.1, server.2, server.3, server.4, server.5 up and running * stop server.5 * stop server.1 * start server.1 with the new config (removing server.5, adding server.6 with the new hostname) * stop server.2 * start server.2 with the new config * stop server.3 * start server.3 with the new config * stop server.4 * start server.4 with the new config * start server.6 with the new config (but re-using the data folder of server.5) during these steps, the cluster was up and running, always had at least 3 members. In the end I checked the logfiles of server.6 and I haven't seen any attempt to try to connect to server.5. Could you please verify that this is the sequence of steps you executed? > ZooKeeper caching of config > --- > > Key: ZOOKEEPER-3814 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3814 > Project: ZooKeeper > Issue Type: Bug > Components: leaderElection, quorum, server >Affects Versions: 3.5.6 >Reporter: Rajkiran Sura >Assignee: Mate Szalay-Beko >Priority: Major > > Hello, > We recently upgraded our 5 node ZooKeeper ensemble from v3.4.8 to v3.5.6. > Encountered no issues as such. > This is how the ZooKeeper config looks like: > {quote}tickTime=2000 > dataDir=/zookeeper-data/ > initLimit=5 > syncLimit=2 > maxClientCnxns=2048 > autopurge.snapRetainCount=3 > autopurge.purgeInterval=1 > 4lw.commands.whitelist=stat, ruok, conf, isro, mntr > authProvider.1=org.apache.zookeeper.server.auth.SASLAuthenticationProvider > requireClientAuthScheme=sasl > quorum.cnxn.threads.size=20 > quorum.auth.enableSasl=true > quorum.auth.kerberos.servicePrincipal= zookeeper/_HOST > quorum.auth.learnerRequireSasl=true > quorum.auth.learner.saslLoginContext=QuorumLearner > quorum.auth.serverRequireSasl=true > quorum.auth.server.saslLoginContext=QuorumServer > server.17=node1.foo.bar.com:2888:3888;2181 > server.19=node2.foo.bar.com:2888:3888;2181 > server.20=node3.foo.bar.com:2888:3888;2181 > server.21=node4.foo.bar.com:2888:3888;2181 > server.22=node5.bar.com:2888:3888;2181 > {quote} > Post upgrade, we had to migrate server.22 on the same node, but with > *FOO*.bar.com domain name due to kerberos referral issues. And, we used > different server-identifier, i.e., *23* when we migrated. So, here is how the > new config looked like: > {quote}server.17=node1.foo.bar.com:2888:3888;2181 > server.19=node2.foo.bar.com:2888:3888;2181 > server.20=node3.foo.bar.com:2888:3888;2181 > server.21=node4.foo.bar.com:2888:3888;2181 > *server.23=node5.{color:#00875a}foo{color}.bar.com:2888:3888;2181* > {quote} > We restarted all the nodes in the ensemble with the above updated config. And > the migrated node joined the quorum successfully and was serving all clients > directly connected to it, without any issues. > Recently, when a leader election happened, > server.*23*=node5.foo.bar.com(migrated node) was chosen as Leader (as it has > highest ID). But then, ZooKeeper was unable to serve any clients and *all* > the servers were _somehow still_ trying to establish a channel to 22 (old DNS > name: node5.bar.com) and were throwing below error in a loop: > {quote}{{2020-05-02 01:43:03,026 [myid:23] - WARN > [WorkerSender[myid=23]:QuorumPeer$QuorumServer@196] - Failed to resolve > address: node4.bar.com}} > {{java.net.UnknownHostException: node5.bar.com: Name or service not known}} > {{ at java.base/java.net.Inet4AddressImpl.lookupAllHostAddr(Native Method)}} > {{ at > java.base/java.net.InetAddress$PlatformNameService.lookupAllHostAddr(InetAddress.java:929)}} > {{ at > java.base/java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1515)}} > {{ at > java.base/java.net.InetAddress$NameServiceAddresses.get(InetAddress.java:848)}} > {{ at java.base/java.net.InetAddress.getAllByName0(InetAddress.java:1505)}} > {{ at java.base/java.net.InetAddress.getAllByName(InetAddress.java:1364)}} > {{ at java.base/java.net.InetAddress.getAllByName(InetAddress.java:1298)}} > {{ at java.base/java.net.InetAddress.getByName(InetAddress.java:1248)}} > {{ at > org.apache.zookeeper.server.quorum.QuorumPeer$QuorumServer.recreateSocketAddresses(QuorumPeer.java:194)}} > {{ at > org.apache.zookeeper.server.quorum.QuorumPeer.recreateSocketAddresses(QuorumPeer.java:774)}} > {{ at > org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:701)}} > {{ at > org.apache.zookeeper.server.quorum.QuorumCnxManager.toSend(QuorumCnxManager.java:620)}} > {{ at > org
[jira] [Assigned] (ZOOKEEPER-3829) Zookeeper refuses request after node expansion
[ https://issues.apache.org/jira/browse/ZOOKEEPER-3829?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mate Szalay-Beko reassigned ZOOKEEPER-3829: --- Assignee: Mate Szalay-Beko > Zookeeper refuses request after node expansion > -- > > Key: ZOOKEEPER-3829 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3829 > Project: ZooKeeper > Issue Type: Bug > Components: server >Affects Versions: 3.5.6 >Reporter: benwang li >Assignee: Mate Szalay-Beko >Priority: Major > Attachments: d.log, screenshot-1.png > > Time Spent: 10m > Remaining Estimate: 0h > > It's easy to reproduce this bug. > {code:java} > //代码占位符 > > Step 1. Deploy 3 nodes A,B,C with configuration A,B,C . > Step 2. Deploy node ` D` with configuration `A,B,C,D` , cluster state is ok > now. > Step 3. Restart nodes A,B,C with configuration A,B,C,D, then the leader will > be D, cluster hangs, but it can accept `mntr` command, other command like `ls > /` will be blocked. > Step 4. Restart nodes D, cluster state is back to normal now. > > {code} > > We have looked into the code of 3.5.6 version, and we found it may be the > issue of `workerPool` . > The `CommitProcessor` shutdown and make `workerPool` shutdown, but > `workerPool` still exists. It will never work anymore, yet the cluster still > thinks it's ok. > > I think the bug may still exist in master branch. > We have tested it in our machines by reset the `workerPool` to null. If it's > ok, please assign this issue to me, and then I'll create a PR. > > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ZOOKEEPER-3829) Zookeeper refuses request after node expansion
[ https://issues.apache.org/jira/browse/ZOOKEEPER-3829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17110960#comment-17110960 ] Mate Szalay-Beko commented on ZOOKEEPER-3829: - I'll push a PR with some proposed fixes and also some tests that reproduces the steps we discussed before. (these tests fail without the changes but after the changes they are passing). Still, I don't consider this still as a final solution yet, as e.g. for the rolling restart case described in ZOOKEEPER-3814, these changes are causing an 'infinite loop' when quorum members are sending their notifications between a server with the old config and a server with the new config. This is leading to large amount of notifications sent between these servers, and the infinite loop only broke when all the servers has the new config. > Zookeeper refuses request after node expansion > -- > > Key: ZOOKEEPER-3829 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3829 > Project: ZooKeeper > Issue Type: Bug > Components: server >Affects Versions: 3.5.6 >Reporter: benwang li >Priority: Major > Attachments: d.log, screenshot-1.png > > Time Spent: 10m > Remaining Estimate: 0h > > It's easy to reproduce this bug. > {code:java} > //代码占位符 > > Step 1. Deploy 3 nodes A,B,C with configuration A,B,C . > Step 2. Deploy node ` D` with configuration `A,B,C,D` , cluster state is ok > now. > Step 3. Restart nodes A,B,C with configuration A,B,C,D, then the leader will > be D, cluster hangs, but it can accept `mntr` command, other command like `ls > /` will be blocked. > Step 4. Restart nodes D, cluster state is back to normal now. > > {code} > > We have looked into the code of 3.5.6 version, and we found it may be the > issue of `workerPool` . > The `CommitProcessor` shutdown and make `workerPool` shutdown, but > `workerPool` still exists. It will never work anymore, yet the cluster still > thinks it's ok. > > I think the bug may still exist in master branch. > We have tested it in our machines by reset the `workerPool` to null. If it's > ok, please assign this issue to me, and then I'll create a PR. > > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ZOOKEEPER-3814) ZooKeeper caching of config
[ https://issues.apache.org/jira/browse/ZOOKEEPER-3814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17110946#comment-17110946 ] Mate Szalay-Beko commented on ZOOKEEPER-3814: - There are other two rolling-restart backward compatibility issues raised recently. I think we should solve them in a single fix. I mark this ticket as duplicate of ZOOKEEPER-3829 and push a PR there soon. That PR will solve all the three issues and also contain some unit tests to verify these rolling-restart scenarios. > ZooKeeper caching of config > --- > > Key: ZOOKEEPER-3814 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3814 > Project: ZooKeeper > Issue Type: Bug > Components: leaderElection, quorum, server >Affects Versions: 3.5.6 >Reporter: Rajkiran Sura >Assignee: Mate Szalay-Beko >Priority: Major > > Hello, > We recently upgraded our 5 node ZooKeeper ensemble from v3.4.8 to v3.5.6. > Encountered no issues as such. > This is how the ZooKeeper config looks like: > {quote}tickTime=2000 > dataDir=/zookeeper-data/ > initLimit=5 > syncLimit=2 > maxClientCnxns=2048 > autopurge.snapRetainCount=3 > autopurge.purgeInterval=1 > 4lw.commands.whitelist=stat, ruok, conf, isro, mntr > authProvider.1=org.apache.zookeeper.server.auth.SASLAuthenticationProvider > requireClientAuthScheme=sasl > quorum.cnxn.threads.size=20 > quorum.auth.enableSasl=true > quorum.auth.kerberos.servicePrincipal= zookeeper/_HOST > quorum.auth.learnerRequireSasl=true > quorum.auth.learner.saslLoginContext=QuorumLearner > quorum.auth.serverRequireSasl=true > quorum.auth.server.saslLoginContext=QuorumServer > server.17=node1.foo.bar.com:2888:3888;2181 > server.19=node2.foo.bar.com:2888:3888;2181 > server.20=node3.foo.bar.com:2888:3888;2181 > server.21=node4.foo.bar.com:2888:3888;2181 > server.22=node5.bar.com:2888:3888;2181 > {quote} > Post upgrade, we had to migrate server.22 on the same node, but with > *FOO*.bar.com domain name due to kerberos referral issues. And, we used > different server-identifier, i.e., *23* when we migrated. So, here is how the > new config looked like: > {quote}server.17=node1.foo.bar.com:2888:3888;2181 > server.19=node2.foo.bar.com:2888:3888;2181 > server.20=node3.foo.bar.com:2888:3888;2181 > server.21=node4.foo.bar.com:2888:3888;2181 > *server.23=node5.{color:#00875a}foo{color}.bar.com:2888:3888;2181* > {quote} > We restarted all the nodes in the ensemble with the above updated config. And > the migrated node joined the quorum successfully and was serving all clients > directly connected to it, without any issues. > Recently, when a leader election happened, > server.*23*=node5.foo.bar.com(migrated node) was chosen as Leader (as it has > highest ID). But then, ZooKeeper was unable to serve any clients and *all* > the servers were _somehow still_ trying to establish a channel to 22 (old DNS > name: node5.bar.com) and were throwing below error in a loop: > {quote}{{2020-05-02 01:43:03,026 [myid:23] - WARN > [WorkerSender[myid=23]:QuorumPeer$QuorumServer@196] - Failed to resolve > address: node4.bar.com}} > {{java.net.UnknownHostException: node5.bar.com: Name or service not known}} > {{ at java.base/java.net.Inet4AddressImpl.lookupAllHostAddr(Native Method)}} > {{ at > java.base/java.net.InetAddress$PlatformNameService.lookupAllHostAddr(InetAddress.java:929)}} > {{ at > java.base/java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1515)}} > {{ at > java.base/java.net.InetAddress$NameServiceAddresses.get(InetAddress.java:848)}} > {{ at java.base/java.net.InetAddress.getAllByName0(InetAddress.java:1505)}} > {{ at java.base/java.net.InetAddress.getAllByName(InetAddress.java:1364)}} > {{ at java.base/java.net.InetAddress.getAllByName(InetAddress.java:1298)}} > {{ at java.base/java.net.InetAddress.getByName(InetAddress.java:1248)}} > {{ at > org.apache.zookeeper.server.quorum.QuorumPeer$QuorumServer.recreateSocketAddresses(QuorumPeer.java:194)}} > {{ at > org.apache.zookeeper.server.quorum.QuorumPeer.recreateSocketAddresses(QuorumPeer.java:774)}} > {{ at > org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:701)}} > {{ at > org.apache.zookeeper.server.quorum.QuorumCnxManager.toSend(QuorumCnxManager.java:620)}} > {{ at > org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender.process(FastLeaderElection.java:477)}} > {{ at > org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender.run(FastLeaderElection.java:456)}} > {{ at java.base/java.lang.Thread.run(Thread.java:834)}} > {{2020-05-02 01:43:03,026 [myid:23] - WARN > [WorkerSender[myid=23]:QuorumCnxManager@679] - Cannot open channel to 22 at > election address node5.bar.com:3888}} > {{java.net.UnknownHostException: node5.bar.com}} > {{ at > java.base/java.n