[jira] [Commented] (ZOOKEEPER-3828) zookeeper CLI client gets connection timeout when thee leader is restarted

2020-05-19 Thread Aishwarya Soni (Jira)


[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-3828?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17111665#comment-17111665
 ] 

Aishwarya Soni commented on ZOOKEEPER-3828:
---

I have one more observation to add. So I tested the leader killing scenario on 
versions 3.4.12 and 3.5.5. It all went smoothly. I didn't have any issues going 
to the zookeeper terminal inside the container and accessing the znodes after 
the quorum, or during the quorum. Now, what changed in between 3.5.5 till 3.6.1 
related to this, I have no clue yet.

I agree with you that it should find a working server and connect to it. But in 
this case, it is unable to do it. Can you test in the dockerized env of 
zookeeper? 

Steps:
 # deploy a 5 node zookeeper 3.6.1 cluster
 # docker stop container running in leader mode
 # ssh to any container within the quorum
 # login to the terminal *./bin/zkCli.sh*
 # run *ls /* 

> zookeeper CLI client gets connection timeout when thee leader is restarted
> --
>
> Key: ZOOKEEPER-3828
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3828
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: java client
>Affects Versions: 3.6.1
>Reporter: Aishwarya Soni
>Priority: Minor
>
> I have configured 5 nodes zookeeper cluster using 3.6.1 version in a docker 
> containerized environment. As a part of some destructive testing, I restarted 
> zookeeper leader. Now, re-election happened and all 5 nodes (containers) are 
> back in good state with new leader. But when I login to one of the container 
> and go inside zk Cli (./zkCli.sh) and run the cmd *ls /* I see below error,
>  {color:#00} {color}
>  *{color:#00}[zk: localhost:2181(CONNECTING) 1]{color}* 
> *{color:#00}[zk: localhost:2181(CONNECTING) 1] ls /{color}*
> *{color:#00}2020-05-14 23:48:26,556 [myid:localhost:2181] - WARN  
> [main-SendThread(localhost:2181):ClientCnxn$SendThread@1229] - Client session 
> timed out, have not heard from server in 30001ms for session id 0x0{color}*
> *{color:#00}2020-05-14 23:48:26,556 [myid:localhost:2181] - WARN  
> [main-SendThread(localhost:2181):ClientCnxn$SendThread@1272] - Session 0x0 
> for sever localhost/127.0.0.1:2181, Closing socket connection. Attempting 
> reconnect except it is a SessionExpiredException.{color}*
> *{color:#00}org.apache.zookeeper.ClientCnxn$SessionTimeoutException: 
> Client session timed out, have not heard from server in 30001ms for session 
> id 0x0{color}*
>  *{color:#00}at 
> org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1230){color}*
> *{color:#00}KeeperErrorCode = ConnectionLoss for /{color}*
> *{color:#00}[zk: localhost:2181(CONNECTING) 2] 2020-05-14 23:48:28,089 
> [myid:localhost:2181] - INFO  
> [main-SendThread(localhost:2181):ClientCnxn$SendThread@1154] - Opening socket 
> connection to server localhost/127.0.0.1:2181.{color}*
> *{color:#00}2020-05-14 23:48:28,089 [myid:localhost:2181] - INFO  
> [main-SendThread(localhost:2181):ClientCnxn$SendThread@1156] - SASL config 
> status: Will not attempt to authenticate using SASL (unknown error){color}*
> *{color:#00}2020-05-14 23:48:28,090 [myid:localhost:2181] - INFO  
> [main-SendThread(localhost:2181):ClientCnxn$SendThread@986] - Socket 
> connection established, initiating session, client: /127.0.0.1:60384, server: 
> localhost/127.0.0.1:2181{color}*
> *{color:#00}2020-05-14 23:48:58,119 [myid:localhost:2181] - WARN  
> [main-SendThread(localhost:2181):ClientCnxn$SendThread@1229] - Client session 
> timed out, have not heard from server in 30030ms for session id 0x0{color}*
> *{color:#00}2020-05-14 23:48:58,120 [myid:localhost:2181] - WARN  
> [main-SendThread(localhost:2181):ClientCnxn$SendThread@1272] - Session 0x0 
> for sever localhost/127.0.0.1:2181, Closing socket connection. Attempting 
> reconnect except it is a SessionExpiredException.{color}*
> *{color:#00}org.apache.zookeeper.ClientCnxn$SessionTimeoutException: 
> Client session timed out, have not heard from server in 30030ms for session 
> id 0x0{color}*
>  *{color:#00}at 
> org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1230){color}*
> *{color:#00}2020-05-14 23:49:00,003 [myid:localhost:2181] - INFO  
> [main-SendThread(localhost:2181):ClientCnxn$SendThread@1154] - Opening socket 
> connection to server localhost/127.0.0.1:2181.{color}*
> *{color:#00}2020-05-14 23:49:00,004 [myid:localhost:2181] - INFO  
> [main-SendThread(localhost:2181):ClientCnxn$SendThread@1156] - SASL config 
> status: Will not attempt to authenticate using SASL (unknown error){color}*
> *{color:#00}2020-05-14 23:49:00,004 [myid:localhost:2181] - INFO  
> [main-SendThread(localhost:2181):ClientCnxn$SendThread@986] - Socket 
> connection

[jira] [Comment Edited] (ZOOKEEPER-3756) Members failing to rejoin quorum

2020-05-19 Thread Dai Shi (Jira)


[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-3756?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17111589#comment-17111589
 ] 

Dai Shi edited comment on ZOOKEEPER-3756 at 5/19/20, 10:28 PM:
---

Hi Mate,

I just wanted to report back after testing 3.5.8. I am happy to say that it 
seems to work well after a brief bit of testing. I am no longer setting 
{{-Dzookeeper.cnxTimeout=500}}, and now when I roll the leader the cluster 
downtime is only 2-3 seconds instead of 30+ seconds.

Thanks again for helping me debug and creating this fix!


was (Author: dshi):
Hi Mate,

I just wanted to report back after testing 3.5.8. I am happy to say that it 
seems to work well after a brief bit of testing. I am no longer setting 
`-Dzookeeper.cnxTimeout=500`, and now when I roll the leader the cluster 
downtime is only 2-3 seconds instead of 30+ seconds.

Thanks again for helping me debug and creating this fix!

> Members failing to rejoin quorum
> 
>
> Key: ZOOKEEPER-3756
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3756
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: leaderElection
>Affects Versions: 3.5.6, 3.5.7
>Reporter: Dai Shi
>Assignee: Mate Szalay-Beko
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.6.1, 3.5.8
>
> Attachments: Dockerfile, configmap.yaml, docker-entrypoint.sh, 
> jmx.yaml, zoo-0.log, zoo-1.log, zoo-2.log, zoo-service.yaml, zookeeper.yaml
>
>  Time Spent: 3.5h
>  Remaining Estimate: 0h
>
> Not sure if this is the place to ask, please close if it's not.
> I am seeing some behavior that I can't explain since upgrading to 3.5:
> In a 5 member quorum, when server 3 is the leader and each server has this in 
> their configuration: 
> {code:java}
> server.1=100.71.255.254:2888:3888:participant;2181
> server.2=100.71.255.253:2888:3888:participant;2181
> server.3=100.71.255.252:2888:3888:participant;2181
> server.4=100.71.255.251:2888:3888:participant;2181
> server.5=100.71.255.250:2888:3888:participant;2181{code}
> If servers 1 or 2 are restarted, they fail to rejoin the quorum with this in 
> the logs:
> {code:java}
> 2020-03-11 20:23:35,720 [myid:2] - INFO  
> [QuorumPeer[myid=2](plain=0.0.0.0:2181)(secure=disabled):QuorumPeer@1175] - 
> LOOKING
> 2020-03-11 20:23:35,721 [myid:2] - INFO  
> [QuorumPeer[myid=2](plain=0.0.0.0:2181)(secure=disabled):FastLeaderElection@885]
>  - New election. My id =  2, proposed zxid=0x1b8005f4bba
> 2020-03-11 20:23:35,733 [myid:2] - INFO  
> [WorkerSender[myid=2]:QuorumCnxManager@438] - Have smaller server identifier, 
> so dropping the connection: (3, 2)
> 2020-03-11 20:23:35,734 [myid:2] - INFO  
> [0.0.0.0/0.0.0.0:3888:QuorumCnxManager$Listener@924] - Received connection 
> request 100.126.116.201:36140
> 2020-03-11 20:23:35,735 [myid:2] - INFO  
> [WorkerSender[myid=2]:QuorumCnxManager@438] - Have smaller server identifier, 
> so dropping the connection: (4, 2)
> 2020-03-11 20:23:35,740 [myid:2] - INFO  
> [WorkerSender[myid=2]:QuorumCnxManager@438] - Have smaller server identifier, 
> so dropping the connection: (5, 2)
> 2020-03-11 20:23:35,740 [myid:2] - INFO  
> [0.0.0.0/0.0.0.0:3888:QuorumCnxManager$Listener@924] - Received connection 
> request 100.126.116.201:36142
> 2020-03-11 20:23:35,740 [myid:2] - INFO  
> [WorkerReceiver[myid=2]:FastLeaderElection@679] - Notification: 2 (message 
> format version), 2 (n.leader), 0x1b8005f4bba (n.zxid), 0x1 (n.round), LOOKING 
> (n.state), 2 (n.sid), 0x1b8 (n.peerEPoch), LOOKING (my state)0 (n.config 
> version)
> 2020-03-11 20:23:35,742 [myid:2] - WARN  
> [SendWorker:3:QuorumCnxManager$SendWorker@1143] - Interrupted while waiting 
> for message on queue
> java.lang.InterruptedException
> at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2014)
> at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2088)
> at 
> java.util.concurrent.ArrayBlockingQueue.poll(ArrayBlockingQueue.java:418)
> at 
> org.apache.zookeeper.server.quorum.QuorumCnxManager.pollSendQueue(QuorumCnxManager.java:1294)
> at 
> org.apache.zookeeper.server.quorum.QuorumCnxManager.access$700(QuorumCnxManager.java:82)
> at 
> org.apache.zookeeper.server.quorum.QuorumCnxManager$SendWorker.run(QuorumCnxManager.java:1131)
> 2020-03-11 20:23:35,744 [myid:2] - WARN  
> [SendWorker:3:QuorumCnxManager$SendWorker@1153] - Send worker leaving thread  
> id 3 my id = 2
> 2020-03-11 20:23:35,745 [myid:2] - WARN  
> [RecvWorker:3:QuorumCnxManager$RecvWorker@1230] - Interrupting 
> SendWorker{code}
> The only way I can seem to get them to rejoin the quorum is to restart t

[jira] [Commented] (ZOOKEEPER-3756) Members failing to rejoin quorum

2020-05-19 Thread Dai Shi (Jira)


[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-3756?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17111589#comment-17111589
 ] 

Dai Shi commented on ZOOKEEPER-3756:


Hi Mate,

I just wanted to report back after testing 3.5.8. I am happy to say that it 
seems to work well after a brief bit of testing. I am no longer setting 
`-Dzookeeper.cnxTimeout=500`, and now when I roll the leader the cluster 
downtime is only 2-3 seconds instead of 30+ seconds.

Thanks again for helping me debug and creating this fix!

> Members failing to rejoin quorum
> 
>
> Key: ZOOKEEPER-3756
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3756
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: leaderElection
>Affects Versions: 3.5.6, 3.5.7
>Reporter: Dai Shi
>Assignee: Mate Szalay-Beko
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.6.1, 3.5.8
>
> Attachments: Dockerfile, configmap.yaml, docker-entrypoint.sh, 
> jmx.yaml, zoo-0.log, zoo-1.log, zoo-2.log, zoo-service.yaml, zookeeper.yaml
>
>  Time Spent: 3.5h
>  Remaining Estimate: 0h
>
> Not sure if this is the place to ask, please close if it's not.
> I am seeing some behavior that I can't explain since upgrading to 3.5:
> In a 5 member quorum, when server 3 is the leader and each server has this in 
> their configuration: 
> {code:java}
> server.1=100.71.255.254:2888:3888:participant;2181
> server.2=100.71.255.253:2888:3888:participant;2181
> server.3=100.71.255.252:2888:3888:participant;2181
> server.4=100.71.255.251:2888:3888:participant;2181
> server.5=100.71.255.250:2888:3888:participant;2181{code}
> If servers 1 or 2 are restarted, they fail to rejoin the quorum with this in 
> the logs:
> {code:java}
> 2020-03-11 20:23:35,720 [myid:2] - INFO  
> [QuorumPeer[myid=2](plain=0.0.0.0:2181)(secure=disabled):QuorumPeer@1175] - 
> LOOKING
> 2020-03-11 20:23:35,721 [myid:2] - INFO  
> [QuorumPeer[myid=2](plain=0.0.0.0:2181)(secure=disabled):FastLeaderElection@885]
>  - New election. My id =  2, proposed zxid=0x1b8005f4bba
> 2020-03-11 20:23:35,733 [myid:2] - INFO  
> [WorkerSender[myid=2]:QuorumCnxManager@438] - Have smaller server identifier, 
> so dropping the connection: (3, 2)
> 2020-03-11 20:23:35,734 [myid:2] - INFO  
> [0.0.0.0/0.0.0.0:3888:QuorumCnxManager$Listener@924] - Received connection 
> request 100.126.116.201:36140
> 2020-03-11 20:23:35,735 [myid:2] - INFO  
> [WorkerSender[myid=2]:QuorumCnxManager@438] - Have smaller server identifier, 
> so dropping the connection: (4, 2)
> 2020-03-11 20:23:35,740 [myid:2] - INFO  
> [WorkerSender[myid=2]:QuorumCnxManager@438] - Have smaller server identifier, 
> so dropping the connection: (5, 2)
> 2020-03-11 20:23:35,740 [myid:2] - INFO  
> [0.0.0.0/0.0.0.0:3888:QuorumCnxManager$Listener@924] - Received connection 
> request 100.126.116.201:36142
> 2020-03-11 20:23:35,740 [myid:2] - INFO  
> [WorkerReceiver[myid=2]:FastLeaderElection@679] - Notification: 2 (message 
> format version), 2 (n.leader), 0x1b8005f4bba (n.zxid), 0x1 (n.round), LOOKING 
> (n.state), 2 (n.sid), 0x1b8 (n.peerEPoch), LOOKING (my state)0 (n.config 
> version)
> 2020-03-11 20:23:35,742 [myid:2] - WARN  
> [SendWorker:3:QuorumCnxManager$SendWorker@1143] - Interrupted while waiting 
> for message on queue
> java.lang.InterruptedException
> at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2014)
> at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2088)
> at 
> java.util.concurrent.ArrayBlockingQueue.poll(ArrayBlockingQueue.java:418)
> at 
> org.apache.zookeeper.server.quorum.QuorumCnxManager.pollSendQueue(QuorumCnxManager.java:1294)
> at 
> org.apache.zookeeper.server.quorum.QuorumCnxManager.access$700(QuorumCnxManager.java:82)
> at 
> org.apache.zookeeper.server.quorum.QuorumCnxManager$SendWorker.run(QuorumCnxManager.java:1131)
> 2020-03-11 20:23:35,744 [myid:2] - WARN  
> [SendWorker:3:QuorumCnxManager$SendWorker@1153] - Send worker leaving thread  
> id 3 my id = 2
> 2020-03-11 20:23:35,745 [myid:2] - WARN  
> [RecvWorker:3:QuorumCnxManager$RecvWorker@1230] - Interrupting 
> SendWorker{code}
> The only way I can seem to get them to rejoin the quorum is to restart the 
> leader.
> However, if I remove server 4 and 5 from the configuration of server 1 or 2 
> (so only servers 1, 2, and 3 remain in the configuration file), then they can 
> rejoin the quorum fine. Is this expected and am I doing something wrong? Any 
> help or explanation would be greatly appreciated. Thank you.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ZOOKEEPER-3840) Use JDK 8 Facilities to Synchronize Access to DataTree Ephemerals

2020-05-19 Thread David Mollitor (Jira)
David Mollitor created ZOOKEEPER-3840:
-

 Summary: Use JDK 8 Facilities to Synchronize Access to DataTree 
Ephemerals
 Key: ZOOKEEPER-3840
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3840
 Project: ZooKeeper
  Issue Type: Improvement
Reporter: David Mollitor
Assignee: David Mollitor


The current setup is a bit confusing and hard to verify.  Use JDK8 facilities 
to manage this collection.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ZOOKEEPER-3814) ZooKeeper caching of config

2020-05-19 Thread Rajkiran Sura (Jira)


[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-3814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17111298#comment-17111298
 ] 

Rajkiran Sura commented on ZOOKEEPER-3814:
--

> I created a PR ([https://github.com/apache/zookeeper/pull/1356]) 

That's great!

> Could you please share the sequence of steps you were executing when you saw 
> the original issue?

I also used the exact sequence of steps that you have described in the above 
comment. Just one minor correction in the last step, we just restart the 
service on server.6 as the it has already been started with new config.

Regards,

Rajkiran

 

> ZooKeeper caching of config
> ---
>
> Key: ZOOKEEPER-3814
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3814
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: leaderElection, quorum, server
>Affects Versions: 3.5.6
>Reporter: Rajkiran Sura
>Assignee: Mate Szalay-Beko
>Priority: Major
>
> Hello,
> We recently upgraded our 5 node ZooKeeper ensemble from v3.4.8 to v3.5.6. 
> Encountered no issues as such.
> This is how the ZooKeeper config looks like:
> {quote}tickTime=2000
> dataDir=/zookeeper-data/
> initLimit=5
> syncLimit=2
> maxClientCnxns=2048
> autopurge.snapRetainCount=3
> autopurge.purgeInterval=1
> 4lw.commands.whitelist=stat, ruok, conf, isro, mntr
> authProvider.1=org.apache.zookeeper.server.auth.SASLAuthenticationProvider
> requireClientAuthScheme=sasl
> quorum.cnxn.threads.size=20
> quorum.auth.enableSasl=true
> quorum.auth.kerberos.servicePrincipal= zookeeper/_HOST
> quorum.auth.learnerRequireSasl=true
> quorum.auth.learner.saslLoginContext=QuorumLearner
> quorum.auth.serverRequireSasl=true
> quorum.auth.server.saslLoginContext=QuorumServer
> server.17=node1.foo.bar.com:2888:3888;2181
> server.19=node2.foo.bar.com:2888:3888;2181
> server.20=node3.foo.bar.com:2888:3888;2181
> server.21=node4.foo.bar.com:2888:3888;2181
> server.22=node5.bar.com:2888:3888;2181
> {quote}
> Post upgrade, we had to migrate server.22 on the same node, but with 
> *FOO*.bar.com domain name due to kerberos referral issues. And, we used 
> different server-identifier, i.e., *23* when we migrated. So, here is how the 
> new config looked like:
> {quote}server.17=node1.foo.bar.com:2888:3888;2181
> server.19=node2.foo.bar.com:2888:3888;2181
> server.20=node3.foo.bar.com:2888:3888;2181
> server.21=node4.foo.bar.com:2888:3888;2181
> *server.23=node5.{color:#00875a}foo{color}.bar.com:2888:3888;2181*
> {quote}
> We restarted all the nodes in the ensemble with the above updated config. And 
> the migrated node joined the quorum successfully and was serving all clients 
> directly connected to it, without any issues.
> Recently, when a leader election happened, 
> server.*23*=node5.foo.bar.com(migrated node) was chosen as Leader (as it has 
> highest ID). But then, ZooKeeper was unable to serve any clients and *all* 
> the servers were _somehow still_ trying to establish a channel to 22 (old DNS 
> name: node5.bar.com) and were throwing below error in a loop:
> {quote}{{2020-05-02 01:43:03,026 [myid:23] - WARN 
> [WorkerSender[myid=23]:QuorumPeer$QuorumServer@196] - Failed to resolve 
> address: node4.bar.com}}
> {{java.net.UnknownHostException: node5.bar.com: Name or service not known}}
> {{ at java.base/java.net.Inet4AddressImpl.lookupAllHostAddr(Native Method)}}
> {{ at 
> java.base/java.net.InetAddress$PlatformNameService.lookupAllHostAddr(InetAddress.java:929)}}
> {{ at 
> java.base/java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1515)}}
> {{ at 
> java.base/java.net.InetAddress$NameServiceAddresses.get(InetAddress.java:848)}}
> {{ at java.base/java.net.InetAddress.getAllByName0(InetAddress.java:1505)}}
> {{ at java.base/java.net.InetAddress.getAllByName(InetAddress.java:1364)}}
> {{ at java.base/java.net.InetAddress.getAllByName(InetAddress.java:1298)}}
> {{ at java.base/java.net.InetAddress.getByName(InetAddress.java:1248)}}
> {{ at 
> org.apache.zookeeper.server.quorum.QuorumPeer$QuorumServer.recreateSocketAddresses(QuorumPeer.java:194)}}
> {{ at 
> org.apache.zookeeper.server.quorum.QuorumPeer.recreateSocketAddresses(QuorumPeer.java:774)}}
> {{ at 
> org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:701)}}
> {{ at 
> org.apache.zookeeper.server.quorum.QuorumCnxManager.toSend(QuorumCnxManager.java:620)}}
> {{ at 
> org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender.process(FastLeaderElection.java:477)}}
> {{ at 
> org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender.run(FastLeaderElection.java:456)}}
> {{ at java.base/java.lang.Thread.run(Thread.java:834)}}
> {{2020-05-02 01:43:03,026 [myid:23] - WARN 
> [WorkerSender[myid=23]:QuorumCnxManager@679] - Cannot open channel to 22 at 
> election addr

[jira] [Created] (ZOOKEEPER-3839) ReconfigBackupTest Remove getFileContent

2020-05-19 Thread David Mollitor (Jira)
David Mollitor created ZOOKEEPER-3839:
-

 Summary: ReconfigBackupTest Remove getFileContent
 Key: ZOOKEEPER-3839
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3839
 Project: ZooKeeper
  Issue Type: Improvement
Reporter: David Mollitor
Assignee: David Mollitor


bq. // upgrade this once we have Google-Guava or Java 7+

https://github.com/apache/zookeeper/blob/a908001be9641d78040b1954acb0cd3a8e9e42c2/zookeeper-server/src/test/java/org/apache/zookeeper/server/quorum/ReconfigBackupTest.java#L53

OK. Done.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ZOOKEEPER-3838) Async handling of quorum connection requests, including SSL handshakes

2020-05-19 Thread Mate Szalay-Beko (Jira)
Mate Szalay-Beko created ZOOKEEPER-3838:
---

 Summary: Async handling of quorum connection requests, including 
SSL handshakes
 Key: ZOOKEEPER-3838
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3838
 Project: ZooKeeper
  Issue Type: Improvement
Affects Versions: 3.5.8, 3.6.1
Reporter: Mate Szalay-Beko


We are facing issues when the leader election takes too long, as the connection 
initiation between quorum members takes too much time when QuorumSSL is used.

In the current implementation, we handle the connection requests (and SASL 
authentications) asynchronously when QuorumSASL is enabled. However, the 
asynchronous handling will not be enabled if only QuorumSSL is enabled (but 
QuorumSASL is disabled). And anyway, as far as I can see, the SSL handshake 
happens before the current asynchronous part.

See: 
https://github.com/apache/zookeeper/blob/a908001be9641d78040b1954acb0cd3a8e9e42c2/zookeeper-server/src/main/java/org/apache/zookeeper/server/quorum/QuorumCnxManager.java#L1058

The goal would be to move the SSL handshake to the asynchronous code part, and 
also to make the connection request handling always asynchronous regardless of 
the QuorumSASL / QuorumSSL configs.

Please note, we already did this for the connection initiation part in 
ZOOKEEPER-3756.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ZOOKEEPER-3837) Deprecate StringUtils Join

2020-05-19 Thread David Mollitor (Jira)


 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-3837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Mollitor updated ZOOKEEPER-3837:
--
Description: Can do with with JDK 8 now.

> Deprecate StringUtils Join
> --
>
> Key: ZOOKEEPER-3837
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3837
> Project: ZooKeeper
>  Issue Type: Improvement
>Reporter: David Mollitor
>Assignee: David Mollitor
>Priority: Minor
>
> Can do with with JDK 8 now.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ZOOKEEPER-3837) Deprecate StringUtils Join

2020-05-19 Thread David Mollitor (Jira)
David Mollitor created ZOOKEEPER-3837:
-

 Summary: Deprecate StringUtils Join
 Key: ZOOKEEPER-3837
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3837
 Project: ZooKeeper
  Issue Type: Improvement
Reporter: David Mollitor
Assignee: David Mollitor






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ZOOKEEPER-3836) Use Commons and JDK Functions in ClientBase

2020-05-19 Thread David Mollitor (Jira)


 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-3836?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Mollitor updated ZOOKEEPER-3836:
--
Summary: Use Commons and JDK Functions in ClientBase  (was: Use Common 
IOUtils and JDK Functions in ClientBase)

> Use Commons and JDK Functions in ClientBase
> ---
>
> Key: ZOOKEEPER-3836
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3836
> Project: ZooKeeper
>  Issue Type: Improvement
>Reporter: David Mollitor
>Assignee: David Mollitor
>Priority: Minor
>
> Remove code in {{ClientBase}} that now exist in JDK and Apache Commons.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ZOOKEEPER-3836) Use Common IOUtils and JDK Functions in ClientBase

2020-05-19 Thread David Mollitor (Jira)
David Mollitor created ZOOKEEPER-3836:
-

 Summary: Use Common IOUtils and JDK Functions in ClientBase
 Key: ZOOKEEPER-3836
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3836
 Project: ZooKeeper
  Issue Type: Improvement
Reporter: David Mollitor
Assignee: David Mollitor


Remove code in {{ClientBase}} that now exist in JDK and Apache Commons.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Moved] (ZOOKEEPER-3835) Deprecate IOUtils copyBytes

2020-05-19 Thread David Mollitor (Jira)


 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-3835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Mollitor moved HIVE-23507 to ZOOKEEPER-3835:
--

Key: ZOOKEEPER-3835  (was: HIVE-23507)
Project: ZooKeeper  (was: Hive)

> Deprecate IOUtils copyBytes
> ---
>
> Key: ZOOKEEPER-3835
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3835
> Project: ZooKeeper
>  Issue Type: Improvement
>Reporter: David Mollitor
>Assignee: David Mollitor
>Priority: Minor
>
> Only used in a single unit test and can easily be replace with o.a.commons



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (ZOOKEEPER-3814) ZooKeeper caching of config

2020-05-19 Thread Mate Szalay-Beko (Jira)


[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-3814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17110998#comment-17110998
 ] 

Mate Szalay-Beko edited comment on ZOOKEEPER-3814 at 5/19/20, 12:09 PM:


[~rajsura] I created a PR (https://github.com/apache/zookeeper/pull/1356) for 
ZOOKEEPER-3829 and using this patch, I was able to do the following steps:

* have server.1, server.2, server.3, server.4, server.5 up and running
* stop server.5
* stop server.1
* start server.1 with the new config (removing server.5, adding server.6 with 
the new hostname)
* stop server.2
* start server.2 with the new config
* stop server.3
* start server.3 with the new config
* stop server.4
* start server.4 with the new config
* start server.6 with the new config (but re-using the data folder of server.5)

during these steps, the cluster was up and running, always had at least 3 
members. In the end I checked the logfiles of server.6 and I haven't seen any 
attempt to try to connect to server.5.


I also tried a different sequence (although I think it makes less sense):

* have server.1, server.2, server.3, server.4, server.5 up and running
* stop server.5
* start server.6 with the new config (removing server.5, adding server.6 with 
the new hostname), re-using the data folder of server.5
* stop server.1
* start server.1 with the new config
* stop server.2
* start server.2 with the new config
* stop server.3
* start server.3 with the new config
* stop server.4
* start server.4 with the new config
* stop server.6
* start server.6 with the new config

In this case I saw that server.6 was still trying to connect to server.5 after 
the first restart, but never after the second restart. I don't consider this a 
big deal, as I don't really think that this is a good sequence anyway. I think 
it is more logical to restart all the other nodes (to have their config 
updated) before I would start the new server.6.

Could you please share the sequence of steps you were executing when you saw 
the original issue?




was (Author: symat):
[~rajsura] I created a PR (https://github.com/apache/zookeeper/pull/1356) for 
ZOOKEEPER-3829 and using this patch, I was able to do the following steps:

* have server.1, server.2, server.3, server.4, server.5 up and running
* stop server.5
* stop server.1
* start server.1 with the new config (removing server.5, adding server.6 with 
the new hostname)
* stop server.2
* start server.2 with the new config
* stop server.3
* start server.3 with the new config
* stop server.4
* start server.4 with the new config
* start server.6 with the new config (but re-using the data folder of server.5)

during these steps, the cluster was up and running, always had at least 3 
members. In the end I checked the logfiles of server.6 and I haven't seen any 
attempt to try to connect to server.5.

Could you please verify that this is the sequence of steps you executed?

> ZooKeeper caching of config
> ---
>
> Key: ZOOKEEPER-3814
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3814
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: leaderElection, quorum, server
>Affects Versions: 3.5.6
>Reporter: Rajkiran Sura
>Assignee: Mate Szalay-Beko
>Priority: Major
>
> Hello,
> We recently upgraded our 5 node ZooKeeper ensemble from v3.4.8 to v3.5.6. 
> Encountered no issues as such.
> This is how the ZooKeeper config looks like:
> {quote}tickTime=2000
> dataDir=/zookeeper-data/
> initLimit=5
> syncLimit=2
> maxClientCnxns=2048
> autopurge.snapRetainCount=3
> autopurge.purgeInterval=1
> 4lw.commands.whitelist=stat, ruok, conf, isro, mntr
> authProvider.1=org.apache.zookeeper.server.auth.SASLAuthenticationProvider
> requireClientAuthScheme=sasl
> quorum.cnxn.threads.size=20
> quorum.auth.enableSasl=true
> quorum.auth.kerberos.servicePrincipal= zookeeper/_HOST
> quorum.auth.learnerRequireSasl=true
> quorum.auth.learner.saslLoginContext=QuorumLearner
> quorum.auth.serverRequireSasl=true
> quorum.auth.server.saslLoginContext=QuorumServer
> server.17=node1.foo.bar.com:2888:3888;2181
> server.19=node2.foo.bar.com:2888:3888;2181
> server.20=node3.foo.bar.com:2888:3888;2181
> server.21=node4.foo.bar.com:2888:3888;2181
> server.22=node5.bar.com:2888:3888;2181
> {quote}
> Post upgrade, we had to migrate server.22 on the same node, but with 
> *FOO*.bar.com domain name due to kerberos referral issues. And, we used 
> different server-identifier, i.e., *23* when we migrated. So, here is how the 
> new config looked like:
> {quote}server.17=node1.foo.bar.com:2888:3888;2181
> server.19=node2.foo.bar.com:2888:3888;2181
> server.20=node3.foo.bar.com:2888:3888;2181
> server.21=node4.foo.bar.com:2888:3888;2181
> *server.23=node5.{color:#00875a}foo{color}.bar.com:2888:3888;2181*
> {quote}
> We restart

[jira] [Comment Edited] (ZOOKEEPER-3829) Zookeeper refuses request after node expansion

2020-05-19 Thread Mate Szalay-Beko (Jira)


[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-3829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17110960#comment-17110960
 ] 

Mate Szalay-Beko edited comment on ZOOKEEPER-3829 at 5/19/20, 12:02 PM:


I'll push a PR with some proposed fixes and also some tests that reproduces the 
steps we discussed before. (these tests fail without the changes but after the 
changes they are passing). 

see: https://github.com/apache/zookeeper/pull/1356


was (Author: symat):
I'll push a PR with some proposed fixes and also some tests that reproduces the 
steps we discussed before. (these tests fail without the changes but after the 
changes they are passing). 

Still, I don't consider this still as a final solution yet, as e.g. for the 
rolling restart case described in ZOOKEEPER-3814, these changes are causing an 
'infinite loop' when quorum members are sending their notifications between a 
server with the old config and a server with the new config. This is leading to 
large amount of notifications sent between these servers, and the infinite loop 
only broke when all the servers has the new config.

> Zookeeper refuses request after node expansion
> --
>
> Key: ZOOKEEPER-3829
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3829
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: server
>Affects Versions: 3.5.6
>Reporter: benwang li
>Assignee: Mate Szalay-Beko
>Priority: Major
> Attachments: d.log, screenshot-1.png
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> It's easy to reproduce this bug.
> {code:java}
> //代码占位符
>  
> Step 1. Deploy 3 nodes  A,B,C with configuration A,B,C .
> Step 2. Deploy node ` D` with configuration  `A,B,C,D` , cluster state is ok 
> now.
> Step 3. Restart nodes A,B,C with configuration A,B,C,D, then the leader will 
> be D, cluster hangs, but it can accept `mntr` command, other command like `ls 
> /` will be blocked.
> Step 4. Restart nodes D, cluster state is back to normal now.
>  
> {code}
>  
> We have looked into the code of 3.5.6 version, and we found it may be the 
> issue of  `workerPool` .
> The `CommitProcessor` shutdown and make `workerPool` shutdown, but 
> `workerPool` still exists. It will never work anymore, yet the cluster still 
> thinks it's ok.
>  
> I think the bug may still exist in master branch.
> We have tested it in our machines by reset the `workerPool` to null. If it's 
> ok, please assign this issue to me, and then I'll create a PR. 
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (ZOOKEEPER-3829) Zookeeper refuses request after node expansion

2020-05-19 Thread Mate Szalay-Beko (Jira)


[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-3829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=1715#comment-1715
 ] 

Mate Szalay-Beko edited comment on ZOOKEEPER-3829 at 5/19/20, 12:00 PM:


{quote}we use docker-compose down, we still have this issue
{quote}

maybe "{{docker-compose down}}" is also removing the virtual network?

I always use "{{stop zoo3}}" to stop a service and "{{up -d zoo3}}" to 
rebuild/start it again.


was (Author: symat):
{quote}we use docker-compose down, we still have this issue
{quote}

maybe {{docker-compose down}} is also removing the virtual network?

I always use {{stop zoo3}} to stop a service and {{up -d zoo3}} to 
rebuild/start it again.

> Zookeeper refuses request after node expansion
> --
>
> Key: ZOOKEEPER-3829
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3829
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: server
>Affects Versions: 3.5.6
>Reporter: benwang li
>Assignee: Mate Szalay-Beko
>Priority: Major
> Attachments: d.log, screenshot-1.png
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> It's easy to reproduce this bug.
> {code:java}
> //代码占位符
>  
> Step 1. Deploy 3 nodes  A,B,C with configuration A,B,C .
> Step 2. Deploy node ` D` with configuration  `A,B,C,D` , cluster state is ok 
> now.
> Step 3. Restart nodes A,B,C with configuration A,B,C,D, then the leader will 
> be D, cluster hangs, but it can accept `mntr` command, other command like `ls 
> /` will be blocked.
> Step 4. Restart nodes D, cluster state is back to normal now.
>  
> {code}
>  
> We have looked into the code of 3.5.6 version, and we found it may be the 
> issue of  `workerPool` .
> The `CommitProcessor` shutdown and make `workerPool` shutdown, but 
> `workerPool` still exists. It will never work anymore, yet the cluster still 
> thinks it's ok.
>  
> I think the bug may still exist in master branch.
> We have tested it in our machines by reset the `workerPool` to null. If it's 
> ok, please assign this issue to me, and then I'll create a PR. 
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ZOOKEEPER-3829) Zookeeper refuses request after node expansion

2020-05-19 Thread Mate Szalay-Beko (Jira)


[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-3829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=1715#comment-1715
 ] 

Mate Szalay-Beko commented on ZOOKEEPER-3829:
-

{quote}we use docker-compose down, we still have this issue
{quote}

maybe {{docker-compose down}} is also removing the virtual network?

I always use {{stop zoo3}} to stop a service and {{up -d zoo3}} to 
rebuild/start it again.

> Zookeeper refuses request after node expansion
> --
>
> Key: ZOOKEEPER-3829
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3829
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: server
>Affects Versions: 3.5.6
>Reporter: benwang li
>Assignee: Mate Szalay-Beko
>Priority: Major
> Attachments: d.log, screenshot-1.png
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> It's easy to reproduce this bug.
> {code:java}
> //代码占位符
>  
> Step 1. Deploy 3 nodes  A,B,C with configuration A,B,C .
> Step 2. Deploy node ` D` with configuration  `A,B,C,D` , cluster state is ok 
> now.
> Step 3. Restart nodes A,B,C with configuration A,B,C,D, then the leader will 
> be D, cluster hangs, but it can accept `mntr` command, other command like `ls 
> /` will be blocked.
> Step 4. Restart nodes D, cluster state is back to normal now.
>  
> {code}
>  
> We have looked into the code of 3.5.6 version, and we found it may be the 
> issue of  `workerPool` .
> The `CommitProcessor` shutdown and make `workerPool` shutdown, but 
> `workerPool` still exists. It will never work anymore, yet the cluster still 
> thinks it's ok.
>  
> I think the bug may still exist in master branch.
> We have tested it in our machines by reset the `workerPool` to null. If it's 
> ok, please assign this issue to me, and then I'll create a PR. 
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ZOOKEEPER-3814) ZooKeeper caching of config

2020-05-19 Thread Mate Szalay-Beko (Jira)


[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-3814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17111006#comment-17111006
 ] 

Mate Szalay-Beko commented on ZOOKEEPER-3814:
-

[~shralex] thanks for your insights! 

bq. Another change where a config could change is during the gossip happening 
in leader election - servers send around their configs peer-to-peer, and update 
their config to a later one if they see one (FastLeaderElection.java, look for 
processReconfig). There too, you could require that the reconfigEnable flag is 
on before calling processReconfig.

I checked this part and I think reconfigEnable flag is validated already in the 
 QuorumPeer.processReconfig function, so the changes will not propagate this 
way (as far as I understood).

> ZooKeeper caching of config
> ---
>
> Key: ZOOKEEPER-3814
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3814
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: leaderElection, quorum, server
>Affects Versions: 3.5.6
>Reporter: Rajkiran Sura
>Assignee: Mate Szalay-Beko
>Priority: Major
>
> Hello,
> We recently upgraded our 5 node ZooKeeper ensemble from v3.4.8 to v3.5.6. 
> Encountered no issues as such.
> This is how the ZooKeeper config looks like:
> {quote}tickTime=2000
> dataDir=/zookeeper-data/
> initLimit=5
> syncLimit=2
> maxClientCnxns=2048
> autopurge.snapRetainCount=3
> autopurge.purgeInterval=1
> 4lw.commands.whitelist=stat, ruok, conf, isro, mntr
> authProvider.1=org.apache.zookeeper.server.auth.SASLAuthenticationProvider
> requireClientAuthScheme=sasl
> quorum.cnxn.threads.size=20
> quorum.auth.enableSasl=true
> quorum.auth.kerberos.servicePrincipal= zookeeper/_HOST
> quorum.auth.learnerRequireSasl=true
> quorum.auth.learner.saslLoginContext=QuorumLearner
> quorum.auth.serverRequireSasl=true
> quorum.auth.server.saslLoginContext=QuorumServer
> server.17=node1.foo.bar.com:2888:3888;2181
> server.19=node2.foo.bar.com:2888:3888;2181
> server.20=node3.foo.bar.com:2888:3888;2181
> server.21=node4.foo.bar.com:2888:3888;2181
> server.22=node5.bar.com:2888:3888;2181
> {quote}
> Post upgrade, we had to migrate server.22 on the same node, but with 
> *FOO*.bar.com domain name due to kerberos referral issues. And, we used 
> different server-identifier, i.e., *23* when we migrated. So, here is how the 
> new config looked like:
> {quote}server.17=node1.foo.bar.com:2888:3888;2181
> server.19=node2.foo.bar.com:2888:3888;2181
> server.20=node3.foo.bar.com:2888:3888;2181
> server.21=node4.foo.bar.com:2888:3888;2181
> *server.23=node5.{color:#00875a}foo{color}.bar.com:2888:3888;2181*
> {quote}
> We restarted all the nodes in the ensemble with the above updated config. And 
> the migrated node joined the quorum successfully and was serving all clients 
> directly connected to it, without any issues.
> Recently, when a leader election happened, 
> server.*23*=node5.foo.bar.com(migrated node) was chosen as Leader (as it has 
> highest ID). But then, ZooKeeper was unable to serve any clients and *all* 
> the servers were _somehow still_ trying to establish a channel to 22 (old DNS 
> name: node5.bar.com) and were throwing below error in a loop:
> {quote}{{2020-05-02 01:43:03,026 [myid:23] - WARN 
> [WorkerSender[myid=23]:QuorumPeer$QuorumServer@196] - Failed to resolve 
> address: node4.bar.com}}
> {{java.net.UnknownHostException: node5.bar.com: Name or service not known}}
> {{ at java.base/java.net.Inet4AddressImpl.lookupAllHostAddr(Native Method)}}
> {{ at 
> java.base/java.net.InetAddress$PlatformNameService.lookupAllHostAddr(InetAddress.java:929)}}
> {{ at 
> java.base/java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1515)}}
> {{ at 
> java.base/java.net.InetAddress$NameServiceAddresses.get(InetAddress.java:848)}}
> {{ at java.base/java.net.InetAddress.getAllByName0(InetAddress.java:1505)}}
> {{ at java.base/java.net.InetAddress.getAllByName(InetAddress.java:1364)}}
> {{ at java.base/java.net.InetAddress.getAllByName(InetAddress.java:1298)}}
> {{ at java.base/java.net.InetAddress.getByName(InetAddress.java:1248)}}
> {{ at 
> org.apache.zookeeper.server.quorum.QuorumPeer$QuorumServer.recreateSocketAddresses(QuorumPeer.java:194)}}
> {{ at 
> org.apache.zookeeper.server.quorum.QuorumPeer.recreateSocketAddresses(QuorumPeer.java:774)}}
> {{ at 
> org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:701)}}
> {{ at 
> org.apache.zookeeper.server.quorum.QuorumCnxManager.toSend(QuorumCnxManager.java:620)}}
> {{ at 
> org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender.process(FastLeaderElection.java:477)}}
> {{ at 
> org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender.run(FastLeaderElection.java:456)}}
> {{ at java.base/java.lang.Thread.run(Thr

[jira] [Commented] (ZOOKEEPER-3814) ZooKeeper caching of config

2020-05-19 Thread Mate Szalay-Beko (Jira)


[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-3814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17110998#comment-17110998
 ] 

Mate Szalay-Beko commented on ZOOKEEPER-3814:
-

[~rajsura] I created a PR (https://github.com/apache/zookeeper/pull/1356) for 
ZOOKEEPER-3829 and using this patch, I was able to do the following steps:

* have server.1, server.2, server.3, server.4, server.5 up and running
* stop server.5
* stop server.1
* start server.1 with the new config (removing server.5, adding server.6 with 
the new hostname)
* stop server.2
* start server.2 with the new config
* stop server.3
* start server.3 with the new config
* stop server.4
* start server.4 with the new config
* start server.6 with the new config (but re-using the data folder of server.5)

during these steps, the cluster was up and running, always had at least 3 
members. In the end I checked the logfiles of server.6 and I haven't seen any 
attempt to try to connect to server.5.

Could you please verify that this is the sequence of steps you executed?

> ZooKeeper caching of config
> ---
>
> Key: ZOOKEEPER-3814
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3814
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: leaderElection, quorum, server
>Affects Versions: 3.5.6
>Reporter: Rajkiran Sura
>Assignee: Mate Szalay-Beko
>Priority: Major
>
> Hello,
> We recently upgraded our 5 node ZooKeeper ensemble from v3.4.8 to v3.5.6. 
> Encountered no issues as such.
> This is how the ZooKeeper config looks like:
> {quote}tickTime=2000
> dataDir=/zookeeper-data/
> initLimit=5
> syncLimit=2
> maxClientCnxns=2048
> autopurge.snapRetainCount=3
> autopurge.purgeInterval=1
> 4lw.commands.whitelist=stat, ruok, conf, isro, mntr
> authProvider.1=org.apache.zookeeper.server.auth.SASLAuthenticationProvider
> requireClientAuthScheme=sasl
> quorum.cnxn.threads.size=20
> quorum.auth.enableSasl=true
> quorum.auth.kerberos.servicePrincipal= zookeeper/_HOST
> quorum.auth.learnerRequireSasl=true
> quorum.auth.learner.saslLoginContext=QuorumLearner
> quorum.auth.serverRequireSasl=true
> quorum.auth.server.saslLoginContext=QuorumServer
> server.17=node1.foo.bar.com:2888:3888;2181
> server.19=node2.foo.bar.com:2888:3888;2181
> server.20=node3.foo.bar.com:2888:3888;2181
> server.21=node4.foo.bar.com:2888:3888;2181
> server.22=node5.bar.com:2888:3888;2181
> {quote}
> Post upgrade, we had to migrate server.22 on the same node, but with 
> *FOO*.bar.com domain name due to kerberos referral issues. And, we used 
> different server-identifier, i.e., *23* when we migrated. So, here is how the 
> new config looked like:
> {quote}server.17=node1.foo.bar.com:2888:3888;2181
> server.19=node2.foo.bar.com:2888:3888;2181
> server.20=node3.foo.bar.com:2888:3888;2181
> server.21=node4.foo.bar.com:2888:3888;2181
> *server.23=node5.{color:#00875a}foo{color}.bar.com:2888:3888;2181*
> {quote}
> We restarted all the nodes in the ensemble with the above updated config. And 
> the migrated node joined the quorum successfully and was serving all clients 
> directly connected to it, without any issues.
> Recently, when a leader election happened, 
> server.*23*=node5.foo.bar.com(migrated node) was chosen as Leader (as it has 
> highest ID). But then, ZooKeeper was unable to serve any clients and *all* 
> the servers were _somehow still_ trying to establish a channel to 22 (old DNS 
> name: node5.bar.com) and were throwing below error in a loop:
> {quote}{{2020-05-02 01:43:03,026 [myid:23] - WARN 
> [WorkerSender[myid=23]:QuorumPeer$QuorumServer@196] - Failed to resolve 
> address: node4.bar.com}}
> {{java.net.UnknownHostException: node5.bar.com: Name or service not known}}
> {{ at java.base/java.net.Inet4AddressImpl.lookupAllHostAddr(Native Method)}}
> {{ at 
> java.base/java.net.InetAddress$PlatformNameService.lookupAllHostAddr(InetAddress.java:929)}}
> {{ at 
> java.base/java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1515)}}
> {{ at 
> java.base/java.net.InetAddress$NameServiceAddresses.get(InetAddress.java:848)}}
> {{ at java.base/java.net.InetAddress.getAllByName0(InetAddress.java:1505)}}
> {{ at java.base/java.net.InetAddress.getAllByName(InetAddress.java:1364)}}
> {{ at java.base/java.net.InetAddress.getAllByName(InetAddress.java:1298)}}
> {{ at java.base/java.net.InetAddress.getByName(InetAddress.java:1248)}}
> {{ at 
> org.apache.zookeeper.server.quorum.QuorumPeer$QuorumServer.recreateSocketAddresses(QuorumPeer.java:194)}}
> {{ at 
> org.apache.zookeeper.server.quorum.QuorumPeer.recreateSocketAddresses(QuorumPeer.java:774)}}
> {{ at 
> org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:701)}}
> {{ at 
> org.apache.zookeeper.server.quorum.QuorumCnxManager.toSend(QuorumCnxManager.java:620)}}
> {{ at 
> org

[jira] [Assigned] (ZOOKEEPER-3829) Zookeeper refuses request after node expansion

2020-05-19 Thread Mate Szalay-Beko (Jira)


 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-3829?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mate Szalay-Beko reassigned ZOOKEEPER-3829:
---

Assignee: Mate Szalay-Beko

> Zookeeper refuses request after node expansion
> --
>
> Key: ZOOKEEPER-3829
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3829
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: server
>Affects Versions: 3.5.6
>Reporter: benwang li
>Assignee: Mate Szalay-Beko
>Priority: Major
> Attachments: d.log, screenshot-1.png
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> It's easy to reproduce this bug.
> {code:java}
> //代码占位符
>  
> Step 1. Deploy 3 nodes  A,B,C with configuration A,B,C .
> Step 2. Deploy node ` D` with configuration  `A,B,C,D` , cluster state is ok 
> now.
> Step 3. Restart nodes A,B,C with configuration A,B,C,D, then the leader will 
> be D, cluster hangs, but it can accept `mntr` command, other command like `ls 
> /` will be blocked.
> Step 4. Restart nodes D, cluster state is back to normal now.
>  
> {code}
>  
> We have looked into the code of 3.5.6 version, and we found it may be the 
> issue of  `workerPool` .
> The `CommitProcessor` shutdown and make `workerPool` shutdown, but 
> `workerPool` still exists. It will never work anymore, yet the cluster still 
> thinks it's ok.
>  
> I think the bug may still exist in master branch.
> We have tested it in our machines by reset the `workerPool` to null. If it's 
> ok, please assign this issue to me, and then I'll create a PR. 
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ZOOKEEPER-3829) Zookeeper refuses request after node expansion

2020-05-19 Thread Mate Szalay-Beko (Jira)


[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-3829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17110960#comment-17110960
 ] 

Mate Szalay-Beko commented on ZOOKEEPER-3829:
-

I'll push a PR with some proposed fixes and also some tests that reproduces the 
steps we discussed before. (these tests fail without the changes but after the 
changes they are passing). 

Still, I don't consider this still as a final solution yet, as e.g. for the 
rolling restart case described in ZOOKEEPER-3814, these changes are causing an 
'infinite loop' when quorum members are sending their notifications between a 
server with the old config and a server with the new config. This is leading to 
large amount of notifications sent between these servers, and the infinite loop 
only broke when all the servers has the new config.

> Zookeeper refuses request after node expansion
> --
>
> Key: ZOOKEEPER-3829
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3829
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: server
>Affects Versions: 3.5.6
>Reporter: benwang li
>Priority: Major
> Attachments: d.log, screenshot-1.png
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> It's easy to reproduce this bug.
> {code:java}
> //代码占位符
>  
> Step 1. Deploy 3 nodes  A,B,C with configuration A,B,C .
> Step 2. Deploy node ` D` with configuration  `A,B,C,D` , cluster state is ok 
> now.
> Step 3. Restart nodes A,B,C with configuration A,B,C,D, then the leader will 
> be D, cluster hangs, but it can accept `mntr` command, other command like `ls 
> /` will be blocked.
> Step 4. Restart nodes D, cluster state is back to normal now.
>  
> {code}
>  
> We have looked into the code of 3.5.6 version, and we found it may be the 
> issue of  `workerPool` .
> The `CommitProcessor` shutdown and make `workerPool` shutdown, but 
> `workerPool` still exists. It will never work anymore, yet the cluster still 
> thinks it's ok.
>  
> I think the bug may still exist in master branch.
> We have tested it in our machines by reset the `workerPool` to null. If it's 
> ok, please assign this issue to me, and then I'll create a PR. 
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ZOOKEEPER-3814) ZooKeeper caching of config

2020-05-19 Thread Mate Szalay-Beko (Jira)


[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-3814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17110946#comment-17110946
 ] 

Mate Szalay-Beko commented on ZOOKEEPER-3814:
-

There are other two rolling-restart backward compatibility issues raised 
recently. I think we should solve them in a single fix. I mark this ticket as 
duplicate of ZOOKEEPER-3829 and push a PR there soon. That PR will solve all 
the three issues and also contain some unit tests to verify these 
rolling-restart scenarios.

> ZooKeeper caching of config
> ---
>
> Key: ZOOKEEPER-3814
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3814
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: leaderElection, quorum, server
>Affects Versions: 3.5.6
>Reporter: Rajkiran Sura
>Assignee: Mate Szalay-Beko
>Priority: Major
>
> Hello,
> We recently upgraded our 5 node ZooKeeper ensemble from v3.4.8 to v3.5.6. 
> Encountered no issues as such.
> This is how the ZooKeeper config looks like:
> {quote}tickTime=2000
> dataDir=/zookeeper-data/
> initLimit=5
> syncLimit=2
> maxClientCnxns=2048
> autopurge.snapRetainCount=3
> autopurge.purgeInterval=1
> 4lw.commands.whitelist=stat, ruok, conf, isro, mntr
> authProvider.1=org.apache.zookeeper.server.auth.SASLAuthenticationProvider
> requireClientAuthScheme=sasl
> quorum.cnxn.threads.size=20
> quorum.auth.enableSasl=true
> quorum.auth.kerberos.servicePrincipal= zookeeper/_HOST
> quorum.auth.learnerRequireSasl=true
> quorum.auth.learner.saslLoginContext=QuorumLearner
> quorum.auth.serverRequireSasl=true
> quorum.auth.server.saslLoginContext=QuorumServer
> server.17=node1.foo.bar.com:2888:3888;2181
> server.19=node2.foo.bar.com:2888:3888;2181
> server.20=node3.foo.bar.com:2888:3888;2181
> server.21=node4.foo.bar.com:2888:3888;2181
> server.22=node5.bar.com:2888:3888;2181
> {quote}
> Post upgrade, we had to migrate server.22 on the same node, but with 
> *FOO*.bar.com domain name due to kerberos referral issues. And, we used 
> different server-identifier, i.e., *23* when we migrated. So, here is how the 
> new config looked like:
> {quote}server.17=node1.foo.bar.com:2888:3888;2181
> server.19=node2.foo.bar.com:2888:3888;2181
> server.20=node3.foo.bar.com:2888:3888;2181
> server.21=node4.foo.bar.com:2888:3888;2181
> *server.23=node5.{color:#00875a}foo{color}.bar.com:2888:3888;2181*
> {quote}
> We restarted all the nodes in the ensemble with the above updated config. And 
> the migrated node joined the quorum successfully and was serving all clients 
> directly connected to it, without any issues.
> Recently, when a leader election happened, 
> server.*23*=node5.foo.bar.com(migrated node) was chosen as Leader (as it has 
> highest ID). But then, ZooKeeper was unable to serve any clients and *all* 
> the servers were _somehow still_ trying to establish a channel to 22 (old DNS 
> name: node5.bar.com) and were throwing below error in a loop:
> {quote}{{2020-05-02 01:43:03,026 [myid:23] - WARN 
> [WorkerSender[myid=23]:QuorumPeer$QuorumServer@196] - Failed to resolve 
> address: node4.bar.com}}
> {{java.net.UnknownHostException: node5.bar.com: Name or service not known}}
> {{ at java.base/java.net.Inet4AddressImpl.lookupAllHostAddr(Native Method)}}
> {{ at 
> java.base/java.net.InetAddress$PlatformNameService.lookupAllHostAddr(InetAddress.java:929)}}
> {{ at 
> java.base/java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1515)}}
> {{ at 
> java.base/java.net.InetAddress$NameServiceAddresses.get(InetAddress.java:848)}}
> {{ at java.base/java.net.InetAddress.getAllByName0(InetAddress.java:1505)}}
> {{ at java.base/java.net.InetAddress.getAllByName(InetAddress.java:1364)}}
> {{ at java.base/java.net.InetAddress.getAllByName(InetAddress.java:1298)}}
> {{ at java.base/java.net.InetAddress.getByName(InetAddress.java:1248)}}
> {{ at 
> org.apache.zookeeper.server.quorum.QuorumPeer$QuorumServer.recreateSocketAddresses(QuorumPeer.java:194)}}
> {{ at 
> org.apache.zookeeper.server.quorum.QuorumPeer.recreateSocketAddresses(QuorumPeer.java:774)}}
> {{ at 
> org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:701)}}
> {{ at 
> org.apache.zookeeper.server.quorum.QuorumCnxManager.toSend(QuorumCnxManager.java:620)}}
> {{ at 
> org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender.process(FastLeaderElection.java:477)}}
> {{ at 
> org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender.run(FastLeaderElection.java:456)}}
> {{ at java.base/java.lang.Thread.run(Thread.java:834)}}
> {{2020-05-02 01:43:03,026 [myid:23] - WARN 
> [WorkerSender[myid=23]:QuorumCnxManager@679] - Cannot open channel to 22 at 
> election address node5.bar.com:3888}}
> {{java.net.UnknownHostException: node5.bar.com}}
> {{ at 
> java.base/java.n