[jira] [Commented] (ZOOKEEPER-3822) Zookeeper 3.6.1 EndOfStreamException

2020-05-12 Thread Sebastian Schmitz (Jira)


[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-3822?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17105695#comment-17105695
 ] 

Sebastian Schmitz commented on ZOOKEEPER-3822:
--

Interestingly this didn't happen again even though I deployed everything around 
six more times in both environments.

> Zookeeper 3.6.1 EndOfStreamException
> 
>
> Key: ZOOKEEPER-3822
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3822
> Project: ZooKeeper
>  Issue Type: Bug
>Affects Versions: 3.6.1
>Reporter: Sebastian Schmitz
>Priority: Critical
> Attachments: kafka.log, kafka_test.log, zookeeper.log, 
> zookeeper_test.log
>
>
> Hello,
> after Zookeeper 3.6.1 solved the issue with leader-election containing the IP 
> and so causing it to fail in separate networks, like in our docker-setup I 
> updated from 3.4.14 to 3.6.1 in Dev- and Test-Environments. It all went 
> smoothly and ran for one day. This night I had a new Update of the 
> environment as we deploy as a whole package of all containers (Kafka, 
> Zookeeper, Mirrormaker etc.) we also replace the Zookeeper-Containers with 
> latest ones. In this case, there was no change, the containers were just 
> removed and deployed again. As the config and data of zookeeper is not stored 
> inside the containers that's not a problem but this night it broke the whole 
> clusters of Zookeeper and so also Kafka was down.
>  * zookeeper_node_1 was stopped and the container removed and created again
>  * zookeeper_node_1 starts up and the election takes place
>  * zookeeper_node_2 is elected as leader again
>  * zookeeper_node_2 is stopped and the container removed and created again
>  * zookeeper_node_3 is elected as the leader while zookeeper_node_2 is down
>  * zookeeper_node_2 starts up and zookeeper_node_3 remains leader
> And from there all servers just report
> 2020-05-07 14:07:57,187 [myid:3] - WARN  
> [NIOWorkerThread-2:NIOServerCnxn@364] - Unexpected exception2020-05-07 
> 14:07:57,187 [myid:3] - WARN  [NIOWorkerThread-2:NIOServerCnxn@364] - 
> Unexpected exceptionEndOfStreamException: Unable to read additional data from 
> client, it probably closed the socket: address = /z.z.z.z:46060, session = 
> 0x2014386bbde at 
> org.apache.zookeeper.server.NIOServerCnxn.handleFailedRead(NIOServerCnxn.java:163)
>  at org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:326) at 
> org.apache.zookeeper.server.NIOServerCnxnFactory$IOWorkRequest.doWork(NIOServerCnxnFactory.java:522)
>  at 
> org.apache.zookeeper.server.WorkerService$ScheduledWorkRequest.run(WorkerService.java:154)
>   at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown 
> Source)  at 
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)  
> at java.base/java.lang.Thread.run(Unknown Source)
> and don't recover.
> I was able to recover the cluster in Test-Environment by stopping and 
> starting all the zookeeper-nodes. The cluster in dev is still in that state 
> and I'm checking the logs to find out more...
> The full logs of the deployment of Zookeeper and Kafka that started at 02:00 
> are attached. The first time in local NZ-time and the second one is UTC. the 
> IPs I replaced are x.x.x.x for node_1, y.y.y.y for node_2 and z.z.z.z for 
> node_3
> The Kafka-Servers are running on the same machine. Which means that the 
> EndOfStreamEceptions could also be connections from Kafka as I don't think 
> that zookeeper_node_3 establish a session with itself?
>  
> Edit:
>  I just found some interesting log from Test-Environment:
>  zookeeper_node_1: 2020-05-07 14:10:29,418 [myid:1] INFO  
> [NIOWorkerThread-6:ZooKeeperServer@1375] Refusing session request for client 
> /f.f.f.f:42012 as it has seen zxid 0xc6 our last zxid is 0xc528f8 
> client must try another server
>  zookeeper_node_2: 2020-05-07 14:10:29,680 [myid:2] INFO  
> [NIOWorkerThread-4:ZooKeeperServer@1375] Refusing session request for client 
> /f.f.f.f:51506 as it has seen zxid 0xc6 our last zxid is 0xc528f8 
> client must try another server
>  These entried are repeated there before the EndOfStreamException shows up...
>  I found that was set by zookeeper_node_3:
>  zookeeper_node_3: 2020-05-07 14:09:44,495 [myid:3] INFO  
> [QuorumPeer[myid=3](plain=0.0.0.0:2181)(secure=disabled):Leader@1501] Have 
> quorum of supporters, sids: [[1, 3],[1, 3]]; starting up and setting last 
> processed zxid: 0xc6
>  zookeeper_node_3: 2020-05-07 14:10:12,587 [myid:3] INFO  
> [LearnerHandler-/z.z.z.z:60156:LearnerHandler@800] Synchronizing with Learner 
> sid: 2 maxCommittedLog=0xc528f8 minCommittedLog=0xc52704 
> lastProcessedZxid=0xc6 peerLastZxid=0xc528f8
>  It looks like this update of zxid didn't reach nodes 1 and 2 and so they 
> refuse 

[jira] [Updated] (ZOOKEEPER-3827) configure explicit sessionInitTimeout for client connection

2020-05-12 Thread Mate Szalay-Beko (Jira)


 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-3827?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mate Szalay-Beko updated ZOOKEEPER-3827:

Description: 
Currently the connectTimeout (the maximum amount of time to connect to one 
ZooKeeper server) in the Java client is initialized as {{ / 
}}. See 
[here|https://github.com/apache/zookeeper/blob/236e3d9183606512f0e03a1f828ad0d392eb6091/zookeeper-server/src/main/java/org/apache/zookeeper/ClientCnxn.java#L440]
 and 
[here|https://github.com/apache/zookeeper/blob/236e3d9183606512f0e03a1f828ad0d392eb6091/zookeeper-server/src/main/java/org/apache/zookeeper/ClientCnxn.java#L1430].
 This means, that connecting to a large ZooKeeper cluster can be hard (as we 
will have shorter connect timeout than we would have by just specifying a 
single server). The idea behind the current approach (I think) is that the 
connection initiation should be timeouted when the session timeout elapsed. And 
we want to make sure that we have the time to try out all the given servers 
before our sessionTimeout elapses. 

But when we use Kerberized cluster with SSL, we might end up having long 
connection initiation (until all the authentication and handshakes are 
completed). Still, we might want to keep the session timeout short for our 
application. E.g. we were facing connection timeouts with Kafka brokers trying 
to use SASL+SSL to communicate with ZooKeeper.

So it would be nice to be able to set an explicit sessionInitTimeout in the 
ZooKeeper client, independently from the session timeout.

If this configuration is not set, we would fallback to our current approach. 
But if it is set, then we would use it to calculate the connectTimeout.

  was:
Currently the connectTimeout (the maximum amount of time to connect to one 
ZooKeeper server) in the Java client is initialized as {{ / 
}}. See 
[here|https://github.com/apache/zookeeper/blob/236e3d9183606512f0e03a1f828ad0d392eb6091/zookeeper-server/src/main/java/org/apache/zookeeper/ClientCnxn.java#L440]
 and 
[here|https://github.com/apache/zookeeper/blob/236e3d9183606512f0e03a1f828ad0d392eb6091/zookeeper-server/src/main/java/org/apache/zookeeper/ClientCnxn.java#L1430]).
 This means, that connecting to a large ZooKeeper cluster can be hard (as we 
will have shorter connect timeout than we would have by just specifying a 
single server). The idea behind the current approach (I think) is that the 
connection initiation should be timeouted when the session timeout elapsed. And 
we want to make sure that we have the time to try out all the given servers 
before our sessionTimeout elapses. 

But when we use Kerberized cluster with SSL, we might end up having long 
connection initiation (until all the authentication and handshakes are 
completed). Still, we might want to keep the session timeout short for our 
application. E.g. we were facing connection timeouts with Kafka brokers trying 
to use SASL+SSL to communicate with ZooKeeper.

So it would be nice to be able to set an explicit sessionInitTimeout in the 
ZooKeeper client, independently from the session timeout.

If this configuration is not set, we would fallback to our current approach. 
But if it is set, then we would use it to calculate the connectTimeout.


> configure explicit sessionInitTimeout for client connection
> ---
>
> Key: ZOOKEEPER-3827
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3827
> Project: ZooKeeper
>  Issue Type: Improvement
>Affects Versions: 3.6.1, 3.5.8
>Reporter: Mate Szalay-Beko
>Assignee: Mate Szalay-Beko
>Priority: Major
>
> Currently the connectTimeout (the maximum amount of time to connect to one 
> ZooKeeper server) in the Java client is initialized as {{ / 
> }}. See 
> [here|https://github.com/apache/zookeeper/blob/236e3d9183606512f0e03a1f828ad0d392eb6091/zookeeper-server/src/main/java/org/apache/zookeeper/ClientCnxn.java#L440]
>  and 
> [here|https://github.com/apache/zookeeper/blob/236e3d9183606512f0e03a1f828ad0d392eb6091/zookeeper-server/src/main/java/org/apache/zookeeper/ClientCnxn.java#L1430].
>  This means, that connecting to a large ZooKeeper cluster can be hard (as we 
> will have shorter connect timeout than we would have by just specifying a 
> single server). The idea behind the current approach (I think) is that the 
> connection initiation should be timeouted when the session timeout elapsed. 
> And we want to make sure that we have the time to try out all the given 
> servers before our sessionTimeout elapses. 
> But when we use Kerberized cluster with SSL, we might end up having long 
> connection initiation (until all the authentication and handshakes are 
> completed). Still, we might want to keep the session timeout short for our 
> application. E.g. we were facing connection timeouts with Kafka broke

[jira] [Updated] (ZOOKEEPER-3827) configure explicit sessionInitTimeout for client connection

2020-05-12 Thread Mate Szalay-Beko (Jira)


 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-3827?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mate Szalay-Beko updated ZOOKEEPER-3827:

Summary: configure explicit sessionInitTimeout for client connection  (was: 
configure explicit sessionInitiationTimeout for client connection)

> configure explicit sessionInitTimeout for client connection
> ---
>
> Key: ZOOKEEPER-3827
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3827
> Project: ZooKeeper
>  Issue Type: Improvement
>Affects Versions: 3.6.1, 3.5.8
>Reporter: Mate Szalay-Beko
>Assignee: Mate Szalay-Beko
>Priority: Major
>
> Currently the connectTimeout (the maximum amount of time to connect to one 
> ZooKeeper server) in the Java client is initialized as {{ / 
> }}. See 
> [here|https://github.com/apache/zookeeper/blob/236e3d9183606512f0e03a1f828ad0d392eb6091/zookeeper-server/src/main/java/org/apache/zookeeper/ClientCnxn.java#L440]
>  and 
> [here|https://github.com/apache/zookeeper/blob/236e3d9183606512f0e03a1f828ad0d392eb6091/zookeeper-server/src/main/java/org/apache/zookeeper/ClientCnxn.java#L1430]).
>  This means, that connecting to a large ZooKeeper cluster can be hard (as we 
> will have shorter connect timeout than we would have by just specifying a 
> single server). The idea behind the current approach (I think) is that the 
> connection initiation should be timeouted when the session timeout elapsed. 
> And we want to make sure that we have the time to try out all the given 
> servers before our sessionTimeout elapses. 
> But when we use Kerberized cluster with SSL, we might end up having long 
> connection initiation (until all the authentication and handshakes are 
> completed). Still, we might want to keep the session timeout short for our 
> application. E.g. we were facing connection timeouts with Kafka brokers 
> trying to use SASL+SSL to communicate with ZooKeeper.
> So it would be nice to be able to set an explicit sessionInitiationTimeout in 
> the ZooKeeper client, independently from the session timeout.
> If this configuration is not set, we would fallback to our current approach. 
> But if it is set, then we would use it to calculate the connectTimeout.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ZOOKEEPER-3827) configure explicit sessionInitTimeout for client connection

2020-05-12 Thread Mate Szalay-Beko (Jira)


 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-3827?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mate Szalay-Beko updated ZOOKEEPER-3827:

Description: 
Currently the connectTimeout (the maximum amount of time to connect to one 
ZooKeeper server) in the Java client is initialized as {{ / 
}}. See 
[here|https://github.com/apache/zookeeper/blob/236e3d9183606512f0e03a1f828ad0d392eb6091/zookeeper-server/src/main/java/org/apache/zookeeper/ClientCnxn.java#L440]
 and 
[here|https://github.com/apache/zookeeper/blob/236e3d9183606512f0e03a1f828ad0d392eb6091/zookeeper-server/src/main/java/org/apache/zookeeper/ClientCnxn.java#L1430]).
 This means, that connecting to a large ZooKeeper cluster can be hard (as we 
will have shorter connect timeout than we would have by just specifying a 
single server). The idea behind the current approach (I think) is that the 
connection initiation should be timeouted when the session timeout elapsed. And 
we want to make sure that we have the time to try out all the given servers 
before our sessionTimeout elapses. 

But when we use Kerberized cluster with SSL, we might end up having long 
connection initiation (until all the authentication and handshakes are 
completed). Still, we might want to keep the session timeout short for our 
application. E.g. we were facing connection timeouts with Kafka brokers trying 
to use SASL+SSL to communicate with ZooKeeper.

So it would be nice to be able to set an explicit sessionInitTimeout in the 
ZooKeeper client, independently from the session timeout.

If this configuration is not set, we would fallback to our current approach. 
But if it is set, then we would use it to calculate the connectTimeout.

  was:
Currently the connectTimeout (the maximum amount of time to connect to one 
ZooKeeper server) in the Java client is initialized as {{ / 
}}. See 
[here|https://github.com/apache/zookeeper/blob/236e3d9183606512f0e03a1f828ad0d392eb6091/zookeeper-server/src/main/java/org/apache/zookeeper/ClientCnxn.java#L440]
 and 
[here|https://github.com/apache/zookeeper/blob/236e3d9183606512f0e03a1f828ad0d392eb6091/zookeeper-server/src/main/java/org/apache/zookeeper/ClientCnxn.java#L1430]).
 This means, that connecting to a large ZooKeeper cluster can be hard (as we 
will have shorter connect timeout than we would have by just specifying a 
single server). The idea behind the current approach (I think) is that the 
connection initiation should be timeouted when the session timeout elapsed. And 
we want to make sure that we have the time to try out all the given servers 
before our sessionTimeout elapses. 

But when we use Kerberized cluster with SSL, we might end up having long 
connection initiation (until all the authentication and handshakes are 
completed). Still, we might want to keep the session timeout short for our 
application. E.g. we were facing connection timeouts with Kafka brokers trying 
to use SASL+SSL to communicate with ZooKeeper.

So it would be nice to be able to set an explicit sessionInitiationTimeout in 
the ZooKeeper client, independently from the session timeout.

If this configuration is not set, we would fallback to our current approach. 
But if it is set, then we would use it to calculate the connectTimeout.


> configure explicit sessionInitTimeout for client connection
> ---
>
> Key: ZOOKEEPER-3827
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3827
> Project: ZooKeeper
>  Issue Type: Improvement
>Affects Versions: 3.6.1, 3.5.8
>Reporter: Mate Szalay-Beko
>Assignee: Mate Szalay-Beko
>Priority: Major
>
> Currently the connectTimeout (the maximum amount of time to connect to one 
> ZooKeeper server) in the Java client is initialized as {{ / 
> }}. See 
> [here|https://github.com/apache/zookeeper/blob/236e3d9183606512f0e03a1f828ad0d392eb6091/zookeeper-server/src/main/java/org/apache/zookeeper/ClientCnxn.java#L440]
>  and 
> [here|https://github.com/apache/zookeeper/blob/236e3d9183606512f0e03a1f828ad0d392eb6091/zookeeper-server/src/main/java/org/apache/zookeeper/ClientCnxn.java#L1430]).
>  This means, that connecting to a large ZooKeeper cluster can be hard (as we 
> will have shorter connect timeout than we would have by just specifying a 
> single server). The idea behind the current approach (I think) is that the 
> connection initiation should be timeouted when the session timeout elapsed. 
> And we want to make sure that we have the time to try out all the given 
> servers before our sessionTimeout elapses. 
> But when we use Kerberized cluster with SSL, we might end up having long 
> connection initiation (until all the authentication and handshakes are 
> completed). Still, we might want to keep the session timeout short for our 
> application. E.g. we were facing connection timeouts with Kaf

[jira] [Updated] (ZOOKEEPER-3827) configure explicit sessionInitiationTimeout for client connection

2020-05-12 Thread Mate Szalay-Beko (Jira)


 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-3827?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mate Szalay-Beko updated ZOOKEEPER-3827:

Description: 
Currently the connectTimeout (the maximum amount of time to connect to one 
ZooKeeper server) in the Java client is initialized as {{ / 
}}. See 
[here|https://github.com/apache/zookeeper/blob/236e3d9183606512f0e03a1f828ad0d392eb6091/zookeeper-server/src/main/java/org/apache/zookeeper/ClientCnxn.java#L440]
 and 
[here|https://github.com/apache/zookeeper/blob/236e3d9183606512f0e03a1f828ad0d392eb6091/zookeeper-server/src/main/java/org/apache/zookeeper/ClientCnxn.java#L1430]).
 This means, that connecting to a large ZooKeeper cluster can be hard (as we 
will have shorter connect timeout than we would have by just specifying a 
single server). The idea behind the current approach (I think) is that the 
connection initiation should be timeouted when the session timeout elapsed. And 
we want to make sure that we have the time to try out all the given servers 
before our sessionTimeout elapses. 

But when we use Kerberized cluster with SSL, we might end up having long 
connection initiation (until all the authentication and handshakes are 
completed). Still, we might want to keep the session timeout short for our 
application. E.g. we were facing connection timeouts with Kafka brokers trying 
to use SASL+SSL to communicate with ZooKeeper.

So it would be nice to be able to set an explicit sessionInitiationTimeout in 
the ZooKeeper client, independently from the session timeout.

If this configuration is not set, we would fallback to our current approach. 
But if it is set, then we would use it to calculate the connectTimeout.

  was:
Currently the connectTimeout (the maximum amount of time to connect to one 
ZooKeeper server) in the Java client is initialized as ` / 
`. See 
[here|https://github.com/apache/zookeeper/blob/236e3d9183606512f0e03a1f828ad0d392eb6091/zookeeper-server/src/main/java/org/apache/zookeeper/ClientCnxn.java#L440]
 and 
[here|https://github.com/apache/zookeeper/blob/236e3d9183606512f0e03a1f828ad0d392eb6091/zookeeper-server/src/main/java/org/apache/zookeeper/ClientCnxn.java#L1430]).
 This means, that connecting to a large ZooKeeper cluster can be hard (as we 
will have shorter connect timeout than we would have by just specifying a 
single server). The idea behind the current approach (I think) is that the 
connection initiation should be timeouted when the session timeout elapsed. And 
we want to make sure that we have the time to try out all the given servers 
before our sessionTimeout elapses. 

But when we use Kerberized cluster with SSL, we might end up having long 
connection initiation (until all the authentication and handshakes are 
completed). Still, we might want to keep the session timeout short for our 
application. E.g. we were facing connection timeouts with Kafka brokers trying 
to use SASL+SSL to communicate with ZooKeeper.

So it would be nice to be able to set an explicit sessionInitiationTimeout in 
the ZooKeeper client, independently from the session timeout.

If this configuration is not set, we would fallback to our current approach. 
But if it is set, then we would use it to calculate the connectTimeout.


> configure explicit sessionInitiationTimeout for client connection
> -
>
> Key: ZOOKEEPER-3827
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3827
> Project: ZooKeeper
>  Issue Type: Improvement
>Affects Versions: 3.6.1, 3.5.8
>Reporter: Mate Szalay-Beko
>Assignee: Mate Szalay-Beko
>Priority: Major
>
> Currently the connectTimeout (the maximum amount of time to connect to one 
> ZooKeeper server) in the Java client is initialized as {{ / 
> }}. See 
> [here|https://github.com/apache/zookeeper/blob/236e3d9183606512f0e03a1f828ad0d392eb6091/zookeeper-server/src/main/java/org/apache/zookeeper/ClientCnxn.java#L440]
>  and 
> [here|https://github.com/apache/zookeeper/blob/236e3d9183606512f0e03a1f828ad0d392eb6091/zookeeper-server/src/main/java/org/apache/zookeeper/ClientCnxn.java#L1430]).
>  This means, that connecting to a large ZooKeeper cluster can be hard (as we 
> will have shorter connect timeout than we would have by just specifying a 
> single server). The idea behind the current approach (I think) is that the 
> connection initiation should be timeouted when the session timeout elapsed. 
> And we want to make sure that we have the time to try out all the given 
> servers before our sessionTimeout elapses. 
> But when we use Kerberized cluster with SSL, we might end up having long 
> connection initiation (until all the authentication and handshakes are 
> completed). Still, we might want to keep the session timeout short for our 
> application. E.g. we were facing connection t

[jira] [Created] (ZOOKEEPER-3827) configure explicit sessionInitiationTimeout for client connection

2020-05-12 Thread Mate Szalay-Beko (Jira)
Mate Szalay-Beko created ZOOKEEPER-3827:
---

 Summary: configure explicit sessionInitiationTimeout for client 
connection
 Key: ZOOKEEPER-3827
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3827
 Project: ZooKeeper
  Issue Type: Improvement
Affects Versions: 3.5.8, 3.6.1
Reporter: Mate Szalay-Beko
Assignee: Mate Szalay-Beko


Currently the connectTimeout (the maximum amount of time to connect to one 
ZooKeeper server) in the Java client is initialized as ` / 
`. See 
[here|https://github.com/apache/zookeeper/blob/236e3d9183606512f0e03a1f828ad0d392eb6091/zookeeper-server/src/main/java/org/apache/zookeeper/ClientCnxn.java#L440]
 and 
[here|https://github.com/apache/zookeeper/blob/236e3d9183606512f0e03a1f828ad0d392eb6091/zookeeper-server/src/main/java/org/apache/zookeeper/ClientCnxn.java#L1430]).
 This means, that connecting to a large ZooKeeper cluster can be hard (as we 
will have shorter connect timeout than we would have by just specifying a 
single server). The idea behind the current approach (I think) is that the 
connection initiation should be timeouted when the session timeout elapsed. And 
we want to make sure that we have the time to try out all the given servers 
before our sessionTimeout elapses. 

But when we use Kerberized cluster with SSL, we might end up having long 
connection initiation (until all the authentication and handshakes are 
completed). Still, we might want to keep the session timeout short for our 
application. E.g. we were facing connection timeouts with Kafka brokers trying 
to use SASL+SSL to communicate with ZooKeeper.

So it would be nice to be able to set an explicit sessionInitiationTimeout in 
the ZooKeeper client, independently from the session timeout.

If this configuration is not set, we would fallback to our current approach. 
But if it is set, then we would use it to calculate the connectTimeout.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (ZOOKEEPER-3814) ZooKeeper caching of config

2020-05-12 Thread Mate Szalay-Beko (Jira)


[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-3814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17105395#comment-17105395
 ] 

Mate Szalay-Beko edited comment on ZOOKEEPER-3814 at 5/12/20, 1:27 PM:
---

update: I was wrong, the order of the rolling restart seems not to be 
important. I got the same error simply by:
 - have server.1, server.2, server.3 up and running
 - stop server.3
 - start server.4 with the new config (but re-using the data and config folder 
of server.3)

I think the problem is that {{server.3}} was committed locally somehow to the 
last valid view of the quorum. And when {{server.4}} comes up, it get the 
{{server.3}} from somewhere. Interestingly, it doesn't get it from 
{{zoo.cfg.dynamic.next}}.

When I do the following test, I still got the same problem:
 - have server.1, server.2, server.3 up and running
 - stop server.3
 - delete {{zoo.cfg.dynamic.next}} from the config folder of server 3/4
 - start server.4 with the new config (but re-using the data and config folder 
of server.3)
 - at this point I still see the same errors in the log + I also notice that 
the freshly generated {{zoo.cfg.dynamic.next}} is still wrong.

I was trying to reproduce the same steps with 3.4.14 and I haven't got any 
errors like these. So this really seems to be a bug (or at least something that 
shouldn't happen / should have been documented... we should be backward 
compatible, especially when dynamic config is disabled). I need to dig into the 
code now to find out the problem. 


was (Author: symat):
update: I was wrong, the order of the rolling restart seems not to be 
important. I got the same error simply by:

- have server.1, server.2, server.3 up and running
- stop server.3
- start server.4 with the new config (but re-using the data and config folder 
of server.3)

I think the problem is that {{server.3}} was committed locally somehow to the 
last valid view of the quorum. And when {{server.4}} comes up, it get the 
{{server.3}} from somewhere. Interestingly, it doesn't get it from 
{{zoo.cfg.dynamic.next}}. 

When I do the following test, I still got the same problem:

- have server.1, server.2, server.3 up and running
- stop server.3
- delete {{zoo.cfg.dynamic.next}} from the config folder of server 3/4
- start server.4 with the new config (but re-using the data and config folder 
of server.3)
- at this point I still see the same errors in the log + I also notice that the 
freshly generated  {{zoo.cfg.dynamic.next}} is still wrong.

I need to dig into the code now to find out the problem. But this really seems 
to be a bug (or at least something that shouldn't happen when dynamic config is 
disabled).

> ZooKeeper caching of config
> ---
>
> Key: ZOOKEEPER-3814
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3814
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: leaderElection, quorum, server
>Affects Versions: 3.5.6
>Reporter: Rajkiran Sura
>Assignee: Mate Szalay-Beko
>Priority: Major
>
> Hello,
> We recently upgraded our 5 node ZooKeeper ensemble from v3.4.8 to v3.5.6. 
> Encountered no issues as such.
> This is how the ZooKeeper config looks like:
> {quote}tickTime=2000
> dataDir=/zookeeper-data/
> initLimit=5
> syncLimit=2
> maxClientCnxns=2048
> autopurge.snapRetainCount=3
> autopurge.purgeInterval=1
> 4lw.commands.whitelist=stat, ruok, conf, isro, mntr
> authProvider.1=org.apache.zookeeper.server.auth.SASLAuthenticationProvider
> requireClientAuthScheme=sasl
> quorum.cnxn.threads.size=20
> quorum.auth.enableSasl=true
> quorum.auth.kerberos.servicePrincipal= zookeeper/_HOST
> quorum.auth.learnerRequireSasl=true
> quorum.auth.learner.saslLoginContext=QuorumLearner
> quorum.auth.serverRequireSasl=true
> quorum.auth.server.saslLoginContext=QuorumServer
> server.17=node1.foo.bar.com:2888:3888;2181
> server.19=node2.foo.bar.com:2888:3888;2181
> server.20=node3.foo.bar.com:2888:3888;2181
> server.21=node4.foo.bar.com:2888:3888;2181
> server.22=node5.bar.com:2888:3888;2181
> {quote}
> Post upgrade, we had to migrate server.22 on the same node, but with 
> *FOO*.bar.com domain name due to kerberos referral issues. And, we used 
> different server-identifier, i.e., *23* when we migrated. So, here is how the 
> new config looked like:
> {quote}server.17=node1.foo.bar.com:2888:3888;2181
> server.19=node2.foo.bar.com:2888:3888;2181
> server.20=node3.foo.bar.com:2888:3888;2181
> server.21=node4.foo.bar.com:2888:3888;2181
> *server.23=node5.{color:#00875a}foo{color}.bar.com:2888:3888;2181*
> {quote}
> We restarted all the nodes in the ensemble with the above updated config. And 
> the migrated node joined the quorum successfully and was serving all clients 
> directly connected to it, without any issues.
> Recently, when a leader election happened,

[jira] [Commented] (ZOOKEEPER-3814) ZooKeeper caching of config

2020-05-12 Thread Mate Szalay-Beko (Jira)


[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-3814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17105395#comment-17105395
 ] 

Mate Szalay-Beko commented on ZOOKEEPER-3814:
-

update: I was wrong, the order of the rolling restart seems not to be 
important. I got the same error simply by:

- have server.1, server.2, server.3 up and running
- stop server.3
- start server.4 with the new config (but re-using the data and config folder 
of server.3)

I think the problem is that {{server.3}} was committed locally somehow to the 
last valid view of the quorum. And when {{server.4}} comes up, it get the 
{{server.3}} from somewhere. Interestingly, it doesn't get it from 
{{zoo.cfg.dynamic.next}}. 

When I do the following test, I still got the same problem:

- have server.1, server.2, server.3 up and running
- stop server.3
- delete {{zoo.cfg.dynamic.next}} from the config folder of server 3/4
- start server.4 with the new config (but re-using the data and config folder 
of server.3)
- at this point I still see the same errors in the log + I also notice that the 
freshly generated  {{zoo.cfg.dynamic.next}} is still wrong.

I need to dig into the code now to find out the problem. But this really seems 
to be a bug (or at least something that shouldn't happen when dynamic config is 
disabled).

> ZooKeeper caching of config
> ---
>
> Key: ZOOKEEPER-3814
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3814
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: leaderElection, quorum, server
>Affects Versions: 3.5.6
>Reporter: Rajkiran Sura
>Assignee: Mate Szalay-Beko
>Priority: Major
>
> Hello,
> We recently upgraded our 5 node ZooKeeper ensemble from v3.4.8 to v3.5.6. 
> Encountered no issues as such.
> This is how the ZooKeeper config looks like:
> {quote}tickTime=2000
> dataDir=/zookeeper-data/
> initLimit=5
> syncLimit=2
> maxClientCnxns=2048
> autopurge.snapRetainCount=3
> autopurge.purgeInterval=1
> 4lw.commands.whitelist=stat, ruok, conf, isro, mntr
> authProvider.1=org.apache.zookeeper.server.auth.SASLAuthenticationProvider
> requireClientAuthScheme=sasl
> quorum.cnxn.threads.size=20
> quorum.auth.enableSasl=true
> quorum.auth.kerberos.servicePrincipal= zookeeper/_HOST
> quorum.auth.learnerRequireSasl=true
> quorum.auth.learner.saslLoginContext=QuorumLearner
> quorum.auth.serverRequireSasl=true
> quorum.auth.server.saslLoginContext=QuorumServer
> server.17=node1.foo.bar.com:2888:3888;2181
> server.19=node2.foo.bar.com:2888:3888;2181
> server.20=node3.foo.bar.com:2888:3888;2181
> server.21=node4.foo.bar.com:2888:3888;2181
> server.22=node5.bar.com:2888:3888;2181
> {quote}
> Post upgrade, we had to migrate server.22 on the same node, but with 
> *FOO*.bar.com domain name due to kerberos referral issues. And, we used 
> different server-identifier, i.e., *23* when we migrated. So, here is how the 
> new config looked like:
> {quote}server.17=node1.foo.bar.com:2888:3888;2181
> server.19=node2.foo.bar.com:2888:3888;2181
> server.20=node3.foo.bar.com:2888:3888;2181
> server.21=node4.foo.bar.com:2888:3888;2181
> *server.23=node5.{color:#00875a}foo{color}.bar.com:2888:3888;2181*
> {quote}
> We restarted all the nodes in the ensemble with the above updated config. And 
> the migrated node joined the quorum successfully and was serving all clients 
> directly connected to it, without any issues.
> Recently, when a leader election happened, 
> server.*23*=node5.foo.bar.com(migrated node) was chosen as Leader (as it has 
> highest ID). But then, ZooKeeper was unable to serve any clients and *all* 
> the servers were _somehow still_ trying to establish a channel to 22 (old DNS 
> name: node5.bar.com) and were throwing below error in a loop:
> {quote}{{2020-05-02 01:43:03,026 [myid:23] - WARN 
> [WorkerSender[myid=23]:QuorumPeer$QuorumServer@196] - Failed to resolve 
> address: node4.bar.com}}
> {{java.net.UnknownHostException: node5.bar.com: Name or service not known}}
> {{ at java.base/java.net.Inet4AddressImpl.lookupAllHostAddr(Native Method)}}
> {{ at 
> java.base/java.net.InetAddress$PlatformNameService.lookupAllHostAddr(InetAddress.java:929)}}
> {{ at 
> java.base/java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1515)}}
> {{ at 
> java.base/java.net.InetAddress$NameServiceAddresses.get(InetAddress.java:848)}}
> {{ at java.base/java.net.InetAddress.getAllByName0(InetAddress.java:1505)}}
> {{ at java.base/java.net.InetAddress.getAllByName(InetAddress.java:1364)}}
> {{ at java.base/java.net.InetAddress.getAllByName(InetAddress.java:1298)}}
> {{ at java.base/java.net.InetAddress.getByName(InetAddress.java:1248)}}
> {{ at 
> org.apache.zookeeper.server.quorum.QuorumPeer$QuorumServer.recreateSocketAddresses(QuorumPeer.java:194)}}
> {{ at 
> org.apache.zookeeper.server.quorum.Quorum

[jira] [Commented] (ZOOKEEPER-3814) ZooKeeper caching of config

2020-05-12 Thread Mate Szalay-Beko (Jira)


[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-3814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17105384#comment-17105384
 ] 

Mate Szalay-Beko commented on ZOOKEEPER-3814:
-

OK, using rolling restarts, I successfully reproduced your case, following 
these steps:
- have server.1, server.2, server.3 up and running
- stop server.1
- start server.1 with the new config (removing server.3, adding server.4 with 
the new hostname)
- stop server.2
- start server.2 with the new config
- stop server.3
- start server.4 with the new config (but re-using the data folder of server.3)

Now I get the same error as you have (in the server.4 logs I see that it tries 
to connect to the old hostname of server.3, and fails obviously). When I get 
the {{/zookeeper/config}} object, I can see that there is no mentioning of 
{{server.3}}. However, the {{zoo.cfg.dynamic.next}} files haven't got updated 
and still contains the old list of servers on all nodes. 

> ZooKeeper caching of config
> ---
>
> Key: ZOOKEEPER-3814
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3814
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: leaderElection, quorum, server
>Affects Versions: 3.5.6
>Reporter: Rajkiran Sura
>Assignee: Mate Szalay-Beko
>Priority: Major
>
> Hello,
> We recently upgraded our 5 node ZooKeeper ensemble from v3.4.8 to v3.5.6. 
> Encountered no issues as such.
> This is how the ZooKeeper config looks like:
> {quote}tickTime=2000
> dataDir=/zookeeper-data/
> initLimit=5
> syncLimit=2
> maxClientCnxns=2048
> autopurge.snapRetainCount=3
> autopurge.purgeInterval=1
> 4lw.commands.whitelist=stat, ruok, conf, isro, mntr
> authProvider.1=org.apache.zookeeper.server.auth.SASLAuthenticationProvider
> requireClientAuthScheme=sasl
> quorum.cnxn.threads.size=20
> quorum.auth.enableSasl=true
> quorum.auth.kerberos.servicePrincipal= zookeeper/_HOST
> quorum.auth.learnerRequireSasl=true
> quorum.auth.learner.saslLoginContext=QuorumLearner
> quorum.auth.serverRequireSasl=true
> quorum.auth.server.saslLoginContext=QuorumServer
> server.17=node1.foo.bar.com:2888:3888;2181
> server.19=node2.foo.bar.com:2888:3888;2181
> server.20=node3.foo.bar.com:2888:3888;2181
> server.21=node4.foo.bar.com:2888:3888;2181
> server.22=node5.bar.com:2888:3888;2181
> {quote}
> Post upgrade, we had to migrate server.22 on the same node, but with 
> *FOO*.bar.com domain name due to kerberos referral issues. And, we used 
> different server-identifier, i.e., *23* when we migrated. So, here is how the 
> new config looked like:
> {quote}server.17=node1.foo.bar.com:2888:3888;2181
> server.19=node2.foo.bar.com:2888:3888;2181
> server.20=node3.foo.bar.com:2888:3888;2181
> server.21=node4.foo.bar.com:2888:3888;2181
> *server.23=node5.{color:#00875a}foo{color}.bar.com:2888:3888;2181*
> {quote}
> We restarted all the nodes in the ensemble with the above updated config. And 
> the migrated node joined the quorum successfully and was serving all clients 
> directly connected to it, without any issues.
> Recently, when a leader election happened, 
> server.*23*=node5.foo.bar.com(migrated node) was chosen as Leader (as it has 
> highest ID). But then, ZooKeeper was unable to serve any clients and *all* 
> the servers were _somehow still_ trying to establish a channel to 22 (old DNS 
> name: node5.bar.com) and were throwing below error in a loop:
> {quote}{{2020-05-02 01:43:03,026 [myid:23] - WARN 
> [WorkerSender[myid=23]:QuorumPeer$QuorumServer@196] - Failed to resolve 
> address: node4.bar.com}}
> {{java.net.UnknownHostException: node5.bar.com: Name or service not known}}
> {{ at java.base/java.net.Inet4AddressImpl.lookupAllHostAddr(Native Method)}}
> {{ at 
> java.base/java.net.InetAddress$PlatformNameService.lookupAllHostAddr(InetAddress.java:929)}}
> {{ at 
> java.base/java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1515)}}
> {{ at 
> java.base/java.net.InetAddress$NameServiceAddresses.get(InetAddress.java:848)}}
> {{ at java.base/java.net.InetAddress.getAllByName0(InetAddress.java:1505)}}
> {{ at java.base/java.net.InetAddress.getAllByName(InetAddress.java:1364)}}
> {{ at java.base/java.net.InetAddress.getAllByName(InetAddress.java:1298)}}
> {{ at java.base/java.net.InetAddress.getByName(InetAddress.java:1248)}}
> {{ at 
> org.apache.zookeeper.server.quorum.QuorumPeer$QuorumServer.recreateSocketAddresses(QuorumPeer.java:194)}}
> {{ at 
> org.apache.zookeeper.server.quorum.QuorumPeer.recreateSocketAddresses(QuorumPeer.java:774)}}
> {{ at 
> org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:701)}}
> {{ at 
> org.apache.zookeeper.server.quorum.QuorumCnxManager.toSend(QuorumCnxManager.java:620)}}
> {{ at 
> org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender.process(FastLe

[jira] [Comment Edited] (ZOOKEEPER-3814) ZooKeeper caching of config

2020-05-12 Thread Mate Szalay-Beko (Jira)


[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-3814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17105358#comment-17105358
 ] 

Mate Szalay-Beko edited comment on ZOOKEEPER-3814 at 5/12/20, 12:05 PM:


bq. Just checking, if you simulated the removal and addition of server via 
legacy rolling-restarts method? Also, we have quorum authn/authz enabled.

could you please describe the order of the server restarts you followed? Was 
there a time when the old server (with {{myid=22}}) was still running, while 
other servers were already restarted with the new config containing 
{{server.23}} ? This can be important, since in ZooKeeper 3.5+ the leader 
election protocol changed (see ZOOKEEPER-107) in a way that the servers are 
sending their id/hostname to each other and this can cause that the 
{{server.22}} remained in the config of the other servers. 



was (Author: symat):
> Just checking, if you simulated the removal and addition of server via legacy 
> rolling-restarts method? Also, we have quorum authn/authz enabled.

could you please describe the order of the server restarts you followed? Was 
there a time when the old server (with {{myid=22}}) was still running, while 
other servers were already restarted with the new config containing 
{{server.23}} ? This can be important, since in ZooKeeper 3.5+ the leader 
election protocol changed (see ZOOKEEPER-107) in a way that the servers are 
sending their id/hostname to each other and this can cause that the 
{{server.22}} remained in the config of the other servers. 


> ZooKeeper caching of config
> ---
>
> Key: ZOOKEEPER-3814
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3814
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: leaderElection, quorum, server
>Affects Versions: 3.5.6
>Reporter: Rajkiran Sura
>Assignee: Mate Szalay-Beko
>Priority: Major
>
> Hello,
> We recently upgraded our 5 node ZooKeeper ensemble from v3.4.8 to v3.5.6. 
> Encountered no issues as such.
> This is how the ZooKeeper config looks like:
> {quote}tickTime=2000
> dataDir=/zookeeper-data/
> initLimit=5
> syncLimit=2
> maxClientCnxns=2048
> autopurge.snapRetainCount=3
> autopurge.purgeInterval=1
> 4lw.commands.whitelist=stat, ruok, conf, isro, mntr
> authProvider.1=org.apache.zookeeper.server.auth.SASLAuthenticationProvider
> requireClientAuthScheme=sasl
> quorum.cnxn.threads.size=20
> quorum.auth.enableSasl=true
> quorum.auth.kerberos.servicePrincipal= zookeeper/_HOST
> quorum.auth.learnerRequireSasl=true
> quorum.auth.learner.saslLoginContext=QuorumLearner
> quorum.auth.serverRequireSasl=true
> quorum.auth.server.saslLoginContext=QuorumServer
> server.17=node1.foo.bar.com:2888:3888;2181
> server.19=node2.foo.bar.com:2888:3888;2181
> server.20=node3.foo.bar.com:2888:3888;2181
> server.21=node4.foo.bar.com:2888:3888;2181
> server.22=node5.bar.com:2888:3888;2181
> {quote}
> Post upgrade, we had to migrate server.22 on the same node, but with 
> *FOO*.bar.com domain name due to kerberos referral issues. And, we used 
> different server-identifier, i.e., *23* when we migrated. So, here is how the 
> new config looked like:
> {quote}server.17=node1.foo.bar.com:2888:3888;2181
> server.19=node2.foo.bar.com:2888:3888;2181
> server.20=node3.foo.bar.com:2888:3888;2181
> server.21=node4.foo.bar.com:2888:3888;2181
> *server.23=node5.{color:#00875a}foo{color}.bar.com:2888:3888;2181*
> {quote}
> We restarted all the nodes in the ensemble with the above updated config. And 
> the migrated node joined the quorum successfully and was serving all clients 
> directly connected to it, without any issues.
> Recently, when a leader election happened, 
> server.*23*=node5.foo.bar.com(migrated node) was chosen as Leader (as it has 
> highest ID). But then, ZooKeeper was unable to serve any clients and *all* 
> the servers were _somehow still_ trying to establish a channel to 22 (old DNS 
> name: node5.bar.com) and were throwing below error in a loop:
> {quote}{{2020-05-02 01:43:03,026 [myid:23] - WARN 
> [WorkerSender[myid=23]:QuorumPeer$QuorumServer@196] - Failed to resolve 
> address: node4.bar.com}}
> {{java.net.UnknownHostException: node5.bar.com: Name or service not known}}
> {{ at java.base/java.net.Inet4AddressImpl.lookupAllHostAddr(Native Method)}}
> {{ at 
> java.base/java.net.InetAddress$PlatformNameService.lookupAllHostAddr(InetAddress.java:929)}}
> {{ at 
> java.base/java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1515)}}
> {{ at 
> java.base/java.net.InetAddress$NameServiceAddresses.get(InetAddress.java:848)}}
> {{ at java.base/java.net.InetAddress.getAllByName0(InetAddress.java:1505)}}
> {{ at java.base/java.net.InetAddress.getAllByName(InetAddress.java:1364)}}
> {{ at java.base/java.net.InetAddress.getAllByName(InetAd

[jira] [Commented] (ZOOKEEPER-3814) ZooKeeper caching of config

2020-05-12 Thread Mate Szalay-Beko (Jira)


[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-3814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17105358#comment-17105358
 ] 

Mate Szalay-Beko commented on ZOOKEEPER-3814:
-

> Just checking, if you simulated the removal and addition of server via legacy 
> rolling-restarts method? Also, we have quorum authn/authz enabled.

could you please describe the order of the server restarts you followed? Was 
there a time when the old server (with {{myid=22}}) was still running, while 
other servers were already restarted with the new config containing 
{{server.23}} ? This can be important, since in ZooKeeper 3.5+ the leader 
election protocol changed (see ZOOKEEPER-107) in a way that the servers are 
sending their id/hostname to each other and this can cause that the 
{{server.22}} remained in the config of the other servers. 


> ZooKeeper caching of config
> ---
>
> Key: ZOOKEEPER-3814
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3814
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: leaderElection, quorum, server
>Affects Versions: 3.5.6
>Reporter: Rajkiran Sura
>Assignee: Mate Szalay-Beko
>Priority: Major
>
> Hello,
> We recently upgraded our 5 node ZooKeeper ensemble from v3.4.8 to v3.5.6. 
> Encountered no issues as such.
> This is how the ZooKeeper config looks like:
> {quote}tickTime=2000
> dataDir=/zookeeper-data/
> initLimit=5
> syncLimit=2
> maxClientCnxns=2048
> autopurge.snapRetainCount=3
> autopurge.purgeInterval=1
> 4lw.commands.whitelist=stat, ruok, conf, isro, mntr
> authProvider.1=org.apache.zookeeper.server.auth.SASLAuthenticationProvider
> requireClientAuthScheme=sasl
> quorum.cnxn.threads.size=20
> quorum.auth.enableSasl=true
> quorum.auth.kerberos.servicePrincipal= zookeeper/_HOST
> quorum.auth.learnerRequireSasl=true
> quorum.auth.learner.saslLoginContext=QuorumLearner
> quorum.auth.serverRequireSasl=true
> quorum.auth.server.saslLoginContext=QuorumServer
> server.17=node1.foo.bar.com:2888:3888;2181
> server.19=node2.foo.bar.com:2888:3888;2181
> server.20=node3.foo.bar.com:2888:3888;2181
> server.21=node4.foo.bar.com:2888:3888;2181
> server.22=node5.bar.com:2888:3888;2181
> {quote}
> Post upgrade, we had to migrate server.22 on the same node, but with 
> *FOO*.bar.com domain name due to kerberos referral issues. And, we used 
> different server-identifier, i.e., *23* when we migrated. So, here is how the 
> new config looked like:
> {quote}server.17=node1.foo.bar.com:2888:3888;2181
> server.19=node2.foo.bar.com:2888:3888;2181
> server.20=node3.foo.bar.com:2888:3888;2181
> server.21=node4.foo.bar.com:2888:3888;2181
> *server.23=node5.{color:#00875a}foo{color}.bar.com:2888:3888;2181*
> {quote}
> We restarted all the nodes in the ensemble with the above updated config. And 
> the migrated node joined the quorum successfully and was serving all clients 
> directly connected to it, without any issues.
> Recently, when a leader election happened, 
> server.*23*=node5.foo.bar.com(migrated node) was chosen as Leader (as it has 
> highest ID). But then, ZooKeeper was unable to serve any clients and *all* 
> the servers were _somehow still_ trying to establish a channel to 22 (old DNS 
> name: node5.bar.com) and were throwing below error in a loop:
> {quote}{{2020-05-02 01:43:03,026 [myid:23] - WARN 
> [WorkerSender[myid=23]:QuorumPeer$QuorumServer@196] - Failed to resolve 
> address: node4.bar.com}}
> {{java.net.UnknownHostException: node5.bar.com: Name or service not known}}
> {{ at java.base/java.net.Inet4AddressImpl.lookupAllHostAddr(Native Method)}}
> {{ at 
> java.base/java.net.InetAddress$PlatformNameService.lookupAllHostAddr(InetAddress.java:929)}}
> {{ at 
> java.base/java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1515)}}
> {{ at 
> java.base/java.net.InetAddress$NameServiceAddresses.get(InetAddress.java:848)}}
> {{ at java.base/java.net.InetAddress.getAllByName0(InetAddress.java:1505)}}
> {{ at java.base/java.net.InetAddress.getAllByName(InetAddress.java:1364)}}
> {{ at java.base/java.net.InetAddress.getAllByName(InetAddress.java:1298)}}
> {{ at java.base/java.net.InetAddress.getByName(InetAddress.java:1248)}}
> {{ at 
> org.apache.zookeeper.server.quorum.QuorumPeer$QuorumServer.recreateSocketAddresses(QuorumPeer.java:194)}}
> {{ at 
> org.apache.zookeeper.server.quorum.QuorumPeer.recreateSocketAddresses(QuorumPeer.java:774)}}
> {{ at 
> org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:701)}}
> {{ at 
> org.apache.zookeeper.server.quorum.QuorumCnxManager.toSend(QuorumCnxManager.java:620)}}
> {{ at 
> org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender.process(FastLeaderElection.java:477)}}
> {{ at 
> org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender.run

[jira] [Commented] (ZOOKEEPER-3826) upgrade from 3.4.x to 3.5.x

2020-05-12 Thread Mate Szalay-Beko (Jira)


[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-3826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17105321#comment-17105321
 ] 

Mate Szalay-Beko commented on ZOOKEEPER-3826:
-

can you check the file system to see if there is any snapshot file present on 
the node in question?

If there is no snapshot file, then (with disabled {{snapshot.trust.empty}}) 
this is not a bug, but this is the expected behaviour is to fail to start. 

Please note, in ZooKeeper all the servers are taking snapshots without 
synchronizing with each other, so it is totally possible that there is a server 
without snapshot, while other servers already has some. Please check our admin 
guide for snapshotting related parameters / details: 
https://zookeeper.apache.org/doc/r3.6.1/zookeeperAdmin.html

You have many ways to avoid this situation:
- wait until all servers has a snapshot file, before disabling 
{{snapshot.trust.empty}})
- copy the snapshots and log files from the leader before starting up the 
server which has no snapshot (make sure you copy both snapshots and logs, as 
due to the fuzzy-snapshotting, they both needed for a consistent view - see 
https://zookeeper.apache.org/doc/r3.6.1/zookeeperAdmin.html#sc_dataFileManagement)
- play with the {{snapCount}} or {{snapSizeLimitInKb}} parameters to instruct 
ZooKeeper to take snapshots more frequently (at least during the period when 
you still have {{snapshot.trust.empty=true}} )

Enforcing the taking of a snapshot with an admin command might be a good 
improvement (I am not sure if there is any feature like this in ZooKeeper right 
now). Maybe others know more...

> upgrade from 3.4.x to 3.5.x
> ---
>
> Key: ZOOKEEPER-3826
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3826
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: server
>Affects Versions: 3.5.7
> Environment: Kuberenetes 
>Reporter: Aldan Brito
>Priority: Critical
>
> upgrade of zookeeper from 3.4.14 to 3.5.7 
> We faced the snapshot issue which is described in 
> https://issues.apache.org/jira/browse/ZOOKEEPER-3056
> After setting the property "snapshot.trust.empty=true" the upgrade was 
> successful.
> while reverting the "snapshot.trust.empty=false" flag and restart of the 
> zookeeper pods, one of the zookeeper server is failing with the similar stack 
> trace no snapshot  found.
> {code:java}
> {"type":"log", "host":"zk-testzk-0", "level":"ERROR", 
> "neid":"zookeeper-4636c00bfc3849e0be179bc71cef17f8", "system":"zookeeper", 
> "time":"2020-05-12T08:32:17.685Z", "timezone":"UTC", "log":{"message":"main - 
> org.apache.zookeeper.server.quorum.QuorumPeer - Unable to load database on 
> disk"}}
> java.io.IOException: No snapshot found, but there are log entries. Something 
> is broken!
> at 
> org.apache.zookeeper.server.persistence.FileTxnSnapLog.restore(FileTxnSnapLog.java:240)
> at 
> org.apache.zookeeper.server.ZKDatabase.loadDataBase(ZKDatabase.java:240)
> at 
> org.apache.zookeeper.server.quorum.QuorumPeer.loadDataBase(QuorumPeer.java:901)
> at 
> org.apache.zookeeper.server.quorum.QuorumPeer.start(QuorumPeer.java:887)
> at 
> org.apache.zookeeper.server.quorum.QuorumPeerMain.runFromConfig(QuorumPeerMain.java:205)
> at 
> org.apache.zookeeper.server.quorum.QuorumPeerMain.initializeAndRun(QuorumPeerMain.java:123)
> at 
> org.apache.zookeeper.server.quorum.QuorumPeerMain.main(QuorumPeerMain.java:82)
> {"type":"log", "host":"zk-testzk-0", "level":"ERROR", 
> "neid":"zookeeper-4636c00bfc3849e0be179bc71cef17f8", "system":"zookeeper", 
> "time":"2020-05-12T08:32:17.764Z", "timezone":"UTC", "log":{"message":"main - 
> org.apache.zookeeper.server.quorum.QuorumPeerMain - Unexpected exception, 
> exiting abnormally"}}
> java.lang.RuntimeException: Unable to run quorum server
> at 
> org.apache.zookeeper.server.quorum.QuorumPeer.loadDataBase(QuorumPeer.java:938)
> at 
> org.apache.zookeeper.server.quorum.QuorumPeer.start(QuorumPeer.java:887)
> at 
> org.apache.zookeeper.server.quorum.QuorumPeerMain.runFromConfig(QuorumPeerMain.java:205)
> at 
> org.apache.zookeeper.server.quorum.QuorumPeerMain.initializeAndRun(QuorumPeerMain.java:123)
> at 
> org.apache.zookeeper.server.quorum.QuorumPeerMain.main(QuorumPeerMain.java:82)
> Caused by: java.io.IOException: No snapshot found, but there are log entries. 
> Something is broken!
> at 
> org.apache.zookeeper.server.persistence.FileTxnSnapLog.restore(FileTxnSnapLog.java:240)
> at 
> org.apache.zookeeper.server.ZKDatabase.loadDataBase(ZKDatabase.java:240)
> at 
> org.apache.zookeeper.server.quorum.QuorumPeer.loadDataBase(QuorumPeer.java:901)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ZOOKEEPER-3818) fix zkServer.sh status command to support SSL-only server

2020-05-12 Thread Mate Szalay-Beko (Jira)


 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-3818?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mate Szalay-Beko resolved ZOOKEEPER-3818.
-
Resolution: Fixed

Issue resolved by pull request 1348
[https://github.com/apache/zookeeper/pull/1348]

> fix zkServer.sh status command to support SSL-only server
> -
>
> Key: ZOOKEEPER-3818
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3818
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: security
>Affects Versions: 3.5.5
>Reporter: Aishwarya Soni
>Assignee: Mate Szalay-Beko
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 3.6.2, 3.5.9, 3.7.0
>
>  Time Spent: 3.5h
>  Remaining Estimate: 0h
>
> I am configuring SSL on Zookeeper 3.5.5 branch and have removed the 
> clientPort config from zoo.cfg and adding onlysecureClientPort. Also, I have 
> removed it from my server ensemble connection string in zoo.cfg.dynamic file 
> as it results in a port binding issue on the port 2181 if we keep it in both 
> the files.
> But, in zkServer.sh, it checks if the clientPort is set in the *status* cmd 
> else it throws exit 1 and terminates the process. How to overcome this 
> situation? We cannot see the clientPort in zoo.cfg as it would enable mixed 
> mode which we do not want when we enable SSL.
> Also, I am using zkServer.sh status output as a healthcheck for our 
> containerized zookeeper to see if thee quorum is established or not as in 
> cluster mode, zookeeper can finally run either in follower or leader state 
> (ignoring intermediate state changes). So as the status output throws exit 1, 
> the healthcheck is also failing.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ZOOKEEPER-3818) fix zkServer.sh status command to support SSL-only server

2020-05-12 Thread Mate Szalay-Beko (Jira)


 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-3818?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mate Szalay-Beko updated ZOOKEEPER-3818:

Summary: fix zkServer.sh status command to support SSL-only server  (was: 
zkServer.sh status command exits if clientPort is missing even if 
secureClientPort is present for SSL)

> fix zkServer.sh status command to support SSL-only server
> -
>
> Key: ZOOKEEPER-3818
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3818
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: security
>Affects Versions: 3.5.5
>Reporter: Aishwarya Soni
>Assignee: Mate Szalay-Beko
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 3.7.0, 3.6.2, 3.5.9
>
>  Time Spent: 3.5h
>  Remaining Estimate: 0h
>
> I am configuring SSL on Zookeeper 3.5.5 branch and have removed the 
> clientPort config from zoo.cfg and adding onlysecureClientPort. Also, I have 
> removed it from my server ensemble connection string in zoo.cfg.dynamic file 
> as it results in a port binding issue on the port 2181 if we keep it in both 
> the files.
> But, in zkServer.sh, it checks if the clientPort is set in the *status* cmd 
> else it throws exit 1 and terminates the process. How to overcome this 
> situation? We cannot see the clientPort in zoo.cfg as it would enable mixed 
> mode which we do not want when we enable SSL.
> Also, I am using zkServer.sh status output as a healthcheck for our 
> containerized zookeeper to see if thee quorum is established or not as in 
> cluster mode, zookeeper can finally run either in follower or leader state 
> (ignoring intermediate state changes). So as the status output throws exit 1, 
> the healthcheck is also failing.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ZOOKEEPER-3761) upgrade JLine jar dependency

2020-05-12 Thread Mate Szalay-Beko (Jira)


 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-3761?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mate Szalay-Beko resolved ZOOKEEPER-3761.
-
Fix Version/s: 3.7.0
   3.5.9
   3.6.2
   Resolution: Fixed

Issue resolved by pull request 1292
[https://github.com/apache/zookeeper/pull/1292]

> upgrade JLine jar dependency
> 
>
> Key: ZOOKEEPER-3761
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3761
> Project: ZooKeeper
>  Issue Type: Improvement
>  Components: server
>Reporter: maoling
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 3.6.2, 3.5.9, 3.7.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> h2. currently JLine used 2.11(May 19, 2013) which is too out-of-date, we need 
> to upgrade it to the lastest one: 3.13.3 or 3.14.0
>  
> update: we upgraded jLine to 2.14.6 (the latest 2.14 version)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ZOOKEEPER-3761) upgrade JLine jar dependency

2020-05-12 Thread Mate Szalay-Beko (Jira)


 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-3761?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mate Szalay-Beko updated ZOOKEEPER-3761:

Description: 
h2. currently JLine used 2.11(May 19, 2013) which is too out-of-date, we need 
to upgrade it to the lastest one: 3.13.3 or 3.14.0

 

update: we upgraded jLine to 2.14.6 (the latest 2.14 version)

  was:h2. currently JLine used 2.11(May 19, 2013) which is too out-of-date, we 
need to upgrade it to the lastest one: 3.13.3 or 3.14.0


> upgrade JLine jar dependency
> 
>
> Key: ZOOKEEPER-3761
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3761
> Project: ZooKeeper
>  Issue Type: Improvement
>  Components: server
>Reporter: maoling
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> h2. currently JLine used 2.11(May 19, 2013) which is too out-of-date, we need 
> to upgrade it to the lastest one: 3.13.3 or 3.14.0
>  
> update: we upgraded jLine to 2.14.6 (the latest 2.14 version)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ZOOKEEPER-3826) upgrade from 3.4.x to 3.5.x

2020-05-12 Thread Aldan Brito (Jira)


 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-3826?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aldan Brito updated ZOOKEEPER-3826:
---
Description: 
upgrade of zookeeper from 3.4.14 to 3.5.7 

We faced the snapshot issue which is described in 
https://issues.apache.org/jira/browse/ZOOKEEPER-3056

After setting the property "snapshot.trust.empty=true" the upgrade was 
successful.

while reverting the "snapshot.trust.empty=false" flag and restart of the 
zookeeper pods, one of the zookeeper server is failing with the similar stack 
trace no snapshot  found.
{code:java}
{"type":"log", "host":"zk-testzk-0", "level":"ERROR", 
"neid":"zookeeper-4636c00bfc3849e0be179bc71cef17f8", "system":"zookeeper", 
"time":"2020-05-12T08:32:17.685Z", "timezone":"UTC", "log":{"message":"main - 
org.apache.zookeeper.server.quorum.QuorumPeer - Unable to load database on 
disk"}}
java.io.IOException: No snapshot found, but there are log entries. Something is 
broken!
at 
org.apache.zookeeper.server.persistence.FileTxnSnapLog.restore(FileTxnSnapLog.java:240)
at 
org.apache.zookeeper.server.ZKDatabase.loadDataBase(ZKDatabase.java:240)
at 
org.apache.zookeeper.server.quorum.QuorumPeer.loadDataBase(QuorumPeer.java:901)
at 
org.apache.zookeeper.server.quorum.QuorumPeer.start(QuorumPeer.java:887)
at 
org.apache.zookeeper.server.quorum.QuorumPeerMain.runFromConfig(QuorumPeerMain.java:205)
at 
org.apache.zookeeper.server.quorum.QuorumPeerMain.initializeAndRun(QuorumPeerMain.java:123)
at 
org.apache.zookeeper.server.quorum.QuorumPeerMain.main(QuorumPeerMain.java:82)
{"type":"log", "host":"zk-testzk-0", "level":"ERROR", 
"neid":"zookeeper-4636c00bfc3849e0be179bc71cef17f8", "system":"zookeeper", 
"time":"2020-05-12T08:32:17.764Z", "timezone":"UTC", "log":{"message":"main - 
org.apache.zookeeper.server.quorum.QuorumPeerMain - Unexpected exception, 
exiting abnormally"}}
java.lang.RuntimeException: Unable to run quorum server
at 
org.apache.zookeeper.server.quorum.QuorumPeer.loadDataBase(QuorumPeer.java:938)
at 
org.apache.zookeeper.server.quorum.QuorumPeer.start(QuorumPeer.java:887)
at 
org.apache.zookeeper.server.quorum.QuorumPeerMain.runFromConfig(QuorumPeerMain.java:205)
at 
org.apache.zookeeper.server.quorum.QuorumPeerMain.initializeAndRun(QuorumPeerMain.java:123)
at 
org.apache.zookeeper.server.quorum.QuorumPeerMain.main(QuorumPeerMain.java:82)
Caused by: java.io.IOException: No snapshot found, but there are log entries. 
Something is broken!
at 
org.apache.zookeeper.server.persistence.FileTxnSnapLog.restore(FileTxnSnapLog.java:240)
at 
org.apache.zookeeper.server.ZKDatabase.loadDataBase(ZKDatabase.java:240)
at 
org.apache.zookeeper.server.quorum.QuorumPeer.loadDataBase(QuorumPeer.java:901)
{code}

  was:
upgrade of zookeeper from 3.4.14 to 3.5.7 

We faced the snapshot issue which is described in 
https://issues.apache.org/jira/browse/ZOOKEEPER-3056

After setting the property "snapshot.trust.empty=true" the upgrade was 
successful.

while reverting the "snapshot.trust.empty=false" flag one of the zookeeper 
server is failing with the similar stack trace no snapshot  found.
{code:java}
{"type":"log", "host":"zk-testzk-0", "level":"ERROR", 
"neid":"zookeeper-4636c00bfc3849e0be179bc71cef17f8", "system":"zookeeper", 
"time":"2020-05-12T08:32:17.685Z", "timezone":"UTC", "log":{"message":"main - 
org.apache.zookeeper.server.quorum.QuorumPeer - Unable to load database on 
disk"}}
java.io.IOException: No snapshot found, but there are log entries. Something is 
broken!
at 
org.apache.zookeeper.server.persistence.FileTxnSnapLog.restore(FileTxnSnapLog.java:240)
at 
org.apache.zookeeper.server.ZKDatabase.loadDataBase(ZKDatabase.java:240)
at 
org.apache.zookeeper.server.quorum.QuorumPeer.loadDataBase(QuorumPeer.java:901)
at 
org.apache.zookeeper.server.quorum.QuorumPeer.start(QuorumPeer.java:887)
at 
org.apache.zookeeper.server.quorum.QuorumPeerMain.runFromConfig(QuorumPeerMain.java:205)
at 
org.apache.zookeeper.server.quorum.QuorumPeerMain.initializeAndRun(QuorumPeerMain.java:123)
at 
org.apache.zookeeper.server.quorum.QuorumPeerMain.main(QuorumPeerMain.java:82)
{"type":"log", "host":"zk-testzk-0", "level":"ERROR", 
"neid":"zookeeper-4636c00bfc3849e0be179bc71cef17f8", "system":"zookeeper", 
"time":"2020-05-12T08:32:17.764Z", "timezone":"UTC", "log":{"message":"main - 
org.apache.zookeeper.server.quorum.QuorumPeerMain - Unexpected exception, 
exiting abnormally"}}
java.lang.RuntimeException: Unable to run quorum server
at 
org.apache.zookeeper.server.quorum.QuorumPeer.loadDataBase(QuorumPeer.java:938)
at 
org.apache.zookeeper.server.quorum.QuorumPeer.start(QuorumPeer.java:887)
at 
org.apache.zookeeper.server.quorum.QuorumPeerMain.runFromConfig(QuorumPeerMain.java:205)
at 

[jira] [Created] (ZOOKEEPER-3826) upgrade from 3.4.x to 3.5.x

2020-05-12 Thread Aldan Brito (Jira)
Aldan Brito created ZOOKEEPER-3826:
--

 Summary: upgrade from 3.4.x to 3.5.x
 Key: ZOOKEEPER-3826
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3826
 Project: ZooKeeper
  Issue Type: Bug
  Components: server
Affects Versions: 3.5.7
 Environment: Kuberenetes 
Reporter: Aldan Brito


upgrade of zookeeper from 3.4.14 to 3.5.7 

We faced the snapshot issue which is described in 
https://issues.apache.org/jira/browse/ZOOKEEPER-3056

After setting the property "snapshot.trust.empty=true" the upgrade was 
successful.

while reverting the "snapshot.trust.empty=false" flag one of the zookeeper 
server is failing with the similar stack trace no snapshot  found.
{code:java}
{"type":"log", "host":"zk-testzk-0", "level":"ERROR", 
"neid":"zookeeper-4636c00bfc3849e0be179bc71cef17f8", "system":"zookeeper", 
"time":"2020-05-12T08:32:17.685Z", "timezone":"UTC", "log":{"message":"main - 
org.apache.zookeeper.server.quorum.QuorumPeer - Unable to load database on 
disk"}}
java.io.IOException: No snapshot found, but there are log entries. Something is 
broken!
at 
org.apache.zookeeper.server.persistence.FileTxnSnapLog.restore(FileTxnSnapLog.java:240)
at 
org.apache.zookeeper.server.ZKDatabase.loadDataBase(ZKDatabase.java:240)
at 
org.apache.zookeeper.server.quorum.QuorumPeer.loadDataBase(QuorumPeer.java:901)
at 
org.apache.zookeeper.server.quorum.QuorumPeer.start(QuorumPeer.java:887)
at 
org.apache.zookeeper.server.quorum.QuorumPeerMain.runFromConfig(QuorumPeerMain.java:205)
at 
org.apache.zookeeper.server.quorum.QuorumPeerMain.initializeAndRun(QuorumPeerMain.java:123)
at 
org.apache.zookeeper.server.quorum.QuorumPeerMain.main(QuorumPeerMain.java:82)
{"type":"log", "host":"zk-testzk-0", "level":"ERROR", 
"neid":"zookeeper-4636c00bfc3849e0be179bc71cef17f8", "system":"zookeeper", 
"time":"2020-05-12T08:32:17.764Z", "timezone":"UTC", "log":{"message":"main - 
org.apache.zookeeper.server.quorum.QuorumPeerMain - Unexpected exception, 
exiting abnormally"}}
java.lang.RuntimeException: Unable to run quorum server
at 
org.apache.zookeeper.server.quorum.QuorumPeer.loadDataBase(QuorumPeer.java:938)
at 
org.apache.zookeeper.server.quorum.QuorumPeer.start(QuorumPeer.java:887)
at 
org.apache.zookeeper.server.quorum.QuorumPeerMain.runFromConfig(QuorumPeerMain.java:205)
at 
org.apache.zookeeper.server.quorum.QuorumPeerMain.initializeAndRun(QuorumPeerMain.java:123)
at 
org.apache.zookeeper.server.quorum.QuorumPeerMain.main(QuorumPeerMain.java:82)
Caused by: java.io.IOException: No snapshot found, but there are log entries. 
Something is broken!
at 
org.apache.zookeeper.server.persistence.FileTxnSnapLog.restore(FileTxnSnapLog.java:240)
at 
org.apache.zookeeper.server.ZKDatabase.loadDataBase(ZKDatabase.java:240)
at 
org.apache.zookeeper.server.quorum.QuorumPeer.loadDataBase(QuorumPeer.java:901)
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)