[jira] [Updated] (ZOOKEEPER-3502) improve the server command: zabstate to have a better observation on the process of leader election
[ https://issues.apache.org/jira/browse/ZOOKEEPER-3502?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ZOOKEEPER-3502: -- Labels: pull-request-available (was: ) > improve the server command: zabstate to have a better observation on the > process of leader election > --- > > Key: ZOOKEEPER-3502 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3502 > Project: ZooKeeper > Issue Type: Improvement > Components: server >Reporter: maoling >Assignee: maoling >Priority: Minor > Labels: pull-request-available > Fix For: 3.6.0 > > -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Updated] (ZOOKEEPER-3502) improve the server command: zabstate to have a better observation on the process of leader election
[ https://issues.apache.org/jira/browse/ZOOKEEPER-3502?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] maoling updated ZOOKEEPER-3502: --- Summary: improve the server command: zabstate to have a better observation on the process of leader election (was: improve the server commands: zabstate to have a better observation on the process of leader election) > improve the server command: zabstate to have a better observation on the > process of leader election > --- > > Key: ZOOKEEPER-3502 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3502 > Project: ZooKeeper > Issue Type: Improvement > Components: server >Reporter: maoling >Assignee: maoling >Priority: Minor > Fix For: 3.6.0 > > -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Updated] (ZOOKEEPER-3502) improve the server commands: zabstate to have a better observation on the process of leader election
[ https://issues.apache.org/jira/browse/ZOOKEEPER-3502?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] maoling updated ZOOKEEPER-3502: --- Priority: Minor (was: Major) > improve the server commands: zabstate to have a better observation on the > process of leader election > > > Key: ZOOKEEPER-3502 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3502 > Project: ZooKeeper > Issue Type: Improvement > Components: server >Reporter: maoling >Assignee: maoling >Priority: Minor > Fix For: 3.6.0 > > -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Assigned] (ZOOKEEPER-3478) Leader restart shuts down all the followers
[ https://issues.apache.org/jira/browse/ZOOKEEPER-3478?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karolos Antoniadis reassigned ZOOKEEPER-3478: - Assignee: Karolos Antoniadis > Leader restart shuts down all the followers > --- > > Key: ZOOKEEPER-3478 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3478 > Project: ZooKeeper > Issue Type: Bug >Affects Versions: 3.4.10 >Reporter: Lara Catipovic >Assignee: Karolos Antoniadis >Priority: Major > > Hello ZooKeeper Community, > Could you please help me with at least clarifying a few doubts related to > ZooKeeper 3.4.10? > We have 2 servers in our system, one with 2 Zookeeper servers and the one > with 3 - meaning that in case of failure of the server with 3 Zookeeper > servers, the quorum cannot be achieved. > *Server 11* > Zookeeper server 10 > Zookeeper server 11 > Zookeeper server 12 > *Server 12* > Zookeeper server 20 > Zookeeper server 21 -> Leader at the beginning of the procedure > As we were changing something in the configuration, it was needed to restart > our servers, and to keep the quorum up, we restarted servers one by one > (first on the one with 3 servers and then the other with 2 servers). > During the restart of the one with 3 servers, the quorum was not lost - > since we restarted one by one. > Then we tried to restart the servers on the other one where we have 2 > Servers deployed, one by one also. > The restart was executed in a small amount of time. After we restarted the > first server 20 (follower) it joined the quorum with no errors, as expected. > *After we restarted the Leader server (21), all followers started to shut > down!* > We had the same log on all the followers, but here is the example from the > follower 20: > {panel} > Jun 27 14:49:31 [myid: 20]: WARN Connection broken for id 21, my id = 20, > error = > Jun 27 14:49:31 javaOFException > Jun 27 14:49:31 at java.io.DataInputStream.readInt(Unknown Source) > Jun 27 14:49:31 at > org.apache.zookeeper.server.quorum.QuorumCnxManager$RecvWorker.run(QuorumCnxManager.java:1013) > Jun 27 14:49:31 [myid: 20]: INFO Accepted socket connection from > /192.168.1.116:18532 > Jun 27 14:49:31 [myid: 20]: WARN Exception when following the leader > Jun 27 14:49:31 OFException > Jun 27 14:49:31 at java.io.DataInputStream.readInt(Unknown Source) > Jun 27 14:49:31 at > org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63) > Jun 27 14:49:31 at > org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:83) > Jun 27 14:49:31 at > org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:99) > Jun 27 14:49:31 at > org.apache.zookeeper.server.quorum.Learner.readPacket(Learner.java:153) > Jun 27 14:49:31 at > org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:85) > Jun 27 14:49:31 at > org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:937) > Jun 27 14:49:31 [myid: 20]: WARN Connection request from old client > /192.168.1.116:18532; will be dropped if server is in r-o mode > Jun 27 14:49:31 [myid: 20]: INFO Notification: 1 (message format version), > 12 (n.leader), 0x6612c7 (n.zxid), 0x19 (n.round), LOOKING (n.state), 12 > (n.sid), 0x66 (n.peerEpoch) FOLLOWING (my state) > Jun 27 14:49:31 [myid: 20]: WARN Interrupting SendWorker > Jun 27 14:49:31 [myid: 20]: INFO Client attempting to renew session > 0xa6b9dc92aa60200 at /192.168.1.116:18532 > Jun 27 14:49:31 [myid: 20]: INFO shutdown called > Jun 27 14:49:31 java.lang.Exception: shutdown Follower > Jun 27 14:49:31 at > org.apache.zookeeper.server.quorum.Follower.shutdown(Follower.java:166) > Jun 27 14:49:31 at > org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:941) > Jun 27 14:49:31 [myid: 20]: INFO Revalidating client: 0xa6b9dc92aa60200 > Jun 27 14:49:31 [myid: 20]: WARN Interrupted while waiting for message on > queue > Jun 27 14:49:31 java.InterruptedException > Jun 27 14:49:31 at > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(Unknown > Source) > Jun 27 14:49:31 at > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(Unknown > Source) > Jun 27 14:49:31 at java.util.concurrent.ArrayBlockingQueue.poll(Unknown > Source) > Jun 27 14:49:31 at > org.apache.zookeeper.server.quorum.QuorumCnxManager.pollSendQueue(QuorumCnxManager.java:1097) > Jun 27 14:49:31 at > org.apache.zookeeper.server.quorum.QuorumCnxManager.access$700(QuorumCnxManager.java:74) > Jun 27 14:49:31 at > org.apache.zookeeper.server.quorum.QuorumCnxManager$SendWorker.run(QuorumCnxManager.java:932) > {panel} > *Is it expected that Leader in case of its restart triggers shut down of all > its followers?* > This seem
[jira] [Commented] (ZOOKEEPER-3478) Leader restart shuts down all the followers
[ https://issues.apache.org/jira/browse/ZOOKEEPER-3478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16904537#comment-16904537 ] Karolos Antoniadis commented on ZOOKEEPER-3478: --- Hi Lara, unless I'm missing something, your ZooKeeper configuration seems unusual. As you said, if server-11 crashes, your ZK cluster becomes unavailable and you can only tolerate the failure of one specific physical server, that of server-12. Furthermore, you could have used 3 ZK servers in total and potentially gain some performance benefits (e.g., faster writes) due to the smaller quorum, although this probably depends on the workload. Why not use 3 physical servers where in each physical server a ZK server is running? Then, your system remains available if *any* of the 3 servers crashes. Regarding your first question: It is *normal* behaviour that all the followers shutdown during a leader election. Since there is no leader after a leader crash, the servers that used to be followers are not followers anymore. So the followers shutdown and go back to {{LOOKING}} state in order to find the new leader. Have a look at the code [here|https://github.com/apache/zookeeper/blob/master/zookeeper-server/src/main/java/org/apache/zookeeper/server/quorum/QuorumPeer.java#L1380]. If the leader crashes, {{followLeader}} throws an exception and the follower is subsequently {{shutdown}}. Later on, you state that 20 becomes the leader and indeed this seems to be the case. However, note that the notification messages received after leader election seem to suggest that servers 10, 12, 11 think that 21 is the actual leader since they have {{21 (n.leader)}}. What might be happening here is something akin to a race condition. For example, the following steps might have taken place: 1) Assume, server 20 receives enough notifications to become the leader. 2) Before server 20 changes its state to {{LEADING}}, server 21 is back up online and starts a leader election by sending notification messages to the other servers 3) The remaining servers agree that 21 is the new leader. 4) Server 20 changes its state to {{LEADING}} and tries to {{getEpochToPropose}} but fails since the other servers consider 21 to be the leader now. This would explain why servers 10, 11, and 12 try to connect to server 21 instead of 20 as you mention. As a matter of fact, I managed to reproduce the aforementioned behaviour in the [3.4.10 release|https://github.com/apache/zookeeper/releases/tag/release-3.4.10]. You mentioned that "*The restart was executed in a small amount of time"* If the time between restarts was longer, then I believe the issue should not appear. However, I'm not sure about this: "After 3 unsuccessfull retries from servers 10,11,12, since the quorum can not be achieved, connection times out and followers started to shut down again, After they are up, another election is triggered and new LEADER is now located on the first node (Server that becomes a new leader is 12):" I did not manage to reproduce this behaviour. Are you able to consistently reproduce the issues you mentioned every time you restart the servers? Cheers, Karolos > Leader restart shuts down all the followers > --- > > Key: ZOOKEEPER-3478 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3478 > Project: ZooKeeper > Issue Type: Bug >Affects Versions: 3.4.10 >Reporter: Lara Catipovic >Priority: Major > > Hello ZooKeeper Community, > Could you please help me with at least clarifying a few doubts related to > ZooKeeper 3.4.10? > We have 2 servers in our system, one with 2 Zookeeper servers and the one > with 3 - meaning that in case of failure of the server with 3 Zookeeper > servers, the quorum cannot be achieved. > *Server 11* > Zookeeper server 10 > Zookeeper server 11 > Zookeeper server 12 > *Server 12* > Zookeeper server 20 > Zookeeper server 21 -> Leader at the beginning of the procedure > As we were changing something in the configuration, it was needed to restart > our servers, and to keep the quorum up, we restarted servers one by one > (first on the one with 3 servers and then the other with 2 servers). > During the restart of the one with 3 servers, the quorum was not lost - > since we restarted one by one. > Then we tried to restart the servers on the other one where we have 2 > Servers deployed, one by one also. > The restart was executed in a small amount of time. After we restarted the > first server 20 (follower) it joined the quorum with no errors, as expected. > *After we restarted the Leader server (21), all followers started to shut > down!* > We had the same log on all the followers, but here is the example from the > follower 20: > {panel} > Jun 27 14:49:31 [myid: 20]: WARN Connection broken
[jira] [Updated] (ZOOKEEPER-3495) Broken test in JDK12+: SnapshotDigestTest.testDifferentDigestVersion
[ https://issues.apache.org/jira/browse/ZOOKEEPER-3495?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jie Huang updated ZOOKEEPER-3495: - Priority: Minor (was: Blocker) > Broken test in JDK12+: SnapshotDigestTest.testDifferentDigestVersion > > > Key: ZOOKEEPER-3495 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3495 > Project: ZooKeeper > Issue Type: Test >Reporter: Andor Molnar >Assignee: Szalay-Beko Mate >Priority: Minor > > This test uses reflection to get access to "modifiers" field in Field class > which is not supported any longer in Java 12+ versions. Please modify the > test accordingly. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (ZOOKEEPER-3503) Add server-side large request protection
Jie Huang created ZOOKEEPER-3503: Summary: Add server-side large request protection Key: ZOOKEEPER-3503 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3503 Project: ZooKeeper Issue Type: Improvement Components: server Affects Versions: 3.6.0 Reporter: Jie Huang This task adds a new request limiting mechanism to ZooKeeper that aims to protect ZooKeeper from accepting too many large requests and crashing because it runs out of memory. This is designed to augment the connection throttling (ZOOKEEPER-3242) and request throttling (ZOOKEEPER-3243), which focus on limiting the number rather than size of requests. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Updated] (ZOOKEEPER-3429) Flaky test test:org.apache.zookeeper.test.DisconnectedWatcherTest.testManyChildWatchersAutoReset
[ https://issues.apache.org/jira/browse/ZOOKEEPER-3429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andor Molnar updated ZOOKEEPER-3429: Issue Type: Sub-task (was: Test) Parent: ZOOKEEPER-3170 > Flaky test > test:org.apache.zookeeper.test.DisconnectedWatcherTest.testManyChildWatchersAutoReset > > > Key: ZOOKEEPER-3429 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3429 > Project: ZooKeeper > Issue Type: Sub-task > Components: tests >Reporter: maoling >Priority: Major > Labels: pull-request-available > Fix For: 3.6.0 > > Time Spent: 0.5h > Remaining Estimate: 0h > > [https://builds.apache.org/view/S-Z/view/ZooKeeper/job/ZooKeeper-trunk-java9/lastFailedBuild/testReport/junit/org.apache.zookeeper.test/DisconnectedWatcherTest/testManyChildWatchersAutoReset/] > > {code:java} > Error Message > test timed out after 84 milliseconds > Stacktrace > org.junit.runners.model.TestTimedOutException: test timed out after 84 > milliseconds > at java.base@9.0.1/java.lang.Object.wait(Native Method) > at java.base@9.0.1/java.lang.Object.wait(Object.java:516) > at > app//org.apache.zookeeper.ClientCnxn.submitRequest(ClientCnxn.java:1556) > at > app//org.apache.zookeeper.ClientCnxn.submitRequest(ClientCnxn.java:1539) > at app//org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:1537) > at > app//org.apache.zookeeper.test.DisconnectedWatcherTest.testManyChildWatchersAutoReset(DisconnectedWatcherTest.java:247) > at > java.base@9.0.1/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native > Method) > at > java.base@9.0.1/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > java.base@9.0.1/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at > app//org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:80) > at > java.base@9.0.1/java.util.concurrent.FutureTask.run(FutureTask.java:264) > at java.base@9.0.1/java.lang.Thread.run(Thread.java:844) > {code} > -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Updated] (ZOOKEEPER-3502) improve the server commands: zabstate to have a better observation on the process of leader election
[ https://issues.apache.org/jira/browse/ZOOKEEPER-3502?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] maoling updated ZOOKEEPER-3502: --- Summary: improve the server commands: zabstate to have a better observation on the process of leader election (was: improve the server commands: zabstate to have a better observation on the process of leader elction) > improve the server commands: zabstate to have a better observation on the > process of leader election > > > Key: ZOOKEEPER-3502 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3502 > Project: ZooKeeper > Issue Type: Improvement > Components: server >Reporter: maoling >Assignee: maoling >Priority: Major > Fix For: 3.6.0 > > -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (ZOOKEEPER-3502) improve the server commands: zabstate to have a better observation on the process of leader elction
maoling created ZOOKEEPER-3502: -- Summary: improve the server commands: zabstate to have a better observation on the process of leader elction Key: ZOOKEEPER-3502 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3502 Project: ZooKeeper Issue Type: Improvement Components: server Reporter: maoling Assignee: maoling Fix For: 3.6.0 -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Updated] (ZOOKEEPER-3501) unify the method:op2String()
[ https://issues.apache.org/jira/browse/ZOOKEEPER-3501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ZOOKEEPER-3501: -- Labels: pull-request-available (was: ) > unify the method:op2String() > > > Key: ZOOKEEPER-3501 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3501 > Project: ZooKeeper > Issue Type: Improvement > Components: server >Reporter: maoling >Assignee: maoling >Priority: Minor > Labels: pull-request-available > Fix For: 3.6.0 > > > there were two duplicated method > *public static String op2String(int op)* > in the code base: > > {code:java} > org.apache.zookeeper.server.TraceFormatter#op2String > org.apache.zookeeper.server.Request#op2String > {code} > > and they are inconsistency, we should unify it and remain only one > -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Updated] (ZOOKEEPER-3475) Enable BookKeeper checkstyle configuration on zookeeper-server
[ https://issues.apache.org/jira/browse/ZOOKEEPER-3475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ZOOKEEPER-3475: -- Labels: pull-request-available (was: ) > Enable BookKeeper checkstyle configuration on zookeeper-server > -- > > Key: ZOOKEEPER-3475 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3475 > Project: ZooKeeper > Issue Type: Sub-task > Components: build >Affects Versions: 3.6.0 >Reporter: TisonKun >Assignee: TisonKun >Priority: Major > Labels: pull-request-available > Fix For: 3.6.0 > > > Enable BookKeeper checkstyle configuration on zookeeper-server -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Assigned] (ZOOKEEPER-3501) unify the method:op2String()
[ https://issues.apache.org/jira/browse/ZOOKEEPER-3501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] maoling reassigned ZOOKEEPER-3501: -- Assignee: maoling > unify the method:op2String() > > > Key: ZOOKEEPER-3501 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3501 > Project: ZooKeeper > Issue Type: Improvement > Components: server >Reporter: maoling >Assignee: maoling >Priority: Minor > Fix For: 3.6.0 > > > there were two duplicated method > *public static String op2String(int op)* > in the code base: > > {code:java} > org.apache.zookeeper.server.TraceFormatter#op2String > org.apache.zookeeper.server.Request#op2String > {code} > > and they are inconsistency, we should unify it and remain only one > -- This message was sent by Atlassian JIRA (v7.6.14#76016)