[jira] [Updated] (ZOOKEEPER-3502) improve the server command: zabstate to have a better observation on the process of leader election

2019-08-10 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-3502?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ZOOKEEPER-3502:
--
Labels: pull-request-available  (was: )

> improve the server command: zabstate to have a better observation on the 
> process of leader election
> ---
>
> Key: ZOOKEEPER-3502
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3502
> Project: ZooKeeper
>  Issue Type: Improvement
>  Components: server
>Reporter: maoling
>Assignee: maoling
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 3.6.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Updated] (ZOOKEEPER-3502) improve the server command: zabstate to have a better observation on the process of leader election

2019-08-10 Thread maoling (JIRA)


 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-3502?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

maoling updated ZOOKEEPER-3502:
---
Summary: improve the server command: zabstate to have a better observation 
on the process of leader election  (was: improve the server commands: zabstate 
to have a better observation on the process of leader election)

> improve the server command: zabstate to have a better observation on the 
> process of leader election
> ---
>
> Key: ZOOKEEPER-3502
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3502
> Project: ZooKeeper
>  Issue Type: Improvement
>  Components: server
>Reporter: maoling
>Assignee: maoling
>Priority: Minor
> Fix For: 3.6.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Updated] (ZOOKEEPER-3502) improve the server commands: zabstate to have a better observation on the process of leader election

2019-08-10 Thread maoling (JIRA)


 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-3502?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

maoling updated ZOOKEEPER-3502:
---
Priority: Minor  (was: Major)

> improve the server commands: zabstate to have a better observation on the 
> process of leader election
> 
>
> Key: ZOOKEEPER-3502
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3502
> Project: ZooKeeper
>  Issue Type: Improvement
>  Components: server
>Reporter: maoling
>Assignee: maoling
>Priority: Minor
> Fix For: 3.6.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Assigned] (ZOOKEEPER-3478) Leader restart shuts down all the followers

2019-08-10 Thread Karolos Antoniadis (JIRA)


 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-3478?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karolos Antoniadis reassigned ZOOKEEPER-3478:
-

Assignee: Karolos Antoniadis

> Leader restart shuts down all the followers
> ---
>
> Key: ZOOKEEPER-3478
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3478
> Project: ZooKeeper
>  Issue Type: Bug
>Affects Versions: 3.4.10
>Reporter: Lara Catipovic
>Assignee: Karolos Antoniadis
>Priority: Major
>
> Hello ZooKeeper Community,
> Could you please help me with at least clarifying a few doubts related to 
> ZooKeeper 3.4.10?
>  We have 2 servers in our system, one with 2 Zookeeper servers and the one 
> with 3 - meaning that in case of failure of the server with 3 Zookeeper 
> servers, the quorum cannot be achieved.
> *Server 11*
>  Zookeeper server 10
>  Zookeeper server 11
>  Zookeeper server 12
> *Server 12*
>  Zookeeper server 20
>  Zookeeper server 21 -> Leader at the beginning of the procedure
> As we were changing something in the configuration, it was needed to restart 
> our servers, and to keep the quorum up, we restarted servers one by one 
> (first on the one with 3 servers and then the other with 2 servers).
>  During the restart of the one with 3 servers, the quorum was not lost - 
> since we restarted one by one.
>  Then we tried to restart the servers on the other one where we have 2 
> Servers deployed, one by one also. 
>  The restart was executed in a small amount of time. After we restarted the 
> first server 20 (follower) it joined the quorum with no errors, as expected. 
>  *After we restarted the Leader server (21), all followers started to shut 
> down!*
> We had the same log on all the followers, but here is the example from the 
> follower 20:
> {panel}
> Jun 27 14:49:31 [myid: 20]: WARN Connection broken for id 21, my id = 20, 
> error =
>  Jun 27 14:49:31 javaOFException
>  Jun 27 14:49:31 at java.io.DataInputStream.readInt(Unknown Source)
>  Jun 27 14:49:31 at 
> org.apache.zookeeper.server.quorum.QuorumCnxManager$RecvWorker.run(QuorumCnxManager.java:1013)
>  Jun 27 14:49:31 [myid: 20]: INFO Accepted socket connection from 
> /192.168.1.116:18532
>  Jun 27 14:49:31 [myid: 20]: WARN Exception when following the leader
>  Jun 27 14:49:31 OFException
>  Jun 27 14:49:31 at java.io.DataInputStream.readInt(Unknown Source)
>  Jun 27 14:49:31 at 
> org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63)
>  Jun 27 14:49:31 at 
> org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:83)
>  Jun 27 14:49:31 at 
> org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:99)
>  Jun 27 14:49:31 at 
> org.apache.zookeeper.server.quorum.Learner.readPacket(Learner.java:153)
>  Jun 27 14:49:31 at 
> org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:85)
>  Jun 27 14:49:31 at 
> org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:937)
>  Jun 27 14:49:31 [myid: 20]: WARN Connection request from old client 
> /192.168.1.116:18532; will be dropped if server is in r-o mode
>  Jun 27 14:49:31 [myid: 20]: INFO Notification: 1 (message format version), 
> 12 (n.leader), 0x6612c7 (n.zxid), 0x19 (n.round), LOOKING (n.state), 12 
> (n.sid), 0x66 (n.peerEpoch) FOLLOWING (my state)
>  Jun 27 14:49:31 [myid: 20]: WARN Interrupting SendWorker
>  Jun 27 14:49:31 [myid: 20]: INFO Client attempting to renew session 
> 0xa6b9dc92aa60200 at /192.168.1.116:18532
>  Jun 27 14:49:31 [myid: 20]: INFO shutdown called
>  Jun 27 14:49:31 java.lang.Exception: shutdown Follower
>  Jun 27 14:49:31 at 
> org.apache.zookeeper.server.quorum.Follower.shutdown(Follower.java:166)
>  Jun 27 14:49:31 at 
> org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:941)
>  Jun 27 14:49:31 [myid: 20]: INFO Revalidating client: 0xa6b9dc92aa60200
>  Jun 27 14:49:31 [myid: 20]: WARN Interrupted while waiting for message on 
> queue
>  Jun 27 14:49:31 java.InterruptedException
>  Jun 27 14:49:31 at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(Unknown
>  Source)
>  Jun 27 14:49:31 at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(Unknown
>  Source)
>  Jun 27 14:49:31 at java.util.concurrent.ArrayBlockingQueue.poll(Unknown 
> Source)
>  Jun 27 14:49:31 at 
> org.apache.zookeeper.server.quorum.QuorumCnxManager.pollSendQueue(QuorumCnxManager.java:1097)
>  Jun 27 14:49:31 at 
> org.apache.zookeeper.server.quorum.QuorumCnxManager.access$700(QuorumCnxManager.java:74)
>  Jun 27 14:49:31 at 
> org.apache.zookeeper.server.quorum.QuorumCnxManager$SendWorker.run(QuorumCnxManager.java:932)
> {panel}
> *Is it expected that Leader in case of its restart triggers shut down of all 
> its followers?* 
>  This seem

[jira] [Commented] (ZOOKEEPER-3478) Leader restart shuts down all the followers

2019-08-10 Thread Karolos Antoniadis (JIRA)


[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-3478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16904537#comment-16904537
 ] 

Karolos Antoniadis commented on ZOOKEEPER-3478:
---

Hi Lara,

unless I'm missing something, your ZooKeeper configuration seems unusual. As 
you said, if server-11 crashes, your ZK cluster becomes unavailable and you can 
only tolerate the failure of one specific physical server, that of server-12. 
Furthermore, you could have used 3 ZK servers in total and potentially gain 
some performance benefits (e.g., faster writes) due to the smaller quorum, 
although this probably depends on the workload. Why not use 3 physical servers 
where in each physical server a ZK server is running? Then, your system remains 
available if *any* of the 3 servers crashes.

Regarding your first question: It is *normal* behaviour that all the followers 
shutdown during a leader election. Since there is no leader after a leader 
crash, the servers that used to be followers are not followers anymore. So the 
followers shutdown and go back to {{LOOKING}} state in order to find the new 
leader. Have a look at the code  
[here|https://github.com/apache/zookeeper/blob/master/zookeeper-server/src/main/java/org/apache/zookeeper/server/quorum/QuorumPeer.java#L1380].
 If the leader crashes, {{followLeader}} throws an exception and the follower 
is subsequently {{shutdown}}.


Later on, you state that 20 becomes the leader and indeed this seems to be the 
case. However, note that the notification messages received after leader 
election seem to suggest that servers 10, 12, 11 think that 21 is the actual 
leader since they have {{21 (n.leader)}}. What might be happening here is 
something akin to a race condition. For example, the following steps might have 
taken place:
1) Assume, server 20 receives enough notifications to become the leader.
2) Before server 20 changes its state to {{LEADING}}, server 21 is back up 
online and starts a leader election by sending notification messages to the 
other servers
3) The remaining servers agree that 21 is the new leader.
4) Server 20 changes its state to {{LEADING}} and tries to 
{{getEpochToPropose}} but fails since the other servers consider 21 to be the 
leader now.
This would explain why servers 10, 11, and 12 try to connect to server 21 
instead of 20 as you mention. As a matter of fact, I managed to reproduce the 
aforementioned behaviour in the [3.4.10 
release|https://github.com/apache/zookeeper/releases/tag/release-3.4.10].

You mentioned that "*The restart was executed in a small amount of time"* If 
the time between restarts was longer, then  I believe the issue should not 
appear.


However, I'm not sure about this: "After 3 unsuccessfull retries from servers 
10,11,12, since the quorum can not be achieved, connection times out and 
followers started to shut down again, After they are up, another election is 
triggered and new LEADER is now located on the first node (Server that becomes 
a new leader is 12):" I did not manage to reproduce this behaviour.


Are you able to consistently reproduce the issues you mentioned every time you 
restart the servers?

 

Cheers,
Karolos

> Leader restart shuts down all the followers
> ---
>
> Key: ZOOKEEPER-3478
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3478
> Project: ZooKeeper
>  Issue Type: Bug
>Affects Versions: 3.4.10
>Reporter: Lara Catipovic
>Priority: Major
>
> Hello ZooKeeper Community,
> Could you please help me with at least clarifying a few doubts related to 
> ZooKeeper 3.4.10?
>  We have 2 servers in our system, one with 2 Zookeeper servers and the one 
> with 3 - meaning that in case of failure of the server with 3 Zookeeper 
> servers, the quorum cannot be achieved.
> *Server 11*
>  Zookeeper server 10
>  Zookeeper server 11
>  Zookeeper server 12
> *Server 12*
>  Zookeeper server 20
>  Zookeeper server 21 -> Leader at the beginning of the procedure
> As we were changing something in the configuration, it was needed to restart 
> our servers, and to keep the quorum up, we restarted servers one by one 
> (first on the one with 3 servers and then the other with 2 servers).
>  During the restart of the one with 3 servers, the quorum was not lost - 
> since we restarted one by one.
>  Then we tried to restart the servers on the other one where we have 2 
> Servers deployed, one by one also. 
>  The restart was executed in a small amount of time. After we restarted the 
> first server 20 (follower) it joined the quorum with no errors, as expected. 
>  *After we restarted the Leader server (21), all followers started to shut 
> down!*
> We had the same log on all the followers, but here is the example from the 
> follower 20:
> {panel}
> Jun 27 14:49:31 [myid: 20]: WARN Connection broken 

[jira] [Updated] (ZOOKEEPER-3495) Broken test in JDK12+: SnapshotDigestTest.testDifferentDigestVersion

2019-08-10 Thread Jie Huang (JIRA)


 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-3495?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jie Huang updated ZOOKEEPER-3495:
-
Priority: Minor  (was: Blocker)

> Broken test in JDK12+: SnapshotDigestTest.testDifferentDigestVersion
> 
>
> Key: ZOOKEEPER-3495
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3495
> Project: ZooKeeper
>  Issue Type: Test
>Reporter: Andor Molnar
>Assignee: Szalay-Beko Mate
>Priority: Minor
>
> This test uses reflection to get access to "modifiers" field in Field class 
> which is not supported any longer in Java 12+ versions. Please modify the 
> test accordingly.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ZOOKEEPER-3503) Add server-side large request protection

2019-08-10 Thread Jie Huang (JIRA)
Jie Huang created ZOOKEEPER-3503:


 Summary: Add server-side large request protection
 Key: ZOOKEEPER-3503
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3503
 Project: ZooKeeper
  Issue Type: Improvement
  Components: server
Affects Versions: 3.6.0
Reporter: Jie Huang


This task adds a new request limiting mechanism to ZooKeeper that aims to 
protect ZooKeeper from accepting too many large requests and crashing because 
it runs out of memory. This is designed to augment the connection throttling 
(ZOOKEEPER-3242) and request throttling (ZOOKEEPER-3243), which focus on 
limiting the number rather than size of requests.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Updated] (ZOOKEEPER-3429) Flaky test test:org.apache.zookeeper.test.DisconnectedWatcherTest.testManyChildWatchersAutoReset

2019-08-10 Thread Andor Molnar (JIRA)


 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-3429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andor Molnar updated ZOOKEEPER-3429:

Issue Type: Sub-task  (was: Test)
Parent: ZOOKEEPER-3170

> Flaky test 
> test:org.apache.zookeeper.test.DisconnectedWatcherTest.testManyChildWatchersAutoReset
> 
>
> Key: ZOOKEEPER-3429
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3429
> Project: ZooKeeper
>  Issue Type: Sub-task
>  Components: tests
>Reporter: maoling
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.6.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> [https://builds.apache.org/view/S-Z/view/ZooKeeper/job/ZooKeeper-trunk-java9/lastFailedBuild/testReport/junit/org.apache.zookeeper.test/DisconnectedWatcherTest/testManyChildWatchersAutoReset/]
>  
> {code:java}
> Error Message
> test timed out after 84 milliseconds
> Stacktrace
> org.junit.runners.model.TestTimedOutException: test timed out after 84 
> milliseconds
>   at java.base@9.0.1/java.lang.Object.wait(Native Method)
>   at java.base@9.0.1/java.lang.Object.wait(Object.java:516)
>   at 
> app//org.apache.zookeeper.ClientCnxn.submitRequest(ClientCnxn.java:1556)
>   at 
> app//org.apache.zookeeper.ClientCnxn.submitRequest(ClientCnxn.java:1539)
>   at app//org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:1537)
>   at 
> app//org.apache.zookeeper.test.DisconnectedWatcherTest.testManyChildWatchersAutoReset(DisconnectedWatcherTest.java:247)
>   at 
> java.base@9.0.1/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native 
> Method)
>   at 
> java.base@9.0.1/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> java.base@9.0.1/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at 
> app//org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:80)
>   at 
> java.base@9.0.1/java.util.concurrent.FutureTask.run(FutureTask.java:264)
>   at java.base@9.0.1/java.lang.Thread.run(Thread.java:844)
> {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Updated] (ZOOKEEPER-3502) improve the server commands: zabstate to have a better observation on the process of leader election

2019-08-10 Thread maoling (JIRA)


 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-3502?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

maoling updated ZOOKEEPER-3502:
---
Summary: improve the server commands: zabstate to have a better observation 
on the process of leader election  (was: improve the server commands: zabstate 
to have a better observation on the process of leader elction)

> improve the server commands: zabstate to have a better observation on the 
> process of leader election
> 
>
> Key: ZOOKEEPER-3502
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3502
> Project: ZooKeeper
>  Issue Type: Improvement
>  Components: server
>Reporter: maoling
>Assignee: maoling
>Priority: Major
> Fix For: 3.6.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ZOOKEEPER-3502) improve the server commands: zabstate to have a better observation on the process of leader elction

2019-08-10 Thread maoling (JIRA)
maoling created ZOOKEEPER-3502:
--

 Summary: improve the server commands: zabstate to have a better 
observation on the process of leader elction
 Key: ZOOKEEPER-3502
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3502
 Project: ZooKeeper
  Issue Type: Improvement
  Components: server
Reporter: maoling
Assignee: maoling
 Fix For: 3.6.0






--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Updated] (ZOOKEEPER-3501) unify the method:op2String()

2019-08-10 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-3501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ZOOKEEPER-3501:
--
Labels: pull-request-available  (was: )

> unify the method:op2String()
> 
>
> Key: ZOOKEEPER-3501
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3501
> Project: ZooKeeper
>  Issue Type: Improvement
>  Components: server
>Reporter: maoling
>Assignee: maoling
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 3.6.0
>
>
> there were two duplicated method
> *public static String op2String(int op)*
> in the code base:
>  
> {code:java}
> org.apache.zookeeper.server.TraceFormatter#op2String
> org.apache.zookeeper.server.Request#op2String
> {code}
>  
> and they are inconsistency, we should unify it and remain only one
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Updated] (ZOOKEEPER-3475) Enable BookKeeper checkstyle configuration on zookeeper-server

2019-08-10 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-3475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ZOOKEEPER-3475:
--
Labels: pull-request-available  (was: )

> Enable BookKeeper checkstyle configuration on zookeeper-server
> --
>
> Key: ZOOKEEPER-3475
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3475
> Project: ZooKeeper
>  Issue Type: Sub-task
>  Components: build
>Affects Versions: 3.6.0
>Reporter: TisonKun
>Assignee: TisonKun
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.6.0
>
>
> Enable BookKeeper checkstyle configuration on zookeeper-server



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Assigned] (ZOOKEEPER-3501) unify the method:op2String()

2019-08-10 Thread maoling (JIRA)


 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-3501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

maoling reassigned ZOOKEEPER-3501:
--

Assignee: maoling

> unify the method:op2String()
> 
>
> Key: ZOOKEEPER-3501
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3501
> Project: ZooKeeper
>  Issue Type: Improvement
>  Components: server
>Reporter: maoling
>Assignee: maoling
>Priority: Minor
> Fix For: 3.6.0
>
>
> there were two duplicated method
> *public static String op2String(int op)*
> in the code base:
>  
> {code:java}
> org.apache.zookeeper.server.TraceFormatter#op2String
> org.apache.zookeeper.server.Request#op2String
> {code}
>  
> and they are inconsistency, we should unify it and remain only one
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)