[jira] [Commented] (ZOOKEEPER-3036) Unexpected exception in zookeeper

2019-04-14 Thread Michael K. Edwards (JIRA)


[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-3036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16817358#comment-16817358
 ] 

Michael K. Edwards commented on ZOOKEEPER-3036:
---

We encountered a similar error on the Leader of our production ZooKeeper 
cluster (running stock 3.4.13, 7 voting members, 2 observers).
{noformat}
Apr 12 16:18:16 prod-zk-voter-aza2 zookeeper[1129]: 2019-04-12 16:18:16,065 
[myid:6] - ERROR [LearnerHandler-/10.3.50.66:39854:LearnerHandler@648] - 
Unexpected exception causing shutdown while sock still open
Apr 12 16:18:16 prod-zk-voter-aza2 zookeeper[1129]: 
java.net.SocketTimeoutException: Read timed out
Apr 12 16:18:16 prod-zk-voter-aza2 zookeeper[1129]: #011at 
java.net.SocketInputStream.socketRead0(Native Method)
Apr 12 16:18:16 prod-zk-voter-aza2 zookeeper[1129]: #011at 
java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
Apr 12 16:18:16 prod-zk-voter-aza2 zookeeper[1129]: #011at 
java.net.SocketInputStream.read(SocketInputStream.java:171)
Apr 12 16:18:16 prod-zk-voter-aza2 zookeeper[1129]: #011at 
java.net.SocketInputStream.read(SocketInputStream.java:141)
Apr 12 16:18:16 prod-zk-voter-aza2 zookeeper[1129]: #011at 
java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
Apr 12 16:18:16 prod-zk-voter-aza2 zookeeper[1129]: #011at 
java.io.BufferedInputStream.read(BufferedInputStream.java:265)
Apr 12 16:18:16 prod-zk-voter-aza2 zookeeper[1129]: #011at 
java.io.DataInputStream.readInt(DataInputStream.java:387)
Apr 12 16:18:16 prod-zk-voter-aza2 zookeeper[1129]: #011at 
org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63)
Apr 12 16:18:16 prod-zk-voter-aza2 zookeeper[1129]: #011at 
org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:85)
Apr 12 16:18:16 prod-zk-voter-aza2 zookeeper[1129]: #011at 
org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:99)
Apr 12 16:18:16 prod-zk-voter-aza2 zookeeper[1129]: #011at 
org.apache.zookeeper.server.quorum.LearnerHandler.run(LearnerHandler.java:559)
Apr 12 16:18:16 prod-zk-voter-aza2 zookeeper[1129]: 2019-04-12 16:18:16,065 
[myid:6] - WARN  [LearnerHandler-/10.3.50.66:39854:LearnerHandler@661] - 
*** GOODBYE /10.3.50.66:39854 
Apr 12 16:18:17 prod-zk-voter-aza2 zookeeper[1129]: 2019-04-12 16:18:17,079 
[myid:6] - INFO  [SessionTracker:ZooKeeperServer@355] - Expiring session 
0x9180f040009, timeout of 6000ms exceeded
Apr 12 16:18:17 prod-zk-voter-aza2 zookeeper[1129]: 2019-04-12 16:18:17,080 
[myid:6] - INFO  [ProcessThread(sid:6 cport:-1)::PrepRequestProcessor@487] - 
Processed session termination for sessionid: 0x9180f040009
Apr 12 16:18:17 prod-zk-voter-aza2 zookeeper[1129]: 2019-04-12 16:18:17,084 
[myid:6] - INFO  [ProcessThread(sid:6 cport:-1)::PrepRequestProcessor@653] - 
Got user-level KeeperException when processing sessionid:0x9180f040007 
type:create cxid:0x1da9 zxid:0x9000126d8 txntype:-1 reqpath:n/a Error 
Path:/kafka/controller Error:KeeperErrorCode = NodeExists for /kafka/controller
Apr 12 16:18:19 prod-zk-voter-aza2 zookeeper[1129]: 2019-04-12 16:18:19,897 
[myid:6] - INFO  [ProcessThread(sid:6 cport:-1)::PrepRequestProcessor@653] - 
Got user-level KeeperException when processing sessionid:0x81d10880004 
type:delete cxid:0x566b zxid:0x900012dbd txntype:-1 reqpath:n/a Error 
Path:/kafka/admin/preferred_replica_election Error:KeeperErrorCode = NoNode for 
/kafka/admin/preferred_replica_election
{noformat}

It appears as though this was the first anomalous behavior logged by the 
cluster, and triggered this failure at the observer on the other end of the 
connection:
{noformat}
Apr 12 16:18:16 prod-zk-observer-2 zookeeper[1114]: 2019-04-12 16:18:16,066 
[myid:9] - WARN  [QuorumPeer[myid=9]/0:0:0:0:0:0:0:0:2181:Observer@77] - 
Exception when observing the leader
Apr 12 16:18:16 prod-zk-observer-2 zookeeper[1114]: java.io.EOFException
Apr 12 16:18:16 prod-zk-observer-2 zookeeper[1114]: #011at 
java.io.DataInputStream.readInt(DataInputStream.java:392)
Apr 12 16:18:16 prod-zk-observer-2 zookeeper[1114]: #011at 
org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63)
Apr 12 16:18:16 prod-zk-observer-2 zookeeper[1114]: #011at 
org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:85)
Apr 12 16:18:16 prod-zk-observer-2 zookeeper[1114]: #011at 
org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:99)
Apr 12 16:18:16 prod-zk-observer-2 zookeeper[1114]: #011at 
org.apache.zookeeper.server.quorum.Learner.readPacket(Learner.java:153)
Apr 12 16:18:16 prod-zk-observer-2 zookeeper[1114]: #011at 
org.apache.zookeeper.server.quorum.Observer.observeLeader(Observer.java:73)
Apr 12 16:18:16 prod-zk-observer-2 zookeeper[1114]: #011at 
org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:968)
Apr 12 16:18:16 prod-zk-observer-2 zookeeper[1114]: 2019-04-12 16:18:16,066 
[myid:9] - 

[jira] [Commented] (ZOOKEEPER-2778) Potential server deadlock between follower sync with leader and follower receiving external connection requests.

2018-11-27 Thread Michael K. Edwards (JIRA)


[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16700689#comment-16700689
 ] 

Michael K. Edwards commented on ZOOKEEPER-2778:
---

The current versions of this patch (#719 for master, #707 for branch-3.5) build 
green and don't have extraneous content.

> Potential server deadlock between follower sync with leader and follower 
> receiving external connection requests.
> 
>
> Key: ZOOKEEPER-2778
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2778
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: quorum
>Affects Versions: 3.5.3
>Reporter: Michael Han
>Assignee: Michael K. Edwards
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 3.6.0, 3.5.5
>
>  Time Spent: 6h 50m
>  Remaining Estimate: 0h
>
> It's possible to have a deadlock during recovery phase. 
> Found this issue by analyzing thread dumps of "flaky" ReconfigRecoveryTest 
> [1]. . Here is a sample thread dump that illustrates the state of the 
> execution:
> {noformat}
> [junit]  java.lang.Thread.State: BLOCKED
> [junit] at  
> org.apache.zookeeper.server.quorum.QuorumPeer.getElectionAddress(QuorumPeer.java:686)
> [junit] at  
> org.apache.zookeeper.server.quorum.QuorumCnxManager.initiateConnection(QuorumCnxManager.java:265)
> [junit] at  
> org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:445)
> [junit] at  
> org.apache.zookeeper.server.quorum.QuorumCnxManager.receiveConnection(QuorumCnxManager.java:369)
> [junit] at  
> org.apache.zookeeper.server.quorum.QuorumCnxManager$Listener.run(QuorumCnxManager.java:642)
> [junit] 
> [junit]  java.lang.Thread.State: BLOCKED
> [junit] at  
> org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:472)
> [junit] at  
> org.apache.zookeeper.server.quorum.QuorumPeer.connectNewPeers(QuorumPeer.java:1438)
> [junit] at  
> org.apache.zookeeper.server.quorum.QuorumPeer.setLastSeenQuorumVerifier(QuorumPeer.java:1471)
> [junit] at  
> org.apache.zookeeper.server.quorum.Learner.syncWithLeader(Learner.java:520)
> [junit] at  
> org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:88)
> [junit] at  
> org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:1133)
> {noformat}
> The dead lock happens between the quorum peer thread which running the 
> follower that doing sync with leader work, and the listener of the qcm of the 
> same quorum peer that doing the receiving connection work. Basically to 
> finish sync with leader, the follower needs to synchronize on both QV_LOCK 
> and the qmc object it owns; while in the receiver thread to finish setup an 
> incoming connection the thread needs to synchronize on both the qcm object 
> the quorum peer owns, and the same QV_LOCK. It's easy to see the problem here 
> is the order of acquiring two locks are different, thus depends on timing / 
> actual execution order, two threads might end up acquiring one lock while 
> holding another.
> [1] 
> org.apache.zookeeper.server.quorum.ReconfigRecoveryTest.testCurrentServersAreObserversInNextConfig



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ZOOKEEPER-2488) Unsynchronized access to shuttingDownLE in QuorumPeer

2018-11-25 Thread Michael K. Edwards (JIRA)


[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16698585#comment-16698585
 ] 

Michael K. Edwards commented on ZOOKEEPER-2488:
---

I pulled that fix out as a separate PR (#724).

> Unsynchronized access to shuttingDownLE in QuorumPeer
> -
>
> Key: ZOOKEEPER-2488
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2488
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: server
>Affects Versions: 3.5.2
>Reporter: Michael Han
>Assignee: gaoshu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.6.0, 3.5.5
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Access to shuttingDownLE in QuorumPeer is not synchronized here:
> https://github.com/apache/zookeeper/blob/3c37184e83a3e68b73544cebccf9388eea26f523/src/java/main/org/apache/zookeeper/server/quorum/QuorumPeer.java#L1066
> https://github.com/apache/zookeeper/blob/3c37184e83a3e68b73544cebccf9388eea26f523/src/java/main/org/
> The access should be synchronized as the same variable might be accessed 
> in QuormPeer::restartLeaderElection, which is synchronized.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ZOOKEEPER-3201) Flaky test: org.apache.zookeeper.server.ZxidRolloverTest.testRolloverThenLeaderRestart

2018-11-25 Thread Michael K. Edwards (JIRA)


[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-3201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16698526#comment-16698526
 ] 

Michael K. Edwards commented on ZOOKEEPER-3201:
---

What seems to be happening here is that this "second chance" logic isn't 
sufficient to ensure that we don't hit {{ConnectionLossException}} again during 
the second attempt to test {{zk.exists()}}.

{noformat}
/**
 * Ensure the client is able to talk to the server.
 * 
 * @param idx the idx of the server the client is talking to
 */
private void checkClientConnected(int idx) throws Exception {
ZooKeeper zk = getClient(idx);
if (zk == null) {
return;
}
try {
Assert.assertNull(zk.exists("/foofoofoo-connected", false));
} catch (ConnectionLossException e) {
// second chance...
// in some cases, leader change in particular, the timing is
// very tricky to get right in order to assure that the client has
// disconnected and reconnected. In some cases the client will
// disconnect, then attempt to reconnect before the server is
// back, in which case we'll see another connloss on the operation
// in the try, this catches that case and waits for the server
// to come back
PeerStruct peer = qu.getPeer(idx);
Assert.assertTrue("Waiting for server down", 
ClientBase.waitForServerUp(
"127.0.0.1:" + peer.clientPort, 
ClientBase.CONNECTION_TIMEOUT));

Assert.assertNull(zk.exists("/foofoofoo-connected", false));
}
}
{noformat}

> Flaky test: 
> org.apache.zookeeper.server.ZxidRolloverTest.testRolloverThenLeaderRestart
> --
>
> Key: ZOOKEEPER-3201
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3201
> Project: ZooKeeper
>  Issue Type: Sub-task
>Reporter: Michael K. Edwards
>Priority: Major
>
> Encountered when running tests locally:
> {noformat}
> 64429     [junit] 2018-11-25 22:28:12,729 [myid:127.0.0.1:27389] - INFO  
> [main-SendThread(127.0.0.1:27389):ClientCnxn$SendThread@1108] - Opening 
> socket connection to server localhost/127.0.0      .1:27389. Will not attempt 
> to authenticate using SASL (unknown error)
> 64430     [junit] 2018-11-25 22:28:12,730 [myid:127.0.0.1:27389] - INFO  
> [main-SendThread(127.0.0.1:27389):ClientCnxn$SendThread@955] - Socket 
> connection established, initiating session, cli      ent: /127.0.0.1:47668, 
> server: localhost/127.0.0.1:27389
> 64431     [junit] 2018-11-25 22:28:12,734 [myid:] - INFO  
> [NIOWorkerThread-1:Learner@117] - Revalidating client: 0x1a9cccf
> 64432     [junit] 2018-11-25 22:28:12,743 [myid:127.0.0.1:27389] - INFO  
> [main-SendThread(127.0.0.1:27389):ClientCnxn$SendThread@1390] - Session 
> establishment complete on server localhost/12      7.0.0.1:27389, sessionid = 
> 0x1a9cccf, negotiated timeout = 3
> 64433     [junit] 2018-11-25 22:28:13,009 [myid:127.0.0.1:27392] - INFO  
> [main-SendThread(127.0.0.1:27392):ClientCnxn$SendThread@1108] - Opening 
> socket connection to server localhost/127.0.0      .1:27392. Will not attempt 
> to authenticate using SASL (unknown error)
> 64434     [junit] 2018-11-25 22:28:13,009 [myid:127.0.0.1:27392] - INFO  
> [main-SendThread(127.0.0.1:27392):ClientCnxn$SendThread@955] - Socket 
> connection established, initiating session, cli      ent: /127.0.0.1:52160, 
> server: localhost/127.0.0.1:27392
> 64435     [junit] 2018-11-25 22:28:13,016 [myid:127.0.0.1:27395] - INFO  
> [main-SendThread(127.0.0.1:27395):ClientCnxn$SendThread@1108] - Opening 
> socket connection to server localhost/127.0.0      .1:27395. Will not attempt 
> to authenticate using SASL (unknown error)
> 64436     [junit] 2018-11-25 22:28:13,016 [myid:127.0.0.1:27395] - INFO  
> [main-SendThread(127.0.0.1:27395):ClientCnxn$SendThread@955] - Socket 
> connection established, initiating session, cli      ent: /127.0.0.1:47256, 
> server: localhost/127.0.0.1:27395
> 64437     [junit] 2018-11-25 22:28:13,017 [myid:] - INFO  
> [NIOWorkerThread-4:ZooKeeperServer@1030] - Refusing session request for 
> client /127.0.0.1:47256 as it has seen zxid 0x3 our       last zxid 
> is 0x2fffe client must try another server
> 64438     [junit] 2018-11-25 22:28:13,018 [myid:127.0.0.1:27395] - INFO  
> [main-SendThread(127.0.0.1:27395):ClientCnxn$SendThread@1236] - Unable to 
> read additional data from server sessionid       0x3a9ccd2, likely 
> server has closed socket, closing socket connection and attempting reconnect
> 64439     [junit] 2018-11-25 22:28:13,023 [myid:127.0.0.1:27392] - INFO  
> [main-SendThread(127.0.0.1:27392):ClientCnxn$SendThread@

[jira] [Created] (ZOOKEEPER-3202) Flaky test: org.apache.zookeeper.test.ClientSSLTest.testClientServerSSL

2018-11-25 Thread Michael K. Edwards (JIRA)
Michael K. Edwards created ZOOKEEPER-3202:
-

 Summary: Flaky test: 
org.apache.zookeeper.test.ClientSSLTest.testClientServerSSL
 Key: ZOOKEEPER-3202
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3202
 Project: ZooKeeper
  Issue Type: Sub-task
Reporter: Michael K. Edwards


Encountered while running tests locally:
{noformat}
283208     [junit] 2018-11-25 22:35:31,581 [myid:2] - INFO  
[QuorumPeer[myid=2](plain=localhost/127.0.0.1:11230)(secure=0.0.0.0/0.0.0.0:11231):ZooKeeperServer@164]
 - Created server with tick       Time 4000 minSessionTimeout 8000 
maxSessionTimeout 8 datadir 
/usr/src/zookeeper/build/test/tmp/test6909783885989201471.junit.dir/data/version-2
 snapdir /usr/src/zookeeper/build/te       
st/tmp/test6909783885989201471.junit.dir/data/version-2

283209     [junit] 2018-11-25 22:35:31,582 [myid:1] - INFO  
[QuorumPeer[myid=1](plain=localhost/127.0.0.1:11226)(secure=0.0.0.0/0.0.0.0:11227):ZooKeeperServer@164]
 - Created server with tick       Time 4000 minSessionTimeout 8000 
maxSessionTimeout 8 datadir 
/usr/src/zookeeper/build/test/tmp/test9169467659375976724.junit.dir/data/version-2
 snapdir /usr/src/zookeeper/build/te       
st/tmp/test9169467659375976724.junit.dir/data/version-2

283210     [junit] 2018-11-25 22:35:31,581 [myid:0] - INFO  
[QuorumPeer[myid=0](plain=localhost/127.0.0.1:11222)(secure=0.0.0.0/0.0.0.0:11223):ZooKeeperServer@164]
 - Created server with tick       Time 4000 minSessionTimeout 8000 
maxSessionTimeout 8 datadir 
/usr/src/zookeeper/build/test/tmp/test8933570428019756122.junit.dir/data/version-2
 snapdir /usr/src/zookeeper/build/te       
st/tmp/test8933570428019756122.junit.dir/data/version-2

283211     [junit] 2018-11-25 22:35:31,585 [myid:0] - INFO  
[QuorumPeer[myid=0](plain=localhost/127.0.0.1:11222)(secure=0.0.0.0/0.0.0.0:11223):Follower@69]
 - FOLLOWING - LEADER ELECTION TOOK        - 275 MS

283212     [junit] 2018-11-25 22:35:31,588 [myid:2] - INFO  
[QuorumPeer[myid=2](plain=localhost/127.0.0.1:11230)(secure=0.0.0.0/0.0.0.0:11231):Leader@457]
 - LEADING - LEADER ELECTION TOOK -       160 MS

283213     [junit] 2018-11-25 22:35:31,582 [myid:1] - INFO  
[QuorumPeer[myid=1](plain=localhost/127.0.0.1:11226)(secure=0.0.0.0/0.0.0.0:11227):Follower@69]
 - FOLLOWING - LEADER ELECTION TOOK        - 155 MS

283214     [junit] 2018-11-25 22:35:31,633 [myid:2] - INFO  
[QuorumPeer[myid=2](plain=localhost/127.0.0.1:11230)(secure=0.0.0.0/0.0.0.0:11231):FileTxnSnapLog@372]
 - Snapshotting: 0x0 to /usr       
/src/zookeeper/build/test/tmp/test6909783885989201471.junit.dir/data/version-2/snapshot.0

283215     [junit] 2018-11-25 22:35:31,694 [myid:] - INFO  
[main:FourLetterWordMain@87] - connecting to 127.0.0.1 11222

283216     [junit] 2018-11-25 22:35:31,695 [myid:0] - INFO  [New I/O worker 
#11:NettyServerCnxn@288] - Processing stat command from /127.0.0.1:60484

283217     [junit] 2018-11-25 22:35:31,699 [myid:] - INFO  
[main:JUnit4ZKTestRunner$LoggedInvokeMethod@98] - TEST METHOD FAILED 
testClientServerSSL

283218     [junit] java.lang.AssertionError: waiting for server 0 being up

283219     [junit]     at org.junit.Assert.fail(Assert.java:88)

283220     [junit]     at org.junit.Assert.assertTrue(Assert.java:41)

283221     [junit]     at 
org.apache.zookeeper.test.ClientSSLTest.testClientServerSSL(ClientSSLTest.java:98){noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ZOOKEEPER-3201) Flaky test: org.apache.zookeeper.server.ZxidRolloverTest.testRolloverThenLeaderRestart

2018-11-25 Thread Michael K. Edwards (JIRA)
Michael K. Edwards created ZOOKEEPER-3201:
-

 Summary: Flaky test: 
org.apache.zookeeper.server.ZxidRolloverTest.testRolloverThenLeaderRestart
 Key: ZOOKEEPER-3201
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3201
 Project: ZooKeeper
  Issue Type: Sub-task
Reporter: Michael K. Edwards


Encountered when running tests locally:
{noformat}
64429     [junit] 2018-11-25 22:28:12,729 [myid:127.0.0.1:27389] - INFO  
[main-SendThread(127.0.0.1:27389):ClientCnxn$SendThread@1108] - Opening socket 
connection to server localhost/127.0.0      .1:27389. Will not attempt to 
authenticate using SASL (unknown error)

64430     [junit] 2018-11-25 22:28:12,730 [myid:127.0.0.1:27389] - INFO  
[main-SendThread(127.0.0.1:27389):ClientCnxn$SendThread@955] - Socket 
connection established, initiating session, cli      ent: /127.0.0.1:47668, 
server: localhost/127.0.0.1:27389

64431     [junit] 2018-11-25 22:28:12,734 [myid:] - INFO  
[NIOWorkerThread-1:Learner@117] - Revalidating client: 0x1a9cccf

64432     [junit] 2018-11-25 22:28:12,743 [myid:127.0.0.1:27389] - INFO  
[main-SendThread(127.0.0.1:27389):ClientCnxn$SendThread@1390] - Session 
establishment complete on server localhost/12      7.0.0.1:27389, sessionid = 
0x1a9cccf, negotiated timeout = 3

64433     [junit] 2018-11-25 22:28:13,009 [myid:127.0.0.1:27392] - INFO  
[main-SendThread(127.0.0.1:27392):ClientCnxn$SendThread@1108] - Opening socket 
connection to server localhost/127.0.0      .1:27392. Will not attempt to 
authenticate using SASL (unknown error)

64434     [junit] 2018-11-25 22:28:13,009 [myid:127.0.0.1:27392] - INFO  
[main-SendThread(127.0.0.1:27392):ClientCnxn$SendThread@955] - Socket 
connection established, initiating session, cli      ent: /127.0.0.1:52160, 
server: localhost/127.0.0.1:27392

64435     [junit] 2018-11-25 22:28:13,016 [myid:127.0.0.1:27395] - INFO  
[main-SendThread(127.0.0.1:27395):ClientCnxn$SendThread@1108] - Opening socket 
connection to server localhost/127.0.0      .1:27395. Will not attempt to 
authenticate using SASL (unknown error)

64436     [junit] 2018-11-25 22:28:13,016 [myid:127.0.0.1:27395] - INFO  
[main-SendThread(127.0.0.1:27395):ClientCnxn$SendThread@955] - Socket 
connection established, initiating session, cli      ent: /127.0.0.1:47256, 
server: localhost/127.0.0.1:27395

64437     [junit] 2018-11-25 22:28:13,017 [myid:] - INFO  
[NIOWorkerThread-4:ZooKeeperServer@1030] - Refusing session request for client 
/127.0.0.1:47256 as it has seen zxid 0x3 our       last zxid is 
0x2fffe client must try another server

64438     [junit] 2018-11-25 22:28:13,018 [myid:127.0.0.1:27395] - INFO  
[main-SendThread(127.0.0.1:27395):ClientCnxn$SendThread@1236] - Unable to read 
additional data from server sessionid       0x3a9ccd2, likely server 
has closed socket, closing socket connection and attempting reconnect

64439     [junit] 2018-11-25 22:28:13,023 [myid:127.0.0.1:27392] - INFO  
[main-SendThread(127.0.0.1:27392):ClientCnxn$SendThread@1390] - Session 
establishment complete on server localhost/12      7.0.0.1:27392, sessionid = 
0x2a9d094, negotiated timeout = 3

64440     [junit] 2018-11-25 22:28:13,119 [myid:] - INFO  
[main:FourLetterWordMain@87] - connecting to 127.0.0.1 27395

64441     [junit] 2018-11-25 22:28:13,120 [myid:] - INFO  
[NIOWorkerThread-1:NIOServerCnxn@518] - Processing stat command from 
/127.0.0.1:47258

64442     [junit] 2018-11-25 22:28:13,121 [myid:] - INFO  
[NIOWorkerThread-1:StatCommand@53] - Stat command output

64443     [junit] 2018-11-25 22:28:14,134 [myid:127.0.0.1:27395] - INFO  
[main-SendThread(127.0.0.1:27395):ClientCnxn$SendThread@1108] - Opening socket 
connection to server localhost/127.0.0      .1:27395. Will not attempt to 
authenticate using SASL (unknown error)

6     [junit] 2018-11-25 22:28:14,135 [myid:127.0.0.1:27395] - INFO  
[main-SendThread(127.0.0.1:27395):ClientCnxn$SendThread@955] - Socket 
connection established, initiating session, cli      ent: /127.0.0.1:47312, 
server: localhost/127.0.0.1:27395

64445     [junit] 2018-11-25 22:28:14,135 [myid:] - INFO  
[NIOWorkerThread-2:ZooKeeperServer@1030] - Refusing session request for client 
/127.0.0.1:47312 as it has seen zxid 0x3 our       last zxid is 
0x2fffe client must try another server

64446     [junit] 2018-11-25 22:28:14,137 [myid:127.0.0.1:27395] - INFO  
[main-SendThread(127.0.0.1:27395):ClientCnxn$SendThread@1236] - Unable to read 
additional data from server sessionid       0x3a9ccd2, likely server 
has closed socket, closing socket connection and attempting reconnect

64447     [junit] 2018-11-25 22:28:14,240 [myid:] - INFO  
[main:JUnit4ZKTestRunner$LoggedInvokeMethod@98] - TEST METHOD FAILED 
testRolloverThenLeaderRestart

64448     [junit] org.apache.zookeeper.KeeperException$ConnectionLossException: 

[jira] [Comment Edited] (ZOOKEEPER-3046) testManyChildWatchersAutoReset is flaky

2018-11-25 Thread Michael K. Edwards (JIRA)


[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-3046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16698341#comment-16698341
 ] 

Michael K. Edwards edited comment on ZOOKEEPER-3046 at 11/25/18 10:21 PM:
--

Still seeing test failures; basically a variant of ZOOKEEPER-2508.  (After 
stopping/starting the server, we have to wait for all clients to reconnect 
before continuing the test.)

{noformat}
422005 [junit] 2018-11-25 21:25:50,228 [myid:127.0.0.1:16611] - INFO  
[Time-limited test-SendThread(127.0.0.1:16611):ClientCnxn$SendThread@1390] - 
Session establishment complete on serve   r localhost/127.0.0.1:16611, 
sessionid = 0x17077c50001, negotiated timeout = 3
422006 [junit] 2018-11-25 21:25:50,286 [myid:] - INFO  [Time-limited 
test:JUnit4ZKTestRunner$LoggedInvokeMethod@98] - TEST METHOD FAILED 
testManyChildWatchersAutoReset
422007 [junit] 
org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = 
ConnectionLoss for 
/long-path-0-1-2-3-4-5-6   
-7-8-9/ch-00/ch
422008 [junit] at 
org.apache.zookeeper.KeeperException.create(KeeperException.java:102)
422009 [junit] at 
org.apache.zookeeper.KeeperException.create(KeeperException.java:54)
422010 [junit] at 
org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:1459)
422011 [junit] at 
org.apache.zookeeper.test.DisconnectedWatcherTest.testManyChildWatchersAutoReset(DisconnectedWatcherTest.java:229)
{noformat}


was (Author: mkedwards):
Still seeing test failures; basically a variant of ZOOKEEPER-2508.  (After 
stopping/starting the server, we have to wait for all clients to reconnect 
before continuing the test.)

{{
422005 [junit] 2018-11-25 21:25:50,228 [myid:127.0.0.1:16611] - INFO  
[Time-limited test-SendThread(127.0.0.1:16611):ClientCnxn$SendThread@1390] - 
Session establishment complete on serve   r localhost/127.0.0.1:16611, 
sessionid = 0x17077c50001, negotiated timeout = 3
422006 [junit] 2018-11-25 21:25:50,286 [myid:] - INFO  [Time-limited 
test:JUnit4ZKTestRunner$LoggedInvokeMethod@98] - TEST METHOD FAILED 
testManyChildWatchersAutoReset
422007 [junit] 
org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = 
ConnectionLoss for 
/long-path-0-1-2-3-4-5-6   
-7-8-9/ch-00/ch
422008 [junit] at 
org.apache.zookeeper.KeeperException.create(KeeperException.java:102)
422009 [junit] at 
org.apache.zookeeper.KeeperException.create(KeeperException.java:54)
422010 [junit] at 
org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:1459)
422011 [junit] at 
org.apache.zookeeper.test.DisconnectedWatcherTest.testManyChildWatchersAutoReset(DisconnectedWatcherTest.java:229)
}}

> testManyChildWatchersAutoReset is flaky
> ---
>
> Key: ZOOKEEPER-3046
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3046
> Project: ZooKeeper
>  Issue Type: Sub-task
>  Components: tests
>Affects Versions: 3.5.3, 3.4.12
>Reporter: Bogdan Kanivets
>Assignee: Bogdan Kanivets
>Priority: Minor
>  Labels: flaky, pull-request-available
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> According to the 
> [dashboard|https://builds.apache.org/job/ZooKeeper-Find-Flaky-Tests/lastSuccessfulBuild/artifact/report.html]
>  testManyChildWatchersAutoReset is flaky in 3.4 and 3.5
> [ZooKeeper_branch34_java10|https://builds.apache.org/job/ZooKeeper_branch34_java10//13]
> [ZooKeeper_branch35_java9|https://builds.apache.org/job/ZooKeeper_branch35_java9/253]
> Test times out and because of that ant doesn't capture any output.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ZOOKEEPER-3046) testManyChildWatchersAutoReset is flaky

2018-11-25 Thread Michael K. Edwards (JIRA)


[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-3046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16698341#comment-16698341
 ] 

Michael K. Edwards commented on ZOOKEEPER-3046:
---

Still seeing test failures; basically a variant of ZOOKEEPER-2508.  (After 
stopping/starting the server, we have to wait for all clients to reconnect 
before continuing the test.)

{{
422005 [junit] 2018-11-25 21:25:50,228 [myid:127.0.0.1:16611] - INFO  
[Time-limited test-SendThread(127.0.0.1:16611):ClientCnxn$SendThread@1390] - 
Session establishment complete on serve   r localhost/127.0.0.1:16611, 
sessionid = 0x17077c50001, negotiated timeout = 3
422006 [junit] 2018-11-25 21:25:50,286 [myid:] - INFO  [Time-limited 
test:JUnit4ZKTestRunner$LoggedInvokeMethod@98] - TEST METHOD FAILED 
testManyChildWatchersAutoReset
422007 [junit] 
org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = 
ConnectionLoss for 
/long-path-0-1-2-3-4-5-6   
-7-8-9/ch-00/ch
422008 [junit] at 
org.apache.zookeeper.KeeperException.create(KeeperException.java:102)
422009 [junit] at 
org.apache.zookeeper.KeeperException.create(KeeperException.java:54)
422010 [junit] at 
org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:1459)
422011 [junit] at 
org.apache.zookeeper.test.DisconnectedWatcherTest.testManyChildWatchersAutoReset(DisconnectedWatcherTest.java:229)
}}

> testManyChildWatchersAutoReset is flaky
> ---
>
> Key: ZOOKEEPER-3046
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3046
> Project: ZooKeeper
>  Issue Type: Sub-task
>  Components: tests
>Affects Versions: 3.5.3, 3.4.12
>Reporter: Bogdan Kanivets
>Assignee: Bogdan Kanivets
>Priority: Minor
>  Labels: flaky, pull-request-available
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> According to the 
> [dashboard|https://builds.apache.org/job/ZooKeeper-Find-Flaky-Tests/lastSuccessfulBuild/artifact/report.html]
>  testManyChildWatchersAutoReset is flaky in 3.4 and 3.5
> [ZooKeeper_branch34_java10|https://builds.apache.org/job/ZooKeeper_branch34_java10//13]
> [ZooKeeper_branch35_java9|https://builds.apache.org/job/ZooKeeper_branch35_java9/253]
> Test times out and because of that ant doesn't capture any output.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ZOOKEEPER-3200) Flaky test: org.apache.zookeeper.server.quorum.QuorumPeerMainTest.testInconsistentDueToNewLeaderOrder

2018-11-25 Thread Michael K. Edwards (JIRA)
Michael K. Edwards created ZOOKEEPER-3200:
-

 Summary: Flaky test: 
org.apache.zookeeper.server.quorum.QuorumPeerMainTest.testInconsistentDueToNewLeaderOrder
 Key: ZOOKEEPER-3200
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3200
 Project: ZooKeeper
  Issue Type: Sub-task
Reporter: Michael K. Edwards


https://builds.apache.org/job/ZooKeeper_branch35_jdk8/1206/

I've seen this locally as well, in a branch where ZOOKEEPER-2778, 
ZOOKEEPER-1818, and ZOOKEEPER-2488 have all been addressed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ZOOKEEPER-2778) Potential server deadlock between follower sync with leader and follower receiving external connection requests.

2018-11-24 Thread Michael K. Edwards (JIRA)


[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16698085#comment-16698085
 ] 

Michael K. Edwards commented on ZOOKEEPER-2778:
---

Note that the current version of this patch also addresses ZOOKEEPER-2488.

> Potential server deadlock between follower sync with leader and follower 
> receiving external connection requests.
> 
>
> Key: ZOOKEEPER-2778
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2778
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: quorum
>Affects Versions: 3.5.3
>Reporter: Michael Han
>Assignee: Michael K. Edwards
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 3.6.0, 3.5.5
>
>  Time Spent: 6h 20m
>  Remaining Estimate: 0h
>
> It's possible to have a deadlock during recovery phase. 
> Found this issue by analyzing thread dumps of "flaky" ReconfigRecoveryTest 
> [1]. . Here is a sample thread dump that illustrates the state of the 
> execution:
> {noformat}
> [junit]  java.lang.Thread.State: BLOCKED
> [junit] at  
> org.apache.zookeeper.server.quorum.QuorumPeer.getElectionAddress(QuorumPeer.java:686)
> [junit] at  
> org.apache.zookeeper.server.quorum.QuorumCnxManager.initiateConnection(QuorumCnxManager.java:265)
> [junit] at  
> org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:445)
> [junit] at  
> org.apache.zookeeper.server.quorum.QuorumCnxManager.receiveConnection(QuorumCnxManager.java:369)
> [junit] at  
> org.apache.zookeeper.server.quorum.QuorumCnxManager$Listener.run(QuorumCnxManager.java:642)
> [junit] 
> [junit]  java.lang.Thread.State: BLOCKED
> [junit] at  
> org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:472)
> [junit] at  
> org.apache.zookeeper.server.quorum.QuorumPeer.connectNewPeers(QuorumPeer.java:1438)
> [junit] at  
> org.apache.zookeeper.server.quorum.QuorumPeer.setLastSeenQuorumVerifier(QuorumPeer.java:1471)
> [junit] at  
> org.apache.zookeeper.server.quorum.Learner.syncWithLeader(Learner.java:520)
> [junit] at  
> org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:88)
> [junit] at  
> org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:1133)
> {noformat}
> The dead lock happens between the quorum peer thread which running the 
> follower that doing sync with leader work, and the listener of the qcm of the 
> same quorum peer that doing the receiving connection work. Basically to 
> finish sync with leader, the follower needs to synchronize on both QV_LOCK 
> and the qmc object it owns; while in the receiver thread to finish setup an 
> incoming connection the thread needs to synchronize on both the qcm object 
> the quorum peer owns, and the same QV_LOCK. It's easy to see the problem here 
> is the order of acquiring two locks are different, thus depends on timing / 
> actual execution order, two threads might end up acquiring one lock while 
> holding another.
> [1] 
> org.apache.zookeeper.server.quorum.ReconfigRecoveryTest.testCurrentServersAreObserversInNextConfig



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ZOOKEEPER-3198) Handle port-binding failures in a systematic and documented fashion

2018-11-24 Thread Michael K. Edwards (JIRA)


[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-3198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16698065#comment-16698065
 ] 

Michael K. Edwards commented on ZOOKEEPER-3198:
---

An attempt (as yet, not very successful) to plumb BindExceptions up the stack 
is in https://github.com/mkedwards/zookeeper/tree/broken-bind-3.5 .  I'm 
currently foundering on test cases that call 
ReconfigTest.testPortChangeToBlockedPort().

> Handle port-binding failures in a systematic and documented fashion
> ---
>
> Key: ZOOKEEPER-3198
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3198
> Project: ZooKeeper
>  Issue Type: Improvement
>Affects Versions: 3.5.3, 3.6.0, 3.4.13
>Reporter: Michael K. Edwards
>Priority: Major
> Fix For: 3.6.0, 3.5.5, 3.4.14
>
>
> Many test failures appear to result from bind failures due to port conflicts. 
>  This can arise in normal use as well.  Presently the code swallows the 
> exception (with an error log) at a low level.  It would probably be useful to 
> throw the exception far enough up the stack to trigger retry with a new port 
> (in tests) or a high-level (perhaps even fatal) error message (in normal use).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ZOOKEEPER-1636) c-client crash when zoo_amulti failed

2018-11-23 Thread Michael K. Edwards (JIRA)


[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1636?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16697290#comment-16697290
 ] 

Michael K. Edwards commented on ZOOKEEPER-1636:
---

Assigning this to me to help get this patch reviewed and landed.

> c-client crash when zoo_amulti failed 
> --
>
> Key: ZOOKEEPER-1636
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1636
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: c client
>Affects Versions: 3.4.3
>Reporter: Thawan Kooburat
>Assignee: Michael K. Edwards
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 3.6.0, 3.5.5
>
> Attachments: ZOOKEEPER-1636.patch, ZOOKEEPER-1636.patch, 
> ZOOKEEPER-1636.patch, ZOOKEEPER-1636.patch, ZOOKEEPER-1636.patch
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> deserialize_response for multi operation don't handle the case where the 
> server fail to send back response. (Eg. when multi packet is too large) 
> c-client will try to process completion of all sub-request as if the 
> operation is successful and will eventually cause SIGSEGV



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (ZOOKEEPER-1636) c-client crash when zoo_amulti failed

2018-11-23 Thread Michael K. Edwards (JIRA)


 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1636?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael K. Edwards reassigned ZOOKEEPER-1636:
-

Assignee: Michael K. Edwards  (was: Thawan Kooburat)

> c-client crash when zoo_amulti failed 
> --
>
> Key: ZOOKEEPER-1636
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1636
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: c client
>Affects Versions: 3.4.3
>Reporter: Thawan Kooburat
>Assignee: Michael K. Edwards
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 3.6.0, 3.5.5
>
> Attachments: ZOOKEEPER-1636.patch, ZOOKEEPER-1636.patch, 
> ZOOKEEPER-1636.patch, ZOOKEEPER-1636.patch, ZOOKEEPER-1636.patch
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> deserialize_response for multi operation don't handle the case where the 
> server fail to send back response. (Eg. when multi packet is too large) 
> c-client will try to process completion of all sub-request as if the 
> operation is successful and will eventually cause SIGSEGV



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (ZOOKEEPER-2778) Potential server deadlock between follower sync with leader and follower receiving external connection requests.

2018-11-23 Thread Michael K. Edwards (JIRA)


 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2778?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael K. Edwards reassigned ZOOKEEPER-2778:
-

Assignee: Michael K. Edwards

> Potential server deadlock between follower sync with leader and follower 
> receiving external connection requests.
> 
>
> Key: ZOOKEEPER-2778
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2778
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: quorum
>Affects Versions: 3.5.3
>Reporter: Michael Han
>Assignee: Michael K. Edwards
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 3.6.0, 3.5.5
>
>  Time Spent: 5h 50m
>  Remaining Estimate: 0h
>
> It's possible to have a deadlock during recovery phase. 
> Found this issue by analyzing thread dumps of "flaky" ReconfigRecoveryTest 
> [1]. . Here is a sample thread dump that illustrates the state of the 
> execution:
> {noformat}
> [junit]  java.lang.Thread.State: BLOCKED
> [junit] at  
> org.apache.zookeeper.server.quorum.QuorumPeer.getElectionAddress(QuorumPeer.java:686)
> [junit] at  
> org.apache.zookeeper.server.quorum.QuorumCnxManager.initiateConnection(QuorumCnxManager.java:265)
> [junit] at  
> org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:445)
> [junit] at  
> org.apache.zookeeper.server.quorum.QuorumCnxManager.receiveConnection(QuorumCnxManager.java:369)
> [junit] at  
> org.apache.zookeeper.server.quorum.QuorumCnxManager$Listener.run(QuorumCnxManager.java:642)
> [junit] 
> [junit]  java.lang.Thread.State: BLOCKED
> [junit] at  
> org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:472)
> [junit] at  
> org.apache.zookeeper.server.quorum.QuorumPeer.connectNewPeers(QuorumPeer.java:1438)
> [junit] at  
> org.apache.zookeeper.server.quorum.QuorumPeer.setLastSeenQuorumVerifier(QuorumPeer.java:1471)
> [junit] at  
> org.apache.zookeeper.server.quorum.Learner.syncWithLeader(Learner.java:520)
> [junit] at  
> org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:88)
> [junit] at  
> org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:1133)
> {noformat}
> The dead lock happens between the quorum peer thread which running the 
> follower that doing sync with leader work, and the listener of the qcm of the 
> same quorum peer that doing the receiving connection work. Basically to 
> finish sync with leader, the follower needs to synchronize on both QV_LOCK 
> and the qmc object it owns; while in the receiver thread to finish setup an 
> incoming connection the thread needs to synchronize on both the qcm object 
> the quorum peer owns, and the same QV_LOCK. It's easy to see the problem here 
> is the order of acquiring two locks are different, thus depends on timing / 
> actual execution order, two threads might end up acquiring one lock while 
> holding another.
> [1] 
> org.apache.zookeeper.server.quorum.ReconfigRecoveryTest.testCurrentServersAreObserversInNextConfig



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ZOOKEEPER-3198) Handle port-binding failures in a systematic and documented fashion

2018-11-22 Thread Michael K. Edwards (JIRA)
Michael K. Edwards created ZOOKEEPER-3198:
-

 Summary: Handle port-binding failures in a systematic and 
documented fashion
 Key: ZOOKEEPER-3198
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3198
 Project: ZooKeeper
  Issue Type: Improvement
Affects Versions: 3.4.13, 3.5.3, 3.6.0
Reporter: Michael K. Edwards
 Fix For: 3.6.0, 3.5.5, 3.4.14


Many test failures appear to result from bind failures due to port conflicts.  
This can arise in normal use as well.  Presently the code swallows the 
exception (with an error log) at a low level.  It would probably be useful to 
throw the exception far enough up the stack to trigger retry with a new port 
(in tests) or a high-level (perhaps even fatal) error message (in normal use).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ZOOKEEPER-1818) Fix don't care for trunk

2018-11-22 Thread Michael K. Edwards (JIRA)


[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1818?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16696267#comment-16696267
 ] 

Michael K. Edwards commented on ZOOKEEPER-1818:
---

#718 is just Fangmin's patch against current master.

> Fix don't care for trunk
> 
>
> Key: ZOOKEEPER-1818
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1818
> Project: ZooKeeper
>  Issue Type: Bug
>Affects Versions: 3.5.1
>Reporter: Flavio Junqueira
>Assignee: Fangmin Lv
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 3.6.0, 3.5.5
>
> Attachments: ZOOKEEPER-1818.patch
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> See umbrella jira.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ZOOKEEPER-1818) Fix don't care for trunk

2018-11-22 Thread Michael K. Edwards (JIRA)


[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1818?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16696264#comment-16696264
 ] 

Michael K. Edwards commented on ZOOKEEPER-1818:
---

#714 now has just Fangmin's patch, ported, without the previous extraneous 
changes.  It may not build green until #707 (or an alternate fix for 
ZOOKEEPER-2778) lands on branch-3.5.

> Fix don't care for trunk
> 
>
> Key: ZOOKEEPER-1818
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1818
> Project: ZooKeeper
>  Issue Type: Bug
>Affects Versions: 3.5.1
>Reporter: Flavio Junqueira
>Assignee: Fangmin Lv
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 3.6.0, 3.5.5
>
> Attachments: ZOOKEEPER-1818.patch
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> See umbrella jira.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ZOOKEEPER-1636) c-client crash when zoo_amulti failed

2018-11-22 Thread Michael K. Edwards (JIRA)


[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1636?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16696263#comment-16696263
 ] 

Michael K. Edwards commented on ZOOKEEPER-1636:
---

#717 is Thawan's patch as a pull request against master.  #713 is the same 
patch against branch-3.5.

> c-client crash when zoo_amulti failed 
> --
>
> Key: ZOOKEEPER-1636
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1636
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: c client
>Affects Versions: 3.4.3
>Reporter: Thawan Kooburat
>Assignee: Thawan Kooburat
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 3.6.0, 3.5.5
>
> Attachments: ZOOKEEPER-1636.patch, ZOOKEEPER-1636.patch, 
> ZOOKEEPER-1636.patch, ZOOKEEPER-1636.patch, ZOOKEEPER-1636.patch
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> deserialize_response for multi operation don't handle the case where the 
> server fail to send back response. (Eg. when multi packet is too large) 
> c-client will try to process completion of all sub-request as if the 
> operation is successful and will eventually cause SIGSEGV



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ZOOKEEPER-2916) startSingleServerTest may be flaky

2018-11-22 Thread Michael K. Edwards (JIRA)


[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2916?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16695827#comment-16695827
 ] 

Michael K. Edwards commented on ZOOKEEPER-2916:
---

The root cause is hidden inside {{...[truncated 395348 chars]...}}.  But it 
looks to me like the server failed to bind the port, which seems to be a common 
cause of spurious test failures in CI.  See 
https://builds.apache.org/job/PreCommit-ZOOKEEPER-github-pr-build/2708/consoleText
 for an example (search for {{BindException}}).

> startSingleServerTest may be flaky
> --
>
> Key: ZOOKEEPER-2916
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2916
> Project: ZooKeeper
>  Issue Type: Sub-task
>  Components: tests
>Affects Versions: 3.5.3, 3.6.0
>Reporter: Patrick Hunt
>Assignee: Bogdan Kanivets
>Priority: Major
>  Labels: flaky, newbie
>
> startSingleServerTest seems to be failing intermittently. 10 times in the 
> first few days of this month. Can someone take a look?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ZOOKEEPER-1677) Misuse of INET_ADDRSTRLEN

2018-11-21 Thread Michael K. Edwards (JIRA)


[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16695506#comment-16695506
 ] 

Michael K. Edwards commented on ZOOKEEPER-1677:
---

This appears serious.  Fix possible in 3.5.5?

> Misuse of INET_ADDRSTRLEN
> -
>
> Key: ZOOKEEPER-1677
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1677
> Project: ZooKeeper
>  Issue Type: Bug
>Affects Versions: 3.5.0
>Reporter: Shevek
>Assignee: Marshall McMullen
>Priority: Major
> Fix For: 3.6.0, 3.5.5
>
> Attachments: ZOOKEEPER-1677.patch, ZOOKEEPER-1677.patch, 
> ZOOKEEPER-1677.patch, ZOOKEEPER-1677.patch, ZOOKEEPER-1677.patch, 
> ZOOKEEPER-1677.patch, ZOOKEEPER-1677.patch, ZOOKEEPER-1677.patch, 
> ZOOKEEPER-1677.patch
>
>
> ZOOKEEPER-1355. Add zk.updateServerList(newServerList) (Alex Shraer, 
> Marshall McMullen via fpj)
> 
> 
> 
> git-svn-id: https://svn.apache.org/repos/asf/zookeeper/trunk@1410731 
> 13f79535-47bb-0310-9956-ffa450edef68
> +int addrvec_contains(const addrvec_t *avec, const struct sockaddr_storage 
> *addr)
> +{
> +if (!avec || !addr)
> +{ 
> +return 0;
> +}
> +
> +int i = 0;
> +for (i = 0; i < avec->count; i++)
> +{
> +if(memcmp(&avec->data[i], addr, INET_ADDRSTRLEN) == 0)
> +return 1;
> +}
> +
> +return 0;
> +}
> Pretty sure that should be sizeof(sockaddr_storage). INET_ADDRSTRLEN is the 
> size of the character buffer which needs to be allocated for the return value 
> of inet_ntop, which seems to be totally wrong.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ZOOKEEPER-2504) Enforce that server ids are unique in a cluster

2018-11-21 Thread Michael K. Edwards (JIRA)


[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2504?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16695499#comment-16695499
 ] 

Michael K. Edwards commented on ZOOKEEPER-2504:
---

Is this something we can address for 3.5.5?

> Enforce that server ids are unique in a cluster
> ---
>
> Key: ZOOKEEPER-2504
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2504
> Project: ZooKeeper
>  Issue Type: Bug
>Reporter: Dan Benediktson
>Assignee: Michael Han
>Priority: Major
> Attachments: ZOOKEEPER-2504.patch
>
>
> The leader will happily accept connections from learners that have the same 
> server id (i.e., due to misconfiguration). This can lead to various issues 
> including non-unique session_ids being generated by these servers.
> The leader can enforce that all learners come in with unique server IDs; if a 
> learner attempts to connect with an id that is already in use, it should be 
> denied.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ZOOKEEPER-3124) Add the correct comment to show why we need the special logic to handle cversion and pzxid

2018-11-21 Thread Michael K. Edwards (JIRA)


[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-3124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16695497#comment-16695497
 ] 

Michael K. Edwards commented on ZOOKEEPER-3124:
---

Can we clear up this confusion before releasing 3.5.5?

> Add the correct comment to show why we need the special logic to handle 
> cversion and pzxid
> --
>
> Key: ZOOKEEPER-3124
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3124
> Project: ZooKeeper
>  Issue Type: Improvement
>  Components: server
>Reporter: Fangmin Lv
>Assignee: Fangmin Lv
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.6.0
>
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> The old comment about setCversionPzxid is not valid, the scenario it 
> mentioned won't trigger the issue, update it to show the exact reason.
>   



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ZOOKEEPER-2916) startSingleServerTest may be flaky

2018-11-21 Thread Michael K. Edwards (JIRA)


[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2916?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16695492#comment-16695492
 ] 

Michael K. Edwards commented on ZOOKEEPER-2916:
---

Reproducible in current branch-3.5?

> startSingleServerTest may be flaky
> --
>
> Key: ZOOKEEPER-2916
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2916
> Project: ZooKeeper
>  Issue Type: Sub-task
>  Components: tests
>Affects Versions: 3.5.3, 3.6.0
>Reporter: Patrick Hunt
>Assignee: Bogdan Kanivets
>Priority: Major
>  Labels: flaky, newbie
>
> startSingleServerTest seems to be failing intermittently. 10 times in the 
> first few days of this month. Can someone take a look?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ZOOKEEPER-2877) Flaky Test: org.apache.zookeeper.server.quorum.Zab1_0Test.testNormalRun

2018-11-21 Thread Michael K. Edwards (JIRA)


[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16695490#comment-16695490
 ] 

Michael K. Edwards commented on ZOOKEEPER-2877:
---

Reproducible in current branch-3.5?

> Flaky Test: org.apache.zookeeper.server.quorum.Zab1_0Test.testNormalRun
> ---
>
> Key: ZOOKEEPER-2877
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2877
> Project: ZooKeeper
>  Issue Type: Sub-task
>  Components: tests
>Reporter: Michael Han
>Priority: Major
>
> {noformat}
> Error Message
> expected:<1> but was:<0>
> Stacktrace
> junit.framework.AssertionFailedError: expected:<1> but was:<0>
>   at 
> org.apache.zookeeper.server.quorum.Zab1_0Test$6.converseWithLeader(Zab1_0Test.java:939)
>   at 
> org.apache.zookeeper.server.quorum.Zab1_0Test.testLeaderConversation(Zab1_0Test.java:398)
>   at 
> org.apache.zookeeper.server.quorum.Zab1_0Test.testNormalRun(Zab1_0Test.java:906)
>   at 
> org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:79)
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ZOOKEEPER-3023) Flaky test: org.apache.zookeeper.server.quorum.Zab1_0Test.testNormalFollowerRunWithDiff

2018-11-21 Thread Michael K. Edwards (JIRA)


[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-3023?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16695494#comment-16695494
 ] 

Michael K. Edwards commented on ZOOKEEPER-3023:
---

Reproducible in current branch-3.5?

> Flaky test: 
> org.apache.zookeeper.server.quorum.Zab1_0Test.testNormalFollowerRunWithDiff
> ---
>
> Key: ZOOKEEPER-3023
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3023
> Project: ZooKeeper
>  Issue Type: Sub-task
>Affects Versions: 3.6.0
>Reporter: Pravin Dsilva
>Assignee: maoling
>Priority: Major
>
> Getting the following error on master branch:
> Error Message
> {code:java}
> expected:<4294967298> but was:<0>{code}
> Stacktrace
> {code:java}
> junit.framework.AssertionFailedError: expected:<4294967298> but was:<0> at 
> org.apache.zookeeper.server.quorum.Zab1_0Test$5.converseWithFollower(Zab1_0Test.java:876)
>  at 
> org.apache.zookeeper.server.quorum.Zab1_0Test.testFollowerConversation(Zab1_0Test.java:523)
>  at 
> org.apache.zookeeper.server.quorum.Zab1_0Test.testNormalFollowerRunWithDiff(Zab1_0Test.java:791)
>  at 
> org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:79){code}
> Flaky 
> test:https://builds.apache.org/job/ZooKeeper-trunk-java10/141/testReport/junit/org.apache.zookeeper.server.quorum/Zab1_0Test/testNormalFollowerRunWithDiff/



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ZOOKEEPER-2320) C-client crashes when removing watcher asynchronously in "local" mode

2018-11-21 Thread Michael K. Edwards (JIRA)


[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16695495#comment-16695495
 ] 

Michael K. Edwards commented on ZOOKEEPER-2320:
---

Reproducible in current branch-3.5?

> C-client crashes when removing watcher asynchronously in "local" mode
> -
>
> Key: ZOOKEEPER-2320
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2320
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: c client
>Affects Versions: 3.5.1
>Reporter: Hadriel Kaplan
>Assignee: Abraham Fine
>Priority: Major
>  Labels: pull-request-available
> Attachments: ZOOKEEPER-2320.patch, ZOOKEEPER-2320.patch
>
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> The C-client library will crash when invoking the asynchronous 
> {{zoo_aremove_watchers()}} API function with the '{{local}}' argument set to 
> 1.
> The reason is: if the local argument is 1/true, then the code does 
> '{{notify_sync_completion((struct sync_completion *)data);}}' But casting the 
> '{{data}}' variable to a {{sync_completion}} struct pointer is bogus/invalid, 
> and when it's later handles as that struct pointer it's accessing invalid 
> memory.
> As a side note: it will work ok when called _synchronously_ through 
> {{zoo_remove_watchers()}}, because that function creates a 
> {{sync_completion}} struct and passes it to the asynch 
> {{zoo_aremove_watchers()}}, but it will not work ok when the asynch function 
> is used directly for the reason stated perviously.
> Another side note: the docs state that setting the 'local' flag makes the 
> C-client remove the watcher "even if there is no server connection" - but 
> really it makes the C-client remove the watcher without notifying the server 
> at *all*, even if the connection to a server is up. (well... that's what it 
> would do if it didn't just crash instead ;)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ZOOKEEPER-3047) flaky test LearnerSnapshotThrottlerTest

2018-11-21 Thread Michael K. Edwards (JIRA)


[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-3047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16695489#comment-16695489
 ] 

Michael K. Edwards commented on ZOOKEEPER-3047:
---

Reproducible in current branch-3.5?

> flaky test LearnerSnapshotThrottlerTest
> ---
>
> Key: ZOOKEEPER-3047
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3047
> Project: ZooKeeper
>  Issue Type: Sub-task
>  Components: tests
>Affects Versions: 3.5.4, 3.6.0, 3.4.12
>Reporter: Patrick Hunt
>Priority: Major
>  Labels: flaky, newbie
>
> * LearnerSnapshotThrottlerTest is flakey - failed during a clover run
> {noformat}
> 2018-05-19 13:39:24,510 [myid:] - INFO  
> [main:JUnit4ZKTestRunner$LoggedInvokeMethod@98] - TEST METHOD FAILED 
> testHighContentionWithTimeout
> java.lang.AssertionError
>   at org.junit.Assert.fail(Assert.java:86)
>   at org.junit.Assert.assertTrue(Assert.java:41)
>   at org.junit.Assert.assertTrue(Assert.java:52)
>   at 
> org.apache.zookeeper.server.quorum.LearnerSnapshotThrottlerTest.__CLR4_2_1a5fyaprev(LearnerSnapshotThrottlerTest.java:216)
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ZOOKEEPER-2925) ZooKeeper server fails to start on first-startup due to race to create dataDir & snapDir

2018-11-21 Thread Michael K. Edwards (JIRA)


[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2925?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16695487#comment-16695487
 ] 

Michael K. Edwards commented on ZOOKEEPER-2925:
---

Reproducible in branch-3.5?

> ZooKeeper server fails to start on first-startup due to race to create 
> dataDir & snapDir
> 
>
> Key: ZOOKEEPER-2925
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2925
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: other
>Affects Versions: 3.4.6
>Reporter: Robert P. Thille
>Priority: Major
>  Labels: easyfix, newbie, patch
> Fix For: 3.4.10
>
> Attachments: ZOOKEEPER-2925.patch
>
>
> Due to two threads trying to create the dataDir and snapDir, and the 
> java.io.File.mkdirs() call returning false both for errors and for the 
> directory already existing, sometimes ZooKeeper will fail to start with the 
> following stack trace:
> {noformat}
> 2017-10-25 22:30:40,069 [myid:] - INFO  [main:ZooKeeperServerMain@95] - 
> Starting server
> 2017-10-25 22:30:40,075 [myid:] - INFO  [main:Environment@100] - Server 
> environment:zookeeper.version=3.4.6-mdavis8efb625--1, built on 10/25/2017 
> 01:12 GMT
> [ More 'Server environment:blah blah blah' messages trimmed]
> 2017-10-25 22:30:40,077 [myid:] - INFO  [main:Environment@100] - Server 
> environment:user.dir=/
> 2017-10-25 22:30:40,081 [myid:] - ERROR [main:ZooKeeperServerMain@63] - 
> Unexpected exception, exiting abnormally
> java.io.IOException: Unable to create data directory /bp2/data/version-2
> at 
> org.apache.zookeeper.server.persistence.FileTxnSnapLog.(FileTxnSnapLog.java:85)
> at 
> org.apache.zookeeper.server.ZooKeeperServerMain.runFromConfig(ZooKeeperServerMain.java:104)
> at 
> org.apache.zookeeper.server.ZooKeeperServerMain.initializeAndRun(ZooKeeperServerMain.java:86)
> at 
> org.apache.zookeeper.server.ZooKeeperServerMain.main(ZooKeeperServerMain.java:52)
> at 
> org.apache.zookeeper.server.quorum.QuorumPeerMain.initializeAndRun(QuorumPeerMain.java:116)
> at 
> org.apache.zookeeper.server.quorum.QuorumPeerMain.main(QuorumPeerMain.java:78)
> 2017-10-25 22:30:40,085 [myid:] - INFO  
> [PurgeTask:DatadirCleanupManager$PurgeTask@144] - Purge task completed.
> {noformat}
> this is caused by the QuorumPeerMain thread and the PurgeTask thread both 
> competing to create the directories.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ZOOKEEPER-1985) Memory leak in C client

2018-11-21 Thread Michael K. Edwards (JIRA)


[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1985?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16695484#comment-16695484
 ] 

Michael K. Edwards commented on ZOOKEEPER-1985:
---

Fixable for 3.5.5?

> Memory leak in C client
> ---
>
> Key: ZOOKEEPER-1985
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1985
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: c client
>Affects Versions: 3.4.6
>Reporter: desmondhe
>Assignee: desmondhe
>Priority: Major
> Fix For: 3.6.0, 3.5.5
>
> Attachments: ZOOKEEPER-1985.patch
>
>
> in the file zookeeper.c, most function call of "close_buffer_oarchive(&oa, 
> 0)" shoud been instead by 
> close_buffer_oarchive(&oa, rc < 0 ? 1 : 0); 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ZOOKEEPER-1998) C library calls getaddrinfo unconditionally from zookeeper_interest

2018-11-21 Thread Michael K. Edwards (JIRA)


[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16695483#comment-16695483
 ] 

Michael K. Edwards commented on ZOOKEEPER-1998:
---

This seems like it could be pretty serious in some environments.  Addressable 
for 3.5.5?

> C library calls getaddrinfo unconditionally from zookeeper_interest
> ---
>
> Key: ZOOKEEPER-1998
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1998
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: c client
>Affects Versions: 3.5.0
>Reporter: Raul Gutierrez Segales
>Assignee: Raul Gutierrez Segales
>Priority: Major
> Fix For: 3.6.0
>
>
> (commented this on ZOOKEEPER-338)
> I've just noticed that we call getaddrinfo from zookeeper_interest... on 
> every call. So from zookeeper_interest we always call update_addrs:
> https://github.com/apache/zookeeper/blob/trunk/src/c/src/zookeeper.c#L2082
> which in turns unconditionally calls resolve_hosts:
> https://github.com/apache/zookeeper/blob/trunk/src/c/src/zookeeper.c#L787
> which does the unconditional calls to getaddrinfo:
> https://github.com/apache/zookeeper/blob/trunk/src/c/src/zookeeper.c#L648
> We should fix this since it'll make 3.5.0 slower for people relying on DNS. I 
> think this is happened as part of ZOOKEEPER-107 in which the list of servers 
> can be updated. 
> cc: [~shralex], [~phunt], [~fpj]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ZOOKEEPER-2466) Client skips servers when trying to connect

2018-11-21 Thread Michael K. Edwards (JIRA)


[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16695476#comment-16695476
 ] 

Michael K. Edwards commented on ZOOKEEPER-2466:
---

Fix needed for 3.5.5?

> Client skips servers when trying to connect
> ---
>
> Key: ZOOKEEPER-2466
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2466
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: c client
>Reporter: Flavio Junqueira
>Assignee: Michael Han
>Priority: Major
> Fix For: 3.6.0
>
> Attachments: ZOOKEEPER-2466.patch, ZOOKEEPER-2466.patch
>
>
> I've been looking at {{Zookeeper_simpleSystem::testFirstServerDown}} and I 
> observed the following behavior. The list of servers to connect contains two 
> servers, let's call them S1 and S2. The client never connects, but the odd 
> bit is the sequence of servers that the client tries to connect to:
> {noformat}
> S1
> S2
> S1
> S1
> S1
> 
> {noformat}
> It intrigued me that S2 is only tried once and never again. Checking the 
> code, here is what happens. Initially, {{zh->reconfig}} is 1, so in 
> {{zoo_cycle_next_server}} we return an address from 
> {{get_next_server_in_reconfig}}, which is taken from {{zh->addrs_new}} in 
> this test case. The attempt to connect fails, and {{handle_error}} is invoked 
> in the error handling path. {{handle_error}} actually invokes 
> {{addrvec_next}} which changes the address pointer to the next server on the 
> list.
> After two attempts, it decides that it has tried all servers in 
> {{zoo_cycle_next_server}} and sets {{zh->reconfig}} to zero. Once 
> {{zh->reconfig == 0}}, we have that each call to {{zoo_cycle_next_server}} 
> moves the address pointer to the next server in {{zh->addrs}}. But, given 
> that {{handle_error}} also moves the pointer to the next server, we end up 
> moving the pointer ahead twice upon every failed attempt to connect, which is 
> wrong.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ZOOKEEPER-2219) ZooKeeper server should better handle SessionMovedException.

2018-11-21 Thread Michael K. Edwards (JIRA)


[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16695475#comment-16695475
 ] 

Michael K. Edwards commented on ZOOKEEPER-2219:
---

Does the fix for ZOOKEEPER-2886 need to be backported to branch-3.5?

> ZooKeeper server should better handle SessionMovedException.
> 
>
> Key: ZOOKEEPER-2219
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2219
> Project: ZooKeeper
>  Issue Type: Bug
>Affects Versions: 3.4.5
>Reporter: zhihai xu
>Priority: Major
>
> ZooKeeper server should better handle SessionMovedException.
> We hit the SessionMovedException. the following is the reason for the 
> SessionMovedException we find:
> 1. ZK client tried to connect to Leader L. Network was very slow, so before 
> leader processed the request, client disconnected.
> 2. Client then re-connected to Follower F reusing the same session ID. It was 
> successful.
> 3. The request in step 1 went into leader. Leader processed it and 
> invalidated the connection created in step 2. But client didn't know the 
> connection it used is invalidated.
> 4. Client got SessionMovedException when it used the connection invalidated 
> by leader for any ZooKeeper operation.
> The following are logs: c045dkh is the Leader, c470udy is the Follower and 
> the sessionID is 0x14be28f50f4419d.
> 1. ZK client try to initiate session to Leader at 015-03-16 10:59:40,735 and 
> timeout after 10/3 seconds.
> logs from ZK client 
> {code}
> 2015-03-16 10:59:40,078 INFO org.apache.zookeeper.ClientCnxn: Client session 
> timed out, have not heard from server in 6670ms for sessionid 
> 0x14be28f50f4419d, closing socket connection and attempting reconnect
> 015-03-16 10:59:40,735 INFO org.apache.zookeeper.ClientCnxn: Opening socket 
> connection to server c045dkh/?.?.?.67:2181. Will not attempt to authenticate 
> using SASL (unknown error)
> 2015-03-16 10:59:40,735 INFO org.apache.zookeeper.ClientCnxn: Socket 
> connection established to c045dkh/?.?.?.67:2181, initiating session
> 2015-03-16 10:59:44,071 INFO org.apache.zookeeper.ClientCnxn: Client session 
> timed out, have not heard from server in 3336ms for sessionid 
> 0x14be28f50f4419d, closing socket connection and attempting reconnect
> {code}
> 2. ZK client initiated session to Follower successfully at 2015-03-16 
> 10:59:44,688
> logs from ZK client
> {code}
> 2015-03-16 10:59:44,673 INFO org.apache.zookeeper.ClientCnxn: Opening socket 
> connection to server c470udy/?.?.?.65:2181. Will not attempt to authenticate 
> using SASL (unknown error)
> 2015-03-16 10:59:44,673 INFO org.apache.zookeeper.ClientCnxn: Socket 
> connection established to c470udy/?.?.?.65:2181, initiating session
> 2015-03-16 10:59:44,688 INFO org.apache.zookeeper.ClientCnxn: Session 
> establishment complete on server c470udy/?.?.?.65:2181, sessionid = 
> 0x14be28f50f4419d, negotiated timeout = 1
> {code}
> logs from ZK Follower server
> {code}
> 2015-03-16 10:59:44,673 INFO 
> org.apache.zookeeper.server.NIOServerCnxnFactory: Accepted socket connection 
> from /?.?.?.65:42777
> 2015-03-16 10:59:44,674 INFO org.apache.zookeeper.server.ZooKeeperServer: 
> Client attempting to renew session 0x14be28f50f4419d at /?.?.?.65:42777
> 2015-03-16 10:59:44,674 INFO org.apache.zookeeper.server.quorum.Learner: 
> Revalidating client: 0x14be28f50f4419d
> 2015-03-16 10:59:44,675 INFO org.apache.zookeeper.server.ZooKeeperServer: 
> Established session 0x14be28f50f4419d with negotiated timeout 1 for 
> client /?.?.?.65:42777
> {code}
> 3. At 2015-03-16 10:59:45,668, Leader processed the delayed request which is 
> sent by Client at 2015-03-16 10:59:40,735, right after session was 
> established, it close the socket connection/session because client was 
> already disconnected due to timeout.
> logs from ZK Leader server
> {code}
> 2015-03-16 10:59:45,668 INFO org.apache.zookeeper.server.ZooKeeperServer: 
> Client attempting to renew session 0x14be28f50f4419d at /?.?.?.65:50271
> 2015-03-16 10:59:45,668 INFO org.apache.zookeeper.server.ZooKeeperServer: 
> Established session 0x14be28f50f4419d with negotiated timeout 1 for 
> client /?.?.?.65:50271
> 2015-03-16 10:59:45,670 WARN org.apache.zookeeper.server.NIOServerCnxn: 
> Exception causing close of session 0x14be28f50f4419d due to 
> java.io.IOException: Broken pipe
> 2015-03-16 10:59:45,671 INFO org.apache.zookeeper.server.NIOServerCnxn: 
> Closed socket connection for client /?.?.?.65:50271 which had sessionid 
> 0x14be28f50f4419d
> {code}
> 4. Client got SessionMovedException at 2015-03-16 10:59:45,693
> logs from ZK Leader server
> {code}
> 2015-03-16 10:59:45,693 INFO 
> org.apache.zookeeper.server.PrepRequestProcessor: Got user-level 
> KeeperException when processing sessionid:0x14be28f50f4419d t

[jira] [Commented] (ZOOKEEPER-3193) Flaky: org.apache.zookeeper.test.SaslAuthFailNotifyTest

2018-11-21 Thread Michael K. Edwards (JIRA)


[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-3193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16695466#comment-16695466
 ] 

Michael K. Edwards commented on ZOOKEEPER-3193:
---

Appropriate for the 3.5 branch, I think.

> Flaky: org.apache.zookeeper.test.SaslAuthFailNotifyTest
> ---
>
> Key: ZOOKEEPER-3193
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3193
> Project: ZooKeeper
>  Issue Type: Sub-task
>  Components: tests
>Affects Versions: 3.5.4, 3.6.0, 3.4.13
>Reporter: Andor Molnar
>Assignee: Andor Molnar
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.6.0, 3.5.5, 3.4.14
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> This test doesn't fail often on Apache Jenkins, but seems like quite flaky in 
> our in-house testing environment. It's having a race in waiting for the 
> AuthFailed event that could happen before client creation succeeds, causing 
> the wait operation to hand infinitely (notify occurred before the wait() 
> call). Using a CountDownLatch would be better for the same purpose.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ZOOKEEPER-3186) bug in barrier example code

2018-11-21 Thread Michael K. Edwards (JIRA)


[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-3186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16695474#comment-16695474
 ] 

Michael K. Edwards commented on ZOOKEEPER-3186:
---

Appropriate for 3.5.5?

> bug in barrier example code
> ---
>
> Key: ZOOKEEPER-3186
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3186
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: documentation
>Reporter: cheng pan
>Priority: Major
>
> the code given in the documentation
> {code:java}
> while (true) {
> synchronized (mutex) {
> List list = zk.getChildren(root, true);
> if (list.size() < size) {
> mutex.wait();
> } else {
> return true;
> }
> }
> }
> {code}
> When some nodes are not ready, the code calls mutex.wait() and waits for the 
> watcher event to call mutex.notify() to wake it up. The problem is, we can't 
> guarantee that mutex.notify() will definitely happen after mutex.wait(), 
> which will cause client is stuck.
> The solution might be CountDownLatch?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ZOOKEEPER-1011) fix Java Barrier Documentation example's race condition issue and polish up the Barrier Documentation

2018-11-21 Thread Michael K. Edwards (JIRA)


[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16695473#comment-16695473
 ] 

Michael K. Edwards commented on ZOOKEEPER-1011:
---

Appropriate for 3.5.5?

> fix Java Barrier Documentation example's race condition issue and polish up 
> the Barrier Documentation
> -
>
> Key: ZOOKEEPER-1011
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1011
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: documentation
>Reporter: Semih Salihoglu
>Assignee: maoling
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> There is a race condition in the Barrier example of the java doc: 
> http://hadoop.apache.org/zookeeper/docs/current/zookeeperTutorial.html. It's 
> in the enter() method. Here's the original example:
> boolean enter() throws KeeperException, InterruptedException{
> zk.create(root + "/" + name, new byte[0], Ids.OPEN_ACL_UNSAFE,
> CreateMode.EPHEMERAL_SEQUENTIAL);
> while (true) {
> synchronized (mutex) {
> List list = zk.getChildren(root, true);
> if (list.size() < size) {
> mutex.wait();
> } else {
> return true;
> }
> }
> }
> }
> Here's the race condition scenario:
> Let's say there are two machines/nodes: node1 and node2 that will use this 
> code to synchronize over ZK. Let's say the following steps take place:
> node1 calls the zk.create method and then reads the number of children, and 
> sees that it's 1 and starts waiting. 
> node2 calls the zk.create method (doesn't call the zk.getChildren method yet, 
> let's say it's very slow) 
> node1 is notified that the number of children on the znode changed, it checks 
> that the size is 2 so it leaves the barrier, it does its work and then leaves 
> the barrier, deleting its node.
> node2 calls zk.getChildren and because node1 has already left, it sees that 
> the number of children is equal to 1. Since node1 will never enter the 
> barrier again, it will keep waiting.
> --- End of scenario ---
> Here's Flavio's fix suggestions (copying from the email thread):
> ...
> I see two possible action points out of this discussion:
>   
> 1- State clearly in the beginning that the example discussed is not correct 
> under the assumption that a process may finish the computation before another 
> has started, and the example is there for illustration purposes;
> 2- Have another example following the current one that discusses the problem 
> and shows how to fix it. This is an interesting option that illustrates how 
> one could reason about a solution when developing with zookeeper.
> ...
> We'll go with the 2nd option.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ZOOKEEPER-3174) Quorum TLS - support reloading trust/key store

2018-11-21 Thread Michael K. Edwards (JIRA)


[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-3174?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16695472#comment-16695472
 ] 

Michael K. Edwards commented on ZOOKEEPER-3174:
---

Appropriate for 3.5.5?

> Quorum TLS - support reloading trust/key store
> --
>
> Key: ZOOKEEPER-3174
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3174
> Project: ZooKeeper
>  Issue Type: Improvement
>Affects Versions: 3.6.0, 3.5.5
>Reporter: Ilya Maykov
>Assignee: Ilya Maykov
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 9h 10m
>  Remaining Estimate: 0h
>
> The Quorum TLS feature recently added in ZOOKEEPER-236 doesn't support 
> reloading a trust/key store from disk when it changes. In an environment 
> where short-lived certificates are used and are refreshed by some background 
> daemon / cron job, this is a problem. Let's support reloading a trust/key 
> store from disk when the file on disk changes.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ZOOKEEPER-3176) Quorum TLS - add SSL config options

2018-11-21 Thread Michael K. Edwards (JIRA)


[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-3176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16695471#comment-16695471
 ] 

Michael K. Edwards commented on ZOOKEEPER-3176:
---

Appropriate for 3.5.5?

> Quorum TLS - add SSL config options
> ---
>
> Key: ZOOKEEPER-3176
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3176
> Project: ZooKeeper
>  Issue Type: Improvement
>Affects Versions: 3.6.0, 3.5.5
>Reporter: Ilya Maykov
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 4h 20m
>  Remaining Estimate: 0h
>
> Some parameters of Quorum TLS connections are not currently configurable. 
> Let's add configuration properties for them with reasonable defaults. In 
> particular, these are:
>  * enabled protocols
>  * client auth behavior (want / need / none)
>  * a timeout for TLS handshake detection in a UnifiedServerSocket



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ZOOKEEPER-3172) Quorum TLS - fix port unification to allow rolling upgrades

2018-11-21 Thread Michael K. Edwards (JIRA)


[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-3172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16695467#comment-16695467
 ] 

Michael K. Edwards commented on ZOOKEEPER-3172:
---

Appropriate for 3.5.5, I think.

> Quorum TLS - fix port unification to allow rolling upgrades
> ---
>
> Key: ZOOKEEPER-3172
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3172
> Project: ZooKeeper
>  Issue Type: Improvement
>  Components: security, server
>Affects Versions: 3.6.0, 3.5.5
>Reporter: Ilya Maykov
>Assignee: Ilya Maykov
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 8h 10m
>  Remaining Estimate: 0h
>
> ZOOKEEPER-236 was committed with port unification support disabled, because 
> of various issues with the implementation. These issues should be fixed so 
> port unification can be enabled again. Port unification is necessary to 
> upgrade an ensemble from plaintext to TLS quorum connections without downtime.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ZOOKEEPER-3195) TLS - disable client-initiated renegotiation

2018-11-21 Thread Michael K. Edwards (JIRA)


[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-3195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16695465#comment-16695465
 ] 

Michael K. Edwards commented on ZOOKEEPER-3195:
---

This looks serious.  Target fix for 3.5.5?

> TLS - disable client-initiated renegotiation
> 
>
> Key: ZOOKEEPER-3195
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3195
> Project: ZooKeeper
>  Issue Type: Improvement
>Affects Versions: 3.6.0, 3.5.5
>Reporter: Ilya Maykov
>Assignee: Ilya Maykov
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 2h
>  Remaining Estimate: 0h
>
> Client-initiated TLS renegotiation is not secure and exposes the connection 
> to MITM attacks. Unfortunately, Java's TLS implementation allows it by 
> default. Thankfully, it is easy to disable.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ZOOKEEPER-2332) Zookeeper failed to start for empty txn log

2018-11-21 Thread Michael K. Edwards (JIRA)


[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16695462#comment-16695462
 ] 

Michael K. Edwards commented on ZOOKEEPER-2332:
---

Still an issue in branch-3.5?  Serious?

> Zookeeper failed to start for empty txn log
> ---
>
> Key: ZOOKEEPER-2332
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2332
> Project: ZooKeeper
>  Issue Type: Bug
>Affects Versions: 3.4.6
>Reporter: Liu Shaohui
>Assignee: Liu Shaohui
>Priority: Critical
> Fix For: 3.6.0
>
> Attachments: ZOOKEEPER-2332-v001.diff
>
>
> We found that the zookeeper server with version 3.4.6 failed to start for 
> there is a empty txn log in log dir.  
> I think we should skip the empty log file during restoring the datatree. 
> Any suggestion?
> {code}
> 2015-11-27 19:16:16,887 [myid:] - ERROR [main:ZooKeeperServerMain@63] - 
> Unexpected exception, exiting abnormally
> java.io.EOFException
> at java.io.DataInputStream.readInt(DataInputStream.java:392)
> at org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63)
> at 
> org.apache.zookeeper.server.persistence.FileHeader.deserialize(FileHeader.java:64)
> at 
> org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.inStreamCreated(FileTxnLog.java:576)
> at 
> org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.createInputArchive(FileTxnLog.java:595)
> at 
> org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.goToNextLog(FileTxnLog.java:561)
> at 
> org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.next(FileTxnLog.java:643)
> at 
> org.apache.zookeeper.server.persistence.FileTxnSnapLog.restore(FileTxnSnapLog.java:158)
> at org.apache.zookeeper.server.ZKDatabase.loadDataBase(ZKDatabase.java:223)
> at 
> org.apache.zookeeper.server.ZooKeeperServer.loadData(ZooKeeperServer.java:272)
> at 
> org.apache.zookeeper.server.ZooKeeperServer.startdata(ZooKeeperServer.java:399)
> at 
> org.apache.zookeeper.server.NIOServerCnxnFactory.startup(NIOServerCnxnFactory.java:122)
> at 
> org.apache.zookeeper.server.ZooKeeperServerMain.runFromConfig(ZooKeeperServerMain.java:113)
> at 
> org.apache.zookeeper.server.ZooKeeperServerMain.initializeAndRun(ZooKeeperServerMain.java:86)
> at 
> org.apache.zookeeper.server.ZooKeeperServerMain.main(ZooKeeperServerMain.java:52)
> at 
> org.apache.zookeeper.server.quorum.QuorumPeerMain.initializeAndRun(QuorumPeerMain.java:116)
> at 
> org.apache.zookeeper.server.quorum.QuorumPeerMain.main(QuorumPeerMain.java:78)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ZOOKEEPER-2745) Node loses data after disk-full event, but successfully joins Quorum

2018-11-21 Thread Michael K. Edwards (JIRA)


[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16695459#comment-16695459
 ] 

Michael K. Edwards commented on ZOOKEEPER-2745:
---

Is this still potentially an issue in 3.5.5?  Or can it be closed?

> Node loses data after disk-full event, but successfully joins Quorum
> 
>
> Key: ZOOKEEPER-2745
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2745
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: server
>Affects Versions: 3.4.6
> Environment: Ubuntu 12.04
>Reporter: Abhay Bothra
>Priority: Critical
> Attachments: ZOOKEEPER-2745.patch
>
>
> If disk is full on 1 zookeeper node in a 3 node ensemble, it is able to join 
> the quorum with partial data.
> Setup:
> 
> - Running a 3 node zookeeper ensemble on Ubuntu 12.04 as upstart services. 
> Let's call the nodes: A, B and C.
> Observation:
> -
> - Connecting to 2 (Node A and B) of the 3 nodes and doing an `ls` in 
> zookeeper data directory was giving:
> /foo
> /bar
> /baz
> But an `ls` on node C was giving:
> /baz
> - On node C, the zookeeper data directory had the following files:
> log.1001
> log.1600
> snapshot.1000 -> size 200
> snapshot.1200 -> size 269
> snapshot.1300 -> size 300
> - Snapshot sizes on node A and B were in the vicinity of 500KB
> RCA
> ---
> - Disk was full on node C prior to the creation time of the small snapshot
>   files.
> - Looking at zookeeper server logs, we observed that zookeeper had crashed 
> and restarted a few times after the first instance of disk full. Everytime 
> time zookeeper starts, it does 3 things:
>   1. Run the purge task to cleanup old snapshot and txn logs. Our
>   autopurge.snapRetainCount is set to 3.
>   2. Restore from the most recent valid snapshot and the txn logs that follow.
>   3. Take part in a leader election - realize it has missed something - 
> become the follower - get diff of missed txns from the current leader - 
> create a new snapshot of its current state.
> - We confirmed that a valid snapshot of the system had existed prior to, and
>   immediately after the crash. Let's call this snapshot snapshot.800.
> - Over the next 3 restarts, zookeeper did the following:
>   - Purged older snapshots
>   - Restored from snapshot.800 + txn logs
>   - Synced up with master, tried to write its updated state to a new 
> snapshot. Crashed due to disk full. The snapshot file, even though invalid, 
> had been created.
> - *Note*: This is the first source of the bug. It might be more appropriate 
> to first write the snapshot to a temporary file, and then rename it
> snapshot.. That would gives us more confidence in the validity of 
> snapshots in the data dir. 
> - Let's say the snapshot files created above were snapshot.850, snapshot.920 
> and snapshot.950
> - On the 4th restart, the purge task retained the 3 recent snapshots - 
> snapshot.850, snapshot.920, and snapshot.950, and proceeded to purge 
> snapshot.800 and associated txn logs assuming that they were no longer needed.
> - *Note*: This is the second source of the bug. Instead of retaining the 3 
> most recent *valid* snapshots, the server just retains 3 most recent 
> snapshots, regardless of their validity.
> - When restoring, zookeeper doesn't find any valid snapshot logs to restore 
> from. So it tries to reload its state from txn logs starting at zxid 0. 
> However, those transaction logs would have long ago been garbage collected. 
> It reloads from whatever txn logs are present. Let's say the only txn log 
> file present (log.951) contains logs for zxid 951 to 998.  It reloads from 
> that log file, syncs with master - gets txns 999 and 1000, and writes the 
> snapshot log snapshot.1000 to disk. Now that we have deleted snapshot.800, we 
> have enough free disk space to write snapshot.1000. From this state onwards, 
> zookeeper will always assume it has the state till txn id 1000, even though 
> it only has state from txn id 951 to 1000.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ZOOKEEPER-2362) ZooKeeper multi / transaction allows partial read

2018-11-21 Thread Michael K. Edwards (JIRA)


[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2362?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16695442#comment-16695442
 ] 

Michael K. Edwards commented on ZOOKEEPER-2362:
---

Reproducible in current branch-3.5?

> ZooKeeper multi / transaction allows partial read
> -
>
> Key: ZOOKEEPER-2362
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2362
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: server
>Affects Versions: 3.4.6
>Reporter: Whitney Sorenson
>Assignee: Atri Sharma
>Priority: Critical
>
> In this thread 
> http://mail-archives.apache.org/mod_mbox/zookeeper-user/201602.mbox/%3CCAPbqGzicBkLLyVDm7RFM20z0y3X1v1P-C9-1%3D%3D1DDqRDTzdOmQ%40mail.gmail.com%3E
>  , I discussed an issue I've now seen in multiple environments:
> In a multi (using Curator), I write 2 new nodes. At some point, I issue 2 
> reads for these new nodes. In one read, I see one of the new nodes. In a 
> subsequent read, I fail to see the other new node:
> 1. Starting state : { /foo = , /bar =  }
> 2. In a multi, write: { /foo = A, /bar = B}
> 3. Read /foo as A
> 4. Read /bar as  
> #3 and #4 are issued 100% sequentially.
> It is not known at what point during #2, #3 starts.
> Note: the reads are getChildren() calls.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ZOOKEEPER-2836) QuorumCnxManager.Listener Thread Better handling of SocketTimeoutException

2018-11-21 Thread Michael K. Edwards (JIRA)


[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16695441#comment-16695441
 ] 

Michael K. Edwards commented on ZOOKEEPER-2836:
---

Fix needed for 3.5.5?

> QuorumCnxManager.Listener Thread Better handling of SocketTimeoutException
> --
>
> Key: ZOOKEEPER-2836
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2836
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: leaderElection, quorum
>Affects Versions: 3.4.6
> Environment: Machine: Linux 3.2.0-4-amd64 #1 SMP Debian 3.2.78-1 
> x86_64 GNU/Linux
> Java Version: jdk64/jdk1.8.0_40
> zookeeper version:  3.4.6.2.3.2.0-2950 
>Reporter: Amarjeet Singh
>Assignee: gaoshu
>Priority: Critical
>
> QuorumCnxManager Listener thread blocks SocketServer on accept but we are 
> getting SocketTimeoutException  on our boxes after 49days 17 hours . As per 
> current code there is a 3 times retry and after that it says "_As I'm leaving 
> the listener thread, I won't be able to participate in leader election any 
> longer: $/$:3888__" , Once server nodes reache this state and 
> we restart or add a new node ,it fails to join cluster and logs 'WARN  
> QuorumPeer/0:0:0:0:0:0:0:0:2181:QuorumCnxManager@383 - Cannot open 
> channel to 3 at election address $/$:3888' .
> As there is no timeout specified for ServerSocket it should never 
> timeout but there are some already discussed issues where people have seen 
> this issue and added checks for SocketTimeoutException explicitly like 
> https://issues.apache.org/jira/browse/KARAF-3325 . 
> I think we need to handle SocketTimeoutException on similar lines for 
> zookeeper as well 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ZOOKEEPER-2592) Zookeeper is not recoverable once running system( machine on which zookeeper is running) is out of space

2018-11-21 Thread Michael K. Edwards (JIRA)


[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16695439#comment-16695439
 ] 

Michael K. Edwards commented on ZOOKEEPER-2592:
---

Fix needed for 3.5.5?  Or is this obsolete and closable?

> Zookeeper is not recoverable once running system( machine on which zookeeper 
> is running) is out of space
> 
>
> Key: ZOOKEEPER-2592
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2592
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: server
>Affects Versions: 3.5.1, 3.5.2
>Reporter: Rakesh Kumar Singh
>Priority: Critical
>
> Zookeeper is not recoverable once running system( machine on which zookeeper 
> is running) is out of space 
> Steps to reproduce:-
> 1. Install zookeeper on standalone mode and start zookeeper
> 2. Make the machine physical memory full
> 3. Connect through client to zookeeper and trying create some znodes with 
> some data.
> 4. After sometime creating further znode will not happened as complete memory 
> is occupied
> 5. Now start creating space in that machine
> 6. Again connect through a client. Connection is fine. Now try to execute any 
> command like "ls / " it fails even though now space is more than 11gb
> Client log:-
> BLR107042:/home/Rakesh/Zookeeper/18_Aug/cluster/zookeeper-3.5.1-alpha/bin 
> # df -h
> Filesystem  Size  Used Avail Use% Mounted on
> /dev/xvda2   36G   24G   11G  70% /
> udev1.9G  116K  1.9G   1% /dev
> tmpfs   1.9G 0  1.9G   0% /dev/shm
> BLR107042:/home/Rakesh/Zookeeper/18_Aug/cluster/zookeeper-3.5.1-alpha/bin 
> # ./zkCli.sh
> Connecting to localhost:2181
> 2016-09-19 22:50:20,227 [myid:] - INFO  [main:Environment@109] - Client 
> environment:zookeeper.version=3.5.1-alpha--1, built on 08/18/2016 08:20 GMT
> 2016-09-19 22:50:20,231 [myid:] - INFO  [main:Environment@109] - Client 
> environment:host.name=BLR107042
> 2016-09-19 22:50:20,231 [myid:] - INFO  [main:Environment@109] - Client 
> environment:java.version=1.7.0_79
> 2016-09-19 22:50:20,234 [myid:] - INFO  [main:Environment@109] - Client 
> environment:java.vendor=Oracle Corporation
> 2016-09-19 22:50:20,234 [myid:] - INFO  [main:Environment@109] - Client 
> environment:java.home=/usr/java/jdk1.7.0_79/jre
> 2016-09-19 22:50:20,234 [myid:] - INFO  [main:Environment@109] - Client 
> environment:java.class.path=/home/Rakesh/Zookeeper/18_Aug/cluster/zookeeper-3.5.1-alpha/bin/../build/classes:/home/Rakesh/Zookeeper/18_Aug/cluster/zookeeper-3.5.1-alpha/bin/../build/lib/*.jar:/home/Rakesh/Zookeeper/18_Aug/cluster/zookeeper-3.5.1-alpha/bin/../lib/slf4j-log4j12-1.7.5.jar:/home/Rakesh/Zookeeper/18_Aug/cluster/zookeeper-3.5.1-alpha/bin/../lib/slf4j-api-1.7.5.jar:/home/Rakesh/Zookeeper/18_Aug/cluster/zookeeper-3.5.1-alpha/bin/../lib/servlet-api-2.5-20081211.jar:/home/Rakesh/Zookeeper/18_Aug/cluster/zookeeper-3.5.1-alpha/bin/../lib/netty-3.7.0.Final.jar:/home/Rakesh/Zookeeper/18_Aug/cluster/zookeeper-3.5.1-alpha/bin/../lib/log4j-1.2.16.jar:/home/Rakesh/Zookeeper/18_Aug/cluster/zookeeper-3.5.1-alpha/bin/../lib/jline-2.11.jar:/home/Rakesh/Zookeeper/18_Aug/cluster/zookeeper-3.5.1-alpha/bin/../lib/jetty-util-6.1.26.jar:/home/Rakesh/Zookeeper/18_Aug/cluster/zookeeper-3.5.1-alpha/bin/../lib/jetty-6.1.26.jar:/home/Rakesh/Zookeeper/18_Aug/cluster/zookeeper-3.5.1-alpha/bin/../lib/javacc.jar:/home/Rakesh/Zookeeper/18_Aug/cluster/zookeeper-3.5.1-alpha/bin/../lib/jackson-mapper-asl-1.9.11.jar:/home/Rakesh/Zookeeper/18_Aug/cluster/zookeeper-3.5.1-alpha/bin/../lib/jackson-core-asl-1.9.11.jar:/home/Rakesh/Zookeeper/18_Aug/cluster/zookeeper-3.5.1-alpha/bin/../lib/commons-cli-1.2.jar:/home/Rakesh/Zookeeper/18_Aug/cluster/zookeeper-3.5.1-alpha/bin/../lib/ant-eclipse-1.0-jvm1.2.jar:/home/Rakesh/Zookeeper/18_Aug/cluster/zookeeper-3.5.1-alpha/bin/../zookeeper-3.5.1-alpha.jar:/home/Rakesh/Zookeeper/18_Aug/cluster/zookeeper-3.5.1-alpha/bin/../src/java/lib/ant-eclipse-1.0-jvm1.2.jar:/home/Rakesh/Zookeeper/18_Aug/cluster/zookeeper-3.5.1-alpha/bin/../conf:/usr/java/jdk1.7.0_79/lib
> 2016-09-19 22:50:20,234 [myid:] - INFO  [main:Environment@109] - Client 
> environment:java.library.path=/usr/java/packages/lib/amd64:/usr/lib64:/lib64:/lib:/usr/lib
> 2016-09-19 22:50:20,234 [myid:] - INFO  [main:Environment@109] - Client 
> environment:java.io.tmpdir=/tmp
> 2016-09-19 22:50:20,234 [myid:] - INFO  [main:Environment@109] - Client 
> environment:java.compiler=
> 2016-09-19 22:50:20,235 [myid:] - INFO  [main:Environment@109] - Client 
> environment:os.name=Linux
> 2016-09-19 22:50:20,235 [myid:] - INFO  [main:Environment@109] - Client 
> environment:os.arch=amd64
> 2016-09-19 22:50:20,235 [myid:] - INFO  [main:Environment@109] - Client 
> environment:os.version=3.0.7

[jira] [Commented] (ZOOKEEPER-3056) Fails to load database with missing snapshot file but valid transaction log file

2018-11-21 Thread Michael K. Edwards (JIRA)


[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-3056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16695438#comment-16695438
 ] 

Michael K. Edwards commented on ZOOKEEPER-3056:
---

Fix needed for 3.5.5?

> Fails to load database with missing snapshot file but valid transaction log 
> file
> 
>
> Key: ZOOKEEPER-3056
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3056
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: server
>Affects Versions: 3.5.3, 3.5.4
>Reporter: Michael Han
>Priority: Critical
> Attachments: snapshot.0
>
>
> [An 
> issue|https://lists.apache.org/thread.html/cc17af6ef05d42318f74148f1a704f16934d1253f14721a93b4b@%3Cdev.zookeeper.apache.org%3E]
>  was reported when a user failed to upgrade from 3.4.10 to 3.5.4 with missing 
> snapshot file.
> The code complains about missing snapshot file is 
> [here|https://github.com/apache/zookeeper/blob/release-3.5.4/src/java/main/org/apache/zookeeper/server/persistence/FileTxnSnapLog.java#L206]
>  which is introduced as part of ZOOKEEPER-2325.
> With this check, ZK will not load the db without a snapshot file, even the 
> transaction log files are present and valid. This could be a problem for 
> restoring a ZK instance which does not have a snapshot file but have a sound 
> state (e.g. it crashes before being able to take the first snap shot with a 
> large snapCount parameter configured).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ZOOKEEPER-3145) Potential watch missing issue due to stale pzxid when replaying CloseSession txn with fuzzy snapshot

2018-11-21 Thread Michael K. Edwards (JIRA)


[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-3145?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16695436#comment-16695436
 ] 

Michael K. Edwards commented on ZOOKEEPER-3145:
---

Fix needed for 3.5.5?

> Potential watch missing issue due to stale pzxid when replaying CloseSession 
> txn with fuzzy snapshot
> 
>
> Key: ZOOKEEPER-3145
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3145
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: server
>Affects Versions: 3.5.4, 3.6.0, 3.4.13
>Reporter: Fangmin Lv
>Assignee: Fangmin Lv
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 3.6.0
>
>  Time Spent: 3h 40m
>  Remaining Estimate: 0h
>
> This is another issue I found recently, we haven't seen this problem on prod 
> (or maybe we don't notice).
>  
> Currently, the CloseSession is not idempotent, executing the CloseSession 
> twice won't get the same result.
>  
> The problem is that closeSession will only check what's the ephemeral nodes 
> associated with that session bases on current states. Nodes deleted during 
> taking fuzzy snapshot won't be deleted again when replay the txn.
>  
> This looks fine, since it's already gone, but there is problem with the pzxid 
> of the parent node. Snapshot is taken fuzzily, so it's possible that the 
> parent had been serialized while the nodes are being deleted when executing 
> the closeSession Txn. The pzxid will not be updated in the snapshot when 
> replaying the closeSession txn, because doesn't know what's the paths being 
> deleted, so it won't patch the pzxid like what we did in the deleteNode 
> ZOOKEEPER-3125.
>  
> The inconsistent pzxid will lead to potential watch notification missing when 
> client reconnect with setWatches because of the staleness. 
>  
> This JIRA is going to fix those issues by adding the CloseSessionTxn, it will 
> record all those nodes being deleted in that CloseSession txn, so that we 
> know which nodes to update when replaying the txn.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ZOOKEEPER-3112) fd leak due to UnresolvedAddressException on connect.

2018-11-21 Thread Michael K. Edwards (JIRA)


[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-3112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16695437#comment-16695437
 ] 

Michael K. Edwards commented on ZOOKEEPER-3112:
---

Fix needed for 3.5.5?

> fd leak due to UnresolvedAddressException on connect.
> -
>
> Key: ZOOKEEPER-3112
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3112
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: java client
>Affects Versions: 3.5.4, 3.4.13
>Reporter: Tianzhou Wang
>Priority: Critical
>  Labels: pull-request-available
> Attachments: patch.diff
>
>   Original Estimate: 10m
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> if connecting domain fail to resolve and lead an UnresolvedAddressException, 
> it would leak the fd.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ZOOKEEPER-2930) Leader cannot be elected due to network timeout of some members.

2018-11-21 Thread Michael K. Edwards (JIRA)


[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2930?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16695435#comment-16695435
 ] 

Michael K. Edwards commented on ZOOKEEPER-2930:
---

Fix needed for 3.5.5?

> Leader cannot be elected due to network timeout of some members.
> 
>
> Key: ZOOKEEPER-2930
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2930
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: leaderElection, quorum, server
>Affects Versions: 3.4.10, 3.5.3, 3.4.11, 3.5.4, 3.4.12
> Environment: Java 8
> ZooKeeper 3.4.11(from github)
> Centos6.5
>Reporter: Jiafu Jiang
>Priority: Critical
>  Labels: pull-request-available
> Attachments: zoo.cfg, zookeeper1.log, zookeeper2.log
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> I deploy a cluster of ZooKeeper with three nodes:
> ofs_zk1:20.10.11.101, 30.10.11.101
> ofs_zk2:20.10.11.102, 30.10.11.102
> ofs_zk3:20.10.11.103, 30.10.11.103
> I shutdown the network interfaces of ofs_zk2 using "ifdown eth0 eth1" command.
> It is supposed that the new Leader should be elected in some seconds, but the 
> fact is, ofs_zk1 and ofs_zk3 just keep electing again and again, but none of 
> them can become the new Leader.
> I change the log level to DEBUG (the default is INFO), and restart zookeeper 
> servers on ofs_zk1 and ofs_zk2 again, but it can not fix the problem.
> I read the log and the ZooKeeper source code, and I think I find the reason.
> When the potential leader(says ofs_zk3) begins the 
> election(FastLeaderElection.lookForLeader()), it will send notifications to 
> all the servers. 
> When it fails to receive any notification during a timeout, it will resend 
> the notifications, and double the timeout. This process will repeat until any 
> notification is received or the timeout reaches a max value.
> The FastLeaderElection.sendNotifications() just put the notification message 
> into a queue and return. The WorkerSender is responsable to send the 
> notifications.
> The WorkerSender just process the notifications one by one by passing the 
> notifications to QuorumCnxManager. Here comes the problem, the 
> QuorumCnxManager.toSend() blocks for a long time when the notification is 
> send to ofs_zk2(whose network is down) and some notifications (which belongs 
> to ofs_zk1) will thus be blocked for a long time. The repeated notifications 
> by FastLeaderElection.sendNotifications() just make things worse.
> Here is the related source code:
> {code:java}
> public void toSend(Long sid, ByteBuffer b) {
> /*
>  * If sending message to myself, then simply enqueue it (loopback).
>  */
> if (this.mySid == sid) {
>  b.position(0);
>  addToRecvQueue(new Message(b.duplicate(), sid));
> /*
>  * Otherwise send to the corresponding thread to send.
>  */
> } else {
>  /*
>   * Start a new connection if doesn't have one already.
>   */
>  ArrayBlockingQueue bq = new 
> ArrayBlockingQueue(SEND_CAPACITY);
>  ArrayBlockingQueue bqExisting = 
> queueSendMap.putIfAbsent(sid, bq);
>  if (bqExisting != null) {
>  addToSendQueue(bqExisting, b);
>  } else {
>  addToSendQueue(bq, b);
>  }
>  
>  // This may block!!!
>  connectOne(sid);
> 
> }
> }
> {code}
> Therefore, when ofs_zk3 believes that it is the leader, it begins to wait the 
> epoch ack, but in fact the ofs_zk1 does not receive the notification(which 
> says the leader is ofs_zk3) because the ofs_zk3 has not sent the 
> notification(which may still exist in the sendqueue of WorkerSender). At 
> last, the potential leader ofs_zk3 fails to receive the epoch ack in timeout, 
> so it quits the leader and begins a new election. 
> The log files of ofs_zk1 and ofs_zk3 are attached.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ZOOKEEPER-3036) Unexpected exception in zookeeper

2018-11-21 Thread Michael K. Edwards (JIRA)


[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-3036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16695430#comment-16695430
 ] 

Michael K. Edwards commented on ZOOKEEPER-3036:
---

Fix needed for 3.5.5?

> Unexpected exception in zookeeper
> -
>
> Key: ZOOKEEPER-3036
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3036
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: quorum, server
>Affects Versions: 3.4.10
> Environment: 3 Zookeepers, 5 kafka servers
>Reporter: Oded
>Priority: Critical
>
> We got an issue with one of the zookeeprs (Leader), causing the entire kafka 
> cluster to fail:
> 2018-05-09 02:29:01,730 [myid:3] - ERROR 
> [LearnerHandler-/192.168.0.91:42490:LearnerHandler@648] - Unexpected 
> exception causing shutdown while sock still open
> java.net.SocketTimeoutException: Read timed out
>     at java.net.SocketInputStream.socketRead0(Native Method)
>     at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
>     at java.net.SocketInputStream.read(SocketInputStream.java:171)
>     at java.net.SocketInputStream.read(SocketInputStream.java:141)
>     at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
>     at java.io.BufferedInputStream.read(BufferedInputStream.java:265)
>     at java.io.DataInputStream.readInt(DataInputStream.java:387)
>     at 
> org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63)
>     at 
> org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:83)
>     at 
> org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:99)
>     at 
> org.apache.zookeeper.server.quorum.LearnerHandler.run(LearnerHandler.java:559)
> 2018-05-09 02:29:01,730 [myid:3] - WARN  
> [LearnerHandler-/192.168.0.91:42490:LearnerHandler@661] - *** GOODBYE 
> /192.168.0.91:42490 
>  
> We would expect that zookeeper will choose another Leader and the Kafka 
> cluster will continue to work as expected, but that was not the case.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ZOOKEEPER-2966) Flaky NullPointerException while closing client connection

2018-11-21 Thread Michael K. Edwards (JIRA)


[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16695428#comment-16695428
 ] 

Michael K. Edwards commented on ZOOKEEPER-2966:
---

Fix needed for 3.5.5?

> Flaky NullPointerException while closing client connection
> --
>
> Key: ZOOKEEPER-2966
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2966
> Project: ZooKeeper
>  Issue Type: Sub-task
>  Components: java client
>Affects Versions: 3.5.3
>Reporter: Enrico Olivelli
>Priority: Critical
>
> It is not always reproducible, I get this in system tests of client 
> applications.
> ZK client 3.5.3, stacktrace self-explains
> {code:java}
> java.lang.NullPointerException
>     at 
> org.apache.zookeeper.ClientCnxnSocketNetty.onClosing(ClientCnxnSocketNetty.java:206)
>     at org.apache.zookeeper.ClientCnxn$SendThread.close(ClientCnxn.java:1395)
>     at org.apache.zookeeper.ClientCnxn.disconnect(ClientCnxn.java:1440)
>     at org.apache.zookeeper.ClientCnxn.close(ClientCnxn.java:1467)
>     at org.apache.zookeeper.ZooKeeper.close(ZooKeeper.java:1319){code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ZOOKEEPER-2711) Deadlock between concurrent 4LW commands that iterate over connections with Netty server

2018-11-21 Thread Michael K. Edwards (JIRA)


[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16695426#comment-16695426
 ] 

Michael K. Edwards commented on ZOOKEEPER-2711:
---

Fix needed for 3.5.5?

> Deadlock between concurrent 4LW commands that iterate over connections with 
> Netty server
> 
>
> Key: ZOOKEEPER-2711
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2711
> Project: ZooKeeper
>  Issue Type: Bug
>Reporter: Josh Elser
>Assignee: Josh Elser
>Priority: Critical
>
> Observed the following issue in some $dayjob testing environments. Line 
> numbers are a little off compared to master/branch-3.5, but I did confirm the 
> same issue exists there.
> With the NettyServerCnxnFactory, before a request is dispatched, the code 
> synchronizes on the {{NettyServerCnxn}} object. However, with some 4LW 
> commands (like {{stat}}), each {{ServerCnxn}} object is also synchronized to 
> (safely) iterate over the internal contents of the object to generate the 
> necessary debug message. As such, multiple concurrent {{stat}} commands can 
> both lock their own {{NettyServerCnxn}} objects, and then be blocked waiting 
> to lock each others' {{ServerCnxn}} in the {{StatCommand}}, deadlocked.
> {noformat}
> "New I/O worker #55":
>   at 
> org.apache.zookeeper.server.ServerCnxn.dumpConnectionInfo(ServerCnxn.java:407)
>   - waiting to lock <0xfabc01b8> (a 
> org.apache.zookeeper.server.NettyServerCnxn)
>   at 
> org.apache.zookeeper.server.NettyServerCnxn$StatCommand.commandRun(NettyServerCnxn.java:478)
>   at 
> org.apache.zookeeper.server.NettyServerCnxn$CommandThread.run(NettyServerCnxn.java:311)
>   at 
> org.apache.zookeeper.server.NettyServerCnxn$CommandThread.start(NettyServerCnxn.java:306)
>   at 
> org.apache.zookeeper.server.NettyServerCnxn.checkFourLetterWord(NettyServerCnxn.java:677)
>   at 
> org.apache.zookeeper.server.NettyServerCnxn.receiveMessage(NettyServerCnxn.java:790)
>   at 
> org.apache.zookeeper.server.NettyServerCnxnFactory$CnxnChannelHandler.processMessage(NettyServerCnxnFactory.java:211)
>   at 
> org.apache.zookeeper.server.NettyServerCnxnFactory$CnxnChannelHandler.messageReceived(NettyServerCnxnFactory.java:135)
>   - locked <0xfab68178> (a 
> org.apache.zookeeper.server.NettyServerCnxn)
>   at 
> org.jboss.netty.channel.SimpleChannelHandler.handleUpstream(SimpleChannelHandler.java:88)
>   at 
> org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
>   at 
> org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:559)
>   at 
> org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:268)
>   at 
> org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:255)
>   at org.jboss.netty.channel.socket.nio.NioWorker.read(NioWorker.java:88)
>   at 
> org.jboss.netty.channel.socket.nio.AbstractNioWorker.process(AbstractNioWorker.java:109)
>   at 
> org.jboss.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:312)
>   at 
> org.jboss.netty.channel.socket.nio.AbstractNioWorker.run(AbstractNioWorker.java:90)
>   at org.jboss.netty.channel.socket.nio.NioWorker.run(NioWorker.java:178)
>   at 
> org.jboss.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)
>   at 
> org.jboss.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> "New I/O worker #51":
>   at 
> org.apache.zookeeper.server.ServerCnxn.dumpConnectionInfo(ServerCnxn.java:407)
>   - waiting to lock <0xfab68178> (a 
> org.apache.zookeeper.server.NettyServerCnxn)
>   at 
> org.apache.zookeeper.server.NettyServerCnxn$StatCommand.commandRun(NettyServerCnxn.java:478)
>   at 
> org.apache.zookeeper.server.NettyServerCnxn$CommandThread.run(NettyServerCnxn.java:311)
>   at 
> org.apache.zookeeper.server.NettyServerCnxn$CommandThread.start(NettyServerCnxn.java:306)
>   at 
> org.apache.zookeeper.server.NettyServerCnxn.checkFourLetterWord(NettyServerCnxn.java:677)
>   at 
> org.apache.zookeeper.server.NettyServerCnxn.receiveMessage(NettyServerCnxn.java:790)
>   at 
> org.apache.zookeeper.server.NettyServerCnxnFactory$CnxnChannelHandler.processMessage(NettyServerCnxnFactory.java:211)
>   at 
> org.apache.zookeeper.server.NettyServerCnxnFactory$CnxnChannelHandler.messageReceived(NettyServerCnxnFactory.java:135)
>   - locked <0xfabc0

[jira] [Commented] (ZOOKEEPER-2639) Port Quorum Peer mutual authentication SASL feature to branch-3.5 and trunk

2018-11-21 Thread Michael K. Edwards (JIRA)


[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2639?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16695424#comment-16695424
 ] 

Michael K. Edwards commented on ZOOKEEPER-2639:
---

Is this a 3.5.5 thing, or later?

> Port Quorum Peer mutual authentication SASL feature to branch-3.5 and trunk
> ---
>
> Key: ZOOKEEPER-2639
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2639
> Project: ZooKeeper
>  Issue Type: Task
>  Components: quorum, security
>Reporter: Rakesh R
>Assignee: Rakesh R
>Priority: Critical
> Fix For: 3.6.0
>
>
> ZooKeeper server-server mutual authentication is implemented in 
> {{branch-3.4}} using ZOOKEEPER-1045 jira. The feature code is not directly 
> portable to other branches due to code difference. This jira can be used to 
> "forward" port the code changes to {{branch-3.5}} and {{trunk}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ZOOKEEPER-3182) Race condition when follower syncing with leader and starting to serve requests

2018-11-21 Thread Michael K. Edwards (JIRA)


[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-3182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16695423#comment-16695423
 ] 

Michael K. Edwards commented on ZOOKEEPER-3182:
---

Is this fix needed for 3.5.5?

> Race condition when follower syncing with leader and starting to serve 
> requests
> ---
>
> Key: ZOOKEEPER-3182
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3182
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: server
>Affects Versions: 3.6.0
>Reporter: Andor Molnar
>Priority: Critical
>
> This issue is probably introduced by ZOOKEEPER-2024 where 2 seperate queues 
> have been implemented in CommitProcessor to improve performance. 
> [~abrahamfine] 's analysis is accurate on GitHub: 
> https://github.com/apache/zookeeper/pull/300
> He was trying to introduce synchronization between Learner.syncWithLeader() 
> and CommitProcessor to wait for in-flight requests to be committed before 
> accepting client requests.
> In the affected unit test ({{testNodeDataChanged}}) there's a race between 
> reconnecting client's setWatches request and updates coming from the leader 
> according to the following logs:
> {noformat}
> 2018-10-25 13:59:58,556 [myid:] - DEBUG 
> [FollowerRequestProcessor:1:CommitProcessor@424] - Processing request:: 
> sessionid:0x10005d8fc4d type:setWatches cxid:0x3 zxid:0xfffe 
> txntype:unknown reqpath:n/a
> 2018-10-25 13:59:58,556 [myid:] - DEBUG 
> [CommitProcWorkThread-1:FinalRequestProcessor@91] - Processing request:: 
> sessionid:0x10005d8fc4d type:setWatches cxid:0x3 zxid:0xfffe 
> txntype:unknown reqpath:n/a
> ...
> 2018-10-25 13:59:58,557 [myid:] - DEBUG 
> [CommitProcWorkThread-1:FinalRequestProcessor@91] - Processing request:: 
> sessionid:0x20005d8f8a4 type:delete cxid:0x1 zxid:0x10004 txntype:2 
> reqpath:n/a
> ...
> 2018-10-25 13:59:58,561 [myid:] - DEBUG 
> [CommitProcWorkThread-1:FinalRequestProcessor@91] - Processing request:: 
> sessionid:0x20005d8f8a4 type:create cxid:0x2 zxid:0x10005 txntype:1 
> reqpath:n/a
> 2018-10-25 13:59:58,561 [myid:127.0.0.1:11231] - DEBUG 
> [main-SendThread(127.0.0.1:11231):ClientCnxn$SendThread@864] - Got 
> WatchedEvent state:SyncConnected type:NodeDeleted path:/test-changed for 
> sessionid 0x10005d8fc4d
> {noformat}
> {{setWatches}} request is processed before {{delete}} and {{create}}, hence 
> the client receives NodeDeleted event.
> In the working scenario it looks like:
> {noformat}
> 2018-10-25 14:04:55,247 [myid:] - DEBUG 
> [CommitProcWorkThread-1:FinalRequestProcessor@91] - Processing request:: 
> sessionid:0x20005dd8811 type:delete cxid:
> 0x1 zxid:0x10004 txntype:2 reqpath:n/a
> 2018-10-25 14:04:55,249 [myid:] - DEBUG 
> [CommitProcWorkThread-1:FinalRequestProcessor@91] - Processing request:: 
> sessionid:0x20005dd8811 type:create cxid:
> 0x2 zxid:0x10005 txntype:1 reqpath:n/a
> ...
> 2018-10-25 14:04:56,314 [myid:] - DEBUG 
> [FollowerRequestProcessor:1:CommitProcessor@424] - Processing request:: 
> sessionid:0x10005dd8811 type:setWatches cxid:0x3 zxid:0xfffe 
> txntype:unknown reqpath:n/a
> 2018-10-25 14:04:56,315 [myid:] - DEBUG 
> [CommitProcWorkThread-1:FinalRequestProcessor@91] - Processing request:: 
> sessionid:0x10005dd8811 type:setWatches cxid:0x3 zxid:0xfffe 
> txntype:unknown reqpath:n/a
> ...
> 2018-10-25 14:04:56,316 [myid:127.0.0.1:11231] - DEBUG 
> [main-SendThread(127.0.0.1:11231):ClientCnxn$SendThread@842] - Got 
> notification sessionid:0x10005dd8811
> 2018-10-25 14:04:56,316 [myid:127.0.0.1:11231] - DEBUG 
> [main-SendThread(127.0.0.1:11231):ClientCnxn$SendThread@864] - Got 
> WatchedEvent state:SyncConnected type:NodeDataChanged path:/test-changed for 
> sessionid 0x10005dd8811
> {noformat}
> {{delete}} and {{create}} requests happen way before {{setWatches}} comes in 
> (even before the client connection is established) and client receives 
> NodeDataChanged event only.
> Abe's approach unfortunately raises the following concerns:
> - modifies CommitProcessor's code which might affect performance and 
> correctness ([~shralex] raised on ZOOKEEPER-2807),
> - we experienced deadlocks while testing the patch: 
> https://github.com/apache/zookeeper/pull/300
> As a consequence I raised this Jira to capture the experiences and to put the 
> unit test on Ignore list, because currently I'm not sure about whether this 
> is a real issue or a non-backward compatible change in 3.6 with the gain of a 
> huge performance improvement.
> Either way I don't want this flaky test to influence contributions, so I'll 
> mark as Ignored on trunk until the issue is resolved.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ZOOKEEPER-2846) Leader follower sync with on disk txns can possibly leads to data inconsistency

2018-11-21 Thread Michael K. Edwards (JIRA)


[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2846?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16695420#comment-16695420
 ] 

Michael K. Edwards commented on ZOOKEEPER-2846:
---

Does this need to be addressed (or release noted) for 3.5.5?

> Leader follower sync with on disk txns can possibly leads to data 
> inconsistency
> ---
>
> Key: ZOOKEEPER-2846
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2846
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: quorum
>Affects Versions: 3.4.10, 3.5.3, 3.6.0
>Reporter: Fangmin Lv
>Priority: Critical
>
> On disk txn sync could cause data inconsistency if the current leader just 
> had a snap sync before it became leader, and then having diff sync with its 
> followers may synced the txns gap on disk. Here is scenario: 
> Let's say S0 - S3 are followers, and S4 is leader at the beginning:
> 1. Stop S2 and send one more request
> 2. Stop S3 and send more requests to the quorum to let S3 have a snap sync 
> with S4 when it started up
> 3. Stop S4 and S3 became the new leader
> 4. Start S2 and had a diff sync with S3, now there are gaps in S2
> Attached the test case to verify the issue. Currently, there is no efficient 
> way to check the gap in txn files is a real gap or due to Epoch change. We 
> need to add that support, but before that, it would be safer to disable the 
> on disk txn leader-follower sync.
> Another two scenarios which could cause the same issue:
> (Scenario 1) Server A, B, C, A is leader, the others are followers:
>   1). A synced to disk, but the other 2 restarted before receiving the 
> proposal
>   2). B and C formed quorum, B is leader, and committed some requests
>   3). A looking again, and sync with B, B won't able to trunc A but send snap 
> instead, and leaves the extra txn in A's txn file
>   4). A became new leader, and someone else has a diff sync with A it will 
> have the extra txn 
> (Scenario 2) Diff sync with committed txn, will only apply to data tree but 
> not on disk txn file, which will also leave hole in it and lead to data 
> inconsistency issue when syncing with learners.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ZOOKEEPER-2851) [QP MutualAuth]: add QuorumCnxManager tests that covers quorum auth logic.

2018-11-21 Thread Michael K. Edwards (JIRA)


[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16695400#comment-16695400
 ] 

Michael K. Edwards commented on ZOOKEEPER-2851:
---

Doable for 3.5.5?

> [QP MutualAuth]: add QuorumCnxManager tests that covers quorum auth logic.
> --
>
> Key: ZOOKEEPER-2851
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2851
> Project: ZooKeeper
>  Issue Type: Sub-task
>  Components: quorum, server, tests
>Affects Versions: 3.5.3
>Reporter: Michael Han
>Assignee: Michael Han
>Priority: Major
> Fix For: 3.6.0, 3.5.5
>
>
> Some of the ZOOKEEPER-1045 unit tests were implemented as part of 
> {{QuorumCnxManagerTest}} however, this class is only available in branch-3.4: 
> it was introduced in ZOOKEEPER-1633 to cover upgrade path testing from 3.4 to 
> 3.5, which is a feature not available in branch-3.5.
> This task is to migrate ZOOKEEPER-1045 related tests in 
> {{QuorumCnxManagerTest}} from branch-3.4 to branch-3.5.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ZOOKEEPER-2850) [QP MutualAuth]: Port ZOOKEEPER-2650 and ZOOKEEPER-2759 from branch-3.4 to branch-3.5.

2018-11-21 Thread Michael K. Edwards (JIRA)


[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16695399#comment-16695399
 ] 

Michael K. Edwards commented on ZOOKEEPER-2850:
---

Doable for 3.5.5?

> [QP MutualAuth]: Port ZOOKEEPER-2650 and ZOOKEEPER-2759 from branch-3.4 to 
> branch-3.5.
> --
>
> Key: ZOOKEEPER-2850
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2850
> Project: ZooKeeper
>  Issue Type: Sub-task
>  Components: quorum, security
>Affects Versions: 3.5.3
>Reporter: Michael Han
>Assignee: Michael Han
>Priority: Major
> Fix For: 3.6.0, 3.5.5
>
>
> These patches are improvements to test cases and small bug fixes after 
> ZOOKEEPER-1045 was committed to branch-3.4. We need port them to branch-3.5 
> to close the loop.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ZOOKEEPER-2840) Should using `System.nanoTime() ^ this.hashCode()` for StaticHostProvider

2018-11-21 Thread Michael K. Edwards (JIRA)


[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2840?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16695398#comment-16695398
 ] 

Michael K. Edwards commented on ZOOKEEPER-2840:
---

Is this something that could/should land in time for 3.5.5?

> Should using `System.nanoTime() ^ this.hashCode()` for StaticHostProvider
> -
>
> Key: ZOOKEEPER-2840
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2840
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: java client
>Affects Versions: 3.5.3
>Reporter: Benedict Jin
>Assignee: Benedict Jin
>Priority: Major
> Fix For: 3.5.5
>
>   Original Estimate: 72h
>  Remaining Estimate: 72h
>
> Should using `System.nanoTime() ^ this.hashCode()` for StaticHostProvider 
> instead of `System.currentTimeMillis()`. Because if we have three Zookeeper 
> server nodes and set the `zookeeper.leaderServes` as `no`, then those 
> connections from client will always connect with the first Zookeeper server 
> node. Due to...
> ```java
> @Test
> public void testShuffle() throws Exception {
> LinkedList inetSocketAddressesList = new 
> LinkedList<>();
> inetSocketAddressesList.add(new InetSocketAddress(0));
> inetSocketAddressesList.add(new InetSocketAddress(1));
> inetSocketAddressesList.add(new InetSocketAddress(2));
> /*
> 1442045361
> currentTime: 1499253530044, currentTime ^ hashCode: 1500143845389, 
> Result: 1 2 0
> currentTime: 1499253530044, currentTime ^ hashCode: 1500143845389, 
> Result: 2 0 1
> currentTime: 1499253530045, currentTime ^ hashCode: 1500143845388, 
> Result: 0 1 2
> currentTime: 1499253530045, currentTime ^ hashCode: 1500143845388, 
> Result: 1 2 0
> currentTime: 1499253530046, currentTime ^ hashCode: 1500143845391, 
> Result: 1 2 0
> currentTime: 1499253530046, currentTime ^ hashCode: 1500143845391, 
> Result: 1 2 0
> currentTime: 1499253530046, currentTime ^ hashCode: 1500143845391, 
> Result: 1 2 0
> currentTime: 1499253530046, currentTime ^ hashCode: 1500143845391, 
> Result: 1 2 0
> currentTime: 1499253530047, currentTime ^ hashCode: 1500143845390, 
> Result: 1 2 0
> currentTime: 1499253530047, currentTime ^ hashCode: 1500143845390, 
> Result: 1 2 0
>  */
> internalShuffleMillis(inetSocketAddressesList);
> /*
> 146611050
> currentTime: 22618159623770, currentTime ^ hashCode: 22618302559536, 
> Result: 2 1 0
> currentTime: 22618159800738, currentTime ^ hashCode: 22618302085832, 
> Result: 0 1 2
> currentTime: 22618159967442, currentTime ^ hashCode: 2261830224, 
> Result: 1 0 2
> currentTime: 22618160135080, currentTime ^ hashCode: 22618302013634, 
> Result: 2 1 0
> currentTime: 22618160302095, currentTime ^ hashCode: 22618301535077, 
> Result: 2 1 0
> currentTime: 22618160490260, currentTime ^ hashCode: 22618301725822, 
> Result: 1 0 2
> currentTime: 22618161566373, currentTime ^ hashCode: 22618300303823, 
> Result: 1 0 2
> currentTime: 22618161745518, currentTime ^ hashCode: 22618300355844, 
> Result: 2 1 0
> currentTime: 22618161910357, currentTime ^ hashCode: 22618291603775, 
> Result: 2 1 0
> currentTime: 22618162079549, currentTime ^ hashCode: 22618291387479, 
> Result: 0 1 2
>  */
> internalShuffleNano(inetSocketAddressesList);
> inetSocketAddressesList.clear();
> inetSocketAddressesList.add(new InetSocketAddress(0));
> inetSocketAddressesList.add(new InetSocketAddress(1));
> /*
> 415138788
> currentTime: 1499253530050, currentTime ^ hashCode: 1499124456998, 
> Result: 0 1
> currentTime: 1499253530050, currentTime ^ hashCode: 1499124456998, 
> Result: 0 1
> currentTime: 1499253530050, currentTime ^ hashCode: 1499124456998, 
> Result: 0 1
> currentTime: 1499253530050, currentTime ^ hashCode: 1499124456998, 
> Result: 0 1
> currentTime: 1499253530050, currentTime ^ hashCode: 1499124456998, 
> Result: 0 1
> currentTime: 1499253530050, currentTime ^ hashCode: 1499124456998, 
> Result: 0 1
> currentTime: 1499253530053, currentTime ^ hashCode: 1499124456993, 
> Result: 0 1
> currentTime: 1499253530055, currentTime ^ hashCode: 1499124456995, 
> Result: 0 1
> currentTime: 1499253530055, currentTime ^ hashCode: 1499124456995, 
> Result: 0 1
> currentTime: 1499253530055, currentTime ^ hashCode: 1499124456995, 
> Result: 0 1
>  */
> internalShuffleMillis(inetSocketAddressesList);
> /*
> 13326370
> currentTime: 22618168292396, currentTime ^ hashCode: 22618156149774,

[jira] [Commented] (ZOOKEEPER-2694) sync CLI command does not wait for result from server

2018-11-21 Thread Michael K. Edwards (JIRA)


[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2694?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16695395#comment-16695395
 ] 

Michael K. Edwards commented on ZOOKEEPER-2694:
---

Fixable for 3.5.5?

> sync CLI command does not wait for result from server
> -
>
> Key: ZOOKEEPER-2694
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2694
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: java client
>Affects Versions: 3.5.0
>Reporter: Mohammad Arshad
>Assignee: Mohammad Arshad
>Priority: Major
> Fix For: 3.6.0, 3.5.5
>
> Attachments: ZOOKEEPER-2694-01.patch
>
>
> sync CLI command does not wait for result from server. It returns immediately 
> after invoking the sync's asynchronous API.
> Executing bellow command does not give the expected result
>  {{/bin/zkCli.sh -server host:port sync /}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ZOOKEEPER-2440) permanent SESSIONMOVED error after client app reconnects to zookeeper cluster

2018-11-21 Thread Michael K. Edwards (JIRA)


[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16695394#comment-16695394
 ] 

Michael K. Edwards commented on ZOOKEEPER-2440:
---

Is this patch something that could/should be revived for 3.5.5?

> permanent SESSIONMOVED error after client app reconnects to zookeeper cluster
> -
>
> Key: ZOOKEEPER-2440
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2440
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: quorum
>Affects Versions: 3.5.0
>Reporter: Ryan Zhang
>Assignee: Ryan Zhang
>Priority: Major
> Fix For: 3.6.0, 3.5.5
>
> Attachments: ZOOKEEPER-2440.patch
>
>
> ZOOKEEPER-710 fixed the issue when the request is not a multi request. 
> However, the multi request is handled a little bit differently as the code 
> didn't throw the SESSIONMOVED exception. In addition, the exception is set in 
> the request by the leader so it will be lost in the commit process and by the 
> time the final processor sees it, it will be gone. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ZOOKEEPER-2407) EventThread in ClientCnxn can't be closed when SendThread exits because of auth failed during reconnection

2018-11-21 Thread Michael K. Edwards (JIRA)


[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16695391#comment-16695391
 ] 

Michael K. Edwards commented on ZOOKEEPER-2407:
---

Is this really a duplicate?  Is it something that should be resolved before 
3.5.5 releases?

> EventThread in ClientCnxn can't be closed when SendThread exits because of 
> auth failed during reconnection
> --
>
> Key: ZOOKEEPER-2407
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2407
> Project: ZooKeeper
>  Issue Type: Bug
>Affects Versions: 3.5.1
>Reporter: sunhaitao
>Assignee: sunhaitao
>Priority: Major
> Fix For: 3.6.0, 3.5.5
>
> Attachments: zookeeper-2407.patch
>
>
> EventThread in ClientCnxn can't be closed when SendThread exits because of 
> auth failed during reconnection.
> for send thread if it is in authfailed state, the send thread exits,but the 
> event thread is still running.
> observation:
> use jstack tho check the thread running they find the send thread no longer 
> exists but event thread is still threre
> even when we call zookeeper.close(),the eventthread is still there.
> Stack trace: 
> sun.misc.Unsafe.park(Native Method)
> java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
> org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:514)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ZOOKEEPER-2260) Paginated getChildren call

2018-11-21 Thread Michael K. Edwards (JIRA)


[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16695381#comment-16695381
 ] 

Michael K. Edwards commented on ZOOKEEPER-2260:
---

This is a high-value feature.  Can it be revived for the 3.5.5 release?

> Paginated getChildren call
> --
>
> Key: ZOOKEEPER-2260
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2260
> Project: ZooKeeper
>  Issue Type: New Feature
>Affects Versions: 3.4.6, 3.5.0
>Reporter: Marco P.
>Assignee: Marco P.
>Priority: Major
>  Labels: api, features
> Fix For: 3.6.0, 3.5.5
>
> Attachments: ZOOKEEPER-2260.patch, ZOOKEEPER-2260.patch
>
>
> Add pagination support to the getChildren() call, allowing clients to iterate 
> over children N at the time.
> Motivations for this include:
>   - Getting out of a situation where so many children were created that 
> listing them exceeded the network buffer sizes (making it impossible to 
> recover by deleting)[1]
>  - More efficient traversal of nodes with large number of children [2]
> I do have a patch (for 3.4.6) we've been using successfully for a while, but 
> I suspect much more work is needed for this to be accepted. 
> [1] https://issues.apache.org/jira/browse/ZOOKEEPER-272
> [2] https://issues.apache.org/jira/browse/ZOOKEEPER-282



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ZOOKEEPER-2170) Zookeeper is not logging as per the configuration in log4j.properties

2018-11-21 Thread Michael K. Edwards (JIRA)


[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16695380#comment-16695380
 ] 

Michael K. Edwards commented on ZOOKEEPER-2170:
---

Is this something we should address prior to the 3.5.5 release?

> Zookeeper is not logging as per the configuration in log4j.properties
> -
>
> Key: ZOOKEEPER-2170
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2170
> Project: ZooKeeper
>  Issue Type: Bug
>Reporter: Mohammad Arshad
>Assignee: Mohammad Arshad
>Priority: Major
> Fix For: 3.6.0, 3.5.5
>
> Attachments: ZOOKEEPER-2170-002.patch, ZOOKEEPER-2170-003.patch, 
> ZOOKEEPER-2170-004.patch, ZOOKEEPER-2170-005.patch, ZOOKEEPER-2170.001.patch
>
>
> In conf/log4j.properties default root logger is 
> {code}
> zookeeper.root.logger=INFO, CONSOLE
> {code}
> Changing root logger to bellow value or any other value does not change 
> logging effect
> {code}
> zookeeper.root.logger=DEBUG, ROLLINGFILE
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ZOOKEEPER-2128) zoo_aremove_watchers API is incorrect

2018-11-21 Thread Michael K. Edwards (JIRA)


[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16695378#comment-16695378
 ] 

Michael K. Edwards commented on ZOOKEEPER-2128:
---

Is this still a thing?  Should it be fixed for 3.5.5?

> zoo_aremove_watchers API is incorrect
> -
>
> Key: ZOOKEEPER-2128
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2128
> Project: ZooKeeper
>  Issue Type: Bug
>Affects Versions: 3.6.0
>Reporter: Dave Gosselin
>Assignee: Dave Gosselin
>Priority: Major
> Fix For: 3.6.0, 3.5.5
>
>
> The C API for zoo_aremove_watchers incorrectly specifies the seventh argument 
> as a pointer to a function pointer.  It should be simply a function pointer 
> only.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ZOOKEEPER-2154) NPE in KeeperException

2018-11-21 Thread Michael K. Edwards (JIRA)


[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16695362#comment-16695362
 ] 

Michael K. Edwards commented on ZOOKEEPER-2154:
---

Is this reproducible with current branch-3.5?

> NPE in KeeperException
> --
>
> Key: ZOOKEEPER-2154
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2154
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: java client
>Affects Versions: 3.4.6
>Reporter: Surendra Singh Lilhore
>Priority: Major
> Fix For: 3.6.0, 3.5.5
>
> Attachments: ZOOKEEPER-2154.patch
>
>
> KeeperException should handle exception is code is null...



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ZOOKEEPER-1801) TestReconfig failure

2018-11-21 Thread Michael K. Edwards (JIRA)


[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16695370#comment-16695370
 ] 

Michael K. Edwards commented on ZOOKEEPER-1801:
---

Is this applicable to the current branch-3.5?

> TestReconfig failure
> 
>
> Key: ZOOKEEPER-1801
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1801
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: quorum
>Reporter: Flavio Junqueira
>Assignee: Marshall McMullen
>Priority: Major
> Fix For: 3.6.0, 3.5.5
>
>
> This is the message:
> {noformat}
> /home/jenkins/jenkins-slave/workspace/ZooKeeper-trunk/trunk/src/c/tests/TestReconfig.cc:183:
>  Assertion: equality assertion failed [Expected: 1, Actual  : 0]
> {noformat}
>  https://builds.apache.org/job/ZooKeeper-trunk/2100/



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ZOOKEEPER-1814) Reduction of waiting time during Fast Leader Election

2018-11-21 Thread Michael K. Edwards (JIRA)


[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16695369#comment-16695369
 ] 

Michael K. Edwards commented on ZOOKEEPER-1814:
---

Is this applicable to the current branch-3.5?

> Reduction of waiting time during Fast Leader Election
> -
>
> Key: ZOOKEEPER-1814
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1814
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: leaderElection
>Affects Versions: 3.4.5, 3.5.0
>Reporter: Daniel Peon
>Assignee: Daniel Peon
>Priority: Major
> Fix For: 3.6.0, 3.5.5
>
> Attachments: ZOOKEEPER-1814.patch, ZOOKEEPER-1814.patch, 
> ZOOKEEPER-1814.patch, ZOOKEEPER-1814.patch, ZOOKEEPER-1814.patch, 
> ZOOKEEPER-1814.patch, ZOOKEEPER-1814.patch, ZOOKEEPER-1814.patch, 
> ZOOKEEPER-1814.patch, ZOOKEEPER-1814_release_3_5_0.patch, 
> ZOOKEEPER-1814_trunk.patch
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> FastLeader election takes long time because of the exponential backoff. 
> Currently the time is 60 seconds.
> It would be interesting to give the possibility to configure this parameter, 
> like for example for a Server shutdown.
> Otherwise, it sometimes takes so long and it has been detected a test failure 
> when executing: org.apache.zookeeper.server.quorum.QuorumPeerMainTest.
> This test case waits until 30 seconds and this is smaller than the 60 seconds 
> where the leader election can be waiting for at the moment of shutting down.
> Considering the failure during the test case, this issue was considered a 
> possible bug.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ZOOKEEPER-1865) Fix retry logic in Learner.connectToLeader()

2018-11-21 Thread Michael K. Edwards (JIRA)


[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16695365#comment-16695365
 ] 

Michael K. Edwards commented on ZOOKEEPER-1865:
---

Is this reproducible in current 3.5?

> Fix retry logic in Learner.connectToLeader() 
> -
>
> Key: ZOOKEEPER-1865
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1865
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: server
>Reporter: Thawan Kooburat
>Assignee: Edward Carter
>Priority: Major
> Fix For: 3.6.0, 3.5.5
>
> Attachments: ZOOKEEPER-1865-nanoTime.patch, 
> ZOOKEEPER-1865-testfix.patch, ZOOKEEPER-1865.patch
>
>
> We discovered a long leader election time today in one of our prod ensemble.
> Here is the description of the event. 
> Before the old leader goes down, it is able to announce notification message. 
> So 3 out 5 (including the old leader) elected the old leader to be a new 
> leader for the next epoch. While, the old leader is being rebooted, 2 other 
> machines are trying to connect to the old leader.  So the quorum couldn't 
> form until those 2 machines give up and move to the next round of leader 
> election.
> This is because Learner.connectToLeader() use a simple retry logic. The 
> contract for this method is that it should never spend longer that initLimit 
> trying to connect to the leader.  In our outage, each sock.connect() is 
> probably blocked for initLimit and it is called 5 times.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ZOOKEEPER-1896) Reconfig error messages when upgrading from 3.4.6 to 3.5.0

2018-11-21 Thread Michael K. Edwards (JIRA)


[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1896?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16695364#comment-16695364
 ] 

Michael K. Edwards commented on ZOOKEEPER-1896:
---

Is this reproducible in current 3.5?

> Reconfig error messages when upgrading from 3.4.6 to 3.5.0
> --
>
> Key: ZOOKEEPER-1896
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1896
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: server
>Affects Versions: 3.5.0
>Reporter: Raul Gutierrez Segales
>Priority: Major
> Fix For: 3.6.0, 3.5.5
>
>
> When upgrading from 3.4.6 (rc0 actually) to 3.5.0 (trunk as of two weeks ago 
> actually) I got this error message:
> {noformat}
> 2014-02-26 22:12:15,446 - ERROR [WorkerReceiver[myid=4]] - Something went 
> wrong while processing config received from 3
> {noformat}
> According to [~fpj]:
> bq. I think you’re right that the reconfig error is harmless, but we 
> shouldn’t be getting it. The problem is that it is not detecting that we are 
> in backward compatibility mode. We need to fix it for 3.5.0 and perhaps 
> ZOOKEEPER-1805 is the right place for doing it.
> cc: [~shralex]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ZOOKEEPER-2019) Unhandled exception when setting invalid limits data in /zookeeper/quota/some/path/zookeeper_limits

2018-11-21 Thread Michael K. Edwards (JIRA)


[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16695363#comment-16695363
 ] 

Michael K. Edwards commented on ZOOKEEPER-2019:
---

Is this reproducible in current branch-3.5?

> Unhandled exception when setting invalid limits data in 
> /zookeeper/quota/some/path/zookeeper_limits 
> 
>
> Key: ZOOKEEPER-2019
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2019
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: server
>Reporter: Raul Gutierrez Segales
>Assignee: Raul Gutierrez Segales
>Priority: Major
> Fix For: 3.6.0, 3.5.5
>
> Attachments: ZOOKEEPER-2019-v2.patch, ZOOKEEPER-2019-v3.patch, 
> ZOOKEEPER-2019-ver1.patch, ZOOKEEPER-2019.patch, ZOOKEEPER-2019.patch
>
>
> If you have quotas properly set for a given path, i.e.:
> {noformat}
> create /zookeeper/quota/test/zookeeper_limits 'count=1,bytes=100'
> create /zookeeper/quota/test/zookeeper_stats 'count=1,bytes=100'
> {noformat}
> and then you update the limits znode with bogus data, i.e.:
> {noformat}
> set /zookeeper/quota/test/zookeeper_limits ''
> {noformat}
> you'll crash the cluster because IllegalArgumentException isn't handled when 
> dealing with quotas znodes:
> https://github.com/apache/zookeeper/blob/ZOOKEEPER-823/src/java/main/org/apache/zookeeper/server/DataTree.java#L379
> https://github.com/apache/zookeeper/blob/ZOOKEEPER-823/src/java/main/org/apache/zookeeper/server/DataTree.java#L425
> We should handle IllegalArgumentException. Optionally, we should also throw 
> BadArgumentsException from PrepRequestProcessor. 
> Review Board: https://reviews.apache.org/r/25968/



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ZOOKEEPER-2164) fast leader election keeps failing

2018-11-21 Thread Michael K. Edwards (JIRA)


[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16695360#comment-16695360
 ] 

Michael K. Edwards commented on ZOOKEEPER-2164:
---

Is this reproducible with the current branch-3.5 code?

> fast leader election keeps failing
> --
>
> Key: ZOOKEEPER-2164
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2164
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: leaderElection
>Affects Versions: 3.4.5
>Reporter: Michi Mutsuzaki
>Priority: Major
> Fix For: 3.6.0, 3.5.5
>
>
> I have a 3-node cluster with sids 1, 2 and 3. Originally 2 is the leader. 
> When I shut down 2, 1 and 3 keep going back to leader election. Here is what 
> seems to be happening.
> - Both 1 and 3 elect 3 as the leader.
> - 1 receives votes from 3 and itself, and starts trying to connect to 3 as a 
> follower.
> - 3 doesn't receive votes for 5 seconds because connectOne() to 2 doesn't 
> timeout for 5 seconds: 
> https://github.com/apache/zookeeper/blob/41c9fcb3ca09cd3d05e59fe47f08ecf0b85532c8/src/java/main/org/apache/zookeeper/server/quorum/QuorumCnxManager.java#L346
> - By the time 3 receives votes, 1 has given up trying to connect to 3: 
> https://github.com/apache/zookeeper/blob/41c9fcb3ca09cd3d05e59fe47f08ecf0b85532c8/src/java/main/org/apache/zookeeper/server/quorum/Learner.java#L247
> I'm using 3.4.5, but it looks like this part of the code hasn't changed for a 
> while, so I'm guessing later versions have the same issue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ZOOKEEPER-2202) Cluster crashes when reconfig adds an unreachable observer

2018-11-21 Thread Michael K. Edwards (JIRA)


[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16695358#comment-16695358
 ] 

Michael K. Edwards commented on ZOOKEEPER-2202:
---

Should this be a release blocker for 3.5.5?

> Cluster crashes when reconfig adds an unreachable observer
> --
>
> Key: ZOOKEEPER-2202
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2202
> Project: ZooKeeper
>  Issue Type: Bug
>Affects Versions: 3.5.0, 3.6.0
>Reporter: Raul Gutierrez Segales
>Assignee: Raul Gutierrez Segales
>Priority: Major
> Fix For: 3.6.0, 3.5.5
>
> Attachments: ZOOKEEPER-2202.patch
>
>
> While adding support for reconfig() in Kazoo 
> (https://github.com/python-zk/kazoo/pull/333) I found that the cluster can be 
> crashed if you add an observer whose election port isn't reachable (i.e.: 
> packets for that destination are dropped, not rejected). This will raise a 
> SocketTimeoutException which will bring down the PrepRequestProcessor:
> {code}
> 2015-06-02 14:37:16,473 [myid:3] - WARN  [ProcessThread(sid:3 
> cport:-1)::QuorumCnxManager@384] - Cannot open channel to 100 at election 
> address /8.8.8.8:38703
> java.net.SocketTimeoutException: connect timed out
> at java.net.PlainSocketImpl.socketConnect(Native Method)
> at 
> java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:345)
> at 
> java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
> at 
> java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
> at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
> at java.net.Socket.connect(Socket.java:589)
> at 
> org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:369)
> at 
> org.apache.zookeeper.server.quorum.QuorumPeer.connectNewPeers(QuorumPeer.java:1288)
> at 
> org.apache.zookeeper.server.quorum.QuorumPeer.setLastSeenQuorumVerifier(QuorumPeer.java:1315)
> at org.apache.zookeeper.server.quorum.Leader.propose(Leader.java:1056)
> at 
> org.apache.zookeeper.server.quorum.ProposalRequestProcessor.processRequest(ProposalRequestProcessor.java:78)
> at 
> org.apache.zookeeper.server.PrepRequestProcessor.pRequest(PrepRequestProcessor.java:877)
> at 
> org.apache.zookeeper.server.PrepRequestProcessor.run(PrepRequestProcessor.java:143)
> {code}
> A simple repro can be obtained by using the code in the referenced pull 
> request above and using 8.8.8.8:3888 (for example) instead of a free (but 
> closed) port in the loopback. 
> I think that adding an Observer (or a Participant) that isn't currently 
> reachable is a valid use case (i.e.: you are provisioning the machine and 
> it's not currently needed) so I think we could handle this with lower connect 
> timeouts, not sure. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ZOOKEEPER-2307) ZooKeeper not starting because acceptedEpoch is less than the currentEpoch

2018-11-21 Thread Michael K. Edwards (JIRA)


[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16695356#comment-16695356
 ] 

Michael K. Edwards commented on ZOOKEEPER-2307:
---

Should this be a release blocker for 3.5.5?

> ZooKeeper not starting because acceptedEpoch is less than the currentEpoch
> --
>
> Key: ZOOKEEPER-2307
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2307
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: server
>Reporter: Mohammad Arshad
>Assignee: Mohammad Arshad
>Priority: Major
> Fix For: 3.6.0, 3.5.5
>
> Attachments: ZOOKEEPER-2307-01.patch, ZOOKEEPER-2307-02.patch, 
> ZOOKEEPER-2307-03.patch, ZOOKEEPER-2307-04.patch
>
>
> This issue occurred in one of our test environment where disk was being 
> changed to read only very frequently.
> The the scenario is as follows:
> # Configure three node ZooKeeper cluster, lets say nodes are A, B and C
> # Start A and B. Both A and B start successfully, quorum is running.
> # Start C, because of IO error C fails to update acceptedEpoch file. But C 
> also starts successfully, joins the quorum as follower
> # Stop C
> # Start C, bellow exception with message "The accepted epoch, 0 is less than 
> the current epoch, 1" is thrown
> {code}
> 2015-10-29 16:52:32,942 [myid:3] - ERROR [main:QuorumPeer@784] - Unable to 
> load database on disk
> java.io.IOException: The accepted epoch, 0 is less than the current epoch, 1
>   at 
> org.apache.zookeeper.server.quorum.QuorumPeer.loadDataBase(QuorumPeer.java:781)
>   at 
> org.apache.zookeeper.server.quorum.QuorumPeer.start(QuorumPeer.java:720)
>   at 
> org.apache.zookeeper.server.quorum.QuorumPeerMain.runFromConfig(QuorumPeerMain.java:202)
>   at 
> org.apache.zookeeper.server.quorum.QuorumPeerMain.initializeAndRun(QuorumPeerMain.java:139)
>   at 
> org.apache.zookeeper.server.quorum.QuorumPeerMain.main(QuorumPeerMain.java:88)
> 2015-10-29 16:52:32,946 [myid:3] - ERROR [main:QuorumPeerMain@111] - 
> Unexpected exception, exiting abnormally
> java.lang.RuntimeException: Unable to run quorum server 
>   at 
> org.apache.zookeeper.server.quorum.QuorumPeer.loadDataBase(QuorumPeer.java:785)
>   at 
> org.apache.zookeeper.server.quorum.QuorumPeer.start(QuorumPeer.java:720)
>   at 
> org.apache.zookeeper.server.quorum.QuorumPeerMain.runFromConfig(QuorumPeerMain.java:202)
>   at 
> org.apache.zookeeper.server.quorum.QuorumPeerMain.initializeAndRun(QuorumPeerMain.java:139)
>   at 
> org.apache.zookeeper.server.quorum.QuorumPeerMain.main(QuorumPeerMain.java:88)
> Caused by: java.io.IOException: The accepted epoch, 0 is less than the 
> current epoch, 1
>   at 
> org.apache.zookeeper.server.quorum.QuorumPeer.loadDataBase(QuorumPeer.java:781)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ZOOKEEPER-2354) ZOOKEEPER-1653 not merged in master and 3.5 branch

2018-11-21 Thread Michael K. Edwards (JIRA)


[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16695354#comment-16695354
 ] 

Michael K. Edwards commented on ZOOKEEPER-2354:
---

Should this be a release blocker for 3.5.5?

> ZOOKEEPER-1653 not merged in master and 3.5 branch
> --
>
> Key: ZOOKEEPER-2354
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2354
> Project: ZooKeeper
>  Issue Type: Bug
>Reporter: Mohammad Arshad
>Assignee: Mohammad Arshad
>Priority: Major
> Fix For: 3.6.0, 3.5.5
>
> Attachments: ZOOKEEPER-2354-01.patch
>
>
> ZOOKEEPER-1653 is merged only to 3.4 branch. 
> It should be merged to 3.5 and master branch as well.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ZOOKEEPER-2488) Unsynchronized access to shuttingDownLE in QuorumPeer

2018-11-21 Thread Michael K. Edwards (JIRA)


[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16695348#comment-16695348
 ] 

Michael K. Edwards commented on ZOOKEEPER-2488:
---

Looks to me like all cases of the test-and-set/test-and-clear idiom for 
{{shuttingDownLE}} should be wrapped in {{synchronized}} blocks, to ensure 
atomicity and visibility of the change.  I think 
https://github.com/apache/zookeeper/pull/707/commits/3dfd49f6bfea357c838e21d5a2e4f1486ed753e9
 is a sufficient fix.

> Unsynchronized access to shuttingDownLE in QuorumPeer
> -
>
> Key: ZOOKEEPER-2488
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2488
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: server
>Affects Versions: 3.5.2
>Reporter: Michael Han
>Assignee: gaoshu
>Priority: Major
> Fix For: 3.6.0, 3.5.5
>
>
> Access to shuttingDownLE in QuorumPeer is not synchronized here:
> https://github.com/apache/zookeeper/blob/3c37184e83a3e68b73544cebccf9388eea26f523/src/java/main/org/apache/zookeeper/server/quorum/QuorumPeer.java#L1066
> https://github.com/apache/zookeeper/blob/3c37184e83a3e68b73544cebccf9388eea26f523/src/java/main/org/
> The access should be synchronized as the same variable might be accessed 
> in QuormPeer::restartLeaderElection, which is synchronized.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ZOOKEEPER-1818) Fix don't care for trunk

2018-11-21 Thread Michael K. Edwards (JIRA)


[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1818?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16695269#comment-16695269
 ] 

Michael K. Edwards commented on ZOOKEEPER-1818:
---

I've ported Fangmin's patch to the 3.5 branch, in the interest of getting to 
something testable that has fixes for the outstanding 3.5 release blockers in 
it.  Tests are running in https://github.com/apache/zookeeper/pull/714.

> Fix don't care for trunk
> 
>
> Key: ZOOKEEPER-1818
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1818
> Project: ZooKeeper
>  Issue Type: Bug
>Affects Versions: 3.5.1
>Reporter: Flavio Junqueira
>Assignee: Fangmin Lv
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 3.6.0, 3.5.5
>
> Attachments: ZOOKEEPER-1818.patch
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> See umbrella jira.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ZOOKEEPER-1636) c-client crash when zoo_amulti failed

2018-11-21 Thread Michael K. Edwards (JIRA)


[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1636?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16695240#comment-16695240
 ] 

Michael K. Edwards commented on ZOOKEEPER-1636:
---

Rebased patch on top of the candidate fix for ZOOKEEPER-2778, to get a green 
build.  See https://github.com/apache/zookeeper/pull/713

> c-client crash when zoo_amulti failed 
> --
>
> Key: ZOOKEEPER-1636
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1636
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: c client
>Affects Versions: 3.4.3
>Reporter: Thawan Kooburat
>Assignee: Thawan Kooburat
>Priority: Critical
> Fix For: 3.6.0, 3.5.5
>
> Attachments: ZOOKEEPER-1636.patch, ZOOKEEPER-1636.patch, 
> ZOOKEEPER-1636.patch, ZOOKEEPER-1636.patch, ZOOKEEPER-1636.patch
>
>
> deserialize_response for multi operation don't handle the case where the 
> server fail to send back response. (Eg. when multi packet is too large) 
> c-client will try to process completion of all sub-request as if the 
> operation is successful and will eventually cause SIGSEGV



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ZOOKEEPER-2778) Potential server deadlock between follower sync with leader and follower receiving external connection requests.

2018-11-21 Thread Michael K. Edwards (JIRA)


[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16695219#comment-16695219
 ] 

Michael K. Edwards commented on ZOOKEEPER-2778:
---

Legit green build!

> Potential server deadlock between follower sync with leader and follower 
> receiving external connection requests.
> 
>
> Key: ZOOKEEPER-2778
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2778
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: quorum
>Affects Versions: 3.5.3
>Reporter: Michael Han
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 3.6.0, 3.5.5
>
>  Time Spent: 4h
>  Remaining Estimate: 0h
>
> It's possible to have a deadlock during recovery phase. 
> Found this issue by analyzing thread dumps of "flaky" ReconfigRecoveryTest 
> [1]. . Here is a sample thread dump that illustrates the state of the 
> execution:
> {noformat}
> [junit]  java.lang.Thread.State: BLOCKED
> [junit] at  
> org.apache.zookeeper.server.quorum.QuorumPeer.getElectionAddress(QuorumPeer.java:686)
> [junit] at  
> org.apache.zookeeper.server.quorum.QuorumCnxManager.initiateConnection(QuorumCnxManager.java:265)
> [junit] at  
> org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:445)
> [junit] at  
> org.apache.zookeeper.server.quorum.QuorumCnxManager.receiveConnection(QuorumCnxManager.java:369)
> [junit] at  
> org.apache.zookeeper.server.quorum.QuorumCnxManager$Listener.run(QuorumCnxManager.java:642)
> [junit] 
> [junit]  java.lang.Thread.State: BLOCKED
> [junit] at  
> org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:472)
> [junit] at  
> org.apache.zookeeper.server.quorum.QuorumPeer.connectNewPeers(QuorumPeer.java:1438)
> [junit] at  
> org.apache.zookeeper.server.quorum.QuorumPeer.setLastSeenQuorumVerifier(QuorumPeer.java:1471)
> [junit] at  
> org.apache.zookeeper.server.quorum.Learner.syncWithLeader(Learner.java:520)
> [junit] at  
> org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:88)
> [junit] at  
> org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:1133)
> {noformat}
> The dead lock happens between the quorum peer thread which running the 
> follower that doing sync with leader work, and the listener of the qcm of the 
> same quorum peer that doing the receiving connection work. Basically to 
> finish sync with leader, the follower needs to synchronize on both QV_LOCK 
> and the qmc object it owns; while in the receiver thread to finish setup an 
> incoming connection the thread needs to synchronize on both the qcm object 
> the quorum peer owns, and the same QV_LOCK. It's easy to see the problem here 
> is the order of acquiring two locks are different, thus depends on timing / 
> actual execution order, two threads might end up acquiring one lock while 
> holding another.
> [1] 
> org.apache.zookeeper.server.quorum.ReconfigRecoveryTest.testCurrentServersAreObserversInNextConfig



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ZOOKEEPER-2778) Potential server deadlock between follower sync with leader and follower receiving external connection requests.

2018-11-21 Thread Michael K. Edwards (JIRA)


[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16694994#comment-16694994
 ] 

Michael K. Edwards commented on ZOOKEEPER-2778:
---

Thanks to [~maoling] and [~andorm] for quick review!  [~castuardo], @afine, any 
chance you might be able to take a look at the alternate patch?

> Potential server deadlock between follower sync with leader and follower 
> receiving external connection requests.
> 
>
> Key: ZOOKEEPER-2778
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2778
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: quorum
>Affects Versions: 3.5.3
>Reporter: Michael Han
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 3.6.0, 3.5.5
>
>  Time Spent: 3h 50m
>  Remaining Estimate: 0h
>
> It's possible to have a deadlock during recovery phase. 
> Found this issue by analyzing thread dumps of "flaky" ReconfigRecoveryTest 
> [1]. . Here is a sample thread dump that illustrates the state of the 
> execution:
> {noformat}
> [junit]  java.lang.Thread.State: BLOCKED
> [junit] at  
> org.apache.zookeeper.server.quorum.QuorumPeer.getElectionAddress(QuorumPeer.java:686)
> [junit] at  
> org.apache.zookeeper.server.quorum.QuorumCnxManager.initiateConnection(QuorumCnxManager.java:265)
> [junit] at  
> org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:445)
> [junit] at  
> org.apache.zookeeper.server.quorum.QuorumCnxManager.receiveConnection(QuorumCnxManager.java:369)
> [junit] at  
> org.apache.zookeeper.server.quorum.QuorumCnxManager$Listener.run(QuorumCnxManager.java:642)
> [junit] 
> [junit]  java.lang.Thread.State: BLOCKED
> [junit] at  
> org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:472)
> [junit] at  
> org.apache.zookeeper.server.quorum.QuorumPeer.connectNewPeers(QuorumPeer.java:1438)
> [junit] at  
> org.apache.zookeeper.server.quorum.QuorumPeer.setLastSeenQuorumVerifier(QuorumPeer.java:1471)
> [junit] at  
> org.apache.zookeeper.server.quorum.Learner.syncWithLeader(Learner.java:520)
> [junit] at  
> org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:88)
> [junit] at  
> org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:1133)
> {noformat}
> The dead lock happens between the quorum peer thread which running the 
> follower that doing sync with leader work, and the listener of the qcm of the 
> same quorum peer that doing the receiving connection work. Basically to 
> finish sync with leader, the follower needs to synchronize on both QV_LOCK 
> and the qmc object it owns; while in the receiver thread to finish setup an 
> incoming connection the thread needs to synchronize on both the qcm object 
> the quorum peer owns, and the same QV_LOCK. It's easy to see the problem here 
> is the order of acquiring two locks are different, thus depends on timing / 
> actual execution order, two threads might end up acquiring one lock while 
> holding another.
> [1] 
> org.apache.zookeeper.server.quorum.ReconfigRecoveryTest.testCurrentServersAreObserversInNextConfig



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ZOOKEEPER-2778) Potential server deadlock between follower sync with leader and follower receiving external connection requests.

2018-11-21 Thread Michael K. Edwards (JIRA)


[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16694401#comment-16694401
 ] 

Michael K. Edwards commented on ZOOKEEPER-2778:
---

I have implemented a more complete fix which (I think) solves all the lock 
ordering and cross-thread visibility issues associated with the QV_LOCK, qcm, 
and address fields.  Testing now; assuming no surprises, I'll push it as an 
update to https://github.com/apache/zookeeper/pull/707.

> Potential server deadlock between follower sync with leader and follower 
> receiving external connection requests.
> 
>
> Key: ZOOKEEPER-2778
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2778
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: quorum
>Affects Versions: 3.5.3
>Reporter: Michael Han
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 3.6.0, 3.5.5
>
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> It's possible to have a deadlock during recovery phase. 
> Found this issue by analyzing thread dumps of "flaky" ReconfigRecoveryTest 
> [1]. . Here is a sample thread dump that illustrates the state of the 
> execution:
> {noformat}
> [junit]  java.lang.Thread.State: BLOCKED
> [junit] at  
> org.apache.zookeeper.server.quorum.QuorumPeer.getElectionAddress(QuorumPeer.java:686)
> [junit] at  
> org.apache.zookeeper.server.quorum.QuorumCnxManager.initiateConnection(QuorumCnxManager.java:265)
> [junit] at  
> org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:445)
> [junit] at  
> org.apache.zookeeper.server.quorum.QuorumCnxManager.receiveConnection(QuorumCnxManager.java:369)
> [junit] at  
> org.apache.zookeeper.server.quorum.QuorumCnxManager$Listener.run(QuorumCnxManager.java:642)
> [junit] 
> [junit]  java.lang.Thread.State: BLOCKED
> [junit] at  
> org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:472)
> [junit] at  
> org.apache.zookeeper.server.quorum.QuorumPeer.connectNewPeers(QuorumPeer.java:1438)
> [junit] at  
> org.apache.zookeeper.server.quorum.QuorumPeer.setLastSeenQuorumVerifier(QuorumPeer.java:1471)
> [junit] at  
> org.apache.zookeeper.server.quorum.Learner.syncWithLeader(Learner.java:520)
> [junit] at  
> org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:88)
> [junit] at  
> org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:1133)
> {noformat}
> The dead lock happens between the quorum peer thread which running the 
> follower that doing sync with leader work, and the listener of the qcm of the 
> same quorum peer that doing the receiving connection work. Basically to 
> finish sync with leader, the follower needs to synchronize on both QV_LOCK 
> and the qmc object it owns; while in the receiver thread to finish setup an 
> incoming connection the thread needs to synchronize on both the qcm object 
> the quorum peer owns, and the same QV_LOCK. It's easy to see the problem here 
> is the order of acquiring two locks are different, thus depends on timing / 
> actual execution order, two threads might end up acquiring one lock while 
> holding another.
> [1] 
> org.apache.zookeeper.server.quorum.ReconfigRecoveryTest.testCurrentServersAreObserversInNextConfig



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ZOOKEEPER-2778) Potential server deadlock between follower sync with leader and follower receiving external connection requests.

2018-11-20 Thread Michael K. Edwards (JIRA)


[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16693898#comment-16693898
 ] 

Michael K. Edwards commented on ZOOKEEPER-2778:
---

>From what I'm seeing, it would be a crashing bug for `getQuorumAddress()` 
>(which cannot be marked `protected`, because it's called by has-a holders of a 
>QuorumPeer reference rather than by is-a subclasses of QuorumPeer, but can and 
>should be package-private) to be called before the addresses are set.  The 
>only call to `getClientAddress()` (which should be `private`) is in 
>`processReconfig()`, and it's appropriate for it to return `null` if called 
>early.  This leaves `getElectionAddress()`, which again is pseudo-protected 
>and would produce a crash if called before the addresses are set.

So the actual problem here is that, if the election address is not yet known, 
there's no safe return value from `getElectionAddress()` in the race scenario 
cited in the bug description.  This "fix" – hanm's or mine – will turn it into 
an NPE instead of a deadlock.

This might be addressable by ensuring that code that needs the `QV_LOCK` for a 
`QuorumPeer` associated with a `QuorumCnxManager` (to protect the macroscopic 
critical sections in `QuorumCnxManager.connectOne()`, 
`QuorumPeer.setLastSeenQuorumVerifier()`, and `QuorumPeer.setQuorumVerifier()`) 
always takes the lock on the `QuorumCnxManager` instance first.  Looking into 
that.

> Potential server deadlock between follower sync with leader and follower 
> receiving external connection requests.
> 
>
> Key: ZOOKEEPER-2778
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2778
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: quorum
>Affects Versions: 3.5.3
>Reporter: Michael Han
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 3.6.0, 3.5.5
>
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> It's possible to have a deadlock during recovery phase. 
> Found this issue by analyzing thread dumps of "flaky" ReconfigRecoveryTest 
> [1]. . Here is a sample thread dump that illustrates the state of the 
> execution:
> {noformat}
> [junit]  java.lang.Thread.State: BLOCKED
> [junit] at  
> org.apache.zookeeper.server.quorum.QuorumPeer.getElectionAddress(QuorumPeer.java:686)
> [junit] at  
> org.apache.zookeeper.server.quorum.QuorumCnxManager.initiateConnection(QuorumCnxManager.java:265)
> [junit] at  
> org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:445)
> [junit] at  
> org.apache.zookeeper.server.quorum.QuorumCnxManager.receiveConnection(QuorumCnxManager.java:369)
> [junit] at  
> org.apache.zookeeper.server.quorum.QuorumCnxManager$Listener.run(QuorumCnxManager.java:642)
> [junit] 
> [junit]  java.lang.Thread.State: BLOCKED
> [junit] at  
> org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:472)
> [junit] at  
> org.apache.zookeeper.server.quorum.QuorumPeer.connectNewPeers(QuorumPeer.java:1438)
> [junit] at  
> org.apache.zookeeper.server.quorum.QuorumPeer.setLastSeenQuorumVerifier(QuorumPeer.java:1471)
> [junit] at  
> org.apache.zookeeper.server.quorum.Learner.syncWithLeader(Learner.java:520)
> [junit] at  
> org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:88)
> [junit] at  
> org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:1133)
> {noformat}
> The dead lock happens between the quorum peer thread which running the 
> follower that doing sync with leader work, and the listener of the qcm of the 
> same quorum peer that doing the receiving connection work. Basically to 
> finish sync with leader, the follower needs to synchronize on both QV_LOCK 
> and the qmc object it owns; while in the receiver thread to finish setup an 
> incoming connection the thread needs to synchronize on both the qcm object 
> the quorum peer owns, and the same QV_LOCK. It's easy to see the problem here 
> is the order of acquiring two locks are different, thus depends on timing / 
> actual execution order, two threads might end up acquiring one lock while 
> holding another.
> [1] 
> org.apache.zookeeper.server.quorum.ReconfigRecoveryTest.testCurrentServersAreObserversInNextConfig



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (ZOOKEEPER-2778) Potential server deadlock between follower sync with leader and follower receiving external connection requests.

2018-11-20 Thread Michael K. Edwards (JIRA)


[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16693898#comment-16693898
 ] 

Michael K. Edwards edited comment on ZOOKEEPER-2778 at 11/20/18 11:44 PM:
--

>From what I'm seeing, it would be a crashing bug for {{getQuorumAddress()}} 
>(which cannot be marked {{protected}}, because it's called by has-a holders of 
>a {{QuorumPeer}} reference rather than by is-a subclasses of {{QuorumPeer}}, 
>but can and should be package-private) to be called before the addresses are 
>set.  The only call to {{getClientAddress()}} (which should be {{private}}) is 
>in {{processReconfig()}}, and it's appropriate for it to return {{null}} if 
>called early.  This leaves {{getElectionAddress()}}, which again is 
>pseudo-protected and would produce a crash if called before the addresses are 
>set.

So the actual problem here is that, if the election address is not yet known, 
there's no safe return value from {{getElectionAddress()}} in the race scenario 
cited in the bug description.  This "fix" – hanm's or mine – will turn it into 
an NPE instead of a deadlock.

This might be addressable by ensuring that code that needs the {{QV_LOCK}} for 
a {{QuorumPeer}} associated with a {{QuorumCnxManager}} (to protect the 
macroscopic critical sections in {{QuorumCnxManager.connectOne()}}, 
{{QuorumPeer.setLastSeenQuorumVerifier()}}, and 
{{QuorumPeer.setQuorumVerifier()}}) always takes the lock on the 
{{QuorumCnxManager}} instance first.  Looking into that.


was (Author: mkedwards):
>From what I'm seeing, it would be a crashing bug for `getQuorumAddress()` 
>(which cannot be marked `protected`, because it's called by has-a holders of a 
>QuorumPeer reference rather than by is-a subclasses of QuorumPeer, but can and 
>should be package-private) to be called before the addresses are set.  The 
>only call to `getClientAddress()` (which should be `private`) is in 
>`processReconfig()`, and it's appropriate for it to return `null` if called 
>early.  This leaves `getElectionAddress()`, which again is pseudo-protected 
>and would produce a crash if called before the addresses are set.

So the actual problem here is that, if the election address is not yet known, 
there's no safe return value from `getElectionAddress()` in the race scenario 
cited in the bug description.  This "fix" – hanm's or mine – will turn it into 
an NPE instead of a deadlock.

This might be addressable by ensuring that code that needs the `QV_LOCK` for a 
`QuorumPeer` associated with a `QuorumCnxManager` (to protect the macroscopic 
critical sections in `QuorumCnxManager.connectOne()`, 
`QuorumPeer.setLastSeenQuorumVerifier()`, and `QuorumPeer.setQuorumVerifier()`) 
always takes the lock on the `QuorumCnxManager` instance first.  Looking into 
that.

> Potential server deadlock between follower sync with leader and follower 
> receiving external connection requests.
> 
>
> Key: ZOOKEEPER-2778
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2778
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: quorum
>Affects Versions: 3.5.3
>Reporter: Michael Han
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 3.6.0, 3.5.5
>
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> It's possible to have a deadlock during recovery phase. 
> Found this issue by analyzing thread dumps of "flaky" ReconfigRecoveryTest 
> [1]. . Here is a sample thread dump that illustrates the state of the 
> execution:
> {noformat}
> [junit]  java.lang.Thread.State: BLOCKED
> [junit] at  
> org.apache.zookeeper.server.quorum.QuorumPeer.getElectionAddress(QuorumPeer.java:686)
> [junit] at  
> org.apache.zookeeper.server.quorum.QuorumCnxManager.initiateConnection(QuorumCnxManager.java:265)
> [junit] at  
> org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:445)
> [junit] at  
> org.apache.zookeeper.server.quorum.QuorumCnxManager.receiveConnection(QuorumCnxManager.java:369)
> [junit] at  
> org.apache.zookeeper.server.quorum.QuorumCnxManager$Listener.run(QuorumCnxManager.java:642)
> [junit] 
> [junit]  java.lang.Thread.State: BLOCKED
> [junit] at  
> org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:472)
> [junit] at  
> org.apache.zookeeper.server.quorum.QuorumPeer.connectNewPeers(QuorumPeer.java:1438)
> [junit] at  
> org.apache.zookeeper.server.quorum.QuorumPeer.setLastSeenQuorumVerifier(QuorumPeer.java:1471)
> [junit] at  
> org.apache.zookeeper.server.quorum.Learner.syncWithLeader(Learner.java:5

[jira] [Comment Edited] (ZOOKEEPER-2778) Potential server deadlock between follower sync with leader and follower receiving external connection requests.

2018-11-20 Thread Michael K. Edwards (JIRA)


[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16693898#comment-16693898
 ] 

Michael K. Edwards edited comment on ZOOKEEPER-2778 at 11/20/18 11:41 PM:
--

>From what I'm seeing, it would be a crashing bug for `getQuorumAddress()` 
>(which cannot be marked `protected`, because it's called by has-a holders of a 
>QuorumPeer reference rather than by is-a subclasses of QuorumPeer, but can and 
>should be package-private) to be called before the addresses are set.  The 
>only call to `getClientAddress()` (which should be `private`) is in 
>`processReconfig()`, and it's appropriate for it to return `null` if called 
>early.  This leaves `getElectionAddress()`, which again is pseudo-protected 
>and would produce a crash if called before the addresses are set.

So the actual problem here is that, if the election address is not yet known, 
there's no safe return value from `getElectionAddress()` in the race scenario 
cited in the bug description.  This "fix" – hanm's or mine – will turn it into 
an NPE instead of a deadlock.

This might be addressable by ensuring that code that needs the `QV_LOCK` for a 
`QuorumPeer` associated with a `QuorumCnxManager` (to protect the macroscopic 
critical sections in `QuorumCnxManager.connectOne()`, 
`QuorumPeer.setLastSeenQuorumVerifier()`, and `QuorumPeer.setQuorumVerifier()`) 
always takes the lock on the `QuorumCnxManager` instance first.  Looking into 
that.


was (Author: mkedwards):
>From what I'm seeing, it would be a crashing bug for `getQuorumAddress()` 
>(which cannot be marked `protected`, because it's called by has-a holders of a 
>QuorumPeer reference rather than by is-a subclasses of QuorumPeer, but can and 
>should be package-private) to be called before the addresses are set.  The 
>only call to `getClientAddress()` (which should be `private`) is in 
>`processReconfig()`, and it's appropriate for it to return `null` if called 
>early.  This leaves `getElectionAddress()`, which again is pseudo-protected 
>and would produce a crash if called before the addresses are set.

So the actual problem here is that, if the election address is not yet known, 
there's no safe return value from `getElectionAddress()` in the race scenario 
cited in the bug description.  This "fix" – hanm's or mine – will turn it into 
an NPE instead of a deadlock.

This might be addressable by ensuring that code that needs the `QV_LOCK` for a 
`QuorumPeer` associated with a `QuorumCnxManager` (to protect the macroscopic 
critical sections in `QuorumCnxManager.connectOne()`, 
`QuorumPeer.setLastSeenQuorumVerifier()`, and `QuorumPeer.setQuorumVerifier()`) 
always takes the lock on the `QuorumCnxManager` instance first.  Looking into 
that.

> Potential server deadlock between follower sync with leader and follower 
> receiving external connection requests.
> 
>
> Key: ZOOKEEPER-2778
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2778
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: quorum
>Affects Versions: 3.5.3
>Reporter: Michael Han
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 3.6.0, 3.5.5
>
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> It's possible to have a deadlock during recovery phase. 
> Found this issue by analyzing thread dumps of "flaky" ReconfigRecoveryTest 
> [1]. . Here is a sample thread dump that illustrates the state of the 
> execution:
> {noformat}
> [junit]  java.lang.Thread.State: BLOCKED
> [junit] at  
> org.apache.zookeeper.server.quorum.QuorumPeer.getElectionAddress(QuorumPeer.java:686)
> [junit] at  
> org.apache.zookeeper.server.quorum.QuorumCnxManager.initiateConnection(QuorumCnxManager.java:265)
> [junit] at  
> org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:445)
> [junit] at  
> org.apache.zookeeper.server.quorum.QuorumCnxManager.receiveConnection(QuorumCnxManager.java:369)
> [junit] at  
> org.apache.zookeeper.server.quorum.QuorumCnxManager$Listener.run(QuorumCnxManager.java:642)
> [junit] 
> [junit]  java.lang.Thread.State: BLOCKED
> [junit] at  
> org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:472)
> [junit] at  
> org.apache.zookeeper.server.quorum.QuorumPeer.connectNewPeers(QuorumPeer.java:1438)
> [junit] at  
> org.apache.zookeeper.server.quorum.QuorumPeer.setLastSeenQuorumVerifier(QuorumPeer.java:1471)
> [junit] at  
> org.apache.zookeeper.server.quorum.Learner.syncWithLeader(Learner.java:520)
> [junit] at  
> org.apa

[jira] [Commented] (ZOOKEEPER-2778) Potential server deadlock between follower sync with leader and follower receiving external connection requests.

2018-11-20 Thread Michael K. Edwards (JIRA)


[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16693530#comment-16693530
 ] 

Michael K. Edwards commented on ZOOKEEPER-2778:
---

Is there a need to add a reader/writer lock that prevents read access to these 
addresses until they have been written for the first time?  I haven't yet 
looked closely enough at the code to see whether that's a possible scenario.

> Potential server deadlock between follower sync with leader and follower 
> receiving external connection requests.
> 
>
> Key: ZOOKEEPER-2778
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2778
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: quorum
>Affects Versions: 3.5.3
>Reporter: Michael Han
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 3.6.0, 3.5.5
>
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> It's possible to have a deadlock during recovery phase. 
> Found this issue by analyzing thread dumps of "flaky" ReconfigRecoveryTest 
> [1]. . Here is a sample thread dump that illustrates the state of the 
> execution:
> {noformat}
> [junit]  java.lang.Thread.State: BLOCKED
> [junit] at  
> org.apache.zookeeper.server.quorum.QuorumPeer.getElectionAddress(QuorumPeer.java:686)
> [junit] at  
> org.apache.zookeeper.server.quorum.QuorumCnxManager.initiateConnection(QuorumCnxManager.java:265)
> [junit] at  
> org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:445)
> [junit] at  
> org.apache.zookeeper.server.quorum.QuorumCnxManager.receiveConnection(QuorumCnxManager.java:369)
> [junit] at  
> org.apache.zookeeper.server.quorum.QuorumCnxManager$Listener.run(QuorumCnxManager.java:642)
> [junit] 
> [junit]  java.lang.Thread.State: BLOCKED
> [junit] at  
> org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:472)
> [junit] at  
> org.apache.zookeeper.server.quorum.QuorumPeer.connectNewPeers(QuorumPeer.java:1438)
> [junit] at  
> org.apache.zookeeper.server.quorum.QuorumPeer.setLastSeenQuorumVerifier(QuorumPeer.java:1471)
> [junit] at  
> org.apache.zookeeper.server.quorum.Learner.syncWithLeader(Learner.java:520)
> [junit] at  
> org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:88)
> [junit] at  
> org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:1133)
> {noformat}
> The dead lock happens between the quorum peer thread which running the 
> follower that doing sync with leader work, and the listener of the qcm of the 
> same quorum peer that doing the receiving connection work. Basically to 
> finish sync with leader, the follower needs to synchronize on both QV_LOCK 
> and the qmc object it owns; while in the receiver thread to finish setup an 
> incoming connection the thread needs to synchronize on both the qcm object 
> the quorum peer owns, and the same QV_LOCK. It's easy to see the problem here 
> is the order of acquiring two locks are different, thus depends on timing / 
> actual execution order, two threads might end up acquiring one lock while 
> holding another.
> [1] 
> org.apache.zookeeper.server.quorum.ReconfigRecoveryTest.testCurrentServersAreObserversInNextConfig



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (ZOOKEEPER-2778) Potential server deadlock between follower sync with leader and follower receiving external connection requests.

2018-11-20 Thread Michael K. Edwards (JIRA)


[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16693231#comment-16693231
 ] 

Michael K. Edwards edited comment on ZOOKEEPER-2778 at 11/20/18 1:35 PM:
-

May I suggest a different approach?  There are three fragments of data here 
(myQuorumAddr, myClientAddr, and myElectionAddr) that should be 1) updated 
atomically as a group, and 2) aggressively made visible to concurrent threads 
on other CPUs.  There isn't really a need to lock out access to them while 
other code that holds QV_LOCK runs.  Seems like an ideal candidate for an 
AtomicReference to an immutable POJO that holds the three addresses.  Suggested 
patch in https://github.com/apache/zookeeper/pull/707


was (Author: mkedwards):
May I suggest a different approach?  There are three fragments of data here 
(myQuorumAddr, myClientAddr, and myElectionAddr) that should be 1) updated 
atomically as a group, and 2) aggressively made visible to concurrent threads 
on other CPUs.  There isn't really a need to lock out access to them while 
other code that holds QV_LOCK runs.  Seems like an ideal candidate for an 
AtomicReference to an immutable POJO that holds the three addresses.  Suggested 
patch attached.


{{ diff --git 
a/zookeeper-server/src/main/java/org/apache/zookeeper/server/quorum/QuorumPeer.java
 
b/zookeeper-server/src/main/java/org/apache/zookeeper/server/quorum/QuorumPeer.java}}
{{ index 0d8a012..7bc8ea6 100644}}
{{ — 
a/zookeeper-server/src/main/java/org/apache/zookeeper/server/quorum/QuorumPeer.java}}
{{ +++ 
b/zookeeper-server/src/main/java/org/apache/zookeeper/server/quorum/QuorumPeer.java}}
{{ @@ -42,6 +42,7 @@}}
{{ import java.util.Properties;}}
{{ import java.util.Set;}}
{{ import java.util.concurrent.atomic.AtomicInteger;}}
{{ +import java.util.concurrent.atomic.AtomicReference;}}{{import 
javax.security.sasl.SaslException;}}{{@@ -121,6 +122,18 @@}}
{{ */}}
{{ private ZKDatabase zkDb;}}{{+ public static class AddressTuple {}}
{{ + public final InetSocketAddress quorumAddr;}}
{{ + public final InetSocketAddress electionAddr;}}
{{ + public final InetSocketAddress clientAddr;}}
{{ +}}
{{ + public AddressTuple(InetSocketAddress quorumAddr, InetSocketAddress 
electionAddr, InetSocketAddress clientAddr)}}{{{ + this.quorumAddr = 
quorumAddr; + this.electionAddr = electionAddr; + this.clientAddr = clientAddr; 
+ }}}{{+ }}}
{{ +}}
{{ public static class QuorumServer {}}
{{ public InetSocketAddress addr = null;}}{{@@ -723,16 +736,14 @@ public 
synchronized ServerState getPeerState(){}}{{DatagramSocket udpSocket;}}
 - {{private InetSocketAddress myQuorumAddr;}}
 - {{private InetSocketAddress myElectionAddr = null;}}
 - {{private InetSocketAddress myClientAddr = null;}}{{ + private final 
AtomicReference myAddrs = new AtomicReference<>();}}

{{/**}}
 * {{Resolves hostname for a given server ID.}}{{ *}}
 * {{This method resolves hostname for a given server ID in both quorumVerifer}}
 * {{and lastSeenQuorumVerifier. If the server ID matches the local server ID,}}

 - {{* it also updates myQuorumAddr and myElectionAddr.}}{{ + * it also updates 
myAddrs.}}{{ */}}{{ public void recreateSocketAddresses(long id) {}}{{ 
QuorumVerifier qv = getQuorumVerifier();}}{{ @@ -741,8 +752,7 @@ public void 
recreateSocketAddresses(long id) {}}{{ if (qs != null)}}{{Unknown macro: \{ 
qs.recreateSocketAddresses(); if (id == getId()) { - setQuorumAddress(qs.addr); 
- setElectionAddress(qs.electionAddr); + setAddrs(qs.addr, qs.electionAddr, 
qs.clientAddr); } }}}{{}}}
{{ @@ -756,39 +766,19 @@ public void recreateSocketAddresses(long id) {}}
{{ }}}

{{public InetSocketAddress getQuorumAddress(){}}
 - {{synchronized (QV_LOCK) \{ - return myQuorumAddr; - }}}{{+ return 
myAddrs.get().quorumAddr;}}
{{ }}}

 - {{public void setQuorumAddress(InetSocketAddress addr){}}
 - {{synchronized (QV_LOCK) \{ - myQuorumAddr = addr; - }}}
 - {{}}}{{ -}}{{ public InetSocketAddress getElectionAddress(){}}
 - {{synchronized (QV_LOCK) \{ - return myElectionAddr; - }}}{{+ return 
myAddrs.get().electionAddr;}}
{{ }}}

 - {{public void setElectionAddress(InetSocketAddress addr){}}
 - {{synchronized (QV_LOCK) \{ - myElectionAddr = addr; - }}}
 - {{}}}
 - {{ public InetSocketAddress getClientAddress(){}}
 - {{synchronized (QV_LOCK) \{ - return myClientAddr; - }}}{{+ return 
myAddrs.get().clientAddr;}}
{{ }}}

 - {{public void setClientAddress(InetSocketAddress addr){}}
 - {{synchronized (QV_LOCK) \{ - myClientAddr = addr; - }}}{{+ public void 
setAddrs(InetSocketAddress quorumAddr, InetSocketAddress electionAddr, 
InetSocketAddress clientAddr)}}{{{ + myAddrs.set(new AddressTuple(quorumAddr, 
electionAddr, clientAddr)); }}}

{{private int electionType;}}
{{ @@ -953,7 +943,7 @@ synchronized public void startLeaderElection()}}{{{ 
//}}}{{if (electionType == 0) {}}
{{ try}}{{{ - udpSocket = new DatagramSocket(myQuorumAddr.getP

[jira] [Comment Edited] (ZOOKEEPER-2778) Potential server deadlock between follower sync with leader and follower receiving external connection requests.

2018-11-20 Thread Michael K. Edwards (JIRA)


[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16693231#comment-16693231
 ] 

Michael K. Edwards edited comment on ZOOKEEPER-2778 at 11/20/18 1:30 PM:
-

May I suggest a different approach?  There are three fragments of data here 
(myQuorumAddr, myClientAddr, and myElectionAddr) that should be 1) updated 
atomically as a group, and 2) aggressively made visible to concurrent threads 
on other CPUs.  There isn't really a need to lock out access to them while 
other code that holds QV_LOCK runs.  Seems like an ideal candidate for an 
AtomicReference to an immutable POJO that holds the three addresses.  Suggested 
patch attached.


{{ diff --git 
a/zookeeper-server/src/main/java/org/apache/zookeeper/server/quorum/QuorumPeer.java
 
b/zookeeper-server/src/main/java/org/apache/zookeeper/server/quorum/QuorumPeer.java}}
{{ index 0d8a012..7bc8ea6 100644}}
{{ — 
a/zookeeper-server/src/main/java/org/apache/zookeeper/server/quorum/QuorumPeer.java}}
{{ +++ 
b/zookeeper-server/src/main/java/org/apache/zookeeper/server/quorum/QuorumPeer.java}}
{{ @@ -42,6 +42,7 @@}}
{{ import java.util.Properties;}}
{{ import java.util.Set;}}
{{ import java.util.concurrent.atomic.AtomicInteger;}}
{{ +import java.util.concurrent.atomic.AtomicReference;}}{{import 
javax.security.sasl.SaslException;}}{{@@ -121,6 +122,18 @@}}
{{ */}}
{{ private ZKDatabase zkDb;}}{{+ public static class AddressTuple {}}
{{ + public final InetSocketAddress quorumAddr;}}
{{ + public final InetSocketAddress electionAddr;}}
{{ + public final InetSocketAddress clientAddr;}}
{{ +}}
{{ + public AddressTuple(InetSocketAddress quorumAddr, InetSocketAddress 
electionAddr, InetSocketAddress clientAddr)}}{{{ + this.quorumAddr = 
quorumAddr; + this.electionAddr = electionAddr; + this.clientAddr = clientAddr; 
+ }}}{{+ }}}
{{ +}}
{{ public static class QuorumServer {}}
{{ public InetSocketAddress addr = null;}}{{@@ -723,16 +736,14 @@ public 
synchronized ServerState getPeerState(){}}{{DatagramSocket udpSocket;}}
 - {{private InetSocketAddress myQuorumAddr;}}
 - {{private InetSocketAddress myElectionAddr = null;}}
 - {{private InetSocketAddress myClientAddr = null;}}{{ + private final 
AtomicReference myAddrs = new AtomicReference<>();}}

{{/**}}
 * {{Resolves hostname for a given server ID.}}{{ *}}
 * {{This method resolves hostname for a given server ID in both quorumVerifer}}
 * {{and lastSeenQuorumVerifier. If the server ID matches the local server ID,}}

 - {{* it also updates myQuorumAddr and myElectionAddr.}}{{ + * it also updates 
myAddrs.}}{{ */}}{{ public void recreateSocketAddresses(long id) {}}{{ 
QuorumVerifier qv = getQuorumVerifier();}}{{ @@ -741,8 +752,7 @@ public void 
recreateSocketAddresses(long id) {}}{{ if (qs != null)}}{{Unknown macro: \{ 
qs.recreateSocketAddresses(); if (id == getId()) { - setQuorumAddress(qs.addr); 
- setElectionAddress(qs.electionAddr); + setAddrs(qs.addr, qs.electionAddr, 
qs.clientAddr); } }}}{{}}}
{{ @@ -756,39 +766,19 @@ public void recreateSocketAddresses(long id) {}}
{{ }}}

{{public InetSocketAddress getQuorumAddress(){}}
 - {{synchronized (QV_LOCK) \{ - return myQuorumAddr; - }}}{{+ return 
myAddrs.get().quorumAddr;}}
{{ }}}

 - {{public void setQuorumAddress(InetSocketAddress addr){}}
 - {{synchronized (QV_LOCK) \{ - myQuorumAddr = addr; - }}}
 - {{}}}{{ -}}{{ public InetSocketAddress getElectionAddress(){}}
 - {{synchronized (QV_LOCK) \{ - return myElectionAddr; - }}}{{+ return 
myAddrs.get().electionAddr;}}
{{ }}}

 - {{public void setElectionAddress(InetSocketAddress addr){}}
 - {{synchronized (QV_LOCK) \{ - myElectionAddr = addr; - }}}
 - {{}}}
 - {{ public InetSocketAddress getClientAddress(){}}
 - {{synchronized (QV_LOCK) \{ - return myClientAddr; - }}}{{+ return 
myAddrs.get().clientAddr;}}
{{ }}}

 - {{public void setClientAddress(InetSocketAddress addr){}}
 - {{synchronized (QV_LOCK) \{ - myClientAddr = addr; - }}}{{+ public void 
setAddrs(InetSocketAddress quorumAddr, InetSocketAddress electionAddr, 
InetSocketAddress clientAddr)}}{{{ + myAddrs.set(new AddressTuple(quorumAddr, 
electionAddr, clientAddr)); }}}

{{private int electionType;}}
{{ @@ -953,7 +943,7 @@ synchronized public void startLeaderElection()}}{{{ 
//}}}{{if (electionType == 0) {}}
{{ try}}{{{ - udpSocket = new DatagramSocket(myQuorumAddr.getPort()); + 
udpSocket = new DatagramSocket(getQuorumAddress().getPort()); responder = new 
ResponderThread(); responder.start(); }}}{{catch (SocketException e) {}}
{{ @@ -1631,9 +1621,7 @@ public QuorumVerifier setQuorumVerifier(QuorumVerifier 
qv, boolean writeToDisk){}}
{{ }}}
{{ QuorumServer qs = qv.getAllMembers().get(getId());}}
{{ if (qs != null)}}{{{ - setQuorumAddress(qs.addr); - 
setElectionAddress(qs.electionAddr); - setClientAddress(qs.clientAddr); + 
setAddrs(qs.addr, qs.electionAddr, qs.clientAddr); }}}{{return prevQV;}}
{{ }}}

[jira] [Commented] (ZOOKEEPER-2778) Potential server deadlock between follower sync with leader and follower receiving external connection requests.

2018-11-20 Thread Michael K. Edwards (JIRA)


[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16693231#comment-16693231
 ] 

Michael K. Edwards commented on ZOOKEEPER-2778:
---

May I suggest a different approach?  There are three fragments of data here 
(myQuorumAddr, myClientAddr, and myElectionAddr) that should be 1) updated 
atomically as a group, and 2) aggressively made visible to concurrent threads 
on other CPUs.  There isn't really a need to lock out access to them while 
other code that holds QV_LOCK runs.  Seems like an ideal candidate for an 
AtomicReference to an immutable POJO that holds the three addresses.  Suggested 
patch attached.

> Potential server deadlock between follower sync with leader and follower 
> receiving external connection requests.
> 
>
> Key: ZOOKEEPER-2778
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2778
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: quorum
>Affects Versions: 3.5.3
>Reporter: Michael Han
>Assignee: maoling
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 3.6.0, 3.5.5
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> It's possible to have a deadlock during recovery phase. 
> Found this issue by analyzing thread dumps of "flaky" ReconfigRecoveryTest 
> [1]. . Here is a sample thread dump that illustrates the state of the 
> execution:
> {noformat}
> [junit]  java.lang.Thread.State: BLOCKED
> [junit] at  
> org.apache.zookeeper.server.quorum.QuorumPeer.getElectionAddress(QuorumPeer.java:686)
> [junit] at  
> org.apache.zookeeper.server.quorum.QuorumCnxManager.initiateConnection(QuorumCnxManager.java:265)
> [junit] at  
> org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:445)
> [junit] at  
> org.apache.zookeeper.server.quorum.QuorumCnxManager.receiveConnection(QuorumCnxManager.java:369)
> [junit] at  
> org.apache.zookeeper.server.quorum.QuorumCnxManager$Listener.run(QuorumCnxManager.java:642)
> [junit] 
> [junit]  java.lang.Thread.State: BLOCKED
> [junit] at  
> org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:472)
> [junit] at  
> org.apache.zookeeper.server.quorum.QuorumPeer.connectNewPeers(QuorumPeer.java:1438)
> [junit] at  
> org.apache.zookeeper.server.quorum.QuorumPeer.setLastSeenQuorumVerifier(QuorumPeer.java:1471)
> [junit] at  
> org.apache.zookeeper.server.quorum.Learner.syncWithLeader(Learner.java:520)
> [junit] at  
> org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:88)
> [junit] at  
> org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:1133)
> {noformat}
> The dead lock happens between the quorum peer thread which running the 
> follower that doing sync with leader work, and the listener of the qcm of the 
> same quorum peer that doing the receiving connection work. Basically to 
> finish sync with leader, the follower needs to synchronize on both QV_LOCK 
> and the qmc object it owns; while in the receiver thread to finish setup an 
> incoming connection the thread needs to synchronize on both the qcm object 
> the quorum peer owns, and the same QV_LOCK. It's easy to see the problem here 
> is the order of acquiring two locks are different, thus depends on timing / 
> actual execution order, two threads might end up acquiring one lock while 
> holding another.
> [1] 
> org.apache.zookeeper.server.quorum.ReconfigRecoveryTest.testCurrentServersAreObserversInNextConfig



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)