[jira] [Resolved] (ZOOKEEPER-1907) Improve Thread handling

2015-12-17 Thread Flavio Junqueira (JIRA)

 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1907?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Flavio Junqueira resolved ZOOKEEPER-1907.
-
Resolution: Fixed

[~rakeshr] you're right, they have been committed at different times. I think 
you've preferred to fix this issue in the other jira, and I'm fine with it, so 
let's close this one.

> Improve Thread handling
> ---
>
> Key: ZOOKEEPER-1907
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1907
> Project: ZooKeeper
>  Issue Type: Improvement
>  Components: server
>Affects Versions: 3.5.0
>Reporter: Rakesh R
>Assignee: Rakesh R
> Fix For: 3.6.0, 3.5.1, 3.4.7
>
> Attachments: ZOOKEEPER-1907-br-3-4.patch, ZOOKEEPER-1907.patch, 
> ZOOKEEPER-1907.patch, ZOOKEEPER-1907.patch, ZOOKEEPER-1907.patch, 
> ZOOKEEPER-1907.patch, ZOOKEEPER-1907.patch, ZOOKEEPER-1907.patch, 
> ZOOKEEPER-1907.patch, ZOOKEEPER-1907.patch, ZOOKEEPER-1907.patch, 
> ZOOKEEPER-1907.patch, ZOOKEEPER-1907.patch, ZOOKEEPER-1907.patch
>
>
> Server has many critical threads running and co-ordinating each other like  
> RequestProcessor chains et. When going through each threads, most of them 
> having the similar structure like:
> {code}
> public void run() {
> try {
>   while(running)
>// processing logic
>   }
> } catch (InterruptedException e) {
> LOG.error("Unexpected interruption", e);
> } catch (Exception e) {
> LOG.error("Unexpected exception", e);
> }
> LOG.info("...exited loop!");
> }
> {code}
> From the design I could see, there could be a chance of silently leaving the 
> thread by swallowing the exception. If this happens in the production, the 
> server would get hanged forever and would not be able to deliver its role. 
> Now its hard for the management tool to detect this.
> The idea of this JIRA is to discuss and imprv.
> Reference: [Community discussion 
> thread|http://mail-archives.apache.org/mod_mbox/zookeeper-user/201403.mbox/%3cc2496325850aa74c92aaf83aa9662d26458a1...@szxeml561-mbx.china.huawei.com%3E]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (ZOOKEEPER-2347) Deadlock shutting down zookeeper

2015-12-17 Thread Flavio Junqueira (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15061765#comment-15061765
 ] 

Flavio Junqueira commented on ZOOKEEPER-2347:
-

To be consistent, I'm reposting the comment I made in the other jira here. 

bq. We have made requestsInProcess an AtomicInteger in ZOOKEEPER-1504, removing 
the synchronization of the decInProcess method. We should just make the same 
change here for the 3.4 branch.


> Deadlock shutting down zookeeper
> 
>
> Key: ZOOKEEPER-2347
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2347
> Project: ZooKeeper
>  Issue Type: Bug
>Affects Versions: 3.4.7
>Reporter: Ted Yu
>Assignee: Rakesh R
>Priority: Blocker
> Fix For: 3.4.8
>
> Attachments: ZOOKEEPER-2347-br-3.4.patch, testSplitLogManager.stack
>
>
> HBase recently upgraded to zookeeper 3.4.7
> In one of the tests, TestSplitLogManager, there is reproducible hang at the 
> end of the test.
> Below is snippet from stack trace related to zookeeper:
> {code}
> "main-EventThread" daemon prio=5 tid=0x7fd27488a800 nid=0x6f1f waiting on 
> condition [0x00011834b000]
>java.lang.Thread.State: WAITING (parking)
>   at sun.misc.Unsafe.park(Native Method)
>   - parking to wait for  <0x0007c5b8d3a0> (a 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
>   at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
>   at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2043)
>   at 
> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
>   at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:501)
> "main-SendThread(localhost:59510)" daemon prio=5 tid=0x7fd274eb4000 
> nid=0x9513 waiting on condition [0x000118042000]
>java.lang.Thread.State: TIMED_WAITING (sleeping)
>   at java.lang.Thread.sleep(Native Method)
>   at 
> org.apache.zookeeper.client.StaticHostProvider.next(StaticHostProvider.java:101)
>   at 
> org.apache.zookeeper.ClientCnxn$SendThread.startConnect(ClientCnxn.java:997)
>   at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1060)
> "SyncThread:0" prio=5 tid=0x7fd274d02000 nid=0x730f waiting for monitor 
> entry [0x0001170ac000]
>java.lang.Thread.State: BLOCKED (on object monitor)
>   at 
> org.apache.zookeeper.server.ZooKeeperServer.decInProcess(ZooKeeperServer.java:512)
>   - waiting to lock <0x0007c5b62128> (a 
> org.apache.zookeeper.server.ZooKeeperServer)
>   at 
> org.apache.zookeeper.server.FinalRequestProcessor.processRequest(FinalRequestProcessor.java:144)
>   at 
> org.apache.zookeeper.server.SyncRequestProcessor.flush(SyncRequestProcessor.java:200)
>   at 
> org.apache.zookeeper.server.SyncRequestProcessor.run(SyncRequestProcessor.java:131)
> "main-EventThread" daemon prio=5 tid=0x7fd2753a3800 nid=0x711b waiting on 
> condition [0x000117a3]
>java.lang.Thread.State: WAITING (parking)
>   at sun.misc.Unsafe.park(Native Method)
>   - parking to wait for  <0x0007c9b106b8> (a 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
>   at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
>   at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2043)
>   at 
> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
>   at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:501)
> "main" prio=5 tid=0x7fd27600 nid=0x1903 in Object.wait() 
> [0x000108aa1000]
>java.lang.Thread.State: WAITING (on object monitor)
>   at java.lang.Object.wait(Native Method)
>   - waiting on <0x0007c5b66400> (a 
> org.apache.zookeeper.server.SyncRequestProcessor)
>   at java.lang.Thread.join(Thread.java:1281)
>   - locked <0x0007c5b66400> (a 
> org.apache.zookeeper.server.SyncRequestProcessor)
>   at java.lang.Thread.join(Thread.java:1355)
>   at 
> org.apache.zookeeper.server.SyncRequestProcessor.shutdown(SyncRequestProcessor.java:213)
>   at 
> org.apache.zookeeper.server.PrepRequestProcessor.shutdown(PrepRequestProcessor.java:770)
>   at 
> org.apache.zookeeper.server.ZooKeeperServer.shutdown(ZooKeeperServer.java:478)
>   - locked <0x0007c5b62128> (a 
> org.apache.zookeeper.server.ZooKeeperServer)
>   at 
> org.apache.zookeeper.server.NIOServerCnxnFactory.shutdown(NIOServerCnxnFactory.java:266)
>   at 
> org.apache.hadoop.hbase.zookeeper.MiniZooKeeperCluster.shutdown(MiniZooKeeperCluster.java:301)
> {code}
> Note the address (0x0007c5b66400) in the last hunk which seems to 
> indicate some form of deadlock.
> According to Camille Fournier:
> We made shutdown synchronized. But decrementing the requests 

[jira] [Commented] (ZOOKEEPER-2104) Sudden crash of all nodes in the cluster

2015-12-17 Thread David Lao (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15062554#comment-15062554
 ] 

David Lao commented on ZOOKEEPER-2104:
--

Unfortunately I've lost a member server and its snapshots and upon restarting 
the issue is no longer reproducible. I'll keep an eye on this and provide 
update as appropriate. Thanks for taking a look. 

> Sudden crash of all nodes in the cluster
> 
>
> Key: ZOOKEEPER-2104
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2104
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: server
>Affects Versions: 3.4.6
>Reporter: Benjamin Jaton
> Attachments: zookeeper-errors.txt, zookeeper-warns.txt
>
>
> In a 3 nodes ensemble, suddenly all the nodes seem to fail, displaying 
> "ZooKeeper is not running" messages.
> Not retry seems to be happening after that.
> This a request to understand what happened and probably to improve the logs 
> when it does.
> See logs below:
> NODE1:
> -- no log for several days before this --
> 2015-01-04 16:18:22,259 [myid:1] - WARN  [SyncThread:1:FileTxnLog@321] - 
> fsync-ing the write ahead log in SyncThread:1 took 11024ms which will 
> adversely effect operation latency. See the ZooKeeper troubleshooting guide
> 2015-01-04 16:18:22,380 [myid:1] - WARN  
> [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:Follower@89] - Exception when 
> following the leader
> java.io.EOFException
> at java.io.DataInputStream.readInt(DataInputStream.java:392)
> at 
> org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63)
> at 
> org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:83)
> at 
> org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:103)
> at 
> org.apache.zookeeper.server.quorum.Learner.readPacket(Learner.java:153)
> at 
> org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:85)
> at 
> org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:786)
> 2015-01-04 16:18:23,384 [myid:1] - WARN  
> [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@362] - Exception 
> causing close of session 0x0 due to java.io.IOException: ZooKeeperServer not 
> running
> 2015-01-04 16:18:23,492 [myid:1] - WARN  
> [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@362] - Exception 
> causing close of session 0x0 due to java.io.IOException: ZooKeeperServer not 
> running
> 2015-01-04 16:18:24,060 [myid:1] - WARN  
> [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@362] - Exception 
> causing close of session 0x0 due to java.io.IOException: ZooKeeperServer not 
> running
> NODE2:
> -- no log for several days before this --
> 2015-01-04 16:18:21,899 [myid:3] - WARN  
> [QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:2181:Follower@89] - Exception when 
> following the leader
> java.io.EOFException
> at java.io.DataInputStream.readInt(DataInputStream.java:392)
> at 
> org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63)
> at 
> org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:83)
> at 
> org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:103)
> at 
> org.apache.zookeeper.server.quorum.Learner.readPacket(Learner.java:153)
> at 
> org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:85)
> at 
> org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:786)
> 2015-01-04 16:18:22,760 [myid:3] - WARN  
> [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@362] - Exception 
> causing close of session 0x0 due to java.io.IOException: ZooKeeperServer not 
> running
> 2015-01-04 16:18:22,801 [myid:3] - WARN  
> [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@362] - Exception 
> causing close of session 0x0 due to java.io.IOException: ZooKeeperServer not 
> running
> 2015-01-04 16:18:22,886 [myid:3] - WARN  
> [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@362] - Exception 
> causing close of session 0x0 due to java.io.IOException: ZooKeeperServer not 
> running
> NODE3 (leader):
> -- no log for several days before this --
> 2015-01-04 16:18:21,897 [myid:2] - WARN  
> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:LearnerHandler@687] - Closing 
> connection to peer due to transaction timeout.
> 2015-01-04 16:18:21,898 [myid:2] - WARN  
> [LearnerHandler-/204.53.107.249:43402:LearnerHandler@646] - *** GOODBYE 
> /204.53.107.249:43402 
> 2015-01-04 16:18:21,905 [myid:2] - WARN  
> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:LearnerHandler@687] - Closing 
> connection to peer due to transaction timeout.
> 2015-01-04 16:18:21,907 [myid:2] - WARN  
> [LearnerHandler-/204.53.107.247:45953:LearnerHandler@646] - *** GOODBYE 
> /204.5

[jira] [Commented] (ZOOKEEPER-2347) Deadlock shutting down zookeeper

2015-12-17 Thread Ted Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15062994#comment-15062994
 ] 

Ted Yu commented on ZOOKEEPER-2347:
---

Not sure how I can test this with hbase unit test(s).

As far as I know, zookeeper still uses ant to build while hbase dependency is 
expressed through maven.

> Deadlock shutting down zookeeper
> 
>
> Key: ZOOKEEPER-2347
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2347
> Project: ZooKeeper
>  Issue Type: Bug
>Affects Versions: 3.4.7
>Reporter: Ted Yu
>Assignee: Rakesh R
>Priority: Blocker
> Fix For: 3.4.8
>
> Attachments: ZOOKEEPER-2347-br-3.4.patch, testSplitLogManager.stack
>
>
> HBase recently upgraded to zookeeper 3.4.7
> In one of the tests, TestSplitLogManager, there is reproducible hang at the 
> end of the test.
> Below is snippet from stack trace related to zookeeper:
> {code}
> "main-EventThread" daemon prio=5 tid=0x7fd27488a800 nid=0x6f1f waiting on 
> condition [0x00011834b000]
>java.lang.Thread.State: WAITING (parking)
>   at sun.misc.Unsafe.park(Native Method)
>   - parking to wait for  <0x0007c5b8d3a0> (a 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
>   at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
>   at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2043)
>   at 
> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
>   at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:501)
> "main-SendThread(localhost:59510)" daemon prio=5 tid=0x7fd274eb4000 
> nid=0x9513 waiting on condition [0x000118042000]
>java.lang.Thread.State: TIMED_WAITING (sleeping)
>   at java.lang.Thread.sleep(Native Method)
>   at 
> org.apache.zookeeper.client.StaticHostProvider.next(StaticHostProvider.java:101)
>   at 
> org.apache.zookeeper.ClientCnxn$SendThread.startConnect(ClientCnxn.java:997)
>   at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1060)
> "SyncThread:0" prio=5 tid=0x7fd274d02000 nid=0x730f waiting for monitor 
> entry [0x0001170ac000]
>java.lang.Thread.State: BLOCKED (on object monitor)
>   at 
> org.apache.zookeeper.server.ZooKeeperServer.decInProcess(ZooKeeperServer.java:512)
>   - waiting to lock <0x0007c5b62128> (a 
> org.apache.zookeeper.server.ZooKeeperServer)
>   at 
> org.apache.zookeeper.server.FinalRequestProcessor.processRequest(FinalRequestProcessor.java:144)
>   at 
> org.apache.zookeeper.server.SyncRequestProcessor.flush(SyncRequestProcessor.java:200)
>   at 
> org.apache.zookeeper.server.SyncRequestProcessor.run(SyncRequestProcessor.java:131)
> "main-EventThread" daemon prio=5 tid=0x7fd2753a3800 nid=0x711b waiting on 
> condition [0x000117a3]
>java.lang.Thread.State: WAITING (parking)
>   at sun.misc.Unsafe.park(Native Method)
>   - parking to wait for  <0x0007c9b106b8> (a 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
>   at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
>   at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2043)
>   at 
> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
>   at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:501)
> "main" prio=5 tid=0x7fd27600 nid=0x1903 in Object.wait() 
> [0x000108aa1000]
>java.lang.Thread.State: WAITING (on object monitor)
>   at java.lang.Object.wait(Native Method)
>   - waiting on <0x0007c5b66400> (a 
> org.apache.zookeeper.server.SyncRequestProcessor)
>   at java.lang.Thread.join(Thread.java:1281)
>   - locked <0x0007c5b66400> (a 
> org.apache.zookeeper.server.SyncRequestProcessor)
>   at java.lang.Thread.join(Thread.java:1355)
>   at 
> org.apache.zookeeper.server.SyncRequestProcessor.shutdown(SyncRequestProcessor.java:213)
>   at 
> org.apache.zookeeper.server.PrepRequestProcessor.shutdown(PrepRequestProcessor.java:770)
>   at 
> org.apache.zookeeper.server.ZooKeeperServer.shutdown(ZooKeeperServer.java:478)
>   - locked <0x0007c5b62128> (a 
> org.apache.zookeeper.server.ZooKeeperServer)
>   at 
> org.apache.zookeeper.server.NIOServerCnxnFactory.shutdown(NIOServerCnxnFactory.java:266)
>   at 
> org.apache.hadoop.hbase.zookeeper.MiniZooKeeperCluster.shutdown(MiniZooKeeperCluster.java:301)
> {code}
> Note the address (0x0007c5b66400) in the last hunk which seems to 
> indicate some form of deadlock.
> According to Camille Fournier:
> We made shutdown synchronized. But decrementing the requests is
> also synchronized and called from a different thread. So yeah, deadlock.
> This came in with ZOOKEEPER-1907



--
This messag

[jira] [Commented] (ZOOKEEPER-2347) Deadlock shutting down zookeeper

2015-12-17 Thread Chris Nauroth (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15063059#comment-15063059
 ] 

Chris Nauroth commented on ZOOKEEPER-2347:
--

bq. As far as I know, zookeeper still uses ant to build while hbase dependency 
is expressed through maven.

Hi Ted.  The Ant build has a {{mvn-install}} target.  If you're interested in 
testing with HBase, then I think you could get the current branch-3.4 ZooKeeper 
code, apply the patch, run {{ant mvn-install}} to install a 3.4.8-SNAPSHOT 
build to your local repository, and then set up your HBase build to link 
against ZooKeeper 3.4.8-SNAPSHOT.

> Deadlock shutting down zookeeper
> 
>
> Key: ZOOKEEPER-2347
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2347
> Project: ZooKeeper
>  Issue Type: Bug
>Affects Versions: 3.4.7
>Reporter: Ted Yu
>Assignee: Rakesh R
>Priority: Blocker
> Fix For: 3.4.8
>
> Attachments: ZOOKEEPER-2347-br-3.4.patch, testSplitLogManager.stack
>
>
> HBase recently upgraded to zookeeper 3.4.7
> In one of the tests, TestSplitLogManager, there is reproducible hang at the 
> end of the test.
> Below is snippet from stack trace related to zookeeper:
> {code}
> "main-EventThread" daemon prio=5 tid=0x7fd27488a800 nid=0x6f1f waiting on 
> condition [0x00011834b000]
>java.lang.Thread.State: WAITING (parking)
>   at sun.misc.Unsafe.park(Native Method)
>   - parking to wait for  <0x0007c5b8d3a0> (a 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
>   at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
>   at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2043)
>   at 
> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
>   at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:501)
> "main-SendThread(localhost:59510)" daemon prio=5 tid=0x7fd274eb4000 
> nid=0x9513 waiting on condition [0x000118042000]
>java.lang.Thread.State: TIMED_WAITING (sleeping)
>   at java.lang.Thread.sleep(Native Method)
>   at 
> org.apache.zookeeper.client.StaticHostProvider.next(StaticHostProvider.java:101)
>   at 
> org.apache.zookeeper.ClientCnxn$SendThread.startConnect(ClientCnxn.java:997)
>   at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1060)
> "SyncThread:0" prio=5 tid=0x7fd274d02000 nid=0x730f waiting for monitor 
> entry [0x0001170ac000]
>java.lang.Thread.State: BLOCKED (on object monitor)
>   at 
> org.apache.zookeeper.server.ZooKeeperServer.decInProcess(ZooKeeperServer.java:512)
>   - waiting to lock <0x0007c5b62128> (a 
> org.apache.zookeeper.server.ZooKeeperServer)
>   at 
> org.apache.zookeeper.server.FinalRequestProcessor.processRequest(FinalRequestProcessor.java:144)
>   at 
> org.apache.zookeeper.server.SyncRequestProcessor.flush(SyncRequestProcessor.java:200)
>   at 
> org.apache.zookeeper.server.SyncRequestProcessor.run(SyncRequestProcessor.java:131)
> "main-EventThread" daemon prio=5 tid=0x7fd2753a3800 nid=0x711b waiting on 
> condition [0x000117a3]
>java.lang.Thread.State: WAITING (parking)
>   at sun.misc.Unsafe.park(Native Method)
>   - parking to wait for  <0x0007c9b106b8> (a 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
>   at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
>   at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2043)
>   at 
> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
>   at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:501)
> "main" prio=5 tid=0x7fd27600 nid=0x1903 in Object.wait() 
> [0x000108aa1000]
>java.lang.Thread.State: WAITING (on object monitor)
>   at java.lang.Object.wait(Native Method)
>   - waiting on <0x0007c5b66400> (a 
> org.apache.zookeeper.server.SyncRequestProcessor)
>   at java.lang.Thread.join(Thread.java:1281)
>   - locked <0x0007c5b66400> (a 
> org.apache.zookeeper.server.SyncRequestProcessor)
>   at java.lang.Thread.join(Thread.java:1355)
>   at 
> org.apache.zookeeper.server.SyncRequestProcessor.shutdown(SyncRequestProcessor.java:213)
>   at 
> org.apache.zookeeper.server.PrepRequestProcessor.shutdown(PrepRequestProcessor.java:770)
>   at 
> org.apache.zookeeper.server.ZooKeeperServer.shutdown(ZooKeeperServer.java:478)
>   - locked <0x0007c5b62128> (a 
> org.apache.zookeeper.server.ZooKeeperServer)
>   at 
> org.apache.zookeeper.server.NIOServerCnxnFactory.shutdown(NIOServerCnxnFactory.java:266)
>   at 
> org.apache.hadoop.hbase.zookeeper.MiniZooKeeperCluster.shutdown(MiniZooKeeperCluster.java:301)
> {code}
> Note the address (0x000

[jira] [Commented] (ZOOKEEPER-2347) Deadlock shutting down zookeeper

2015-12-17 Thread Jason Rosenberg (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15063076#comment-15063076
 ] 

Jason Rosenberg commented on ZOOKEEPER-2347:


What are the conditions that trigger this issue?  We've been running with 3.4.7 
and so far have not seen any dead-locks with routine server shutdowns, or with 
tests.  Trying to judge whether we should revert or not.

> Deadlock shutting down zookeeper
> 
>
> Key: ZOOKEEPER-2347
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2347
> Project: ZooKeeper
>  Issue Type: Bug
>Affects Versions: 3.4.7
>Reporter: Ted Yu
>Assignee: Rakesh R
>Priority: Blocker
> Fix For: 3.4.8
>
> Attachments: ZOOKEEPER-2347-br-3.4.patch, testSplitLogManager.stack
>
>
> HBase recently upgraded to zookeeper 3.4.7
> In one of the tests, TestSplitLogManager, there is reproducible hang at the 
> end of the test.
> Below is snippet from stack trace related to zookeeper:
> {code}
> "main-EventThread" daemon prio=5 tid=0x7fd27488a800 nid=0x6f1f waiting on 
> condition [0x00011834b000]
>java.lang.Thread.State: WAITING (parking)
>   at sun.misc.Unsafe.park(Native Method)
>   - parking to wait for  <0x0007c5b8d3a0> (a 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
>   at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
>   at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2043)
>   at 
> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
>   at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:501)
> "main-SendThread(localhost:59510)" daemon prio=5 tid=0x7fd274eb4000 
> nid=0x9513 waiting on condition [0x000118042000]
>java.lang.Thread.State: TIMED_WAITING (sleeping)
>   at java.lang.Thread.sleep(Native Method)
>   at 
> org.apache.zookeeper.client.StaticHostProvider.next(StaticHostProvider.java:101)
>   at 
> org.apache.zookeeper.ClientCnxn$SendThread.startConnect(ClientCnxn.java:997)
>   at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1060)
> "SyncThread:0" prio=5 tid=0x7fd274d02000 nid=0x730f waiting for monitor 
> entry [0x0001170ac000]
>java.lang.Thread.State: BLOCKED (on object monitor)
>   at 
> org.apache.zookeeper.server.ZooKeeperServer.decInProcess(ZooKeeperServer.java:512)
>   - waiting to lock <0x0007c5b62128> (a 
> org.apache.zookeeper.server.ZooKeeperServer)
>   at 
> org.apache.zookeeper.server.FinalRequestProcessor.processRequest(FinalRequestProcessor.java:144)
>   at 
> org.apache.zookeeper.server.SyncRequestProcessor.flush(SyncRequestProcessor.java:200)
>   at 
> org.apache.zookeeper.server.SyncRequestProcessor.run(SyncRequestProcessor.java:131)
> "main-EventThread" daemon prio=5 tid=0x7fd2753a3800 nid=0x711b waiting on 
> condition [0x000117a3]
>java.lang.Thread.State: WAITING (parking)
>   at sun.misc.Unsafe.park(Native Method)
>   - parking to wait for  <0x0007c9b106b8> (a 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
>   at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
>   at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2043)
>   at 
> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
>   at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:501)
> "main" prio=5 tid=0x7fd27600 nid=0x1903 in Object.wait() 
> [0x000108aa1000]
>java.lang.Thread.State: WAITING (on object monitor)
>   at java.lang.Object.wait(Native Method)
>   - waiting on <0x0007c5b66400> (a 
> org.apache.zookeeper.server.SyncRequestProcessor)
>   at java.lang.Thread.join(Thread.java:1281)
>   - locked <0x0007c5b66400> (a 
> org.apache.zookeeper.server.SyncRequestProcessor)
>   at java.lang.Thread.join(Thread.java:1355)
>   at 
> org.apache.zookeeper.server.SyncRequestProcessor.shutdown(SyncRequestProcessor.java:213)
>   at 
> org.apache.zookeeper.server.PrepRequestProcessor.shutdown(PrepRequestProcessor.java:770)
>   at 
> org.apache.zookeeper.server.ZooKeeperServer.shutdown(ZooKeeperServer.java:478)
>   - locked <0x0007c5b62128> (a 
> org.apache.zookeeper.server.ZooKeeperServer)
>   at 
> org.apache.zookeeper.server.NIOServerCnxnFactory.shutdown(NIOServerCnxnFactory.java:266)
>   at 
> org.apache.hadoop.hbase.zookeeper.MiniZooKeeperCluster.shutdown(MiniZooKeeperCluster.java:301)
> {code}
> Note the address (0x0007c5b66400) in the last hunk which seems to 
> indicate some form of deadlock.
> According to Camille Fournier:
> We made shutdown synchronized. But decrementing the requests is
> also synchronized and called from a different th

[jira] [Commented] (ZOOKEEPER-2347) Deadlock shutting down zookeeper

2015-12-17 Thread Ted Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15063086#comment-15063086
 ] 

Ted Yu commented on ZOOKEEPER-2347:
---

Thanks for the pointer, Chris.

I ran TestSplitLogManager after modifying pom.xml twice which passed. 
Previously the test hung quite reliably on Mac.



> Deadlock shutting down zookeeper
> 
>
> Key: ZOOKEEPER-2347
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2347
> Project: ZooKeeper
>  Issue Type: Bug
>Affects Versions: 3.4.7
>Reporter: Ted Yu
>Assignee: Rakesh R
>Priority: Blocker
> Fix For: 3.4.8
>
> Attachments: ZOOKEEPER-2347-br-3.4.patch, testSplitLogManager.stack
>
>
> HBase recently upgraded to zookeeper 3.4.7
> In one of the tests, TestSplitLogManager, there is reproducible hang at the 
> end of the test.
> Below is snippet from stack trace related to zookeeper:
> {code}
> "main-EventThread" daemon prio=5 tid=0x7fd27488a800 nid=0x6f1f waiting on 
> condition [0x00011834b000]
>java.lang.Thread.State: WAITING (parking)
>   at sun.misc.Unsafe.park(Native Method)
>   - parking to wait for  <0x0007c5b8d3a0> (a 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
>   at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
>   at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2043)
>   at 
> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
>   at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:501)
> "main-SendThread(localhost:59510)" daemon prio=5 tid=0x7fd274eb4000 
> nid=0x9513 waiting on condition [0x000118042000]
>java.lang.Thread.State: TIMED_WAITING (sleeping)
>   at java.lang.Thread.sleep(Native Method)
>   at 
> org.apache.zookeeper.client.StaticHostProvider.next(StaticHostProvider.java:101)
>   at 
> org.apache.zookeeper.ClientCnxn$SendThread.startConnect(ClientCnxn.java:997)
>   at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1060)
> "SyncThread:0" prio=5 tid=0x7fd274d02000 nid=0x730f waiting for monitor 
> entry [0x0001170ac000]
>java.lang.Thread.State: BLOCKED (on object monitor)
>   at 
> org.apache.zookeeper.server.ZooKeeperServer.decInProcess(ZooKeeperServer.java:512)
>   - waiting to lock <0x0007c5b62128> (a 
> org.apache.zookeeper.server.ZooKeeperServer)
>   at 
> org.apache.zookeeper.server.FinalRequestProcessor.processRequest(FinalRequestProcessor.java:144)
>   at 
> org.apache.zookeeper.server.SyncRequestProcessor.flush(SyncRequestProcessor.java:200)
>   at 
> org.apache.zookeeper.server.SyncRequestProcessor.run(SyncRequestProcessor.java:131)
> "main-EventThread" daemon prio=5 tid=0x7fd2753a3800 nid=0x711b waiting on 
> condition [0x000117a3]
>java.lang.Thread.State: WAITING (parking)
>   at sun.misc.Unsafe.park(Native Method)
>   - parking to wait for  <0x0007c9b106b8> (a 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
>   at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
>   at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2043)
>   at 
> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
>   at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:501)
> "main" prio=5 tid=0x7fd27600 nid=0x1903 in Object.wait() 
> [0x000108aa1000]
>java.lang.Thread.State: WAITING (on object monitor)
>   at java.lang.Object.wait(Native Method)
>   - waiting on <0x0007c5b66400> (a 
> org.apache.zookeeper.server.SyncRequestProcessor)
>   at java.lang.Thread.join(Thread.java:1281)
>   - locked <0x0007c5b66400> (a 
> org.apache.zookeeper.server.SyncRequestProcessor)
>   at java.lang.Thread.join(Thread.java:1355)
>   at 
> org.apache.zookeeper.server.SyncRequestProcessor.shutdown(SyncRequestProcessor.java:213)
>   at 
> org.apache.zookeeper.server.PrepRequestProcessor.shutdown(PrepRequestProcessor.java:770)
>   at 
> org.apache.zookeeper.server.ZooKeeperServer.shutdown(ZooKeeperServer.java:478)
>   - locked <0x0007c5b62128> (a 
> org.apache.zookeeper.server.ZooKeeperServer)
>   at 
> org.apache.zookeeper.server.NIOServerCnxnFactory.shutdown(NIOServerCnxnFactory.java:266)
>   at 
> org.apache.hadoop.hbase.zookeeper.MiniZooKeeperCluster.shutdown(MiniZooKeeperCluster.java:301)
> {code}
> Note the address (0x0007c5b66400) in the last hunk which seems to 
> indicate some form of deadlock.
> According to Camille Fournier:
> We made shutdown synchronized. But decrementing the requests is
> also synchronized and called from a different thread. So yeah, deadlock.
> This came in with ZOOKEEPER-1907



--
This message was 

Re: ZooKeeperServer#shutdown hangs

2015-12-17 Thread Ted Yu
Jason:
See the following test which revealed the deadlock scenario:

https://github.com/apache/hbase/blob/master/hbase-server/src/test/java/org/apache/hadoop/hbase/master/TestSplitLogManager.java

On Jenkins, hbase build has been flaky where sometimes the above test hung
but sometimes it passed.

I tend to think that this bug should be fixed for production system.

Cheers

On Thu, Dec 17, 2015 at 3:33 PM, Jason Rosenberg  wrote:

> Curious if there are specific scenarios which trigger this issue.  So far
> we have not seen it where we've upgraded.  We have many tests in continuous
> integration that embed zookeeper servers, and so far haven't seen any
> issues.
>
> Jason
>
> On Wed, Dec 16, 2015 at 6:01 PM, Ted Yu  wrote:
>
> > Thanks, Flavio.
> >
> > When 3.4.8 RC comes out, I will give it a spin.
> >
> > Cheers
> >
> > On Wed, Dec 16, 2015 at 2:59 PM, Flavio Junqueira 
> wrote:
> >
> > > This is bad, we should fix it and release 3.4.8 soon. With the holidays
> > > and such, we won't be able to produce an RC and vote, so I suggest we
> > > target early Jan. In the meanwhile, I'd suggest users to not move to
> > 3.4.7.
> > >
> > > I've reopened ZK-1907 and suggested a fix to this problem.
> > >
> > > -Flavio
> > >
> > >
> > > > On 16 Dec 2015, at 21:01, Ted Yu  wrote:
> > > >
> > > > Logged ZOOKEEPER-2347
> > > >
> > > > Thanks
> > > >
> > > > On Wed, Dec 16, 2015 at 12:36 PM, Camille Fournier <
> cami...@apache.org
> > >
> > > > wrote:
> > > >
> > > >> Blergh. We made shutdown synchronized. But decrementing the requests
> > is
> > > >> also synchronized and called from a different thread. So yeah,
> > deadlock.
> > > >>
> > > >> Can you open a ticket for this? This came in with ZOOKEEPER-1907
> > > >>
> > > >> C
> > > >>
> > > >> On Wed, Dec 16, 2015 at 2:46 PM, Ted Yu 
> wrote:
> > > >>
> > > >>> Hi,
> > > >>> HBase recently upgraded to zookeeper 3.4.7
> > > >>>
> > > >>> In one of the tests, TestSplitLogManager, there is reproducible
> hang
> > at
> > > >> the
> > > >>> end of the test.
> > > >>> Below is snippet from stack trace related to zookeeper:
> > > >>>
> > > >>> "main-EventThread" daemon prio=5 tid=0x7fd27488a800 nid=0x6f1f
> > > >> waiting
> > > >>> on condition [0x00011834b000]
> > > >>>   java.lang.Thread.State: WAITING (parking)
> > > >>>  at sun.misc.Unsafe.park(Native Method)
> > > >>>  - parking to wait for  <0x0007c5b8d3a0> (a
> > > >>>
> > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
> > > >>>  at
> java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
> > > >>>  at
> > > >>>
> > > >>>
> > > >>
> > >
> >
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2043)
> > > >>>  at
> > > >>>
> > > >>
> > >
> >
> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
> > > >>>  at
> > > org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:501)
> > > >>>
> > > >>> "main-SendThread(localhost:59510)" daemon prio=5
> > tid=0x7fd274eb4000
> > > >>> nid=0x9513 waiting on condition [0x000118042000]
> > > >>>   java.lang.Thread.State: TIMED_WAITING (sleeping)
> > > >>>  at java.lang.Thread.sleep(Native Method)
> > > >>>  at
> > > >>>
> > > >>>
> > > >>
> > >
> >
> org.apache.zookeeper.client.StaticHostProvider.next(StaticHostProvider.java:101)
> > > >>>  at
> > > >>>
> > > >>>
> > > >>
> > >
> >
> org.apache.zookeeper.ClientCnxn$SendThread.startConnect(ClientCnxn.java:997)
> > > >>>  at
> > > org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1060)
> > > >>>
> > > >>> "SyncThread:0" prio=5 tid=0x7fd274d02000 nid=0x730f waiting for
> > > >> monitor
> > > >>> entry [0x0001170ac000]
> > > >>>   java.lang.Thread.State: BLOCKED (on object monitor)
> > > >>>  at
> > > >>>
> > > >>>
> > > >>
> > >
> >
> org.apache.zookeeper.server.ZooKeeperServer.decInProcess(ZooKeeperServer.java:512)
> > > >>>  - waiting to lock <0x0007c5b62128> (a
> > > >>> org.apache.zookeeper.server.ZooKeeperServer)
> > > >>>  at
> > > >>>
> > > >>>
> > > >>
> > >
> >
> org.apache.zookeeper.server.FinalRequestProcessor.processRequest(FinalRequestProcessor.java:144)
> > > >>>  at
> > > >>>
> > > >>>
> > > >>
> > >
> >
> org.apache.zookeeper.server.SyncRequestProcessor.flush(SyncRequestProcessor.java:200)
> > > >>>  at
> > > >>>
> > > >>>
> > > >>
> > >
> >
> org.apache.zookeeper.server.SyncRequestProcessor.run(SyncRequestProcessor.java:131)
> > > >>>
> > > >>> "main-EventThread" daemon prio=5 tid=0x7fd2753a3800 nid=0x711b
> > > >> waiting
> > > >>> on condition [0x000117a3]
> > > >>>   java.lang.Thread.State: WAITING (parking)
> > > >>>  at sun.misc.Unsafe.park(Native Method)
> > > >>>  - parking to wait for  <0x0007c9b106b8> (a
> > > >>>
> > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
> > > >>>  at
> java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
> > > >>>  at
> > > >>>
> > > >>>
> > > >>
> > >
> >
> java.ut

[jira] [Commented] (ZOOKEEPER-2347) Deadlock shutting down zookeeper

2015-12-17 Thread Chris Nauroth (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15063101#comment-15063101
 ] 

Chris Nauroth commented on ZOOKEEPER-2347:
--

[~yuzhih...@gmail.com], thank you for the help with testing!

> Deadlock shutting down zookeeper
> 
>
> Key: ZOOKEEPER-2347
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2347
> Project: ZooKeeper
>  Issue Type: Bug
>Affects Versions: 3.4.7
>Reporter: Ted Yu
>Assignee: Rakesh R
>Priority: Blocker
> Fix For: 3.4.8
>
> Attachments: ZOOKEEPER-2347-br-3.4.patch, testSplitLogManager.stack
>
>
> HBase recently upgraded to zookeeper 3.4.7
> In one of the tests, TestSplitLogManager, there is reproducible hang at the 
> end of the test.
> Below is snippet from stack trace related to zookeeper:
> {code}
> "main-EventThread" daemon prio=5 tid=0x7fd27488a800 nid=0x6f1f waiting on 
> condition [0x00011834b000]
>java.lang.Thread.State: WAITING (parking)
>   at sun.misc.Unsafe.park(Native Method)
>   - parking to wait for  <0x0007c5b8d3a0> (a 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
>   at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
>   at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2043)
>   at 
> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
>   at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:501)
> "main-SendThread(localhost:59510)" daemon prio=5 tid=0x7fd274eb4000 
> nid=0x9513 waiting on condition [0x000118042000]
>java.lang.Thread.State: TIMED_WAITING (sleeping)
>   at java.lang.Thread.sleep(Native Method)
>   at 
> org.apache.zookeeper.client.StaticHostProvider.next(StaticHostProvider.java:101)
>   at 
> org.apache.zookeeper.ClientCnxn$SendThread.startConnect(ClientCnxn.java:997)
>   at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1060)
> "SyncThread:0" prio=5 tid=0x7fd274d02000 nid=0x730f waiting for monitor 
> entry [0x0001170ac000]
>java.lang.Thread.State: BLOCKED (on object monitor)
>   at 
> org.apache.zookeeper.server.ZooKeeperServer.decInProcess(ZooKeeperServer.java:512)
>   - waiting to lock <0x0007c5b62128> (a 
> org.apache.zookeeper.server.ZooKeeperServer)
>   at 
> org.apache.zookeeper.server.FinalRequestProcessor.processRequest(FinalRequestProcessor.java:144)
>   at 
> org.apache.zookeeper.server.SyncRequestProcessor.flush(SyncRequestProcessor.java:200)
>   at 
> org.apache.zookeeper.server.SyncRequestProcessor.run(SyncRequestProcessor.java:131)
> "main-EventThread" daemon prio=5 tid=0x7fd2753a3800 nid=0x711b waiting on 
> condition [0x000117a3]
>java.lang.Thread.State: WAITING (parking)
>   at sun.misc.Unsafe.park(Native Method)
>   - parking to wait for  <0x0007c9b106b8> (a 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
>   at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
>   at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2043)
>   at 
> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
>   at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:501)
> "main" prio=5 tid=0x7fd27600 nid=0x1903 in Object.wait() 
> [0x000108aa1000]
>java.lang.Thread.State: WAITING (on object monitor)
>   at java.lang.Object.wait(Native Method)
>   - waiting on <0x0007c5b66400> (a 
> org.apache.zookeeper.server.SyncRequestProcessor)
>   at java.lang.Thread.join(Thread.java:1281)
>   - locked <0x0007c5b66400> (a 
> org.apache.zookeeper.server.SyncRequestProcessor)
>   at java.lang.Thread.join(Thread.java:1355)
>   at 
> org.apache.zookeeper.server.SyncRequestProcessor.shutdown(SyncRequestProcessor.java:213)
>   at 
> org.apache.zookeeper.server.PrepRequestProcessor.shutdown(PrepRequestProcessor.java:770)
>   at 
> org.apache.zookeeper.server.ZooKeeperServer.shutdown(ZooKeeperServer.java:478)
>   - locked <0x0007c5b62128> (a 
> org.apache.zookeeper.server.ZooKeeperServer)
>   at 
> org.apache.zookeeper.server.NIOServerCnxnFactory.shutdown(NIOServerCnxnFactory.java:266)
>   at 
> org.apache.hadoop.hbase.zookeeper.MiniZooKeeperCluster.shutdown(MiniZooKeeperCluster.java:301)
> {code}
> Note the address (0x0007c5b66400) in the last hunk which seems to 
> indicate some form of deadlock.
> According to Camille Fournier:
> We made shutdown synchronized. But decrementing the requests is
> also synchronized and called from a different thread. So yeah, deadlock.
> This came in with ZOOKEEPER-1907



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: ZooKeeperServer#shutdown hangs

2015-12-17 Thread Jason Rosenberg
Curious if there are specific scenarios which trigger this issue.  So far
we have not seen it where we've upgraded.  We have many tests in continuous
integration that embed zookeeper servers, and so far haven't seen any
issues.

Jason

On Wed, Dec 16, 2015 at 6:01 PM, Ted Yu  wrote:

> Thanks, Flavio.
>
> When 3.4.8 RC comes out, I will give it a spin.
>
> Cheers
>
> On Wed, Dec 16, 2015 at 2:59 PM, Flavio Junqueira  wrote:
>
> > This is bad, we should fix it and release 3.4.8 soon. With the holidays
> > and such, we won't be able to produce an RC and vote, so I suggest we
> > target early Jan. In the meanwhile, I'd suggest users to not move to
> 3.4.7.
> >
> > I've reopened ZK-1907 and suggested a fix to this problem.
> >
> > -Flavio
> >
> >
> > > On 16 Dec 2015, at 21:01, Ted Yu  wrote:
> > >
> > > Logged ZOOKEEPER-2347
> > >
> > > Thanks
> > >
> > > On Wed, Dec 16, 2015 at 12:36 PM, Camille Fournier  >
> > > wrote:
> > >
> > >> Blergh. We made shutdown synchronized. But decrementing the requests
> is
> > >> also synchronized and called from a different thread. So yeah,
> deadlock.
> > >>
> > >> Can you open a ticket for this? This came in with ZOOKEEPER-1907
> > >>
> > >> C
> > >>
> > >> On Wed, Dec 16, 2015 at 2:46 PM, Ted Yu  wrote:
> > >>
> > >>> Hi,
> > >>> HBase recently upgraded to zookeeper 3.4.7
> > >>>
> > >>> In one of the tests, TestSplitLogManager, there is reproducible hang
> at
> > >> the
> > >>> end of the test.
> > >>> Below is snippet from stack trace related to zookeeper:
> > >>>
> > >>> "main-EventThread" daemon prio=5 tid=0x7fd27488a800 nid=0x6f1f
> > >> waiting
> > >>> on condition [0x00011834b000]
> > >>>   java.lang.Thread.State: WAITING (parking)
> > >>>  at sun.misc.Unsafe.park(Native Method)
> > >>>  - parking to wait for  <0x0007c5b8d3a0> (a
> > >>>
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
> > >>>  at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
> > >>>  at
> > >>>
> > >>>
> > >>
> >
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2043)
> > >>>  at
> > >>>
> > >>
> >
> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
> > >>>  at
> > org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:501)
> > >>>
> > >>> "main-SendThread(localhost:59510)" daemon prio=5
> tid=0x7fd274eb4000
> > >>> nid=0x9513 waiting on condition [0x000118042000]
> > >>>   java.lang.Thread.State: TIMED_WAITING (sleeping)
> > >>>  at java.lang.Thread.sleep(Native Method)
> > >>>  at
> > >>>
> > >>>
> > >>
> >
> org.apache.zookeeper.client.StaticHostProvider.next(StaticHostProvider.java:101)
> > >>>  at
> > >>>
> > >>>
> > >>
> >
> org.apache.zookeeper.ClientCnxn$SendThread.startConnect(ClientCnxn.java:997)
> > >>>  at
> > org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1060)
> > >>>
> > >>> "SyncThread:0" prio=5 tid=0x7fd274d02000 nid=0x730f waiting for
> > >> monitor
> > >>> entry [0x0001170ac000]
> > >>>   java.lang.Thread.State: BLOCKED (on object monitor)
> > >>>  at
> > >>>
> > >>>
> > >>
> >
> org.apache.zookeeper.server.ZooKeeperServer.decInProcess(ZooKeeperServer.java:512)
> > >>>  - waiting to lock <0x0007c5b62128> (a
> > >>> org.apache.zookeeper.server.ZooKeeperServer)
> > >>>  at
> > >>>
> > >>>
> > >>
> >
> org.apache.zookeeper.server.FinalRequestProcessor.processRequest(FinalRequestProcessor.java:144)
> > >>>  at
> > >>>
> > >>>
> > >>
> >
> org.apache.zookeeper.server.SyncRequestProcessor.flush(SyncRequestProcessor.java:200)
> > >>>  at
> > >>>
> > >>>
> > >>
> >
> org.apache.zookeeper.server.SyncRequestProcessor.run(SyncRequestProcessor.java:131)
> > >>>
> > >>> "main-EventThread" daemon prio=5 tid=0x7fd2753a3800 nid=0x711b
> > >> waiting
> > >>> on condition [0x000117a3]
> > >>>   java.lang.Thread.State: WAITING (parking)
> > >>>  at sun.misc.Unsafe.park(Native Method)
> > >>>  - parking to wait for  <0x0007c9b106b8> (a
> > >>>
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
> > >>>  at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
> > >>>  at
> > >>>
> > >>>
> > >>
> >
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2043)
> > >>>  at
> > >>>
> > >>
> >
> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
> > >>>  at
> > org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:501)
> > >>>
> > >>> "main" prio=5 tid=0x7fd27600 nid=0x1903 in Object.wait()
> > >>> [0x000108aa1000]
> > >>>   java.lang.Thread.State: WAITING (on object monitor)
> > >>>  at java.lang.Object.wait(Native Method)
> > >>>  - waiting on <*0x0007c5b66400*> (a
> > >>> org.apache.zookeeper.server.SyncRequestProcessor)
> > >>>  at java.lang.Thread.join(Thread.java:1281)
> > >>>  - locked <*0x0007c5b66400*> (a
> > >>> org.apache.zookeeper.server.SyncRequestProc

Re: ZooKeeperServer#shutdown hangs

2015-12-17 Thread Jason Rosenberg
Yep,

I'm able to reproduce it now intermittently (but not high percentage of the
time) in some of our tests.I'm reverting.

Thanks,

Jason

On Thu, Dec 17, 2015 at 6:39 PM, Ted Yu  wrote:

> Jason:
> See the following test which revealed the deadlock scenario:
>
>
> https://github.com/apache/hbase/blob/master/hbase-server/src/test/java/org/apache/hadoop/hbase/master/TestSplitLogManager.java
>
> On Jenkins, hbase build has been flaky where sometimes the above test hung
> but sometimes it passed.
>
> I tend to think that this bug should be fixed for production system.
>
> Cheers
>
> On Thu, Dec 17, 2015 at 3:33 PM, Jason Rosenberg  wrote:
>
> > Curious if there are specific scenarios which trigger this issue.  So far
> > we have not seen it where we've upgraded.  We have many tests in
> continuous
> > integration that embed zookeeper servers, and so far haven't seen any
> > issues.
> >
> > Jason
> >
> > On Wed, Dec 16, 2015 at 6:01 PM, Ted Yu  wrote:
> >
> > > Thanks, Flavio.
> > >
> > > When 3.4.8 RC comes out, I will give it a spin.
> > >
> > > Cheers
> > >
> > > On Wed, Dec 16, 2015 at 2:59 PM, Flavio Junqueira 
> > wrote:
> > >
> > > > This is bad, we should fix it and release 3.4.8 soon. With the
> holidays
> > > > and such, we won't be able to produce an RC and vote, so I suggest we
> > > > target early Jan. In the meanwhile, I'd suggest users to not move to
> > > 3.4.7.
> > > >
> > > > I've reopened ZK-1907 and suggested a fix to this problem.
> > > >
> > > > -Flavio
> > > >
> > > >
> > > > > On 16 Dec 2015, at 21:01, Ted Yu  wrote:
> > > > >
> > > > > Logged ZOOKEEPER-2347
> > > > >
> > > > > Thanks
> > > > >
> > > > > On Wed, Dec 16, 2015 at 12:36 PM, Camille Fournier <
> > cami...@apache.org
> > > >
> > > > > wrote:
> > > > >
> > > > >> Blergh. We made shutdown synchronized. But decrementing the
> requests
> > > is
> > > > >> also synchronized and called from a different thread. So yeah,
> > > deadlock.
> > > > >>
> > > > >> Can you open a ticket for this? This came in with ZOOKEEPER-1907
> > > > >>
> > > > >> C
> > > > >>
> > > > >> On Wed, Dec 16, 2015 at 2:46 PM, Ted Yu 
> > wrote:
> > > > >>
> > > > >>> Hi,
> > > > >>> HBase recently upgraded to zookeeper 3.4.7
> > > > >>>
> > > > >>> In one of the tests, TestSplitLogManager, there is reproducible
> > hang
> > > at
> > > > >> the
> > > > >>> end of the test.
> > > > >>> Below is snippet from stack trace related to zookeeper:
> > > > >>>
> > > > >>> "main-EventThread" daemon prio=5 tid=0x7fd27488a800
> nid=0x6f1f
> > > > >> waiting
> > > > >>> on condition [0x00011834b000]
> > > > >>>   java.lang.Thread.State: WAITING (parking)
> > > > >>>  at sun.misc.Unsafe.park(Native Method)
> > > > >>>  - parking to wait for  <0x0007c5b8d3a0> (a
> > > > >>>
> > > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
> > > > >>>  at
> > java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
> > > > >>>  at
> > > > >>>
> > > > >>>
> > > > >>
> > > >
> > >
> >
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2043)
> > > > >>>  at
> > > > >>>
> > > > >>
> > > >
> > >
> >
> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
> > > > >>>  at
> > > > org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:501)
> > > > >>>
> > > > >>> "main-SendThread(localhost:59510)" daemon prio=5
> > > tid=0x7fd274eb4000
> > > > >>> nid=0x9513 waiting on condition [0x000118042000]
> > > > >>>   java.lang.Thread.State: TIMED_WAITING (sleeping)
> > > > >>>  at java.lang.Thread.sleep(Native Method)
> > > > >>>  at
> > > > >>>
> > > > >>>
> > > > >>
> > > >
> > >
> >
> org.apache.zookeeper.client.StaticHostProvider.next(StaticHostProvider.java:101)
> > > > >>>  at
> > > > >>>
> > > > >>>
> > > > >>
> > > >
> > >
> >
> org.apache.zookeeper.ClientCnxn$SendThread.startConnect(ClientCnxn.java:997)
> > > > >>>  at
> > > > org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1060)
> > > > >>>
> > > > >>> "SyncThread:0" prio=5 tid=0x7fd274d02000 nid=0x730f waiting
> for
> > > > >> monitor
> > > > >>> entry [0x0001170ac000]
> > > > >>>   java.lang.Thread.State: BLOCKED (on object monitor)
> > > > >>>  at
> > > > >>>
> > > > >>>
> > > > >>
> > > >
> > >
> >
> org.apache.zookeeper.server.ZooKeeperServer.decInProcess(ZooKeeperServer.java:512)
> > > > >>>  - waiting to lock <0x0007c5b62128> (a
> > > > >>> org.apache.zookeeper.server.ZooKeeperServer)
> > > > >>>  at
> > > > >>>
> > > > >>>
> > > > >>
> > > >
> > >
> >
> org.apache.zookeeper.server.FinalRequestProcessor.processRequest(FinalRequestProcessor.java:144)
> > > > >>>  at
> > > > >>>
> > > > >>>
> > > > >>
> > > >
> > >
> >
> org.apache.zookeeper.server.SyncRequestProcessor.flush(SyncRequestProcessor.java:200)
> > > > >>>  at
> > > > >>>
> > > > >>>
> > > > >>
> > > >
> > >
> >
> org.apache.zookeeper.server.SyncRequestProcessor.run(SyncRequestProcessor.java:131)
> > > > >

[jira] [Commented] (ZOOKEEPER-2251) Add Client side packet response timeout to avoid infinite wait.

2015-12-17 Thread nijel (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15063503#comment-15063503
 ] 

nijel commented on ZOOKEEPER-2251:
--

hi [~marshad] and [~suda]

I observed this when i am doing reliability test for a banking customer
Here we test for any network abnormality and packet drop.

Here in the scenario packet is sent and wait for ever. Even if the server is 
not responding due to any reason, this issue can happen

so my opinion is to have this time out since many services' high availability 
solution depends on zookeeper.



> Add Client side packet response timeout to avoid infinite wait.
> ---
>
> Key: ZOOKEEPER-2251
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2251
> Project: ZooKeeper
>  Issue Type: Bug
>Reporter: nijel
>Assignee: Arshad Mohammad
> Attachments: ZOOKEEPER-2251-01.patch, ZOOKEEPER-2251-02.patch, 
> ZOOKEEPER-2251-03.patch
>
>
> I came across one issue related to Client side packet response timeout In my 
> cluster many packet drops happened for some time.
> One observation is the zookeeper client got hanged. As per the thread dump it 
> is waiting for the response/ACK for the operation performed (synchronous API 
> used here).
> I am using 
> zookeeper.serverCnxnFactory=org.apache.zookeeper.server.NIOServerCnxnFactory
> Since only few packets missed there is no DISCONNECTED event occurred.
> Need add a "response time out" for the operations or packets.
> *Comments from [~rakeshr]*
> My observation about the problem:-
> * Can use tools like 'Wireshark' to simulate the artificial packet loss.
> * Assume there is only one packet in the 'outgoingQueue' and unfortunately 
> the server response packet lost. Now, client will enter into infinite 
> waiting. 
> https://github.com/apache/zookeeper/blob/trunk/src/java/main/org/apache/zookeeper/ClientCnxn.java#L1515
> * Probably we can discuss more about this problem and possible solutions(add 
> packet ACK timeout or another better approach) in the jira.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Failed: ZOOKEEPER-2251 PreCommit Build #2999

2015-12-17 Thread Apache Jenkins Server
Jira: https://issues.apache.org/jira/browse/ZOOKEEPER-2251
Build: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/2999/

###
## LAST 60 LINES OF THE CONSOLE 
###
[...truncated 1531 lines...]
 [exec] Skipping patch.
 [exec] 2 out of 2 hunks ignored
 [exec] PATCH APPLICATION FAILED
 [exec] 
 [exec] 
 [exec] 
 [exec] 
 [exec] -1 overall.  Here are the results of testing the latest attachment 
 [exec]   
http://issues.apache.org/jira/secure/attachment/12765803/ZOOKEEPER-2251-03.patch
 [exec]   against trunk revision 1720227.
 [exec] 
 [exec] +1 @author.  The patch does not contain any @author tags.
 [exec] 
 [exec] +1 tests included.  The patch appears to include 3 new or 
modified tests.
 [exec] 
 [exec] -1 patch.  The patch command could not apply the patch.
 [exec] 
 [exec] Console output: 
https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/2999//console
 [exec] 
 [exec] This message is automatically generated.
 [exec] 
 [exec] 
 [exec] 
==
 [exec] 
==
 [exec] Adding comment to Jira.
 [exec] 
==
 [exec] 
==
 [exec] 
 [exec] 
 [exec] Comment added.
 [exec] 3d9d7fe2e9aaed64e384c9f56adc7f7ec703232a logged out
 [exec] 
 [exec] 
 [exec] 
==
 [exec] 
==
 [exec] Finished build.
 [exec] 
==
 [exec] 
==
 [exec] 
 [exec] 

BUILD FAILED
/home/jenkins/jenkins-slave/workspace/PreCommit-ZOOKEEPER-Build/trunk/build.xml:1816:
 exec returned: 1

Total time: 1 minute 17 seconds
Build step 'Execute shell' marked build as failure
Archiving artifacts
Setting 
LATEST1_7_HOME=/home/jenkins/jenkins-slave/tools/hudson.model.JDK/latest1.7
Recording test results
Setting 
LATEST1_7_HOME=/home/jenkins/jenkins-slave/tools/hudson.model.JDK/latest1.7
ERROR: Publisher 'Publish JUnit test result report' failed: No test report 
files were found. Configuration error?
Setting 
LATEST1_7_HOME=/home/jenkins/jenkins-slave/tools/hudson.model.JDK/latest1.7
[description-setter] Description set: ZOOKEEPER-2251
Email was triggered for: Failure - Any
Sending email for trigger: Failure - Any
Setting 
LATEST1_7_HOME=/home/jenkins/jenkins-slave/tools/hudson.model.JDK/latest1.7
Setting 
LATEST1_7_HOME=/home/jenkins/jenkins-slave/tools/hudson.model.JDK/latest1.7
Setting 
LATEST1_7_HOME=/home/jenkins/jenkins-slave/tools/hudson.model.JDK/latest1.7
Setting 
LATEST1_7_HOME=/home/jenkins/jenkins-slave/tools/hudson.model.JDK/latest1.7



###
## FAILED TESTS (if any) 
##
No tests ran.

[jira] [Commented] (ZOOKEEPER-2251) Add Client side packet response timeout to avoid infinite wait.

2015-12-17 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15063517#comment-15063517
 ] 

Hadoop QA commented on ZOOKEEPER-2251:
--

-1 overall.  Here are the results of testing the latest attachment 
  
http://issues.apache.org/jira/secure/attachment/12765803/ZOOKEEPER-2251-03.patch
  against trunk revision 1720227.

+1 @author.  The patch does not contain any @author tags.

+1 tests included.  The patch appears to include 3 new or modified tests.

-1 patch.  The patch command could not apply the patch.

Console output: 
https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/2999//console

This message is automatically generated.

> Add Client side packet response timeout to avoid infinite wait.
> ---
>
> Key: ZOOKEEPER-2251
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2251
> Project: ZooKeeper
>  Issue Type: Bug
>Reporter: nijel
>Assignee: Arshad Mohammad
> Attachments: ZOOKEEPER-2251-01.patch, ZOOKEEPER-2251-02.patch, 
> ZOOKEEPER-2251-03.patch
>
>
> I came across one issue related to Client side packet response timeout In my 
> cluster many packet drops happened for some time.
> One observation is the zookeeper client got hanged. As per the thread dump it 
> is waiting for the response/ACK for the operation performed (synchronous API 
> used here).
> I am using 
> zookeeper.serverCnxnFactory=org.apache.zookeeper.server.NIOServerCnxnFactory
> Since only few packets missed there is no DISCONNECTED event occurred.
> Need add a "response time out" for the operations or packets.
> *Comments from [~rakeshr]*
> My observation about the problem:-
> * Can use tools like 'Wireshark' to simulate the artificial packet loss.
> * Assume there is only one packet in the 'outgoingQueue' and unfortunately 
> the server response packet lost. Now, client will enter into infinite 
> waiting. 
> https://github.com/apache/zookeeper/blob/trunk/src/java/main/org/apache/zookeeper/ClientCnxn.java#L1515
> * Probably we can discuss more about this problem and possible solutions(add 
> packet ACK timeout or another better approach) in the jira.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (ZOOKEEPER-2251) Add Client side packet response timeout to avoid infinite wait.

2015-12-17 Thread Akihiro Suda (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15063565#comment-15063565
 ] 

Akihiro Suda commented on ZOOKEEPER-2251:
-

Hi [~arshad.mohammad] and [~nijel],
Thank you for comments,
and sorry for that I'm still not able to identify the root cause of the bug.

I'm ok for having this timeout.
I would like to respect committers' decision.

Cc: [~rgs] (zktraffic author), could you please look on this?


> Add Client side packet response timeout to avoid infinite wait.
> ---
>
> Key: ZOOKEEPER-2251
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2251
> Project: ZooKeeper
>  Issue Type: Bug
>Reporter: nijel
>Assignee: Arshad Mohammad
> Attachments: ZOOKEEPER-2251-01.patch, ZOOKEEPER-2251-02.patch, 
> ZOOKEEPER-2251-03.patch
>
>
> I came across one issue related to Client side packet response timeout In my 
> cluster many packet drops happened for some time.
> One observation is the zookeeper client got hanged. As per the thread dump it 
> is waiting for the response/ACK for the operation performed (synchronous API 
> used here).
> I am using 
> zookeeper.serverCnxnFactory=org.apache.zookeeper.server.NIOServerCnxnFactory
> Since only few packets missed there is no DISCONNECTED event occurred.
> Need add a "response time out" for the operations or packets.
> *Comments from [~rakeshr]*
> My observation about the problem:-
> * Can use tools like 'Wireshark' to simulate the artificial packet loss.
> * Assume there is only one packet in the 'outgoingQueue' and unfortunately 
> the server response packet lost. Now, client will enter into infinite 
> waiting. 
> https://github.com/apache/zookeeper/blob/trunk/src/java/main/org/apache/zookeeper/ClientCnxn.java#L1515
> * Probably we can discuss more about this problem and possible solutions(add 
> packet ACK timeout or another better approach) in the jira.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)