[jira] [Commented] (ZOOKEEPER-1731) Unsynchronized access to ServerCnxnFactory.connectionBeans results in deadlock

2013-08-02 Thread Camille Fournier (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13727807#comment-13727807
 ] 

Camille Fournier commented on ZOOKEEPER-1731:
-

Looks good checking in

> Unsynchronized access to ServerCnxnFactory.connectionBeans results in deadlock
> --
>
> Key: ZOOKEEPER-1731
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1731
> Project: ZooKeeper
>  Issue Type: Bug
>Reporter: Dave Latham
>Priority: Critical
> Fix For: 3.4.6
>
> Attachments: ZOOKEEPER-1731.patch
>
>
> We had a cluster of 3 peers (running 3.4.3) fail after we took down 1 peer 
> briefly for maintenance.  A second peer became unresponsive and the leader 
> lost quorum.  Thread dumps on the second peer showed two threads consistently 
> stuck in these states:
> {noformat}
> "QuorumPeer[myid=0]/0.0.0.0:2181" prio=10 tid=0x2aaab8d20800 nid=0x598a 
> runnable [0x4335d000]
>java.lang.Thread.State: RUNNABLE
> at java.util.HashMap.put(HashMap.java:405)
> at 
> org.apache.zookeeper.server.ServerCnxnFactory.registerConnection(ServerCnxnFactory.java:131)
> at 
> org.apache.zookeeper.server.ZooKeeperServer.finishSessionInit(ZooKeeperServer.java:572)
> at 
> org.apache.zookeeper.server.quorum.Learner.revalidate(Learner.java:444)
> at 
> org.apache.zookeeper.server.quorum.Follower.processPacket(Follower.java:133)
> at 
> org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:86)
> at 
> org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:740)
> "NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181" daemon prio=10 
> tid=0x2aaab84b0800 nid=0x5986 runnable [0x40878000]
>java.lang.Thread.State: RUNNABLE
> at java.util.HashMap.removeEntryForKey(HashMap.java:614)
> at java.util.HashMap.remove(HashMap.java:581)
> at 
> org.apache.zookeeper.server.ServerCnxnFactory.unregisterConnection(ServerCnxnFactory.java:120)
> at 
> org.apache.zookeeper.server.NIOServerCnxn.close(NIOServerCnxn.java:971)
> - locked <0x00078d8a51f0> (a java.util.HashSet)
> at 
> org.apache.zookeeper.server.NIOServerCnxnFactory.closeSessionWithoutWakeup(NIOServerCnxnFactory.java:307)
> at 
> org.apache.zookeeper.server.NIOServerCnxnFactory.closeSession(NIOServerCnxnFactory.java:294)
> - locked <0x00078d82c750> (a 
> org.apache.zookeeper.server.NIOServerCnxnFactory)
> at 
> org.apache.zookeeper.server.ZooKeeperServer.processConnectRequest(ZooKeeperServer.java:834)
> at 
> org.apache.zookeeper.server.NIOServerCnxn.readConnectRequest(NIOServerCnxn.java:410)
> at 
> org.apache.zookeeper.server.NIOServerCnxn.readPayload(NIOServerCnxn.java:200)
> at 
> org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:236)
> at 
> org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:224)
> at java.lang.Thread.run(Thread.java:662)
> {noformat}
> It shows both threads concurrently modifying 
> ServerCnxnFactory.connectionBeans which is a java.util.HashMap.
> This cluster was serving thousands of clients, which seems to make this 
> condition more likely as it appears to occur when one client connects and 
> another disconnects at about the same time.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (ZOOKEEPER-1731) Unsynchronized access to ServerCnxnFactory.connectionBeans results in deadlock

2013-07-29 Thread Dave Latham (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13722640#comment-13722640
 ] 

Dave Latham commented on ZOOKEEPER-1731:


Note this patch is for branch-3.4 and so doesn't apply to trunk.

> Unsynchronized access to ServerCnxnFactory.connectionBeans results in deadlock
> --
>
> Key: ZOOKEEPER-1731
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1731
> Project: ZooKeeper
>  Issue Type: Bug
>Reporter: Dave Latham
>Priority: Critical
> Fix For: 3.4.6
>
> Attachments: ZOOKEEPER-1731.patch
>
>
> We had a cluster of 3 peers (running 3.4.3) fail after we took down 1 peer 
> briefly for maintenance.  A second peer became unresponsive and the leader 
> lost quorum.  Thread dumps on the second peer showed two threads consistently 
> stuck in these states:
> {noformat}
> "QuorumPeer[myid=0]/0.0.0.0:2181" prio=10 tid=0x2aaab8d20800 nid=0x598a 
> runnable [0x4335d000]
>java.lang.Thread.State: RUNNABLE
> at java.util.HashMap.put(HashMap.java:405)
> at 
> org.apache.zookeeper.server.ServerCnxnFactory.registerConnection(ServerCnxnFactory.java:131)
> at 
> org.apache.zookeeper.server.ZooKeeperServer.finishSessionInit(ZooKeeperServer.java:572)
> at 
> org.apache.zookeeper.server.quorum.Learner.revalidate(Learner.java:444)
> at 
> org.apache.zookeeper.server.quorum.Follower.processPacket(Follower.java:133)
> at 
> org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:86)
> at 
> org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:740)
> "NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181" daemon prio=10 
> tid=0x2aaab84b0800 nid=0x5986 runnable [0x40878000]
>java.lang.Thread.State: RUNNABLE
> at java.util.HashMap.removeEntryForKey(HashMap.java:614)
> at java.util.HashMap.remove(HashMap.java:581)
> at 
> org.apache.zookeeper.server.ServerCnxnFactory.unregisterConnection(ServerCnxnFactory.java:120)
> at 
> org.apache.zookeeper.server.NIOServerCnxn.close(NIOServerCnxn.java:971)
> - locked <0x00078d8a51f0> (a java.util.HashSet)
> at 
> org.apache.zookeeper.server.NIOServerCnxnFactory.closeSessionWithoutWakeup(NIOServerCnxnFactory.java:307)
> at 
> org.apache.zookeeper.server.NIOServerCnxnFactory.closeSession(NIOServerCnxnFactory.java:294)
> - locked <0x00078d82c750> (a 
> org.apache.zookeeper.server.NIOServerCnxnFactory)
> at 
> org.apache.zookeeper.server.ZooKeeperServer.processConnectRequest(ZooKeeperServer.java:834)
> at 
> org.apache.zookeeper.server.NIOServerCnxn.readConnectRequest(NIOServerCnxn.java:410)
> at 
> org.apache.zookeeper.server.NIOServerCnxn.readPayload(NIOServerCnxn.java:200)
> at 
> org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:236)
> at 
> org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:224)
> at java.lang.Thread.run(Thread.java:662)
> {noformat}
> It shows both threads concurrently modifying 
> ServerCnxnFactory.connectionBeans which is a java.util.HashMap.
> This cluster was serving thousands of clients, which seems to make this 
> condition more likely as it appears to occur when one client connects and 
> another disconnects at about the same time.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (ZOOKEEPER-1731) Unsynchronized access to ServerCnxnFactory.connectionBeans results in deadlock

2013-07-29 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13722611#comment-13722611
 ] 

Hadoop QA commented on ZOOKEEPER-1731:
--

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12594725/ZOOKEEPER-1731.patch
  against trunk revision 1503101.

+1 @author.  The patch does not contain any @author tags.

-1 tests included.  The patch doesn't appear to include any new or modified 
tests.
Please justify why no new tests are needed for this 
patch.
Also please list what manual steps were performed to 
verify this patch.

-1 patch.  The patch command could not apply the patch.

Console output: 
https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1525//console

This message is automatically generated.

> Unsynchronized access to ServerCnxnFactory.connectionBeans results in deadlock
> --
>
> Key: ZOOKEEPER-1731
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1731
> Project: ZooKeeper
>  Issue Type: Bug
>Reporter: Dave Latham
>Priority: Critical
> Fix For: 3.4.6
>
> Attachments: ZOOKEEPER-1731.patch
>
>
> We had a cluster of 3 peers (running 3.4.3) fail after we took down 1 peer 
> briefly for maintenance.  A second peer became unresponsive and the leader 
> lost quorum.  Thread dumps on the second peer showed two threads consistently 
> stuck in these states:
> {noformat}
> "QuorumPeer[myid=0]/0.0.0.0:2181" prio=10 tid=0x2aaab8d20800 nid=0x598a 
> runnable [0x4335d000]
>java.lang.Thread.State: RUNNABLE
> at java.util.HashMap.put(HashMap.java:405)
> at 
> org.apache.zookeeper.server.ServerCnxnFactory.registerConnection(ServerCnxnFactory.java:131)
> at 
> org.apache.zookeeper.server.ZooKeeperServer.finishSessionInit(ZooKeeperServer.java:572)
> at 
> org.apache.zookeeper.server.quorum.Learner.revalidate(Learner.java:444)
> at 
> org.apache.zookeeper.server.quorum.Follower.processPacket(Follower.java:133)
> at 
> org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:86)
> at 
> org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:740)
> "NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181" daemon prio=10 
> tid=0x2aaab84b0800 nid=0x5986 runnable [0x40878000]
>java.lang.Thread.State: RUNNABLE
> at java.util.HashMap.removeEntryForKey(HashMap.java:614)
> at java.util.HashMap.remove(HashMap.java:581)
> at 
> org.apache.zookeeper.server.ServerCnxnFactory.unregisterConnection(ServerCnxnFactory.java:120)
> at 
> org.apache.zookeeper.server.NIOServerCnxn.close(NIOServerCnxn.java:971)
> - locked <0x00078d8a51f0> (a java.util.HashSet)
> at 
> org.apache.zookeeper.server.NIOServerCnxnFactory.closeSessionWithoutWakeup(NIOServerCnxnFactory.java:307)
> at 
> org.apache.zookeeper.server.NIOServerCnxnFactory.closeSession(NIOServerCnxnFactory.java:294)
> - locked <0x00078d82c750> (a 
> org.apache.zookeeper.server.NIOServerCnxnFactory)
> at 
> org.apache.zookeeper.server.ZooKeeperServer.processConnectRequest(ZooKeeperServer.java:834)
> at 
> org.apache.zookeeper.server.NIOServerCnxn.readConnectRequest(NIOServerCnxn.java:410)
> at 
> org.apache.zookeeper.server.NIOServerCnxn.readPayload(NIOServerCnxn.java:200)
> at 
> org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:236)
> at 
> org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:224)
> at java.lang.Thread.run(Thread.java:662)
> {noformat}
> It shows both threads concurrently modifying 
> ServerCnxnFactory.connectionBeans which is a java.util.HashMap.
> This cluster was serving thousands of clients, which seems to make this 
> condition more likely as it appears to occur when one client connects and 
> another disconnects at about the same time.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira