[jira] [Commented] (ZOOKEEPER-1731) Unsynchronized access to ServerCnxnFactory.connectionBeans results in deadlock
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13727807#comment-13727807 ] Camille Fournier commented on ZOOKEEPER-1731: - Looks good checking in > Unsynchronized access to ServerCnxnFactory.connectionBeans results in deadlock > -- > > Key: ZOOKEEPER-1731 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1731 > Project: ZooKeeper > Issue Type: Bug >Reporter: Dave Latham >Priority: Critical > Fix For: 3.4.6 > > Attachments: ZOOKEEPER-1731.patch > > > We had a cluster of 3 peers (running 3.4.3) fail after we took down 1 peer > briefly for maintenance. A second peer became unresponsive and the leader > lost quorum. Thread dumps on the second peer showed two threads consistently > stuck in these states: > {noformat} > "QuorumPeer[myid=0]/0.0.0.0:2181" prio=10 tid=0x2aaab8d20800 nid=0x598a > runnable [0x4335d000] >java.lang.Thread.State: RUNNABLE > at java.util.HashMap.put(HashMap.java:405) > at > org.apache.zookeeper.server.ServerCnxnFactory.registerConnection(ServerCnxnFactory.java:131) > at > org.apache.zookeeper.server.ZooKeeperServer.finishSessionInit(ZooKeeperServer.java:572) > at > org.apache.zookeeper.server.quorum.Learner.revalidate(Learner.java:444) > at > org.apache.zookeeper.server.quorum.Follower.processPacket(Follower.java:133) > at > org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:86) > at > org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:740) > "NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181" daemon prio=10 > tid=0x2aaab84b0800 nid=0x5986 runnable [0x40878000] >java.lang.Thread.State: RUNNABLE > at java.util.HashMap.removeEntryForKey(HashMap.java:614) > at java.util.HashMap.remove(HashMap.java:581) > at > org.apache.zookeeper.server.ServerCnxnFactory.unregisterConnection(ServerCnxnFactory.java:120) > at > org.apache.zookeeper.server.NIOServerCnxn.close(NIOServerCnxn.java:971) > - locked <0x00078d8a51f0> (a java.util.HashSet) > at > org.apache.zookeeper.server.NIOServerCnxnFactory.closeSessionWithoutWakeup(NIOServerCnxnFactory.java:307) > at > org.apache.zookeeper.server.NIOServerCnxnFactory.closeSession(NIOServerCnxnFactory.java:294) > - locked <0x00078d82c750> (a > org.apache.zookeeper.server.NIOServerCnxnFactory) > at > org.apache.zookeeper.server.ZooKeeperServer.processConnectRequest(ZooKeeperServer.java:834) > at > org.apache.zookeeper.server.NIOServerCnxn.readConnectRequest(NIOServerCnxn.java:410) > at > org.apache.zookeeper.server.NIOServerCnxn.readPayload(NIOServerCnxn.java:200) > at > org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:236) > at > org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:224) > at java.lang.Thread.run(Thread.java:662) > {noformat} > It shows both threads concurrently modifying > ServerCnxnFactory.connectionBeans which is a java.util.HashMap. > This cluster was serving thousands of clients, which seems to make this > condition more likely as it appears to occur when one client connects and > another disconnects at about the same time. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ZOOKEEPER-1731) Unsynchronized access to ServerCnxnFactory.connectionBeans results in deadlock
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13722640#comment-13722640 ] Dave Latham commented on ZOOKEEPER-1731: Note this patch is for branch-3.4 and so doesn't apply to trunk. > Unsynchronized access to ServerCnxnFactory.connectionBeans results in deadlock > -- > > Key: ZOOKEEPER-1731 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1731 > Project: ZooKeeper > Issue Type: Bug >Reporter: Dave Latham >Priority: Critical > Fix For: 3.4.6 > > Attachments: ZOOKEEPER-1731.patch > > > We had a cluster of 3 peers (running 3.4.3) fail after we took down 1 peer > briefly for maintenance. A second peer became unresponsive and the leader > lost quorum. Thread dumps on the second peer showed two threads consistently > stuck in these states: > {noformat} > "QuorumPeer[myid=0]/0.0.0.0:2181" prio=10 tid=0x2aaab8d20800 nid=0x598a > runnable [0x4335d000] >java.lang.Thread.State: RUNNABLE > at java.util.HashMap.put(HashMap.java:405) > at > org.apache.zookeeper.server.ServerCnxnFactory.registerConnection(ServerCnxnFactory.java:131) > at > org.apache.zookeeper.server.ZooKeeperServer.finishSessionInit(ZooKeeperServer.java:572) > at > org.apache.zookeeper.server.quorum.Learner.revalidate(Learner.java:444) > at > org.apache.zookeeper.server.quorum.Follower.processPacket(Follower.java:133) > at > org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:86) > at > org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:740) > "NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181" daemon prio=10 > tid=0x2aaab84b0800 nid=0x5986 runnable [0x40878000] >java.lang.Thread.State: RUNNABLE > at java.util.HashMap.removeEntryForKey(HashMap.java:614) > at java.util.HashMap.remove(HashMap.java:581) > at > org.apache.zookeeper.server.ServerCnxnFactory.unregisterConnection(ServerCnxnFactory.java:120) > at > org.apache.zookeeper.server.NIOServerCnxn.close(NIOServerCnxn.java:971) > - locked <0x00078d8a51f0> (a java.util.HashSet) > at > org.apache.zookeeper.server.NIOServerCnxnFactory.closeSessionWithoutWakeup(NIOServerCnxnFactory.java:307) > at > org.apache.zookeeper.server.NIOServerCnxnFactory.closeSession(NIOServerCnxnFactory.java:294) > - locked <0x00078d82c750> (a > org.apache.zookeeper.server.NIOServerCnxnFactory) > at > org.apache.zookeeper.server.ZooKeeperServer.processConnectRequest(ZooKeeperServer.java:834) > at > org.apache.zookeeper.server.NIOServerCnxn.readConnectRequest(NIOServerCnxn.java:410) > at > org.apache.zookeeper.server.NIOServerCnxn.readPayload(NIOServerCnxn.java:200) > at > org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:236) > at > org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:224) > at java.lang.Thread.run(Thread.java:662) > {noformat} > It shows both threads concurrently modifying > ServerCnxnFactory.connectionBeans which is a java.util.HashMap. > This cluster was serving thousands of clients, which seems to make this > condition more likely as it appears to occur when one client connects and > another disconnects at about the same time. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ZOOKEEPER-1731) Unsynchronized access to ServerCnxnFactory.connectionBeans results in deadlock
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13722611#comment-13722611 ] Hadoop QA commented on ZOOKEEPER-1731: -- -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12594725/ZOOKEEPER-1731.patch against trunk revision 1503101. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. -1 patch. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1525//console This message is automatically generated. > Unsynchronized access to ServerCnxnFactory.connectionBeans results in deadlock > -- > > Key: ZOOKEEPER-1731 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1731 > Project: ZooKeeper > Issue Type: Bug >Reporter: Dave Latham >Priority: Critical > Fix For: 3.4.6 > > Attachments: ZOOKEEPER-1731.patch > > > We had a cluster of 3 peers (running 3.4.3) fail after we took down 1 peer > briefly for maintenance. A second peer became unresponsive and the leader > lost quorum. Thread dumps on the second peer showed two threads consistently > stuck in these states: > {noformat} > "QuorumPeer[myid=0]/0.0.0.0:2181" prio=10 tid=0x2aaab8d20800 nid=0x598a > runnable [0x4335d000] >java.lang.Thread.State: RUNNABLE > at java.util.HashMap.put(HashMap.java:405) > at > org.apache.zookeeper.server.ServerCnxnFactory.registerConnection(ServerCnxnFactory.java:131) > at > org.apache.zookeeper.server.ZooKeeperServer.finishSessionInit(ZooKeeperServer.java:572) > at > org.apache.zookeeper.server.quorum.Learner.revalidate(Learner.java:444) > at > org.apache.zookeeper.server.quorum.Follower.processPacket(Follower.java:133) > at > org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:86) > at > org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:740) > "NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181" daemon prio=10 > tid=0x2aaab84b0800 nid=0x5986 runnable [0x40878000] >java.lang.Thread.State: RUNNABLE > at java.util.HashMap.removeEntryForKey(HashMap.java:614) > at java.util.HashMap.remove(HashMap.java:581) > at > org.apache.zookeeper.server.ServerCnxnFactory.unregisterConnection(ServerCnxnFactory.java:120) > at > org.apache.zookeeper.server.NIOServerCnxn.close(NIOServerCnxn.java:971) > - locked <0x00078d8a51f0> (a java.util.HashSet) > at > org.apache.zookeeper.server.NIOServerCnxnFactory.closeSessionWithoutWakeup(NIOServerCnxnFactory.java:307) > at > org.apache.zookeeper.server.NIOServerCnxnFactory.closeSession(NIOServerCnxnFactory.java:294) > - locked <0x00078d82c750> (a > org.apache.zookeeper.server.NIOServerCnxnFactory) > at > org.apache.zookeeper.server.ZooKeeperServer.processConnectRequest(ZooKeeperServer.java:834) > at > org.apache.zookeeper.server.NIOServerCnxn.readConnectRequest(NIOServerCnxn.java:410) > at > org.apache.zookeeper.server.NIOServerCnxn.readPayload(NIOServerCnxn.java:200) > at > org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:236) > at > org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:224) > at java.lang.Thread.run(Thread.java:662) > {noformat} > It shows both threads concurrently modifying > ServerCnxnFactory.connectionBeans which is a java.util.HashMap. > This cluster was serving thousands of clients, which seems to make this > condition more likely as it appears to occur when one client connects and > another disconnects at about the same time. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira