[jira] [Updated] (ZOOKEEPER-2201) Network issues can cause cluster to hang due to near-deadlock

Patrick Hunt (JIRA) Tue, 02 Jun 2015 22:21:13 -0700

     [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Patrick Hunt updated ZOOKEEPER-2201:
------------------------------------
    Assignee: Donny Nadolny

> Network issues can cause cluster to hang due to near-deadlock
> -------------------------------------------------------------
>
>                 Key: ZOOKEEPER-2201
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2201
>             Project: ZooKeeper
>          Issue Type: Bug
>    Affects Versions: 3.4.6
>            Reporter: Donny Nadolny
>            Assignee: Donny Nadolny
>            Priority: Critical
>         Attachments: ZOOKEEPER-2201.patch
>
>
> {{DataTree.serializeNode}} synchronizes on the {{DataNode}} it is about to 
> serialize then writes it out via {{OutputArchive.writeRecord}}, potentially 
> to a network connection. Under default linux TCP settings, a network 
> connection where the other side completely disappears will hang (blocking on 
> the {{java.net.SocketOutputStream.socketWrite0}} call) for over 15 minutes. 
> During this time, any attempt to create/delete/modify the {{DataNode}} will 
> cause the leader to hang at the beginning of the request processor chain:
> {noformat}
> "ProcessThread(sid:5 cport:-1):" prio=10 tid=0x00000000026f1800 nid=0x379c 
> waiting for monitor entry [0x00007fe6c2a8c000]
>    java.lang.Thread.State: BLOCKED (on object monitor)
>         at 
> org.apache.zookeeper.server.PrepRequestProcessor.getRecordForPath(PrepRequestProcessor.java:163)
>         - waiting to lock <0x00000000d4cd9e28> (a 
> org.apache.zookeeper.server.DataNode)
>         - locked <0x00000000d2ef81d0> (a java.util.ArrayList)
>         at 
> org.apache.zookeeper.server.PrepRequestProcessor.pRequest2Txn(PrepRequestProcessor.java:345)
>         at 
> org.apache.zookeeper.server.PrepRequestProcessor.pRequest(PrepRequestProcessor.java:534)
>         at 
> org.apache.zookeeper.server.PrepRequestProcessor.run(PrepRequestProcessor.java:131)
> {noformat}
> Additionally, any attempt to send a snapshot to a follower or to disk will 
> hang.
> Because the ping packets are sent by another thread which is unaffected, 
> followers never time out and become leader, even though the cluster will make 
> no progress until either the leader is killed or the TCP connection times 
> out. This isn't exactly a deadlock since it will resolve itself eventually, 
> but as mentioned above this will take > 15 minutes with the default TCP retry 
> settings in linux.
> A simple solution to this is: in {{DataTree.serializeNode}} we can take a 
> copy of the contents of the {{DataNode}} (as is done with its children) in 
> the synchronized block, then call {{writeRecord}} with the copy of the 
> {{DataNode}} outside of the original {{DataNode}} synchronized block.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (ZOOKEEPER-2201) Network issues can cause cluster to hang due to near-deadlock

Reply via email to