[jira] [Updated] (ZOOKEEPER-2201) Network issues can cause cluster to hang due to near-deadlock

Raul Gutierrez Segales (JIRA) Sat, 06 Jun 2015 09:55:59 -0700

     [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Raul Gutierrez Segales updated ZOOKEEPER-2201:
----------------------------------------------
    Release Note:   (was: Merged:

http://svn.apache.org/viewvc?view=revision&revision=1683878
http://svn.apache.org/viewvc?view=revision&revision=1683930
http://svn.apache.org/viewvc?view=revision&revision=1683931

(pasting SVN URLs since the github mirror seems to be lagging). 

Thanks [~dnadolny]! )

> Network issues can cause cluster to hang due to near-deadlock
> -------------------------------------------------------------
>
>                 Key: ZOOKEEPER-2201
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2201
>             Project: ZooKeeper
>          Issue Type: Bug
>    Affects Versions: 3.4.6, 3.5.0
>            Reporter: Donny Nadolny
>            Assignee: Donny Nadolny
>            Priority: Critical
>             Fix For: 3.4.7, 3.5.2, 3.6.0
>
>         Attachments: ZOOKEEPER-2201-branch-34.patch, ZOOKEEPER-2201.patch, 
> ZOOKEEPER-2201.patch, ZOOKEEPER-2201.patch, ZOOKEEPER-2201.patch, 
> ZOOKEEPER-2201.patch
>
>
> {{DataTree.serializeNode}} synchronizes on the {{DataNode}} it is about to 
> serialize then writes it out via {{OutputArchive.writeRecord}}, potentially 
> to a network connection. Under default linux TCP settings, a network 
> connection where the other side completely disappears will hang (blocking on 
> the {{java.net.SocketOutputStream.socketWrite0}} call) for over 15 minutes. 
> During this time, any attempt to create/delete/modify the {{DataNode}} will 
> cause the leader to hang at the beginning of the request processor chain:
> {noformat}
> "ProcessThread(sid:5 cport:-1):" prio=10 tid=0x00000000026f1800 nid=0x379c 
> waiting for monitor entry [0x00007fe6c2a8c000]
>    java.lang.Thread.State: BLOCKED (on object monitor)
>         at 
> org.apache.zookeeper.server.PrepRequestProcessor.getRecordForPath(PrepRequestProcessor.java:163)
>         - waiting to lock <0x00000000d4cd9e28> (a 
> org.apache.zookeeper.server.DataNode)
>         - locked <0x00000000d2ef81d0> (a java.util.ArrayList)
>         at 
> org.apache.zookeeper.server.PrepRequestProcessor.pRequest2Txn(PrepRequestProcessor.java:345)
>         at 
> org.apache.zookeeper.server.PrepRequestProcessor.pRequest(PrepRequestProcessor.java:534)
>         at 
> org.apache.zookeeper.server.PrepRequestProcessor.run(PrepRequestProcessor.java:131)
> {noformat}
> Additionally, any attempt to send a snapshot to a follower or to disk will 
> hang.
> Because the ping packets are sent by another thread which is unaffected, 
> followers never time out and become leader, even though the cluster will make 
> no progress until either the leader is killed or the TCP connection times 
> out. This isn't exactly a deadlock since it will resolve itself eventually, 
> but as mentioned above this will take > 15 minutes with the default TCP retry 
> settings in linux.
> A simple solution to this is: in {{DataTree.serializeNode}} we can take a 
> copy of the contents of the {{DataNode}} (as is done with its children) in 
> the synchronized block, then call {{writeRecord}} with the copy of the 
> {{DataNode}} outside of the original {{DataNode}} synchronized block.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (ZOOKEEPER-2201) Network issues can cause cluster to hang due to near-deadlock

Reply via email to