[jira] [Commented] (ZOOKEEPER-2201) Network issues can cause cluster to hang due to near-deadlock

Hadoop QA (JIRA) Wed, 03 Jun 2015 07:22:10 -0700

    [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14570900#comment-14570900
 ]


Hadoop QA commented on ZOOKEEPER-2201:
--------------------------------------

-1 overall.  Here are the results of testing the latest attachment 
  
http://issues.apache.org/jira/secure/attachment/12737283/ZOOKEEPER-2201-branch-34.patch
  against trunk revision 1683163.

    +1 @author.  The patch does not contain any @author tags.

    +1 tests included.  The patch appears to include 3 new or modified tests.

    -1 patch.  The patch command could not apply the patch.

Console output: 
https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/2746//console

This message is automatically generated.

> Network issues can cause cluster to hang due to near-deadlock
> -------------------------------------------------------------
>
>                 Key: ZOOKEEPER-2201
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2201
>             Project: ZooKeeper
>          Issue Type: Bug
>    Affects Versions: 3.4.6
>            Reporter: Donny Nadolny
>            Assignee: Donny Nadolny
>            Priority: Critical
>             Fix For: 3.4.7, 3.5.2
>
>         Attachments: ZOOKEEPER-2201-branch-34.patch, ZOOKEEPER-2201.patch, 
> ZOOKEEPER-2201.patch
>
>
> {{DataTree.serializeNode}} synchronizes on the {{DataNode}} it is about to 
> serialize then writes it out via {{OutputArchive.writeRecord}}, potentially 
> to a network connection. Under default linux TCP settings, a network 
> connection where the other side completely disappears will hang (blocking on 
> the {{java.net.SocketOutputStream.socketWrite0}} call) for over 15 minutes. 
> During this time, any attempt to create/delete/modify the {{DataNode}} will 
> cause the leader to hang at the beginning of the request processor chain:
> {noformat}
> "ProcessThread(sid:5 cport:-1):" prio=10 tid=0x00000000026f1800 nid=0x379c 
> waiting for monitor entry [0x00007fe6c2a8c000]
>    java.lang.Thread.State: BLOCKED (on object monitor)
>         at 
> org.apache.zookeeper.server.PrepRequestProcessor.getRecordForPath(PrepRequestProcessor.java:163)
>         - waiting to lock <0x00000000d4cd9e28> (a 
> org.apache.zookeeper.server.DataNode)
>         - locked <0x00000000d2ef81d0> (a java.util.ArrayList)
>         at 
> org.apache.zookeeper.server.PrepRequestProcessor.pRequest2Txn(PrepRequestProcessor.java:345)
>         at 
> org.apache.zookeeper.server.PrepRequestProcessor.pRequest(PrepRequestProcessor.java:534)
>         at 
> org.apache.zookeeper.server.PrepRequestProcessor.run(PrepRequestProcessor.java:131)
> {noformat}
> Additionally, any attempt to send a snapshot to a follower or to disk will 
> hang.
> Because the ping packets are sent by another thread which is unaffected, 
> followers never time out and become leader, even though the cluster will make 
> no progress until either the leader is killed or the TCP connection times 
> out. This isn't exactly a deadlock since it will resolve itself eventually, 
> but as mentioned above this will take > 15 minutes with the default TCP retry 
> settings in linux.
> A simple solution to this is: in {{DataTree.serializeNode}} we can take a 
> copy of the contents of the {{DataNode}} (as is done with its children) in 
> the synchronized block, then call {{writeRecord}} with the copy of the 
> {{DataNode}} outside of the original {{DataNode}} synchronized block.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (ZOOKEEPER-2201) Network issues can cause cluster to hang due to near-deadlock

Reply via email to