[jira] [Updated] (HDFS-5922) DN heartbeat thread can get stuck in tight loop

Arpit Agarwal (JIRA) Sat, 22 Feb 2014 17:36:07 -0800

     [ 
https://issues.apache.org/jira/browse/HDFS-5922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Arpit Agarwal updated HDFS-5922:
--------------------------------

    Attachment: HDFS-5922.01.patch

Hi Aaron, sorry about the delayed response. I was away. Here's a preliminary 
patch to get Jenkins results.

The specific bug here could have been avoided by resetting the counter to zero 
when emptying the queues. However it seems unnecessary to maintain an exact 
count of the pending requests when all we care about is whether or not there 
are any requests. The patch replaces the counter with a boolean.

{quote}
Andrew Wang also pointed out offline that it is perhaps incorrect to be 
subtracting the number of deleted blocks from pendingReceivedRequests in 
BPServiceActor#reportReceivedDeletedBlocks, but the result of that is somewhat 
less serious, since in that case the worst case is just that we send a somewhat 
delayed IBR.
{quote}
This behavior looks odd but it was probably by design. 
{{pendingReceivedRequests}} was not incremented for deleted requests to avoid 
sending an IBR for just deleted blocks before the timeout interval has elapsed. 
However when we failed to send an IBR we reinserted all pending entries into 
the queue and set {{pendingReceivedRequests}} to be the count of all pending 
requests - deleted+received - presumably to avoid waiting for another timeout 
interval before retrying.

> DN heartbeat thread can get stuck in tight loop
> -----------------------------------------------
>
>                 Key: HDFS-5922
>                 URL: https://issues.apache.org/jira/browse/HDFS-5922
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: datanode
>    Affects Versions: 2.3.0
>            Reporter: Aaron T. Myers
>            Assignee: Arpit Agarwal
>         Attachments: HDFS-5922.01.patch
>
>
> We saw an issue recently on a test cluster where one of the DN threads was 
> consuming 100% of a single CPU. Running jstack indicated that it was the DN 
> heartbeat thread. I believe I've tracked down the cause to a bug in the 
> accounting around the value of {{pendingReceivedRequests}}.
> More details in the first comment.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Updated] (HDFS-5922) DN heartbeat thread can get stuck in tight loop

Reply via email to