[
https://issues.apache.org/jira/browse/HDFS-5922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Arpit Agarwal updated HDFS-5922:
--------------------------------
Attachment: HDFS-5922.01.patch
Hi Aaron, sorry about the delayed response. I was away. Here's a preliminary
patch to get Jenkins results.
The specific bug here could have been avoided by resetting the counter to zero
when emptying the queues. However it seems unnecessary to maintain an exact
count of the pending requests when all we care about is whether or not there
are any requests. The patch replaces the counter with a boolean.
{quote}
Andrew Wang also pointed out offline that it is perhaps incorrect to be
subtracting the number of deleted blocks from pendingReceivedRequests in
BPServiceActor#reportReceivedDeletedBlocks, but the result of that is somewhat
less serious, since in that case the worst case is just that we send a somewhat
delayed IBR.
{quote}
This behavior looks odd but it was probably by design.
{{pendingReceivedRequests}} was not incremented for deleted requests to avoid
sending an IBR for just deleted blocks before the timeout interval has elapsed.
However when we failed to send an IBR we reinserted all pending entries into
the queue and set {{pendingReceivedRequests}} to be the count of all pending
requests - deleted+received - presumably to avoid waiting for another timeout
interval before retrying.
> DN heartbeat thread can get stuck in tight loop
> -----------------------------------------------
>
> Key: HDFS-5922
> URL: https://issues.apache.org/jira/browse/HDFS-5922
> Project: Hadoop HDFS
> Issue Type: Bug
> Components: datanode
> Affects Versions: 2.3.0
> Reporter: Aaron T. Myers
> Assignee: Arpit Agarwal
> Attachments: HDFS-5922.01.patch
>
>
> We saw an issue recently on a test cluster where one of the DN threads was
> consuming 100% of a single CPU. Running jstack indicated that it was the DN
> heartbeat thread. I believe I've tracked down the cause to a bug in the
> accounting around the value of {{pendingReceivedRequests}}.
> More details in the first comment.
--
This message was sent by Atlassian JIRA
(v6.1.5#6160)