[
https://issues.apache.org/jira/browse/HADOOP-4584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12669719#action_12669719
]
Raghu Angadi commented on HADOOP-4584:
--------------------------------------
I think this approach might work ok for now. It makes sure the data node is not
marked dead. But this should be considered mostly a work around. We should
note the fundamental problem still remains (a little less lethal). e.g. a) new
blocks are not reported, b) no new blocks can be written during this time c)
(not sure) not blocks can be read? etc.
If all the nodes are taking very long to process the block report, many
operations on HDFS will fail. An admin can increase the block report period to
reduce the effect of this problem. The current fix works fine for occasional
delays.
> In step 4. should we wait for receiving a command or for receiving another
> block?
both would be better.
> In OfferService we process all the commands that are in the queue at once.
> Do you see any issues with it?
Not fundamentally different. One main issue would be that there might be
thousands of blocks to delete sometimes.. But that is same problem as long
block report.
Regd more complete fix, I could file another jira to propose a fix that I
discussed with Sameer and Hairong, that satisfies all the requirements on
current block report.
> Slow generation of blockReport at DataNode causes delay of sending heartbeat
> to NameNode
> ----------------------------------------------------------------------------------------
>
> Key: HADOOP-4584
> URL: https://issues.apache.org/jira/browse/HADOOP-4584
> Project: Hadoop Core
> Issue Type: Bug
> Components: dfs
> Reporter: Hairong Kuang
> Assignee: Suresh Srinivas
> Fix For: 0.20.0
>
> Attachments: 4584.patch
>
>
> sometimes due to disk or some other problems, datanode takes minutes or tens
> of minutes to generate a block report. It causes the datanode not able to
> send heartbeat to NameNode every 3 seconds. In the worst case, it makes
> NameNode to detect a lost heartbeat and wrongly decide that the datanode is
> dead.
> It would be nice to have two threads instead. One thread is for scanning data
> directories and generating block report, and executes the requests sent by
> NameNode; Another thread is for sending heartbeats, block reports, and
> picking up the requests from NameNode. By having these two threads, the
> sending of heartbeats will not get delayed by any slow block report or slow
> execution of NameNode requests.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.