[ 
https://issues.apache.org/jira/browse/HADOOP-4584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12676356#action_12676356
 ] 

dhruba borthakur commented on HADOOP-4584:
------------------------------------------

I think it is imperative that missing blocks be detected by the system more 
aggresively that what is proposed by this JIRA. This JIRA assumes that if a 
block file disappears from the disk, then it will be handled by the periodic 
block scanner, but it might be a couple of weeks before the detection occurs. 
This could reduce reliability of HDFS, isn't it? Bit rot does not happen that 
often, but an rougue program deleting lots of block files can occur, especially 
when arbitrary user-written map-reduce code can execute on cluster nodes.

One option would be to do a a brute-force-block-report (that scans the entire 
disk) once every day or so. The hourly block reports may skip scanning the 
disk. This might alleviate the problem to some extent, while at the same time 
detecting missing blocks much much earlier than what is proposed by the JIRA.

In many cases, a datanode has three or four disk devices where it stores data 
blocks. What happens if one out of the four configured data directories go bad? 
if the block scanner never gets to processing a block from that data directory, 
then all the blocks in that data directory might be inaccesible for  a long 
time without being detected, isn't it?



> Slow generation of blockReport at DataNode causes delay of sending heartbeat 
> to NameNode
> ----------------------------------------------------------------------------------------
>
>                 Key: HADOOP-4584
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4584
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>            Reporter: Hairong Kuang
>            Assignee: Suresh Srinivas
>             Fix For: 0.20.0
>
>         Attachments: 4584.patch, 4584.patch, 4584.patch, 4584.patch, 
> 4584.patch, 4584.patch
>
>
> sometimes due to disk or some other problems, datanode takes minutes or tens 
> of minutes to generate a block report. It causes the datanode not able to 
> send heartbeat to NameNode every 3 seconds. In the worst case, it makes 
> NameNode to detect a lost heartbeat and wrongly decide that the datanode is 
> dead.
> It would be nice to have two threads instead. One thread is for scanning data 
> directories and generating block report, and executes the requests sent by 
> NameNode; Another thread is for sending heartbeats, block reports, and 
> picking up the requests from NameNode. By having these two threads, the 
> sending of heartbeats will not get delayed by any slow block report or slow 
> execution of NameNode requests.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to