[jira] Commented: (HADOOP-4584) Slow generation of blockReport at DataNode causes delay of sending heartbeat to NameNode

Konstantin Shvachko (JIRA) Fri, 27 Mar 2009 14:39:13 -0700

    [ 
https://issues.apache.org/jira/browse/HADOOP-4584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12690118#action_12690118
 ]


Konstantin Shvachko commented on HADOOP-4584:
---------------------------------------------

I am commenting on the design document.
It seems that you can simplify the description of the algorithm. As I 
understood you generate 2 reports memory_report and disk_report, then compare 
them and generate a (diff) list of suspicious blocks. They are only suspicious, 
since they were different at the time the reports were generated, which may be 
not true at the current time. And then for each suspicious block you reconcile 
it under a lock in order to prevent immediate modifications of the block state.
To simplify the algorithm you can completely drop the conditions reflecting the 
state of the block in the past when it was chosen as suspicious. The past state 
is irrelevant in the present because you still need to verify the state and act 
according to its present state rather than the past.
I see the code in fact does exactly that.

Other comments:
- I don't think the directory scan interval in hdfs-default.xml should be in 
hours. This is radical. At least for testing you should be able to run the 
directory scanner more often.
- {{DirectoryScanner()}} constructor and {{reconcile()}} should not be public. 
Please check other methods that do not need to be public.
- It is better to give a hint in the override annotation which base class is 
overridden, e.g. {...@override // Object}}
- {{FSDataset.checkAndUpdate()}} You can make it much more readable if you add 
return statements inside if statements. This will let you drop a lot of else 
clauses and linearize the code making the logic clearer.

> Slow generation of blockReport at DataNode causes delay of sending heartbeat 
> to NameNode
> ----------------------------------------------------------------------------------------
>
>                 Key: HADOOP-4584
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4584
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>            Reporter: Hairong Kuang
>            Assignee: Suresh Srinivas
>             Fix For: 0.20.0
>
>         Attachments: 4584.brthread.2.patch, 4584.brthread.3.patch, 
> 4584.brthread.3.patch, 4584.brthread.3.patch, 4584.brthread.3.patch, 
> 4584.brthread.3.patch, 4584.hbthread.patch, 4584.patch, 4584.patch, 
> 4584.patch, 4584.patch, 4584.patch, 4584.patch, Design.pdf
>
>
> sometimes due to disk or some other problems, datanode takes minutes or tens 
> of minutes to generate a block report. It causes the datanode not able to 
> send heartbeat to NameNode every 3 seconds. In the worst case, it makes 
> NameNode to detect a lost heartbeat and wrongly decide that the datanode is 
> dead.
> It would be nice to have two threads instead. One thread is for scanning data 
> directories and generating block report, and executes the requests sent by 
> NameNode; Another thread is for sending heartbeats, block reports, and 
> picking up the requests from NameNode. By having these two threads, the 
> sending of heartbeats will not get delayed by any slow block report or slow 
> execution of NameNode requests.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4584) Slow generation of blockReport at DataNode causes delay of sending heartbeat to NameNode

Reply via email to