[ 
https://issues.apache.org/jira/browse/HDFS-14186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16742293#comment-16742293
 ] 

Kihwal Lee commented on HDFS-14186:
-----------------------------------

[~hexiaoqiao], one thing to note is that the rpc processing time can be 
misleading in this case.  You mentioned that the block report processing time 
was low. If it was based on the rpc call processing time, the average number 
may seem low because the thrown-away calls are recorded as 0 ms service time. 
When a lot of handlers dequeue a call and find the caller is already gone 
(client timeout), it will artificially pull down the average.  Got misled by 
this many times.

The bottom line is, the rpc quque grows too big and the NN cannot process them 
fast enough. Making datanodes not dead during a safe mode and the safe mode 
extension... These are all workarounds for the slow processing.

Breaking up block reports to individual storage can be helpful in some cases. 
The datanodes will send a single storage report and block until the reponse is 
returned. NN can process the smaller reports faster and their queue time will 
be much shorter. I.e. there is less chance of datanodes timing out on NN and 
resending the big reports multiple times, making the rpc queue even more 
clogged. In 2.7 days, we ended up configuring datanodes breaking up block 
reports unconditionally and that helped NN startup performance.

Independent of these mitigating strategies, we can pursue adding an option to 
enable what you suggested. However, waiting for 100% of replication can be 
troublesome in many cases. Presence of under-replicated blocks is almost a 
norm, not an anomaly. Clients do setrep often and decom can happen, not to 
mention occasional heartbeat expirations. May be we could make NN do something 
like this:
 1) check whether the usual safe block count is met. This is a precondition to 
leave safemode without missing blocks.
 2) optionally extend safe mode until various condition is met. Here, we can 
have NN check whether all storage reports are received from all registered 
nodes. It will also be helpful if this information can be obtained by an admin 
command or through web ui.

> blockreport storm slow down namenode restart seriously in large cluster
> -----------------------------------------------------------------------
>
>                 Key: HDFS-14186
>                 URL: https://issues.apache.org/jira/browse/HDFS-14186
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>          Components: namenode
>            Reporter: He Xiaoqiao
>            Assignee: He Xiaoqiao
>            Priority: Major
>         Attachments: HDFS-14186.001.patch
>
>
> In the current implementation, the datanode sends blockreport immediately 
> after register to namenode successfully when restart, and the blockreport 
> storm will make namenode high load to process them. One result is some 
> received RPC have to skip because queue time is timeout. If some datanodes' 
> heartbeat RPC are continually skipped for long times (default is 
> heartbeatExpireInterval=630s) it will be set DEAD, then datanode has to 
> re-register and send blockreport again, aggravate blockreport storm and trap 
> in a vicious circle, and slow down (more than one hour and even more) 
> namenode startup seriously in a large (several thousands of datanodes) and 
> busy cluster especially. Although there are many work to optimize namenode 
> startup, the issue still exists. 
> I propose to postpone dead datanode check when namenode have finished startup.
> Any comments and suggestions are welcome.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

Reply via email to