[ 
https://issues.apache.org/jira/browse/HDFS-17090?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He updated HDFS-17090:
-------------------------------
    Description: 
I met one corner case recently, which decommission DataNode impact performance 
of NameNode. After dig carefully, I have reproduced this case.
a. Add some DataNodes to exclude and prepare to decommission this Datanodes.
b. Execute bin/hdfs dfsadmin -refresh (This is optional step).
c. Restart NameNode for upgrade or other reason before complete to decommission.
d. All DataNodes will be trigger to register and FBR.
e. Considering that the load of NameNode will be very high, especially 8040 
CallQueue(which is used by DataNode interact with NameNode) will be full for a 
long time because RPC flood about register/heartbeat/FBR from DataNodes.
f. For one decommission in-progress node, it will not complete to decommission 
until next FBR even all replicas of this node has been processed, because the 
request order register-heartbeat-(blockreport, register), and the second 
register could be one retry RPC request from DataNode (No more log information 
from DataNode to confirm), and for (blockreport, register), NameNode could 
process one storage then process register then process remaining storages in 
order(process FBR asynchronously now). 
g. Because the second register RPC, the related DataNodes will be marked 
unhealthy by BlockManager#isNodeHealthyForDecommissionOrMaintenance. So 
decommission will be stuck for long time until next FBR. Thus NameNode need to 
scan this DataNode at every round to check if could complete which hold the 
global write lock and impact performance of NameNode.

To improve it, I think we could filter the repeated register RPC request at 
startup progress. Not think carefully if it will involve other risks when 
filter register directly. Welcome anymore discussions.

  was:
I met one corner case recently, which decommission DataNode impact performance 
of NameNode. After dig carefully, I have reproduced this case.
a. Add some DataNodes to exclude and prepare to decommission this Datanodes.
b. Execute bin/hdfs dfsadmin -refresh (This is optional step).
c. Restart NameNode for upgrade or other reason before complete to decommission.
d. All DataNodes will be trigger to register and FBR.
e. Considering that the load of NameNode will be very high, especially 8040 
CallQueue will be full for a long time because RPC flood about 
register/heartbeat/FBR from DataNodes.
f. For one decommission in-progress node, it will not complete to decommission 
until next FBR even all replicas of this node has been processed, because the 
request order register-heartbeat-(blockreport, register), and the second 
register could be one retry RPC request from DataNode (No more log information 
from DataNode to confirm), and for (blockreport, register), NameNode could 
process one storage then process register then process remaining storages in 
order. 
g. Because the second register RPC, the related DataNodes will be marked 
unhealthy by BlockManager#isNodeHealthyForDecommissionOrMaintenance. So 
decommission will be stuck for long time until next FBR. Thus NameNode need to 
scan this DataNode at every round to check if could complete which hold the 
global write lock and impact performance of NameNode.

To improve it, I think we could filter the repeated register RPC request at 
startup progress. Not think carefully if it will involve other risks when 
filter register directly. Welcome anymore discussions.


> Decommission will be stuck for long time when restart because overlapped 
> process Register and BlockReport.
> ----------------------------------------------------------------------------------------------------------
>
>                 Key: HDFS-17090
>                 URL: https://issues.apache.org/jira/browse/HDFS-17090
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: namenode
>            Reporter: Xiaoqiao He
>            Assignee: Xiaoqiao He
>            Priority: Major
>
> I met one corner case recently, which decommission DataNode impact 
> performance of NameNode. After dig carefully, I have reproduced this case.
> a. Add some DataNodes to exclude and prepare to decommission this Datanodes.
> b. Execute bin/hdfs dfsadmin -refresh (This is optional step).
> c. Restart NameNode for upgrade or other reason before complete to 
> decommission.
> d. All DataNodes will be trigger to register and FBR.
> e. Considering that the load of NameNode will be very high, especially 8040 
> CallQueue(which is used by DataNode interact with NameNode) will be full for 
> a long time because RPC flood about register/heartbeat/FBR from DataNodes.
> f. For one decommission in-progress node, it will not complete to 
> decommission until next FBR even all replicas of this node has been 
> processed, because the request order register-heartbeat-(blockreport, 
> register), and the second register could be one retry RPC request from 
> DataNode (No more log information from DataNode to confirm), and for 
> (blockreport, register), NameNode could process one storage then process 
> register then process remaining storages in order(process FBR asynchronously 
> now). 
> g. Because the second register RPC, the related DataNodes will be marked 
> unhealthy by BlockManager#isNodeHealthyForDecommissionOrMaintenance. So 
> decommission will be stuck for long time until next FBR. Thus NameNode need 
> to scan this DataNode at every round to check if could complete which hold 
> the global write lock and impact performance of NameNode.
> To improve it, I think we could filter the repeated register RPC request at 
> startup progress. Not think carefully if it will involve other risks when 
> filter register directly. Welcome anymore discussions.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

Reply via email to