[ 
https://issues.apache.org/jira/browse/HDFS-6626?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14069811#comment-14069811
 ] 

Ming Ma commented on HDFS-6626:
-------------------------------

Thanks, Andrew. I discussed more with our admins and they want to identify bad 
nodes quickly in the context of decommission. I agree such new state doesn't 
help much, given the dead nodes UI can provide such information.

> Node is marked decommissioned if it becomes dead when it is being 
> decommissioned
> --------------------------------------------------------------------------------
>
>                 Key: HDFS-6626
>                 URL: https://issues.apache.org/jira/browse/HDFS-6626
>             Project: Hadoop HDFS
>          Issue Type: Bug
>            Reporter: Ming Ma
>
> Not sure if it is by design. But it isn't intuitive. The scenario is like 
> this, you try to decommission a node; when the node is being decommissioned, 
> the node becomes dead from NN's point of view; right after that NN will mark 
> this node decommissioned. On the webUI, administrators will consider the 
> decommission has completed successfully. That is because when there is no 
> block left for the DN, decommission is considered done.
> {noformat}
> BlockManager.java
>   boolean isReplicationInProgress(DatanodeDescriptor srcNode) {
>     boolean status = false;
> ...
>     final Iterator<? extends Block> it = srcNode.getBlockIterator();
>     while(it.hasNext()) {
> ...
> // set status if there is block under replication
>     }
> ...
>     return status;
> }
> {noformat}
> The question is whether we should mark the dead node as decommission 
> completed (the current behavior), or mark the dead node "decommission 
> aborted". From administrators' point of view, when they are doing decomm, 
> they want to know the status of decomm and the health of those 
> decomm-in-progress nodes. If they can detect decommission failure earlier, 
> they might be able to take actions earlier; for example if the TOR switch has 
> issues during decomm, administrators will be able to quickly find out a bunch 
> of "decommission aborted" nodes from the same rack. People can still find 
> this information by doing the join between decomm node list and recent dead 
> node list on the webUI; just not as convenient.
> Suggestions?



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to