[jira] [Commented] (HDFS-6626) Node is marked decommissioned if it becomes dead when it is being decommissioned

2014-07-21 Thread Ming Ma (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-6626?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14069811#comment-14069811
 ] 

Ming Ma commented on HDFS-6626:
---

Thanks, Andrew. I discussed more with our admins and they want to identify bad 
nodes quickly in the context of decommission. I agree such new state doesn't 
help much, given the dead nodes UI can provide such information.

 Node is marked decommissioned if it becomes dead when it is being 
 decommissioned
 

 Key: HDFS-6626
 URL: https://issues.apache.org/jira/browse/HDFS-6626
 Project: Hadoop HDFS
  Issue Type: Bug
Reporter: Ming Ma

 Not sure if it is by design. But it isn't intuitive. The scenario is like 
 this, you try to decommission a node; when the node is being decommissioned, 
 the node becomes dead from NN's point of view; right after that NN will mark 
 this node decommissioned. On the webUI, administrators will consider the 
 decommission has completed successfully. That is because when there is no 
 block left for the DN, decommission is considered done.
 {noformat}
 BlockManager.java
   boolean isReplicationInProgress(DatanodeDescriptor srcNode) {
 boolean status = false;
 ...
 final Iterator? extends Block it = srcNode.getBlockIterator();
 while(it.hasNext()) {
 ...
 // set status if there is block under replication
 }
 ...
 return status;
 }
 {noformat}
 The question is whether we should mark the dead node as decommission 
 completed (the current behavior), or mark the dead node decommission 
 aborted. From administrators' point of view, when they are doing decomm, 
 they want to know the status of decomm and the health of those 
 decomm-in-progress nodes. If they can detect decommission failure earlier, 
 they might be able to take actions earlier; for example if the TOR switch has 
 issues during decomm, administrators will be able to quickly find out a bunch 
 of decommission aborted nodes from the same rack. People can still find 
 this information by doing the join between decomm node list and recent dead 
 node list on the webUI; just not as convenient.
 Suggestions?



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HDFS-6626) Node is marked decommissioned if it becomes dead when it is being decommissioned

2014-07-16 Thread Andrew Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-6626?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14064408#comment-14064408
 ] 

Andrew Wang commented on HDFS-6626:
---

Hi Ming,

I think the main goal of decommissioning is to shift blocks off of a DN, which 
is done at low priority to avoid disrupting the cluster. However, if a DN dies 
while decommissioning, HDFS is forced to immediately re-replicate all of its 
blocks at a high priority. Thus, the end result of a successful decommission vs 
an aborted decommission as you term it is the same: no blocks on that DN.

What additional actions would the admin be able to take if we also had a 
decommission aborted state? If you're interested in process / host health, 
that's typically handled by dedicated monitoring tools like CM or ganglia.

 Node is marked decommissioned if it becomes dead when it is being 
 decommissioned
 

 Key: HDFS-6626
 URL: https://issues.apache.org/jira/browse/HDFS-6626
 Project: Hadoop HDFS
  Issue Type: Bug
Reporter: Ming Ma

 Not sure if it is by design. But it isn't intuitive. The scenario is like 
 this, you try to decommission a node; when the node is being decommissioned, 
 the node becomes dead from NN's point of view; right after that NN will mark 
 this node decommissioned. On the webUI, administrators will consider the 
 decommission has completed successfully. That is because when there is no 
 block left for the DN, decommission is considered done.
 {noformat}
 BlockManager.java
   boolean isReplicationInProgress(DatanodeDescriptor srcNode) {
 boolean status = false;
 ...
 final Iterator? extends Block it = srcNode.getBlockIterator();
 while(it.hasNext()) {
 ...
 // set status if there is block under replication
 }
 ...
 return status;
 }
 {noformat}
 The question is whether we should mark the dead node as decommission 
 completed (the current behavior), or mark the dead node decommission 
 aborted. From administrators' point of view, when they are doing decomm, 
 they want to know the status of decomm and the health of those 
 decomm-in-progress nodes. If they can detect decommission failure earlier, 
 they might be able to take actions earlier; for example if the TOR switch has 
 issues during decomm, administrators will be able to quickly find out a bunch 
 of decommission aborted nodes from the same rack. People can still find 
 this information by doing the join between decomm node list and recent dead 
 node list on the webUI; just not as convenient.
 Suggestions?



--
This message was sent by Atlassian JIRA
(v6.2#6252)