[ 
https://issues.apache.org/jira/browse/HDFS-11609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15955393#comment-15955393
 ] 

Wei-Chiu Chuang commented on HDFS-11609:
----------------------------------------

Hello [~kihwal] thanks for posting the patch and very nice analysis. I am still 
reviewing the patch. Could you update the comments in the code as well?
For example,
{code}
       // never use already decommissioned nodes, maintenance node not
       // suitable for read or unknown state replicas.
-      if (state == null || state == StoredReplicaState.DECOMMISSIONED
-          || state == StoredReplicaState.MAINTENANCE_NOT_FOR_READ) {
+      if (state == null ||
+          state == StoredReplicaState.MAINTENANCE_NOT_FOR_READ) {
{code}
I think "already decommissioned nodes," should be removed from the comment.

It would also be nice if you could add comments next to the comment added in 
BlockManager explaining what it is meant to do.

Thanks!

> Some blocks can be permanently lost if nodes are decommissioned while dead
> --------------------------------------------------------------------------
>
>                 Key: HDFS-11609
>                 URL: https://issues.apache.org/jira/browse/HDFS-11609
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: namenode
>    Affects Versions: 2.7.0
>            Reporter: Kihwal Lee
>            Assignee: Kihwal Lee
>            Priority: Critical
>         Attachments: HDFS-11609.branch-2.patch, HDFS-11609.trunk.patch
>
>
> When all the nodes containing a replica of a block are decommissioned while 
> they are dead, they get decommissioned right away even if there are missing 
> blocks. This behavior was introduced by HDFS-7374.
> The problem starts when those decommissioned nodes are brought back online. 
> The namenode no longer shows missing blocks, which creates a false sense of 
> cluster health. When the decommissioned nodes are removed and reformatted, 
> the block data is permanently lost. The namenode will report missing blocks 
> after the heartbeat recheck interval (e.g. 10 minutes) from the moment the 
> last node is taken down.
> There are multiple issues in the code. As some cause different behaviors in 
> testing vs. production, it took a while to reproduce it in a unit test. I 
> will present analysis and proposal soon.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

Reply via email to