[ 
https://issues.apache.org/jira/browse/HDFS-11609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15951587#comment-15951587
 ] 

Kihwal Lee commented on HDFS-11609:
-----------------------------------

h3. Inability to correctly guess the previous replication priority
Guessing the previous replication priority level of a block works most of 
times, but is not perfect. Different orders of events can lead to the identical 
current state, but the previous priority levels can differ.  We can improve the 
priority update method so that the guessing logic still provides benefit in 
majority of cases, yet its correctness is not strictly necessary.

Following shows problems I encountered and solutions.

In {{UnderReplicatedBlocks}},
{code}
  private int getPriority(int curReplicas,
                          int readOnlyReplicas,
                          int decommissionedReplicas,
                          int expectedReplicas) {
    assert curReplicas >= 0 : "Negative replicas!";
{code}
This is called from {{update()}}, which calls it with {{curReplicas}} set to 
{{curReplicas-curReplicasDelta}}. When all replica-containing nodes are dead 
({{curReplicas}} is 0), but a decommissioned node having a replica joins, 
{{update()}} is called with {{curReplicas}} of -1, which sets off the assert.  
This causes initial block report processing to stop in the middle.  This node 
is live and decommissioned and the block will appear missing because the block 
report wasn't processed due to the assertion failure.

This can be avoided if {{curReplicasDelta}} is not set to 1 if this replica is 
decommissioned. This value originates from {{BlockManager}}'s 
{{addStoredBlock()}}.
{code}
     if (result == AddBlockResult.ADDED) {
-      curReplicaDelta = 1;
+      curReplicaDelta = (node.isDecommissioned()) ? 0 : 1;
{code}
This fixes this particular issue.

The assert is removed in the real build, so it acts differently in production 
runtime. Instead block report processing blowing up, "-1" will cause it to 
return the level, {{QUEUE_VERY_UNDER_REPLICATED}} without the above fix, which 
is incorrect.

If the previous priority level is guessed incorrectly and it happens to be 
identical to the current level, the old entry won't be removed, resulting in 
duplicate entries. The {{remove()}} method is already robust so if a block is 
not found in the specified level, it tries to remove it from other priority 
levels too. So we can simply call {{remove()}} unconditionally. Guessing the 
old priority is not functionally necessary with this change, but is still 
useful, since the guess is normally correct which makes it visit only one 
priority level for removal in most of cases.

{code}
-    if(oldPri != curPri) {
-      remove(block, oldPri);
-    }
+    // oldPri is mostly correct, but not always. If not found with oldPri,
+    // other levels will be searched until the block is found & removed.
+    remove(block, oldPri);
{code}

h3. Replication priority level of a block with only decommissioned replicas
With the surrounding bugs fixed, now we can address the real issue.  
{{getPriority()}} explicitly does this:
{code}
    } else if (curReplicas == 0) {
      // If there are zero non-decommissioned replicas but there are
      // some decommissioned replicas, then assign them highest priority
      if (decommissionedReplicas > 0) {
        return QUEUE_HIGHEST_PRIORITY;
      }
{code}

This does not make any sense. Since decommissioned nodes are never chose as a 
replication source, the block cannot be re-replicated. Being at this priority, 
the block won't be recognized as "missing" either.  It will appear that the 
cluster is healthy until the decommissioned nodes are taken down, at which 
point it might be too late to recover the data.

There are several possible approaches to this.
1) If all it has is decommissioned replicas, show it as missing. I.e. priority 
level of {{QUEUE_WITH_CORRUPT_BLOCKS}}.  {{fsck}} will show the decommissioned 
locations and the admin can recommission/decommission or manually copy the data 
out.
2) Re-evaluate all replicas when a decommissioned node rejoins. The simplest 
way is to start decommissioning the node again.
3) Allow a decommissioned replica to be picked as a replication source in this 
special case. 1) might still be needed.

I have a patch with 1) and a unit test, but want to hear from others before 
posting.

> Some blocks can be permanently lost if nodes are decommissioned while dead
> --------------------------------------------------------------------------
>
>                 Key: HDFS-11609
>                 URL: https://issues.apache.org/jira/browse/HDFS-11609
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: namenode
>    Affects Versions: 2.7.0
>            Reporter: Kihwal Lee
>            Assignee: Kihwal Lee
>            Priority: Critical
>
> When all the nodes containing a replica of a block are decommissioned while 
> they are dead, they get decommissioned right away even if there are missing 
> blocks. This behavior was introduced by HDFS-7374.
> The problem starts when those decommissioned nodes are brought back online. 
> The namenode no longer shows missing blocks, which creates a false sense of 
> cluster health. When the decommissioned nodes are removed and reformatted, 
> the block data is permanently lost. The namenode will report missing blocks 
> after the heartbeat recheck interval (e.g. 10 minutes) from the moment the 
> last node is taken down.
> There are multiple issues in the code. As some cause different behaviors in 
> testing vs. production, it took a while to reproduce it in a unit test. I 
> will present analysis and proposal soon.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

Reply via email to