[ https://issues.apache.org/jira/browse/HDFS-11609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15951587#comment-15951587 ]
Kihwal Lee commented on HDFS-11609: ----------------------------------- h3. Inability to correctly guess the previous replication priority Guessing the previous replication priority level of a block works most of times, but is not perfect. Different orders of events can lead to the identical current state, but the previous priority levels can differ. We can improve the priority update method so that the guessing logic still provides benefit in majority of cases, yet its correctness is not strictly necessary. Following shows problems I encountered and solutions. In {{UnderReplicatedBlocks}}, {code} private int getPriority(int curReplicas, int readOnlyReplicas, int decommissionedReplicas, int expectedReplicas) { assert curReplicas >= 0 : "Negative replicas!"; {code} This is called from {{update()}}, which calls it with {{curReplicas}} set to {{curReplicas-curReplicasDelta}}. When all replica-containing nodes are dead ({{curReplicas}} is 0), but a decommissioned node having a replica joins, {{update()}} is called with {{curReplicas}} of -1, which sets off the assert. This causes initial block report processing to stop in the middle. This node is live and decommissioned and the block will appear missing because the block report wasn't processed due to the assertion failure. This can be avoided if {{curReplicasDelta}} is not set to 1 if this replica is decommissioned. This value originates from {{BlockManager}}'s {{addStoredBlock()}}. {code} if (result == AddBlockResult.ADDED) { - curReplicaDelta = 1; + curReplicaDelta = (node.isDecommissioned()) ? 0 : 1; {code} This fixes this particular issue. The assert is removed in the real build, so it acts differently in production runtime. Instead block report processing blowing up, "-1" will cause it to return the level, {{QUEUE_VERY_UNDER_REPLICATED}} without the above fix, which is incorrect. If the previous priority level is guessed incorrectly and it happens to be identical to the current level, the old entry won't be removed, resulting in duplicate entries. The {{remove()}} method is already robust so if a block is not found in the specified level, it tries to remove it from other priority levels too. So we can simply call {{remove()}} unconditionally. Guessing the old priority is not functionally necessary with this change, but is still useful, since the guess is normally correct which makes it visit only one priority level for removal in most of cases. {code} - if(oldPri != curPri) { - remove(block, oldPri); - } + // oldPri is mostly correct, but not always. If not found with oldPri, + // other levels will be searched until the block is found & removed. + remove(block, oldPri); {code} h3. Replication priority level of a block with only decommissioned replicas With the surrounding bugs fixed, now we can address the real issue. {{getPriority()}} explicitly does this: {code} } else if (curReplicas == 0) { // If there are zero non-decommissioned replicas but there are // some decommissioned replicas, then assign them highest priority if (decommissionedReplicas > 0) { return QUEUE_HIGHEST_PRIORITY; } {code} This does not make any sense. Since decommissioned nodes are never chose as a replication source, the block cannot be re-replicated. Being at this priority, the block won't be recognized as "missing" either. It will appear that the cluster is healthy until the decommissioned nodes are taken down, at which point it might be too late to recover the data. There are several possible approaches to this. 1) If all it has is decommissioned replicas, show it as missing. I.e. priority level of {{QUEUE_WITH_CORRUPT_BLOCKS}}. {{fsck}} will show the decommissioned locations and the admin can recommission/decommission or manually copy the data out. 2) Re-evaluate all replicas when a decommissioned node rejoins. The simplest way is to start decommissioning the node again. 3) Allow a decommissioned replica to be picked as a replication source in this special case. 1) might still be needed. I have a patch with 1) and a unit test, but want to hear from others before posting. > Some blocks can be permanently lost if nodes are decommissioned while dead > -------------------------------------------------------------------------- > > Key: HDFS-11609 > URL: https://issues.apache.org/jira/browse/HDFS-11609 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode > Affects Versions: 2.7.0 > Reporter: Kihwal Lee > Assignee: Kihwal Lee > Priority: Critical > > When all the nodes containing a replica of a block are decommissioned while > they are dead, they get decommissioned right away even if there are missing > blocks. This behavior was introduced by HDFS-7374. > The problem starts when those decommissioned nodes are brought back online. > The namenode no longer shows missing blocks, which creates a false sense of > cluster health. When the decommissioned nodes are removed and reformatted, > the block data is permanently lost. The namenode will report missing blocks > after the heartbeat recheck interval (e.g. 10 minutes) from the moment the > last node is taken down. > There are multiple issues in the code. As some cause different behaviors in > testing vs. production, it took a while to reproduce it in a unit test. I > will present analysis and proposal soon. -- This message was sent by Atlassian JIRA (v6.3.15#6346) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org