[PR] HDFS-17722. DataNode stuck decommissioning on standby NameNode due to excess replica timing race. [hadoop]

via GitHub Tue, 03 Mar 2026 03:30:28 -0800


balodesecurity opened a new pull request, #8295:
URL: https://github.com/apache/hadoop/pull/8295


   ## Problem
   
   On a standby NameNode, a DataNode can get stuck in the 
`DECOMMISSION_INPROGRESS` state indefinitely when a timing race causes a 
replica to be flagged as **excess** instead of **live** during decommissioning.
   
   Sequence:
   1. File is written to DN-A, DN-B, DN-C (RF=3).
   2. DN-A is marked for decommission.
   3. The block manager schedules re-replication → copies a new replica to DN-D.
   4. On the standby NN, the block report for DN-D arrives *before* the 
decommission state for DN-A is propagated. The standby marks DN-D's replica as 
**excess** (it looks like an over-replicated block).
   5. The decommission monitor on the standby calls `isSufficient()`: 
`numLive=2` (DN-B, DN-C) satisfies RF=3? No. It sees only 2 live copies, so 
decommission stalls.
   6. Meanwhile DN-A is never fully decommissioned because `isSufficient()` 
never returns true.
   
   The excess replica on DN-D is a **physically present block copy** and 
contributes to durability — ignoring it causes the deadlock.
   
   ## Fix
   
   In `DatanodeAdminManager.isSufficient()`, count excess replicas alongside 
live replicas for the sufficiency check on non-under-construction blocks:
   
   ```java
   final int numLiveAndExcess = numLive + numberReplicas.excessReplicas();
   if (numLiveAndExcess >= blockManager.getDefaultStorageNum(block)
       && blockManager.hasMinStorage(block, numLive)) {
     return true;
   }
   ```
   
   The `hasMinStorage` guard (checks `dfs.replication.min`, default 1) ensures 
decommission does not proceed if zero live replicas exist — excess-only 
replicas are not guaranteed durable. After decommission completes, if the 
excess replica on DN-D is subsequently deleted, the block manager's normal 
under-replication detection will schedule re-replication.
   
   ## Testing
   
   **Unit tests** — `TestDatanodeAdminManagerIsSufficient` (5 tests, no cluster 
required):
   
   | Test | Scenario | Expected |
   |---|---|---|
   | `testExcessReplicaCountsTowardSufficiency` | HDFS-17722 bug: live=1, 
excess=1, RF=2 | `true` |
   | `testNormalDecommissionStillSufficient` | Baseline: live=2, excess=0, RF=2 
| `true` |
   | `testNoLiveReplicaBlocksDecommission` | Safety guard: live=0, excess=2, 
RF=2 | `false` |
   | `testInsufficientEvenWithExcess` | live=0, excess=1, RF=2 — not enough 
either way | `false` |
   | `testExcessAboveRFWithMinLive` | live=1, excess=2, RF=2 — excess 
over-covers RF | `true` |
   
   ```
   Tests run: 5, Failures: 0, Errors: 0, Skipped: 0
   ```
   
   **Docker integration** — 3-DataNode cluster with 1 NameNode and RF=3, 5 
scenarios:
   - Scenario 1: Clean decommission (RF=2) — **PASS**
   - Scenario 2: RF=3→2 creates excess replicas, then decommission DN2 — 
**PASS**
   - Scenario 3: Same scenario on DN3 — **PASS**
   - Scenario 4: Repeated decommission + recommission cycles (3 rounds) — 
**PASS**
   - Scenario 5: Data integrity check after decommission — **PASS**
   
   ```
   Results: 0 failure(s) — ALL TESTS PASSED
   ```
   
   ## Related
   
   - JIRA: https://issues.apache.org/jira/browse/HDFS-17722


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] HDFS-17722. DataNode stuck decommissioning on standby NameNode due to excess replica timing race. [hadoop]

Reply via email to