Ming Ma created HDFS-6178:
-----------------------------

             Summary: Decommission on standby NN couldn't finish
                 Key: HDFS-6178
                 URL: https://issues.apache.org/jira/browse/HDFS-6178
             Project: Hadoop HDFS
          Issue Type: Bug
          Components: namenode
            Reporter: Ming Ma


Currently decommissioning machines in HA-enabled cluster requires running 
refreshNodes in both active and standby nodes. Sometimes decommissioning won't 
finish from standby NN's point of view.  Here is the diagnosis of why it could 
happen.

Standby NN's blockManager manages blocks replication and block invalidation as 
if it is the active NN; even though DNs will ignore block commands coming from 
standby NN. When standby NN makes block operation decisions such as the target 
of block replication and the node to remove excess blocks from, the decision is 
independent of active NN. So active NN and standby NN could have different 
states. When we try to decommission nodes on standby nodes; such state 
inconsistency might prevent standby NN from making progress. Here is an example.

Machine A
Machine B
Machine C
Machine D
Machine E
Machine F
Machine G
Machine H

1. For a given block, both active and standby have 5 replicas on machine A, B, 
C, D, E. So both active and standby decide to pick excess nodes to invalidate.

Active picked D and E as excess DNs. After the next block reports from D and E, 
active NN has 3 active replicas (A, B, C), 0 excess replica.

{noformat}
2014-03-27 01:50:14,410 INFO BlockStateChange: BLOCK* chooseExcessReplicates: 
(E:50010, blk_-5207804474559026159_121186764) is added to invalidated blocks set
2014-03-27 01:50:15,539 INFO BlockStateChange: BLOCK* chooseExcessReplicates: 
(D:50010, blk_-5207804474559026159_121186764) is added to invalidated blocks set
{noformat}

Standby pick C, E as excess DNs. Given DNs ignore commands from standby, After 
the next block reports from C, D, E,  standby has 2 active replicas (A, B), 1 
excess replica (C).

{noformat}
2014-03-27 01:51:49,543 INFO BlockStateChange: BLOCK* chooseExcessReplicates: 
(E:50010, blk_-5207804474559026159_121186764) is added to invalidated blocks set
2014-03-27 01:51:49,894 INFO BlockStateChange: BLOCK* chooseExcessReplicates: 
(C:50010, blk_-5207804474559026159_121186764) is added to invalidated blocks set
{noformat}


2. Machine A decomm request was sent to standby. Standby only had one live 
replica and picked machine G, H as targets, but given standby commands was 
ignored by DNs, G, H remained in pending replication queue until they are timed 
out. At this point, you have one decommissioning replica (A), 1 active replica 
(B), one excess replica (C).
{noformat}
2014-03-27 04:42:52,258 INFO BlockStateChange: BLOCK* ask A:50010 to replicate 
blk_-5207804474559026159_121186764 to datanode(s) G:50010 H:50010
{noformat}

3. Machine A decomm request was sent to active NN. Active NN picked machine F 
as the target. It finished properly. So active NN had 3 active replicas (B, C, 
F), one decommissioned replica (A).

{noformat}
2014-03-27 04:44:15,239 INFO BlockStateChange: BLOCK* ask 10.42.246.110:50010 
to replicate blk_-5207804474559026159_121186764 to datanode(s) F:50010
2014-03-27 04:44:16,083 INFO BlockStateChange: BLOCK* addStoredBlock: blockMap 
updated: F:50010 is added to blk_-5207804474559026159_121186764 size 7100065
{noformat}

4. Standby NN picked up F as a new replica. Thus standby had one 
decommissioning replica (A), 2 active replicas (B, F), one excess replica (C). 
Standby NN kept trying to schedule replication work, but DNs ignored the 
commands.

{noformat}
2014-03-27 04:44:16,084 INFO BlockStateChange: BLOCK* addStoredBlock: blockMap 
updated: F:50010 is added to blk_-5207804474559026159_121186764 size 7100065

2014-03-28 23:06:11,970 INFO 
org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Block: 
blk_-5207804474559026159_121186764, Expected Replicas: 3, live replicas: 2, 
corrupt replicas: 0, decommissioned replicas: 1, excess replicas: 1, Is Open 
File: false, Datanodes having this block: C:50010 B:50010 A:50010 F:50010 , 
Current Datanode: A:50010, Is current datanode decommissioning: true
{noformat}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to