[ https://issues.apache.org/jira/browse/HDFS-6178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Ming Ma updated HDFS-6178: -------------------------- Attachment: HDFS-6178-2.patch Thanks, Jing. Updated patch per suggestion. > Decommission on standby NN couldn't finish > ------------------------------------------ > > Key: HDFS-6178 > URL: https://issues.apache.org/jira/browse/HDFS-6178 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode > Reporter: Ming Ma > Attachments: HDFS-6178-2.patch, HDFS-6178.patch > > > Currently decommissioning machines in HA-enabled cluster requires running > refreshNodes in both active and standby nodes. Sometimes decommissioning > won't finish from standby NN's point of view. Here is the diagnosis of why > it could happen. > Standby NN's blockManager manages blocks replication and block invalidation > as if it is the active NN; even though DNs will ignore block commands coming > from standby NN. When standby NN makes block operation decisions such as the > target of block replication and the node to remove excess blocks from, the > decision is independent of active NN. So active NN and standby NN could have > different states. When we try to decommission nodes on standby nodes; such > state inconsistency might prevent standby NN from making progress. Here is an > example. > Machine A > Machine B > Machine C > Machine D > Machine E > Machine F > Machine G > Machine H > 1. For a given block, both active and standby have 5 replicas on machine A, > B, C, D, E. So both active and standby decide to pick excess nodes to > invalidate. > Active picked D and E as excess DNs. After the next block reports from D and > E, active NN has 3 active replicas (A, B, C), 0 excess replica. > {noformat} > 2014-03-27 01:50:14,410 INFO BlockStateChange: BLOCK* chooseExcessReplicates: > (E:50010, blk_-5207804474559026159_121186764) is added to invalidated blocks > set > 2014-03-27 01:50:15,539 INFO BlockStateChange: BLOCK* chooseExcessReplicates: > (D:50010, blk_-5207804474559026159_121186764) is added to invalidated blocks > set > {noformat} > Standby pick C, E as excess DNs. Given DNs ignore commands from standby, > After the next block reports from C, D, E, standby has 2 active replicas (A, > B), 1 excess replica (C). > {noformat} > 2014-03-27 01:51:49,543 INFO BlockStateChange: BLOCK* chooseExcessReplicates: > (E:50010, blk_-5207804474559026159_121186764) is added to invalidated blocks > set > 2014-03-27 01:51:49,894 INFO BlockStateChange: BLOCK* chooseExcessReplicates: > (C:50010, blk_-5207804474559026159_121186764) is added to invalidated blocks > set > {noformat} > 2. Machine A decomm request was sent to standby. Standby only had one live > replica and picked machine G, H as targets, but given standby commands was > ignored by DNs, G, H remained in pending replication queue until they are > timed out. At this point, you have one decommissioning replica (A), 1 active > replica (B), one excess replica (C). > {noformat} > 2014-03-27 04:42:52,258 INFO BlockStateChange: BLOCK* ask A:50010 to > replicate blk_-5207804474559026159_121186764 to datanode(s) G:50010 H:50010 > {noformat} > 3. Machine A decomm request was sent to active NN. Active NN picked machine F > as the target. It finished properly. So active NN had 3 active replicas (B, > C, F), one decommissioned replica (A). > {noformat} > 2014-03-27 04:44:15,239 INFO BlockStateChange: BLOCK* ask 10.42.246.110:50010 > to replicate blk_-5207804474559026159_121186764 to datanode(s) F:50010 > 2014-03-27 04:44:16,083 INFO BlockStateChange: BLOCK* addStoredBlock: > blockMap updated: F:50010 is added to blk_-5207804474559026159_121186764 size > 7100065 > {noformat} > 4. Standby NN picked up F as a new replica. Thus standby had one > decommissioning replica (A), 2 active replicas (B, F), one excess replica > (C). Standby NN kept trying to schedule replication work, but DNs ignored the > commands. > {noformat} > 2014-03-27 04:44:16,084 INFO BlockStateChange: BLOCK* addStoredBlock: > blockMap updated: F:50010 is added to blk_-5207804474559026159_121186764 size > 7100065 > 2014-03-28 23:06:11,970 INFO > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Block: > blk_-5207804474559026159_121186764, Expected Replicas: 3, live replicas: 2, > corrupt replicas: 0, decommissioned replicas: 1, excess replicas: 1, Is Open > File: false, Datanodes having this block: C:50010 B:50010 A:50010 F:50010 , > Current Datanode: A:50010, Is current datanode decommissioning: true > {noformat} -- This message was sent by Atlassian JIRA (v6.2#6252)