[ https://issues.apache.org/jira/browse/HDFS-6178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13957308#comment-13957308 ]
Ming Ma commented on HDFS-6178: ------------------------------- Thanks, Jing, Fengdong. It sounds like we can go with the "only decomm ANN" approach; the correctness can be guaranteed. However, it will be useful to further simplify the operations and improve SBN webUI quality. To summarize operational steps. Option 1 - No code change; people have to ignore SBN webUI as the data is misleading. 1. Update excludes files on both ANN and SBN. 2. Run "dfsadmin -refreshNodes" only on ANN. Wait for it to complete. 3. If decomm finishes before any failover, do nothing. SBN webUI doesn't have updated node status. 4. If there is a failover before decomm, someone or script external to HDFS has to run "dfsadmin -refreshNodes" on the new ANN so that decomm can continue. Option 2 - Code change to simplify the process and SBN web UI. 1. When old SBN become new ANN, it calls refreshNodes in FSNamesystem.startActiveServices. With this, option 1's step 4 can be skipped. 2. SBN can throws some exception when someone tries to run "dfsadmin -refreshNodes". That will make it clear not to run the command on SBN. 3. Make SBN webUI correct. For example, it can choose not to display # of dead/live/decommissioning/decommissioned nodes. Such data could become stale overtime people update include and exclude files but only run "dfsadmin -refreshNodes" on ANN. Separately I can open another jira to disable the replication monitor for SBN. Any comments? > Decommission on standby NN couldn't finish > ------------------------------------------ > > Key: HDFS-6178 > URL: https://issues.apache.org/jira/browse/HDFS-6178 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode > Reporter: Ming Ma > > Currently decommissioning machines in HA-enabled cluster requires running > refreshNodes in both active and standby nodes. Sometimes decommissioning > won't finish from standby NN's point of view. Here is the diagnosis of why > it could happen. > Standby NN's blockManager manages blocks replication and block invalidation > as if it is the active NN; even though DNs will ignore block commands coming > from standby NN. When standby NN makes block operation decisions such as the > target of block replication and the node to remove excess blocks from, the > decision is independent of active NN. So active NN and standby NN could have > different states. When we try to decommission nodes on standby nodes; such > state inconsistency might prevent standby NN from making progress. Here is an > example. > Machine A > Machine B > Machine C > Machine D > Machine E > Machine F > Machine G > Machine H > 1. For a given block, both active and standby have 5 replicas on machine A, > B, C, D, E. So both active and standby decide to pick excess nodes to > invalidate. > Active picked D and E as excess DNs. After the next block reports from D and > E, active NN has 3 active replicas (A, B, C), 0 excess replica. > {noformat} > 2014-03-27 01:50:14,410 INFO BlockStateChange: BLOCK* chooseExcessReplicates: > (E:50010, blk_-5207804474559026159_121186764) is added to invalidated blocks > set > 2014-03-27 01:50:15,539 INFO BlockStateChange: BLOCK* chooseExcessReplicates: > (D:50010, blk_-5207804474559026159_121186764) is added to invalidated blocks > set > {noformat} > Standby pick C, E as excess DNs. Given DNs ignore commands from standby, > After the next block reports from C, D, E, standby has 2 active replicas (A, > B), 1 excess replica (C). > {noformat} > 2014-03-27 01:51:49,543 INFO BlockStateChange: BLOCK* chooseExcessReplicates: > (E:50010, blk_-5207804474559026159_121186764) is added to invalidated blocks > set > 2014-03-27 01:51:49,894 INFO BlockStateChange: BLOCK* chooseExcessReplicates: > (C:50010, blk_-5207804474559026159_121186764) is added to invalidated blocks > set > {noformat} > 2. Machine A decomm request was sent to standby. Standby only had one live > replica and picked machine G, H as targets, but given standby commands was > ignored by DNs, G, H remained in pending replication queue until they are > timed out. At this point, you have one decommissioning replica (A), 1 active > replica (B), one excess replica (C). > {noformat} > 2014-03-27 04:42:52,258 INFO BlockStateChange: BLOCK* ask A:50010 to > replicate blk_-5207804474559026159_121186764 to datanode(s) G:50010 H:50010 > {noformat} > 3. Machine A decomm request was sent to active NN. Active NN picked machine F > as the target. It finished properly. So active NN had 3 active replicas (B, > C, F), one decommissioned replica (A). > {noformat} > 2014-03-27 04:44:15,239 INFO BlockStateChange: BLOCK* ask 10.42.246.110:50010 > to replicate blk_-5207804474559026159_121186764 to datanode(s) F:50010 > 2014-03-27 04:44:16,083 INFO BlockStateChange: BLOCK* addStoredBlock: > blockMap updated: F:50010 is added to blk_-5207804474559026159_121186764 size > 7100065 > {noformat} > 4. Standby NN picked up F as a new replica. Thus standby had one > decommissioning replica (A), 2 active replicas (B, F), one excess replica > (C). Standby NN kept trying to schedule replication work, but DNs ignored the > commands. > {noformat} > 2014-03-27 04:44:16,084 INFO BlockStateChange: BLOCK* addStoredBlock: > blockMap updated: F:50010 is added to blk_-5207804474559026159_121186764 size > 7100065 > 2014-03-28 23:06:11,970 INFO > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Block: > blk_-5207804474559026159_121186764, Expected Replicas: 3, live replicas: 2, > corrupt replicas: 0, decommissioned replicas: 1, excess replicas: 1, Is Open > File: false, Datanodes having this block: C:50010 B:50010 A:50010 F:50010 , > Current Datanode: A:50010, Is current datanode decommissioning: true > {noformat} -- This message was sent by Atlassian JIRA (v6.2#6252)