[jira] [Commented] (HDFS-6178) Decommission on standby NN couldn't finish

Ming Ma (JIRA) Tue, 01 Apr 2014 20:52:26 -0700

    [ 
https://issues.apache.org/jira/browse/HDFS-6178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13957308#comment-13957308
 ]


Ming Ma commented on HDFS-6178:
-------------------------------

Thanks, Jing, Fengdong. It sounds like we can go with the "only decomm ANN" 
approach; the correctness can be guaranteed. However, it will be useful to 
further simplify the operations and improve SBN webUI quality. To summarize 
operational steps.

Option 1 - No code change; people have to ignore SBN webUI as the data is 
misleading.
1. Update excludes files on both ANN and SBN.
2. Run "dfsadmin -refreshNodes" only on ANN. Wait for it to complete.
3. If decomm finishes before any failover, do nothing. SBN webUI doesn't have 
updated node status.
4. If there is a failover before decomm, someone or script external to HDFS has 
to run "dfsadmin -refreshNodes" on the new ANN so that decomm can continue.

Option 2 - Code change to simplify the process and SBN web UI.
1. When old SBN become new ANN, it calls refreshNodes in 
FSNamesystem.startActiveServices. With this, option 1's step 4 can be skipped.
2. SBN can throws some exception when someone tries to run "dfsadmin 
-refreshNodes". That will make it clear not to run the command on SBN.
3. Make SBN webUI correct. For example, it can choose not to display # of 
dead/live/decommissioning/decommissioned nodes. Such data could become stale 
overtime people update include and exclude files but only run "dfsadmin 
-refreshNodes" on ANN.

Separately I can open another jira to disable the replication monitor for SBN.

Any comments?

> Decommission on standby NN couldn't finish
> ------------------------------------------
>
>                 Key: HDFS-6178
>                 URL: https://issues.apache.org/jira/browse/HDFS-6178
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: namenode
>            Reporter: Ming Ma
>
> Currently decommissioning machines in HA-enabled cluster requires running 
> refreshNodes in both active and standby nodes. Sometimes decommissioning 
> won't finish from standby NN's point of view.  Here is the diagnosis of why 
> it could happen.
> Standby NN's blockManager manages blocks replication and block invalidation 
> as if it is the active NN; even though DNs will ignore block commands coming 
> from standby NN. When standby NN makes block operation decisions such as the 
> target of block replication and the node to remove excess blocks from, the 
> decision is independent of active NN. So active NN and standby NN could have 
> different states. When we try to decommission nodes on standby nodes; such 
> state inconsistency might prevent standby NN from making progress. Here is an 
> example.
> Machine A
> Machine B
> Machine C
> Machine D
> Machine E
> Machine F
> Machine G
> Machine H
> 1. For a given block, both active and standby have 5 replicas on machine A, 
> B, C, D, E. So both active and standby decide to pick excess nodes to 
> invalidate.
> Active picked D and E as excess DNs. After the next block reports from D and 
> E, active NN has 3 active replicas (A, B, C), 0 excess replica.
> {noformat}
> 2014-03-27 01:50:14,410 INFO BlockStateChange: BLOCK* chooseExcessReplicates: 
> (E:50010, blk_-5207804474559026159_121186764) is added to invalidated blocks 
> set
> 2014-03-27 01:50:15,539 INFO BlockStateChange: BLOCK* chooseExcessReplicates: 
> (D:50010, blk_-5207804474559026159_121186764) is added to invalidated blocks 
> set
> {noformat}
> Standby pick C, E as excess DNs. Given DNs ignore commands from standby, 
> After the next block reports from C, D, E,  standby has 2 active replicas (A, 
> B), 1 excess replica (C).
> {noformat}
> 2014-03-27 01:51:49,543 INFO BlockStateChange: BLOCK* chooseExcessReplicates: 
> (E:50010, blk_-5207804474559026159_121186764) is added to invalidated blocks 
> set
> 2014-03-27 01:51:49,894 INFO BlockStateChange: BLOCK* chooseExcessReplicates: 
> (C:50010, blk_-5207804474559026159_121186764) is added to invalidated blocks 
> set
> {noformat}
> 2. Machine A decomm request was sent to standby. Standby only had one live 
> replica and picked machine G, H as targets, but given standby commands was 
> ignored by DNs, G, H remained in pending replication queue until they are 
> timed out. At this point, you have one decommissioning replica (A), 1 active 
> replica (B), one excess replica (C).
> {noformat}
> 2014-03-27 04:42:52,258 INFO BlockStateChange: BLOCK* ask A:50010 to 
> replicate blk_-5207804474559026159_121186764 to datanode(s) G:50010 H:50010
> {noformat}
> 3. Machine A decomm request was sent to active NN. Active NN picked machine F 
> as the target. It finished properly. So active NN had 3 active replicas (B, 
> C, F), one decommissioned replica (A).
> {noformat}
> 2014-03-27 04:44:15,239 INFO BlockStateChange: BLOCK* ask 10.42.246.110:50010 
> to replicate blk_-5207804474559026159_121186764 to datanode(s) F:50010
> 2014-03-27 04:44:16,083 INFO BlockStateChange: BLOCK* addStoredBlock: 
> blockMap updated: F:50010 is added to blk_-5207804474559026159_121186764 size 
> 7100065
> {noformat}
> 4. Standby NN picked up F as a new replica. Thus standby had one 
> decommissioning replica (A), 2 active replicas (B, F), one excess replica 
> (C). Standby NN kept trying to schedule replication work, but DNs ignored the 
> commands.
> {noformat}
> 2014-03-27 04:44:16,084 INFO BlockStateChange: BLOCK* addStoredBlock: 
> blockMap updated: F:50010 is added to blk_-5207804474559026159_121186764 size 
> 7100065
> 2014-03-28 23:06:11,970 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Block: 
> blk_-5207804474559026159_121186764, Expected Replicas: 3, live replicas: 2, 
> corrupt replicas: 0, decommissioned replicas: 1, excess replicas: 1, Is Open 
> File: false, Datanodes having this block: C:50010 B:50010 A:50010 F:50010 , 
> Current Datanode: A:50010, Is current datanode decommissioning: true
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (HDFS-6178) Decommission on standby NN couldn't finish

Reply via email to