[ 
https://issues.apache.org/jira/browse/HDFS-2753?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hari Mankude updated HDFS-2753:
-------------------------------

    Description: Write traffic initiated from the client. Manual failover is 
done by killing NN and converting a  different standby to active. NN is 
restarted as standby. The restarted standby stays in safemode forever. More 
information in the description.  (was: teragen is run to generate write 
traffic. A manual failover is initiated by killing namenode process. namenode 
which was killed is then restarted and it comes up as a standby. However, the 
standby never exits out of the safemode. Looking at the standby logs, it looks 
like the standby namenode gets addStoredBlock immediately after data node 
registration which results in numBlocks getting incremented. There is an 
optimization in processReport() which causes the follow on complete block 
report to be ignored when NN is in safemode. So, NN never exits out of the 
safemode.

2012-01-05 18:57:46,030 INFO  hdfs.StateChange 
(DatanodeManager.java:registerDatanode(573)) - BLOCK* 
NameSystem.registerDatanode: node registration from 98.137.233.235:50010 
storage DS-526656430-98.137.233.235-50010-1325723536492
2012-01-05 18:57:46,033 INFO  net.NetworkTopology 
(NetworkTopology.java:add(344)) - Adding a new node: 
/default-rack/98.137.233.235:50010
2012-01-05 18:57:46,033 INFO  namenode.FSNamesystem 
(FSNamesystem.java:checkMode(3411)) - DID NOT call initialize at 1
2012-01-05 18:57:46,034 INFO  hdfs.StateChange 
(DatanodeManager.java:registerDatanode(573)) - BLOCK* 
NameSystem.registerDatanode: node registration from 98.137.233.237:50010 
storage DS-1961520590-98.137.233.237-50010-1325725253057
2012-01-05 18:57:46,034 INFO  net.NetworkTopology 
(NetworkTopology.java:add(344)) - Adding a new node: 
/default-rack/98.137.233.237:50010
2012-01-05 18:57:46,042 INFO  namenode.FSNamesystem 
(FSNamesystem.java:checkMode(3411)) - DID NOT call initialize at 1
2012-01-05 18:57:46,045 INFO  hdfs.StateChange 
(BlockManager.java:addStoredBlock(1775)) - BLOCK* addStoredBlock: blockMap 
updated: 98.137.233.235:50010 is added to 
blk_-3183325095022454724_1172{blockUCState=UNDER_CONSTRUCTION, 
primaryNodeIndex=-1, 
replicas=[ReplicaUnderConstruction[98.137.233.235:50010|FINALIZED]]} size 0
2012-01-05 18:57:46,046 INFO  hdfs.StateChange 
(BlockManager.java:addStoredBlock(1775)) - BLOCK* addStoredBlock: blockMap 
updated: 98.137.233.235:50010 is added to 
blk_5617057825952660916_1173{blockUCState=UNDER_CONSTRUCTION, 
primaryNodeIndex=-1, 
replicas=[ReplicaUnderConstruction[98.137.233.235:50010|FINALIZED]]} size 0
2012-01-05 18:57:46,046 INFO  hdfs.StateChange 
(BlockManager.java:addStoredBlock(1775)) - BLOCK* addStoredBlock: blockMap 
updated: 98.137.233.237:50010 is added to 
blk_-3183325095022454724_1172{blockUCState=UNDER_CONSTRUCTION, 
primaryNodeIndex=-1, 
replicas=[ReplicaUnderConstruction[98.137.233.235:50010|FINALIZED], 
ReplicaUnderConstruction[98.137.233.237:50010|FINALIZED]]} size 0
2012-01-05 18:57:46,046 INFO  hdfs.StateChange 
(BlockManager.java:addStoredBlock(1775)) - BLOCK* addStoredBlock: blockMap 
updated: 98.137.233.237:50010 is added to 
blk_5617057825952660916_1173{blockUCState=UNDER_CONSTRUCTION, 
primaryNodeIndex=-1, 
replicas=[ReplicaUnderConstruction[98.137.233.235:50010|FINALIZED], 
ReplicaUnderConstruction[98.137.233.237:50010|FINALIZED]]} size 0
2012-01-05 18:57:46,049 INFO  hdfs.StateChange 
(BlockManager.java:processReport(1365)) - BLOCK* processReport: discarded 
non-initial block report from 98.137.233.235:50010 because namenode still in 
startup phase
2012-01-05 18:57:46,049 INFO  hdfs.StateChange 
(BlockManager.java:processReport(1365)) - BLOCK* processReport: discarded 
non-initial block report from 98.137.233.237:50010 because namenode still in 
startup phase
2012-01-05 18:58:05,167 INFO  namenode.NameNode 
(NameNodeRpcServer.java:blockReceivedAndDeleted(894)) - Required GS=1175, 
Queuing blockReceivedAndDeleted message
2012-01-05 18:58:05,168 INFO  namenode.NameNode 
(NameNodeRpcServer.java:blockReceivedAndDeleted(894)) - Required GS=1175, 
Queuing blockReceivedAndDeleted message
2012-01-05 18:58:06,634 INFO  namenode.NameNode 
(NameNodeRpcServer.java:blockReceivedAndDeleted(894)) - Required GS=1176, 
Queuing blockReceivedAndDeleted message
2012-01-05 18:58:06,636 INFO  namenode.NameNode 
(NameNodeRpcServer.java:blockReceivedAndDeleted(894)) - Required GS=1176, 
Queuing blockReceivedAndDeleted message
2012-01-05 18:58:08,097 INFO  namenode.NameNode 
(NameNodeRpcServer.java:blockReceivedAndDeleted(894)) - Required GS=1177, 
Queuing blockReceivedAndDeleted message
2012-01-05 18:58:08,097 INFO  namenode.NameNode 
(NameNodeRpcServer.java:blockReceivedAndDeleted(894)) - Required GS=1177, 
Queuing blockReceivedAndDeleted message)

teragen is run to generate write traffic. A manual failover is initiated by 
killing namenode process. namenode which was killed is then restarted and it 
comes up as a standby. However, the standby never exits out of the safemode. 
Looking at the standby logs, it looks like the standby namenode gets 
addStoredBlock immediately after data node registration which results in 
numBlocks getting incremented. There is an optimization in processReport() 
which causes the follow on complete block report to be ignored when NN is in 
safemode. So, NN never exits out of the safemode.

2012-01-05 18:57:46,030 INFO  hdfs.StateChange 
(DatanodeManager.java:registerDatanode(573)) - BLOCK* 
NameSystem.registerDatanode: node registration from 98.137.233.235:50010 
storage DS-526656430-98.137.233.235-50010-1325723536492
2012-01-05 18:57:46,033 INFO  net.NetworkTopology 
(NetworkTopology.java:add(344)) - Adding a new node: 
/default-rack/98.137.233.235:50010
2012-01-05 18:57:46,033 INFO  namenode.FSNamesystem 
(FSNamesystem.java:checkMode(3411)) - DID NOT call initialize at 1
2012-01-05 18:57:46,034 INFO  hdfs.StateChange 
(DatanodeManager.java:registerDatanode(573)) - BLOCK* 
NameSystem.registerDatanode: node registration from 98.137.233.237:50010 
storage DS-1961520590-98.137.233.237-50010-1325725253057
2012-01-05 18:57:46,034 INFO  net.NetworkTopology 
(NetworkTopology.java:add(344)) - Adding a new node: 
/default-rack/98.137.233.237:50010
2012-01-05 18:57:46,042 INFO  namenode.FSNamesystem 
(FSNamesystem.java:checkMode(3411)) - DID NOT call initialize at 1
2012-01-05 18:57:46,045 INFO  hdfs.StateChange 
(BlockManager.java:addStoredBlock(1775)) - BLOCK* addStoredBlock: blockMap 
updated: 98.137.233.235:50010 is added to 
blk_-3183325095022454724_1172{blockUCState=UNDER_CONSTRUCTION, 
primaryNodeIndex=-1, 
replicas=[ReplicaUnderConstruction[98.137.233.235:50010|FINALIZED]]} size 0
2012-01-05 18:57:46,046 INFO  hdfs.StateChange 
(BlockManager.java:addStoredBlock(1775)) - BLOCK* addStoredBlock: blockMap 
updated: 98.137.233.235:50010 is added to 
blk_5617057825952660916_1173{blockUCState=UNDER_CONSTRUCTION, 
primaryNodeIndex=-1, 
replicas=[ReplicaUnderConstruction[98.137.233.235:50010|FINALIZED]]} size 0
2012-01-05 18:57:46,046 INFO  hdfs.StateChange 
(BlockManager.java:addStoredBlock(1775)) - BLOCK* addStoredBlock: blockMap 
updated: 98.137.233.237:50010 is added to 
blk_-3183325095022454724_1172{blockUCState=UNDER_CONSTRUCTION, 
primaryNodeIndex=-1, 
replicas=[ReplicaUnderConstruction[98.137.233.235:50010|FINALIZED], 
ReplicaUnderConstruction[98.137.233.237:50010|FINALIZED]]} size 0
2012-01-05 18:57:46,046 INFO  hdfs.StateChange 
(BlockManager.java:addStoredBlock(1775)) - BLOCK* addStoredBlock: blockMap 
updated: 98.137.233.237:50010 is added to 
blk_5617057825952660916_1173{blockUCState=UNDER_CONSTRUCTION, 
primaryNodeIndex=-1, 
replicas=[ReplicaUnderConstruction[98.137.233.235:50010|FINALIZED], 
ReplicaUnderConstruction[98.137.233.237:50010|FINALIZED]]} size 0
2012-01-05 18:57:46,049 INFO  hdfs.StateChange 
(BlockManager.java:processReport(1365)) - BLOCK* processReport: discarded 
non-initial block report from 98.137.233.235:50010 because namenode still in 
startup phase
2012-01-05 18:57:46,049 INFO  hdfs.StateChange 
(BlockManager.java:processReport(1365)) - BLOCK* processReport: discarded 
non-initial block report from 98.137.233.237:50010 because namenode still in 
startup phase
2012-01-05 18:58:05,167 INFO  namenode.NameNode 
(NameNodeRpcServer.java:blockReceivedAndDeleted(894)) - Required GS=1175, 
Queuing blockReceivedAndDeleted message
2012-01-05 18:58:05,168 INFO  namenode.NameNode 
(NameNodeRpcServer.java:blockReceivedAndDeleted(894)) - Required GS=1175, 
Queuing blockReceivedAndDeleted message
2012-01-05 18:58:06,634 INFO  namenode.NameNode 
(NameNodeRpcServer.java:blockReceivedAndDeleted(894)) - Required GS=1176, 
Queuing blockReceivedAndDeleted message
2012-01-05 18:58:06,636 INFO  namenode.NameNode 
(NameNodeRpcServer.java:blockReceivedAndDeleted(894)) - Required GS=1176, 
Queuing blockReceivedAndDeleted message
2012-01-05 18:58:08,097 INFO  namenode.NameNode 
(NameNodeRpcServer.java:blockReceivedAndDeleted(894)) - Required GS=1177, 
Queuing blockReceivedAndDeleted message
2012-01-05 18:58:08,097 INFO  namenode.NameNode 
(NameNodeRpcServer.java:blockReceivedAndDeleted(894)) - Required GS=1177, 
Queuing blockReceivedAndDeleted message
                
> Standby namenode stuck in safemode during a failover
> ----------------------------------------------------
>
>                 Key: HDFS-2753
>                 URL: https://issues.apache.org/jira/browse/HDFS-2753
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: name-node
>    Affects Versions: HA branch (HDFS-1623)
>            Reporter: Hari Mankude
>            Assignee: Hari Mankude
>
> Write traffic initiated from the client. Manual failover is done by killing 
> NN and converting a  different standby to active. NN is restarted as standby. 
> The restarted standby stays in safemode forever. More information in the 
> description.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to