[ 
https://issues.apache.org/jira/browse/HDFS-6094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13934384#comment-13934384
 ] 

Jing Zhao commented on HDFS-6094:
---------------------------------

I can also reproduce the issue on my local machine. Looks like the issue is:
1. After the standby NN restarts, DN1 sends first the incremental block report 
then the complete block report to SBN.
2. DN2 sends the incremental block report to SBN. This block report will not 
change the replica number in SBN because the corresponding storage ID has not 
been added in SBN yet (the storage ID will only be added during the full block 
report processing). However, the SBN still checks the current live replica 
number (which is 1 because SBN already received the full block report from DN1) 
and use the number to update the safe block count.

So maybe a simple fix can be:
{code}
@@ -2277,7 +2277,7 @@ private Block addStoredBlock(final BlockInfo block,
     if(storedBlock.getBlockUCState() == BlockUCState.COMMITTED &&
         numLiveReplicas >= minReplication) {
       storedBlock = completeBlock(bc, storedBlock, false);
-    } else if (storedBlock.isComplete()) {
+    } else if (storedBlock.isComplete() && added) {
       // check whether safe replication is reached for the block
       // only complete blocks are counted towards that
       // Is no-op if not in safe mode.
{code}

> The same block can be counted twice towards safe mode threshold
> ---------------------------------------------------------------
>
>                 Key: HDFS-6094
>                 URL: https://issues.apache.org/jira/browse/HDFS-6094
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: namenode
>    Affects Versions: 2.4.0
>            Reporter: Arpit Agarwal
>            Assignee: Arpit Agarwal
>
> {{BlockManager#addStoredBlock}} can cause the same block can be counted 
> towards safe mode threshold. We see this manifest via 
> {{TestHASafeMode#testBlocksAddedWhileStandbyIsDown}} failures on Ubuntu. More 
> details to follow in a comment.
> Exception details:
> {code}
>   Time elapsed: 12.874 sec  <<< FAILURE!
> java.lang.AssertionError: Bad safemode status: 'Safe mode is ON. The reported 
> blocks 7 has reached the threshold 0.9990 of total blocks 6. The number of 
> live datanodes 3 has reached the minimum number 0. Safe mode will be turned 
> off automatically in 28 seconds.'
>         at org.junit.Assert.fail(Assert.java:93)
>         at org.junit.Assert.assertTrue(Assert.java:43)
>         at 
> org.apache.hadoop.hdfs.server.namenode.ha.TestHASafeMode.assertSafeMode(TestHASafeMode.java:493)
>         at 
> org.apache.hadoop.hdfs.server.namenode.ha.TestHASafeMode.testBlocksAddedWhileStandbyIsDown(TestHASafeMode.java:660)
> {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to