[ 
https://issues.apache.org/jira/browse/HDFS-4015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14936250#comment-14936250
 ] 

Arpit Agarwal commented on HDFS-4015:
-------------------------------------

bq. Please let me know if you think this assumption is not valid or breaks down 
in special cases, Could this happen with V1 vs V2 generation stamps ?
Hi [~anu], your assumption is correct. I was just referring to this statement 
"This means blocks have been reported which do not correspond to any file in 
the namespace". It's a minor point.

bq. With this patch we are slightly changing the behavior of SafeMode. Even if 
we find the threshold blocks we will not exit if we find blocks with future 
generation stamps, under the assumption that NN meta-data has been modified.
Agreed it's the right behavior. I meant the timing of displaying the new safe 
mode tip. It would be better displayed after thresholds are checked, so we know 
that it is a safe time to run the {{-forceExit}} command, assuming the correct 
metadata is being used. I also like [~liuml07]'s suggestion of combining the 
two messages if it is feasible. So the message explains both problems if 
applicable (e.g. _there are X missing blocks and there are Y blocks with 
generation stamps in the future..._). 

bq. As the error message says , we are refusing to leave the safe mode – we 
want the users to send up -forceExit to restart NN with right Metadata files 
before we will move out of safe mode.
{code}
+      case SAFEMODE_FORCE_EXIT:
+        if (isInStartupSafeMode() && (blockManager.getBytesInFuture() > 0)) {
+          LOG.warn("Leaving safe mode due to forceExit. This will cause a data 
" +
+              "loss of " + blockManager.getBytesInFuture() + " byte(s).");
+          // we should leave safe mode before clearing bytes, otherwise
+          // there is a race condition where bytes in future may not be zero.
+          leaveSafeMode();
+          blockManager.clearBytesInFuture();
{code}
So it looks like this call to {{leaveSafeMode}} is guaranteed to fail and we 
can remove it. The next iteration of {{SafeModeMonitor}} will bring us out of 
safe mode.

> Safemode should count and report orphaned blocks
> ------------------------------------------------
>
>                 Key: HDFS-4015
>                 URL: https://issues.apache.org/jira/browse/HDFS-4015
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>          Components: namenode
>    Affects Versions: 3.0.0
>            Reporter: Todd Lipcon
>            Assignee: Anu Engineer
>         Attachments: HDFS-4015.001.patch, dfsAdmin-report_with_forceExit.png, 
> dfsHealth.html.message.png
>
>
> The safemode status currently reports the number of unique reported blocks 
> compared to the total number of blocks referenced by the namespace. However, 
> it does not report the inverse: blocks which are reported by datanodes but 
> not referenced by the namespace.
> In the case that an admin accidentally starts up from an old image, this can 
> be confusing: safemode and fsck will show "corrupt files", which are the 
> files which actually have been deleted but got resurrected by restarting from 
> the old image. This will convince them that they can safely force leave 
> safemode and remove these files -- after all, they know that those files 
> should really have been deleted. However, they're not aware that leaving 
> safemode will also unrecoverably delete a bunch of other block files which 
> have been orphaned due to the namespace rollback.
> I'd like to consider reporting something like: "900000 of expected 1000000 
> blocks have been reported. Additionally, 10000 blocks have been reported 
> which do not correspond to any file in the namespace. Forcing exit of 
> safemode will unrecoverably remove those data blocks"
> Whether this statistic is also used for some kind of "inverse safe mode" is 
> the logical next step, but just reporting it as a warning seems easy enough 
> to accomplish and worth doing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to