[ https://issues.apache.org/jira/browse/HDFS-4015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14936226#comment-14936226 ]
Mingliang Liu commented on HDFS-4015: ------------------------------------- # This patch looks good overall to me. The first assumption you made, aka the _Generation Stamp_ of those blocks reported by a rejoining DN will be less than the current highest generation stamp that is known to NN, makes sense to me. # I agree with [~arpitagarwal] that this tip may not show up until the the thresholds are reached. As it surpasses its following threshold message, once the administrator sees this warning he/she may think that it is the right time to run {{forceExit}} even before block thresholds are reached. Or we may need to combine this warning with threshold message. {code:title=FSNamesystem.java} + if(blockManager.getBytesInFuture() > 0) { + String msg = "Name node detected blocks with generation stamps " ... + return msg; + } + {code} # I suppose the {{reached}} be 0 when we enter safemode, which stands for {{safe mode is on, and threshold is not reached yet}}. {code:title=FSNamesystem.java} + @VisibleForTesting + synchronized void enableSafeModeForTesting(Configuration conf) { + SafeModeInfo newSafemode = new SafeModeInfo(conf); + newSafemode.reached = 1; + this.safeMode = newSafemode; + } {code} > Safemode should count and report orphaned blocks > ------------------------------------------------ > > Key: HDFS-4015 > URL: https://issues.apache.org/jira/browse/HDFS-4015 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode > Affects Versions: 3.0.0 > Reporter: Todd Lipcon > Assignee: Anu Engineer > Attachments: HDFS-4015.001.patch, dfsAdmin-report_with_forceExit.png, > dfsHealth.html.message.png > > > The safemode status currently reports the number of unique reported blocks > compared to the total number of blocks referenced by the namespace. However, > it does not report the inverse: blocks which are reported by datanodes but > not referenced by the namespace. > In the case that an admin accidentally starts up from an old image, this can > be confusing: safemode and fsck will show "corrupt files", which are the > files which actually have been deleted but got resurrected by restarting from > the old image. This will convince them that they can safely force leave > safemode and remove these files -- after all, they know that those files > should really have been deleted. However, they're not aware that leaving > safemode will also unrecoverably delete a bunch of other block files which > have been orphaned due to the namespace rollback. > I'd like to consider reporting something like: "900000 of expected 1000000 > blocks have been reported. Additionally, 10000 blocks have been reported > which do not correspond to any file in the namespace. Forcing exit of > safemode will unrecoverably remove those data blocks" > Whether this statistic is also used for some kind of "inverse safe mode" is > the logical next step, but just reporting it as a warning seems easy enough > to accomplish and worth doing. -- This message was sent by Atlassian JIRA (v6.3.4#6332)