[ 
https://issues.apache.org/jira/browse/HDFS-4015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14936086#comment-14936086
 ] 

Arpit Agarwal commented on HDFS-4015:
-------------------------------------

Hi [~anu], thanks for this improvement. Few comments below, I haven't reviewed 
the test case yet.

# ClientProtocol.java:729: Perhaps we can describe it as "bytes that are at 
risk for deletion."?
# DFSAdmin.java:474: This can happen even without blocks with future generation 
stamps e.g. DN is restarted after a long downtime and reports blocks for 
deleted files. 
# FSNamesystem.java:4438: For turn-off tip, should we check 
{{getBytesInFuture}} after the threshold of reported blocks isreached? One 
potential issue is that the administrator may see this message and immediately 
run {{-forceExit}} even before block thresholds are reached.
# FSNamesystem.java:4445: "you are ok with data loss." might also be confusing. 
Perhaps we can say "if you are certain that the NameNode was started with the 
correct FsImage and edit logs."
# FSNamesystem.java:4631: Not sure how this works. leaveSafeMode will just 
return {{if (isInStartupSafeMode() && (blockManager.getBytesInFuture() > 0))}}

Comments also posted at 
https://github.com/arp7/hadoop/commit/f16f4525a9a814f0945e76af55ad06b5fc18ecb7

> Safemode should count and report orphaned blocks
> ------------------------------------------------
>
>                 Key: HDFS-4015
>                 URL: https://issues.apache.org/jira/browse/HDFS-4015
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>          Components: namenode
>    Affects Versions: 3.0.0
>            Reporter: Todd Lipcon
>            Assignee: Anu Engineer
>         Attachments: HDFS-4015.001.patch, dfsAdmin-report_with_forceExit.png, 
> dfsHealth.html.message.png
>
>
> The safemode status currently reports the number of unique reported blocks 
> compared to the total number of blocks referenced by the namespace. However, 
> it does not report the inverse: blocks which are reported by datanodes but 
> not referenced by the namespace.
> In the case that an admin accidentally starts up from an old image, this can 
> be confusing: safemode and fsck will show "corrupt files", which are the 
> files which actually have been deleted but got resurrected by restarting from 
> the old image. This will convince them that they can safely force leave 
> safemode and remove these files -- after all, they know that those files 
> should really have been deleted. However, they're not aware that leaving 
> safemode will also unrecoverably delete a bunch of other block files which 
> have been orphaned due to the namespace rollback.
> I'd like to consider reporting something like: "900000 of expected 1000000 
> blocks have been reported. Additionally, 10000 blocks have been reported 
> which do not correspond to any file in the namespace. Forcing exit of 
> safemode will unrecoverably remove those data blocks"
> Whether this statistic is also used for some kind of "inverse safe mode" is 
> the logical next step, but just reporting it as a warning seems easy enough 
> to accomplish and worth doing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to