[ 
https://issues.apache.org/jira/browse/HDFS-4015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14936147#comment-14936147
 ] 

Anu Engineer commented on HDFS-4015:
------------------------------------

Hi [~arpitagarwal], thanks for the review and comments. I will wait for the 
rest of the review comments and post a new patch.

bq.ClientProtocol.java:729: Perhaps we can describe it as "bytes that are at 
risk for deletion."?
Makes sense, I will modify this.

bq. DFSAdmin.java:474: This can happen even without blocks with future 
generation stamps e.g. DN is restarted after a long downtime and reports blocks 
for deleted files.
In this patch we track blocks with generation stamp greater than the current 
highest generation stamp that is known to NN. I have made the assumption that 
if DN comes back on-line and reports blocks for files that have been deleted, 
those Generation IDs for those blocks will be lesser than the current 
Generation Stamp of NN. Please let me know if you think this assumption is not 
valid or breaks down in special cases, Could this happen with V1 vs V2 
generation stamps ? 
bq. FSNamesystem.java:4438: For turn-off tip, should we check getBytesInFuture 
after the threshold of reported blocks isreached? One potential issue is that 
the administrator may see this message and immediately run -forceExit even 
before block thresholds are reached.

With this patch we are slightly changing the behavior of SafeMode. Even if we 
find the threshold blocks we will not exit if we find blocks with future 
generation stamps, under the assumption that NN meta-data has been modified. 

bq. FSNamesystem.java:4445: "you are ok with data loss." might also be 
confusing. Perhaps we can say "if you are certain that the NameNode was started 
with the correct FsImage and edit logs."
Agreed, I will modify this warning. But we also have have the case where 
someone is actually replacing the NN metadata, and is ok with data loss.

bq. FSNamesystem.java:4631: Not sure how this works. leaveSafeMode will just 
return if (isInStartupSafeMode() && (blockManager.getBytesInFuture() > 0))
As the error message says , we are refusing to leave the safe mode -- we want 
the users to send up -forceExit to restart NN with right Metadata files before 
we will move out of safe mode.



> Safemode should count and report orphaned blocks
> ------------------------------------------------
>
>                 Key: HDFS-4015
>                 URL: https://issues.apache.org/jira/browse/HDFS-4015
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>          Components: namenode
>    Affects Versions: 3.0.0
>            Reporter: Todd Lipcon
>            Assignee: Anu Engineer
>         Attachments: HDFS-4015.001.patch, dfsAdmin-report_with_forceExit.png, 
> dfsHealth.html.message.png
>
>
> The safemode status currently reports the number of unique reported blocks 
> compared to the total number of blocks referenced by the namespace. However, 
> it does not report the inverse: blocks which are reported by datanodes but 
> not referenced by the namespace.
> In the case that an admin accidentally starts up from an old image, this can 
> be confusing: safemode and fsck will show "corrupt files", which are the 
> files which actually have been deleted but got resurrected by restarting from 
> the old image. This will convince them that they can safely force leave 
> safemode and remove these files -- after all, they know that those files 
> should really have been deleted. However, they're not aware that leaving 
> safemode will also unrecoverably delete a bunch of other block files which 
> have been orphaned due to the namespace rollback.
> I'd like to consider reporting something like: "900000 of expected 1000000 
> blocks have been reported. Additionally, 10000 blocks have been reported 
> which do not correspond to any file in the namespace. Forcing exit of 
> safemode will unrecoverably remove those data blocks"
> Whether this statistic is also used for some kind of "inverse safe mode" is 
> the logical next step, but just reporting it as a warning seems easy enough 
> to accomplish and worth doing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to