[ 
https://issues.apache.org/jira/browse/HDFS-903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12923315#action_12923315
 ] 

dhruba borthakur commented on HDFS-903:
---------------------------------------

I agree with Konstantin/Hairong that the MD5 signature should be part of the 
CheckpointSignature. 

It would have been nice if the contents of the VERSION file was stored as a 
header record in the beginning of the fsimage file itself (I now remember the 
initial reason why the VERSION file exists separate from the fsimage: the 
datanode needs the VERSION file too for its block-directories and the datanode 
does not have a fsimage file). Given that, t should be fine to store the 
checkum in the VERSION file. Also, the algoritm to compute the checksum need 
not be configurable, it could be hardcoded to generate a MD5 checksum.

> NN should verify images and edit logs on startup
> ------------------------------------------------
>
>                 Key: HDFS-903
>                 URL: https://issues.apache.org/jira/browse/HDFS-903
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: name-node
>            Reporter: Eli Collins
>            Assignee: Hairong Kuang
>            Priority: Critical
>             Fix For: 0.22.0
>
>
> I was playing around with corrupting fsimage and edits logs when there are 
> multiple dfs.name.dirs specified. I noticed that:
>  * As long as your corruption does not make the image invalid, eg changes an 
> opcode so it's an invalid opcode HDFS doesn't notice and happily uses a 
> corrupt image or applies the corrupt edit.
> * If the first image in dfs.name.dir is "valid" it replaces the other copies 
> in the other name.dirs, even if they are different, with this first image, ie 
> if the first image is actually invalid/old/corrupt metadata than you've lost 
> your valid metadata, which can result in data loss if the namenode garbage 
> collects blocks that it thinks are no longer used.
> How about we maintain a checksum as part of the image and edit log and check 
> those on startup and refuse to startup if they are different. Or at least 
> provide a configuration option to do so if people are worried about the 
> overhead of maintaining checksums of these files. Even if we assume 
> dfs.name.dir is reliable storage this guards against operator errors.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to