[ 
https://issues.apache.org/jira/browse/HDFS-903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13036705#comment-13036705
 ] 

Hari commented on HDFS-903:
---------------------------

With this change , Backupnode is downloading the image & edit files from 
namenode everytime since the difference in checkpoint time is always maintined 
b/w Namenode and Backupnode . This happens since Namenode is resetting its 
checkpoint time everytime since we are ignoring renewCheckpointTime and passing 
true explicitly to rollFsimage during endcheckpoint .. Isn't this a problem or 
am I missing something ? 

> NN should verify images and edit logs on startup
> ------------------------------------------------
>
>                 Key: HDFS-903
>                 URL: https://issues.apache.org/jira/browse/HDFS-903
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: name-node
>            Reporter: Eli Collins
>            Assignee: Hairong Kuang
>            Priority: Critical
>             Fix For: 0.22.0
>
>         Attachments: trunkChecksumImage.patch, trunkChecksumImage1.patch, 
> trunkChecksumImage2.patch, trunkChecksumImage3.patch, 
> trunkChecksumImage4.patch
>
>
> I was playing around with corrupting fsimage and edits logs when there are 
> multiple dfs.name.dirs specified. I noticed that:
>  * As long as your corruption does not make the image invalid, eg changes an 
> opcode so it's an invalid opcode HDFS doesn't notice and happily uses a 
> corrupt image or applies the corrupt edit.
> * If the first image in dfs.name.dir is "valid" it replaces the other copies 
> in the other name.dirs, even if they are different, with this first image, ie 
> if the first image is actually invalid/old/corrupt metadata than you've lost 
> your valid metadata, which can result in data loss if the namenode garbage 
> collects blocks that it thinks are no longer used.
> How about we maintain a checksum as part of the image and edit log and check 
> those on startup and refuse to startup if they are different. Or at least 
> provide a configuration option to do so if people are worried about the 
> overhead of maintaining checksums of these files. Even if we assume 
> dfs.name.dir is reliable storage this guards against operator errors.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to