[ https://issues.apache.org/jira/browse/HDFS-903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13036705#comment-13036705 ]
Hari commented on HDFS-903: --------------------------- With this change , Backupnode is downloading the image & edit files from namenode everytime since the difference in checkpoint time is always maintined b/w Namenode and Backupnode . This happens since Namenode is resetting its checkpoint time everytime since we are ignoring renewCheckpointTime and passing true explicitly to rollFsimage during endcheckpoint .. Isn't this a problem or am I missing something ? > NN should verify images and edit logs on startup > ------------------------------------------------ > > Key: HDFS-903 > URL: https://issues.apache.org/jira/browse/HDFS-903 > Project: Hadoop HDFS > Issue Type: Bug > Components: name-node > Reporter: Eli Collins > Assignee: Hairong Kuang > Priority: Critical > Fix For: 0.22.0 > > Attachments: trunkChecksumImage.patch, trunkChecksumImage1.patch, > trunkChecksumImage2.patch, trunkChecksumImage3.patch, > trunkChecksumImage4.patch > > > I was playing around with corrupting fsimage and edits logs when there are > multiple dfs.name.dirs specified. I noticed that: > * As long as your corruption does not make the image invalid, eg changes an > opcode so it's an invalid opcode HDFS doesn't notice and happily uses a > corrupt image or applies the corrupt edit. > * If the first image in dfs.name.dir is "valid" it replaces the other copies > in the other name.dirs, even if they are different, with this first image, ie > if the first image is actually invalid/old/corrupt metadata than you've lost > your valid metadata, which can result in data loss if the namenode garbage > collects blocks that it thinks are no longer used. > How about we maintain a checksum as part of the image and edit log and check > those on startup and refuse to startup if they are different. Or at least > provide a configuration option to do so if people are worried about the > overhead of maintaining checksums of these files. Even if we assume > dfs.name.dir is reliable storage this guards against operator errors. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira