[ https://issues.apache.org/jira/browse/HDFS-903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12927749#action_12927749 ]
Konstantin Shvachko commented on HDFS-903: ------------------------------------------ Sounds like a plan. > NN should verify images and edit logs on startup > ------------------------------------------------ > > Key: HDFS-903 > URL: https://issues.apache.org/jira/browse/HDFS-903 > Project: Hadoop HDFS > Issue Type: Bug > Components: name-node > Reporter: Eli Collins > Assignee: Hairong Kuang > Priority: Critical > Fix For: 0.22.0 > > Attachments: trunkChecksumImage.patch, trunkChecksumImage1.patch > > > I was playing around with corrupting fsimage and edits logs when there are > multiple dfs.name.dirs specified. I noticed that: > * As long as your corruption does not make the image invalid, eg changes an > opcode so it's an invalid opcode HDFS doesn't notice and happily uses a > corrupt image or applies the corrupt edit. > * If the first image in dfs.name.dir is "valid" it replaces the other copies > in the other name.dirs, even if they are different, with this first image, ie > if the first image is actually invalid/old/corrupt metadata than you've lost > your valid metadata, which can result in data loss if the namenode garbage > collects blocks that it thinks are no longer used. > How about we maintain a checksum as part of the image and edit log and check > those on startup and refuse to startup if they are different. Or at least > provide a configuration option to do so if people are worried about the > overhead of maintaining checksums of these files. Even if we assume > dfs.name.dir is reliable storage this guards against operator errors. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.