[ https://issues.apache.org/jira/browse/HDFS-903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12923084#action_12923084 ]
Eli Collins commented on HDFS-903: ---------------------------------- Hey Hairong, I haven't had a chance to work on this yet, feel free to grab it. Agree this would work well with HDFS-1458. Thanks, Eli > NN should verify images and edit logs on startup > ------------------------------------------------ > > Key: HDFS-903 > URL: https://issues.apache.org/jira/browse/HDFS-903 > Project: Hadoop HDFS > Issue Type: Bug > Components: name-node > Reporter: Eli Collins > Assignee: Eli Collins > Priority: Critical > > I was playing around with corrupting fsimage and edits logs when there are > multiple dfs.name.dirs specified. I noticed that: > * As long as your corruption does not make the image invalid, eg changes an > opcode so it's an invalid opcode HDFS doesn't notice and happily uses a > corrupt image or applies the corrupt edit. > * If the first image in dfs.name.dir is "valid" it replaces the other copies > in the other name.dirs, even if they are different, with this first image, ie > if the first image is actually invalid/old/corrupt metadata than you've lost > your valid metadata, which can result in data loss if the namenode garbage > collects blocks that it thinks are no longer used. > How about we maintain a checksum as part of the image and edit log and check > those on startup and refuse to startup if they are different. Or at least > provide a configuration option to do so if people are worried about the > overhead of maintaining checksums of these files. Even if we assume > dfs.name.dir is reliable storage this guards against operator errors. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.