[jira] [Commented] (HDFS-903) NN should verify images and edit logs on startup

2011-05-20 Thread Hari (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13036705#comment-13036705
 ] 

Hari commented on HDFS-903:
---

With this change , Backupnode is downloading the image & edit files from 
namenode everytime since the difference in checkpoint time is always maintined 
b/w Namenode and Backupnode . This happens since Namenode is resetting its 
checkpoint time everytime since we are ignoring renewCheckpointTime and passing 
true explicitly to rollFsimage during endcheckpoint .. Isn't this a problem or 
am I missing something ? 

> NN should verify images and edit logs on startup
> 
>
> Key: HDFS-903
> URL: https://issues.apache.org/jira/browse/HDFS-903
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: name-node
>Reporter: Eli Collins
>Assignee: Hairong Kuang
>Priority: Critical
> Fix For: 0.22.0
>
> Attachments: trunkChecksumImage.patch, trunkChecksumImage1.patch, 
> trunkChecksumImage2.patch, trunkChecksumImage3.patch, 
> trunkChecksumImage4.patch
>
>
> I was playing around with corrupting fsimage and edits logs when there are 
> multiple dfs.name.dirs specified. I noticed that:
>  * As long as your corruption does not make the image invalid, eg changes an 
> opcode so it's an invalid opcode HDFS doesn't notice and happily uses a 
> corrupt image or applies the corrupt edit.
> * If the first image in dfs.name.dir is "valid" it replaces the other copies 
> in the other name.dirs, even if they are different, with this first image, ie 
> if the first image is actually invalid/old/corrupt metadata than you've lost 
> your valid metadata, which can result in data loss if the namenode garbage 
> collects blocks that it thinks are no longer used.
> How about we maintain a checksum as part of the image and edit log and check 
> those on startup and refuse to startup if they are different. Or at least 
> provide a configuration option to do so if people are worried about the 
> overhead of maintaining checksums of these files. Even if we assume 
> dfs.name.dir is reliable storage this guards against operator errors.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] Commented: (HDFS-903) NN should verify images and edit logs on startup

2010-10-19 Thread Hairong Kuang (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12922773#action_12922773
 ] 

Hairong Kuang commented on HDFS-903:


Any progress on this?

If this is done, HDFS-1458 could check if primary NN has changed its image or 
not by simply fetching the primary NN's image's checksum and compares it to the 
2nd NN's image's checksum.

> NN should verify images and edit logs on startup
> 
>
> Key: HDFS-903
> URL: https://issues.apache.org/jira/browse/HDFS-903
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: name-node
>Reporter: Eli Collins
>Assignee: Eli Collins
>Priority: Critical
>
> I was playing around with corrupting fsimage and edits logs when there are 
> multiple dfs.name.dirs specified. I noticed that:
>  * As long as your corruption does not make the image invalid, eg changes an 
> opcode so it's an invalid opcode HDFS doesn't notice and happily uses a 
> corrupt image or applies the corrupt edit.
> * If the first image in dfs.name.dir is "valid" it replaces the other copies 
> in the other name.dirs, even if they are different, with this first image, ie 
> if the first image is actually invalid/old/corrupt metadata than you've lost 
> your valid metadata, which can result in data loss if the namenode garbage 
> collects blocks that it thinks are no longer used.
> How about we maintain a checksum as part of the image and edit log and check 
> those on startup and refuse to startup if they are different. Or at least 
> provide a configuration option to do so if people are worried about the 
> overhead of maintaining checksums of these files. Even if we assume 
> dfs.name.dir is reliable storage this guards against operator errors.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HDFS-903) NN should verify images and edit logs on startup

2010-10-20 Thread Eli Collins (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12923084#action_12923084
 ] 

Eli Collins commented on HDFS-903:
--

Hey Hairong,

I haven't had a chance to work on this yet, feel free to grab it. Agree this 
would work well with HDFS-1458.

Thanks,
Eli

> NN should verify images and edit logs on startup
> 
>
> Key: HDFS-903
> URL: https://issues.apache.org/jira/browse/HDFS-903
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: name-node
>Reporter: Eli Collins
>Assignee: Eli Collins
>Priority: Critical
>
> I was playing around with corrupting fsimage and edits logs when there are 
> multiple dfs.name.dirs specified. I noticed that:
>  * As long as your corruption does not make the image invalid, eg changes an 
> opcode so it's an invalid opcode HDFS doesn't notice and happily uses a 
> corrupt image or applies the corrupt edit.
> * If the first image in dfs.name.dir is "valid" it replaces the other copies 
> in the other name.dirs, even if they are different, with this first image, ie 
> if the first image is actually invalid/old/corrupt metadata than you've lost 
> your valid metadata, which can result in data loss if the namenode garbage 
> collects blocks that it thinks are no longer used.
> How about we maintain a checksum as part of the image and edit log and check 
> those on startup and refuse to startup if they are different. Or at least 
> provide a configuration option to do so if people are worried about the 
> overhead of maintaining checksums of these files. Even if we assume 
> dfs.name.dir is reliable storage this guards against operator errors.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HDFS-903) NN should verify images and edit logs on startup

2010-10-20 Thread Hairong Kuang (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12923096#action_12923096
 ] 

Hairong Kuang commented on HDFS-903:


Here is the plan:
1. Generate a MD5 Digest when saving an image;
2. Store MD5 Digest hash in Version file;
3. When loading an image, generate a MD5 Digest as well and then compare it to 
the one storing in the Version file.

> NN should verify images and edit logs on startup
> 
>
> Key: HDFS-903
> URL: https://issues.apache.org/jira/browse/HDFS-903
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: name-node
>Reporter: Eli Collins
>Assignee: Hairong Kuang
>Priority: Critical
> Fix For: 0.22.0
>
>
> I was playing around with corrupting fsimage and edits logs when there are 
> multiple dfs.name.dirs specified. I noticed that:
>  * As long as your corruption does not make the image invalid, eg changes an 
> opcode so it's an invalid opcode HDFS doesn't notice and happily uses a 
> corrupt image or applies the corrupt edit.
> * If the first image in dfs.name.dir is "valid" it replaces the other copies 
> in the other name.dirs, even if they are different, with this first image, ie 
> if the first image is actually invalid/old/corrupt metadata than you've lost 
> your valid metadata, which can result in data loss if the namenode garbage 
> collects blocks that it thinks are no longer used.
> How about we maintain a checksum as part of the image and edit log and check 
> those on startup and refuse to startup if they are different. Or at least 
> provide a configuration option to do so if people are worried about the 
> overhead of maintaining checksums of these files. Even if we assume 
> dfs.name.dir is reliable storage this guards against operator errors.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HDFS-903) NN should verify images and edit logs on startup

2010-10-20 Thread Hairong Kuang (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12923202#action_12923202
 ] 

Hairong Kuang commented on HDFS-903:


Do we want to provide a configuration option for checksuming image or not? I 
tend to say no. But does anybody have a use case that you do not want the image 
to be checksumed except for the performance concern? 

> NN should verify images and edit logs on startup
> 
>
> Key: HDFS-903
> URL: https://issues.apache.org/jira/browse/HDFS-903
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: name-node
>Reporter: Eli Collins
>Assignee: Hairong Kuang
>Priority: Critical
> Fix For: 0.22.0
>
>
> I was playing around with corrupting fsimage and edits logs when there are 
> multiple dfs.name.dirs specified. I noticed that:
>  * As long as your corruption does not make the image invalid, eg changes an 
> opcode so it's an invalid opcode HDFS doesn't notice and happily uses a 
> corrupt image or applies the corrupt edit.
> * If the first image in dfs.name.dir is "valid" it replaces the other copies 
> in the other name.dirs, even if they are different, with this first image, ie 
> if the first image is actually invalid/old/corrupt metadata than you've lost 
> your valid metadata, which can result in data loss if the namenode garbage 
> collects blocks that it thinks are no longer used.
> How about we maintain a checksum as part of the image and edit log and check 
> those on startup and refuse to startup if they are different. Or at least 
> provide a configuration option to do so if people are worried about the 
> overhead of maintaining checksums of these files. Even if we assume 
> dfs.name.dir is reliable storage this guards against operator errors.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HDFS-903) NN should verify images and edit logs on startup

2010-10-20 Thread Hairong Kuang (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12923248#action_12923248
 ] 

Hairong Kuang commented on HDFS-903:


Thought more about this. Instead of storing the image's MD5 digest in VERSION 
file, I am thinking to store it to the end of the image file. The advantage is 
that the image is self-contained. We do not need to worry about atomicity when 
rolling images.

> NN should verify images and edit logs on startup
> 
>
> Key: HDFS-903
> URL: https://issues.apache.org/jira/browse/HDFS-903
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: name-node
>Reporter: Eli Collins
>Assignee: Hairong Kuang
>Priority: Critical
> Fix For: 0.22.0
>
>
> I was playing around with corrupting fsimage and edits logs when there are 
> multiple dfs.name.dirs specified. I noticed that:
>  * As long as your corruption does not make the image invalid, eg changes an 
> opcode so it's an invalid opcode HDFS doesn't notice and happily uses a 
> corrupt image or applies the corrupt edit.
> * If the first image in dfs.name.dir is "valid" it replaces the other copies 
> in the other name.dirs, even if they are different, with this first image, ie 
> if the first image is actually invalid/old/corrupt metadata than you've lost 
> your valid metadata, which can result in data loss if the namenode garbage 
> collects blocks that it thinks are no longer used.
> How about we maintain a checksum as part of the image and edit log and check 
> those on startup and refuse to startup if they are different. Or at least 
> provide a configuration option to do so if people are worried about the 
> overhead of maintaining checksums of these files. Even if we assume 
> dfs.name.dir is reliable storage this guards against operator errors.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HDFS-903) NN should verify images and edit logs on startup

2010-10-20 Thread Todd Lipcon (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12923250#action_12923250
 ] 

Todd Lipcon commented on HDFS-903:
--

I think it's better to put in VERSION file since then you can use a command 
line "md5sum" utility to check for corruption.

The atomicity issue is a little annoying, I agree. I can't think of a good 
short-term solution... HDFS-1073 will help with that, at least.


> NN should verify images and edit logs on startup
> 
>
> Key: HDFS-903
> URL: https://issues.apache.org/jira/browse/HDFS-903
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: name-node
>Reporter: Eli Collins
>Assignee: Hairong Kuang
>Priority: Critical
> Fix For: 0.22.0
>
>
> I was playing around with corrupting fsimage and edits logs when there are 
> multiple dfs.name.dirs specified. I noticed that:
>  * As long as your corruption does not make the image invalid, eg changes an 
> opcode so it's an invalid opcode HDFS doesn't notice and happily uses a 
> corrupt image or applies the corrupt edit.
> * If the first image in dfs.name.dir is "valid" it replaces the other copies 
> in the other name.dirs, even if they are different, with this first image, ie 
> if the first image is actually invalid/old/corrupt metadata than you've lost 
> your valid metadata, which can result in data loss if the namenode garbage 
> collects blocks that it thinks are no longer used.
> How about we maintain a checksum as part of the image and edit log and check 
> those on startup and refuse to startup if they are different. Or at least 
> provide a configuration option to do so if people are worried about the 
> overhead of maintaining checksums of these files. Even if we assume 
> dfs.name.dir is reliable storage this guards against operator errors.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HDFS-903) NN should verify images and edit logs on startup

2010-10-20 Thread Konstantin Shvachko (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12923255#action_12923255
 ] 

Konstantin Shvachko commented on HDFS-903:
--

Hairong, we should also add the image MD5 digest into CheckpointSignature. I 
believe it will be easier to do that if the digest is in VERSION. Also until 
VERSION is written it does not matter what the image state is. If VERSION is 
written then the MD5 is there so everything is consistent.
Would you consider adding MD5 into CheckpointSignature?

> NN should verify images and edit logs on startup
> 
>
> Key: HDFS-903
> URL: https://issues.apache.org/jira/browse/HDFS-903
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: name-node
>Reporter: Eli Collins
>Assignee: Hairong Kuang
>Priority: Critical
> Fix For: 0.22.0
>
>
> I was playing around with corrupting fsimage and edits logs when there are 
> multiple dfs.name.dirs specified. I noticed that:
>  * As long as your corruption does not make the image invalid, eg changes an 
> opcode so it's an invalid opcode HDFS doesn't notice and happily uses a 
> corrupt image or applies the corrupt edit.
> * If the first image in dfs.name.dir is "valid" it replaces the other copies 
> in the other name.dirs, even if they are different, with this first image, ie 
> if the first image is actually invalid/old/corrupt metadata than you've lost 
> your valid metadata, which can result in data loss if the namenode garbage 
> collects blocks that it thinks are no longer used.
> How about we maintain a checksum as part of the image and edit log and check 
> those on startup and refuse to startup if they are different. Or at least 
> provide a configuration option to do so if people are worried about the 
> overhead of maintaining checksums of these files. Even if we assume 
> dfs.name.dir is reliable storage this guards against operator errors.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HDFS-903) NN should verify images and edit logs on startup

2010-10-20 Thread Hairong Kuang (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12923263#action_12923263
 ] 

Hairong Kuang commented on HDFS-903:


@Todd, you made a valid point.

@Konstantin, sooo good to see that you are back! Yes, I will add MD5 into 
CheckpointSignature so that HDFS-1458 can happen.

> NN should verify images and edit logs on startup
> 
>
> Key: HDFS-903
> URL: https://issues.apache.org/jira/browse/HDFS-903
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: name-node
>Reporter: Eli Collins
>Assignee: Hairong Kuang
>Priority: Critical
> Fix For: 0.22.0
>
>
> I was playing around with corrupting fsimage and edits logs when there are 
> multiple dfs.name.dirs specified. I noticed that:
>  * As long as your corruption does not make the image invalid, eg changes an 
> opcode so it's an invalid opcode HDFS doesn't notice and happily uses a 
> corrupt image or applies the corrupt edit.
> * If the first image in dfs.name.dir is "valid" it replaces the other copies 
> in the other name.dirs, even if they are different, with this first image, ie 
> if the first image is actually invalid/old/corrupt metadata than you've lost 
> your valid metadata, which can result in data loss if the namenode garbage 
> collects blocks that it thinks are no longer used.
> How about we maintain a checksum as part of the image and edit log and check 
> those on startup and refuse to startup if they are different. Or at least 
> provide a configuration option to do so if people are worried about the 
> overhead of maintaining checksums of these files. Even if we assume 
> dfs.name.dir is reliable storage this guards against operator errors.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HDFS-903) NN should verify images and edit logs on startup

2010-10-20 Thread dhruba borthakur (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12923315#action_12923315
 ] 

dhruba borthakur commented on HDFS-903:
---

I agree with Konstantin/Hairong that the MD5 signature should be part of the 
CheckpointSignature. 

It would have been nice if the contents of the VERSION file was stored as a 
header record in the beginning of the fsimage file itself (I now remember the 
initial reason why the VERSION file exists separate from the fsimage: the 
datanode needs the VERSION file too for its block-directories and the datanode 
does not have a fsimage file). Given that, t should be fine to store the 
checkum in the VERSION file. Also, the algoritm to compute the checksum need 
not be configurable, it could be hardcoded to generate a MD5 checksum.

> NN should verify images and edit logs on startup
> 
>
> Key: HDFS-903
> URL: https://issues.apache.org/jira/browse/HDFS-903
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: name-node
>Reporter: Eli Collins
>Assignee: Hairong Kuang
>Priority: Critical
> Fix For: 0.22.0
>
>
> I was playing around with corrupting fsimage and edits logs when there are 
> multiple dfs.name.dirs specified. I noticed that:
>  * As long as your corruption does not make the image invalid, eg changes an 
> opcode so it's an invalid opcode HDFS doesn't notice and happily uses a 
> corrupt image or applies the corrupt edit.
> * If the first image in dfs.name.dir is "valid" it replaces the other copies 
> in the other name.dirs, even if they are different, with this first image, ie 
> if the first image is actually invalid/old/corrupt metadata than you've lost 
> your valid metadata, which can result in data loss if the namenode garbage 
> collects blocks that it thinks are no longer used.
> How about we maintain a checksum as part of the image and edit log and check 
> those on startup and refuse to startup if they are different. Or at least 
> provide a configuration option to do so if people are worried about the 
> overhead of maintaining checksums of these files. Even if we assume 
> dfs.name.dir is reliable storage this guards against operator errors.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HDFS-903) NN should verify images and edit logs on startup

2010-10-21 Thread Allen Wittenauer (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12923567#action_12923567
 ] 

Allen Wittenauer commented on HDFS-903:
---

> I think it's better to put in VERSION file since then you can use a command 
> line "md5sum" utility to check for corruption. 

+1

This is much more operations friendly.  If an alternative is picked--which is 
fine--just keep in mind we'll need a tool built to go with this change.

> NN should verify images and edit logs on startup
> 
>
> Key: HDFS-903
> URL: https://issues.apache.org/jira/browse/HDFS-903
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: name-node
>Reporter: Eli Collins
>Assignee: Hairong Kuang
>Priority: Critical
> Fix For: 0.22.0
>
>
> I was playing around with corrupting fsimage and edits logs when there are 
> multiple dfs.name.dirs specified. I noticed that:
>  * As long as your corruption does not make the image invalid, eg changes an 
> opcode so it's an invalid opcode HDFS doesn't notice and happily uses a 
> corrupt image or applies the corrupt edit.
> * If the first image in dfs.name.dir is "valid" it replaces the other copies 
> in the other name.dirs, even if they are different, with this first image, ie 
> if the first image is actually invalid/old/corrupt metadata than you've lost 
> your valid metadata, which can result in data loss if the namenode garbage 
> collects blocks that it thinks are no longer used.
> How about we maintain a checksum as part of the image and edit log and check 
> those on startup and refuse to startup if they are different. Or at least 
> provide a configuration option to do so if people are worried about the 
> overhead of maintaining checksums of these files. Even if we assume 
> dfs.name.dir is reliable storage this guards against operator errors.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HDFS-903) NN should verify images and edit logs on startup

2010-10-22 Thread Suresh Srinivas (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12923930#action_12923930
 ] 

Suresh Srinivas commented on HDFS-903:
--

> It would have been nice if the contents of the VERSION file was stored as a 
> header record in the beginning of the fsimage file itself 
Currently VERSION creation signals end of snapshot creation, independent of 
fsimage and edits creation. Moving VERSION to fsimage will complicate the 
current design.

> NN should verify images and edit logs on startup
> 
>
> Key: HDFS-903
> URL: https://issues.apache.org/jira/browse/HDFS-903
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: name-node
>Reporter: Eli Collins
>Assignee: Hairong Kuang
>Priority: Critical
> Fix For: 0.22.0
>
>
> I was playing around with corrupting fsimage and edits logs when there are 
> multiple dfs.name.dirs specified. I noticed that:
>  * As long as your corruption does not make the image invalid, eg changes an 
> opcode so it's an invalid opcode HDFS doesn't notice and happily uses a 
> corrupt image or applies the corrupt edit.
> * If the first image in dfs.name.dir is "valid" it replaces the other copies 
> in the other name.dirs, even if they are different, with this first image, ie 
> if the first image is actually invalid/old/corrupt metadata than you've lost 
> your valid metadata, which can result in data loss if the namenode garbage 
> collects blocks that it thinks are no longer used.
> How about we maintain a checksum as part of the image and edit log and check 
> those on startup and refuse to startup if they are different. Or at least 
> provide a configuration option to do so if people are worried about the 
> overhead of maintaining checksums of these files. Even if we assume 
> dfs.name.dir is reliable storage this guards against operator errors.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HDFS-903) NN should verify images and edit logs on startup

2010-10-29 Thread Konstantin Shvachko (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12926498#action_12926498
 ] 

Konstantin Shvachko commented on HDFS-903:
--

The image part of the patch looks good. I liked the seamless calculation of the 
checksum with DigestInputStream.

The checkpoint part is a bit inconsistent.
The CheckpointSignature is an invariant of the current checkpoint process. This 
is the way for the NN to recognize that it is still talking with the same 
checkpointer that started the process. Therefore, we should never alter the 
signature within the same checkpoint process.
We should probably make the JavaDoc for CheckpointSignature more explicit on 
that.

Current implementation is inconsistent in two ways.
On the one hand, rollImage for SNN sends back a new checkpoint signature with 
the digest of the new image. This value is recorded by the NN as the new 
digest. Instead 
# The SNN should send back the digest of the original (old) image, and the NN 
should verify it against the original signature by calling 
validateStorageInfo(), as it is done in endCheckpoint().
# The NN should calculate the digest by itself not relying on the value passed 
by SNN.
It could probably be done while uploading the image from SNN.

On the other hand, the BN/CN sends back the original checkpoint signature with 
the old image digest. And the NN records it as the new digest, which should 
lead to an error during reloading. Again the NN should always calculate the 
digest by itself, it should be the only authority for its own image.

A couple of nits:
- At the end of {{FSImage.loadFSImage(File)}} the same LOG message is printed 
twice.
- Empty line added in {{FSImage.resetVersion()}}


> NN should verify images and edit logs on startup
> 
>
> Key: HDFS-903
> URL: https://issues.apache.org/jira/browse/HDFS-903
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: name-node
>Reporter: Eli Collins
>Assignee: Hairong Kuang
>Priority: Critical
> Fix For: 0.22.0
>
> Attachments: trunkChecksumImage.patch
>
>
> I was playing around with corrupting fsimage and edits logs when there are 
> multiple dfs.name.dirs specified. I noticed that:
>  * As long as your corruption does not make the image invalid, eg changes an 
> opcode so it's an invalid opcode HDFS doesn't notice and happily uses a 
> corrupt image or applies the corrupt edit.
> * If the first image in dfs.name.dir is "valid" it replaces the other copies 
> in the other name.dirs, even if they are different, with this first image, ie 
> if the first image is actually invalid/old/corrupt metadata than you've lost 
> your valid metadata, which can result in data loss if the namenode garbage 
> collects blocks that it thinks are no longer used.
> How about we maintain a checksum as part of the image and edit log and check 
> those on startup and refuse to startup if they are different. Or at least 
> provide a configuration option to do so if people are worried about the 
> overhead of maintaining checksums of these files. Even if we assume 
> dfs.name.dir is reliable storage this guards against operator errors.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HDFS-903) NN should verify images and edit logs on startup

2010-11-01 Thread Hairong Kuang (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12927039#action_12927039
 ] 

Hairong Kuang commented on HDFS-903:


Konstantin, thank you so much for your review comments.

> The NN should calculate the digest by itself not relying on the value passed 
> by SNN. It could probably be done while uploading the image from SNN.
As I pointed out in HDFS-1382, an image on disk or in transmission might get 
corrupt. So calculating checksum while uploading the image from SNN is not 
reliable at all. NN should depend on SNN to get the new image's checksum.

> the NN records it as the new digest, which should lead to an error during 
> reloading.
I did not get this point. Why?

> NN should verify images and edit logs on startup
> 
>
> Key: HDFS-903
> URL: https://issues.apache.org/jira/browse/HDFS-903
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: name-node
>Reporter: Eli Collins
>Assignee: Hairong Kuang
>Priority: Critical
> Fix For: 0.22.0
>
> Attachments: trunkChecksumImage.patch
>
>
> I was playing around with corrupting fsimage and edits logs when there are 
> multiple dfs.name.dirs specified. I noticed that:
>  * As long as your corruption does not make the image invalid, eg changes an 
> opcode so it's an invalid opcode HDFS doesn't notice and happily uses a 
> corrupt image or applies the corrupt edit.
> * If the first image in dfs.name.dir is "valid" it replaces the other copies 
> in the other name.dirs, even if they are different, with this first image, ie 
> if the first image is actually invalid/old/corrupt metadata than you've lost 
> your valid metadata, which can result in data loss if the namenode garbage 
> collects blocks that it thinks are no longer used.
> How about we maintain a checksum as part of the image and edit log and check 
> those on startup and refuse to startup if they are different. Or at least 
> provide a configuration option to do so if people are worried about the 
> overhead of maintaining checksums of these files. Even if we assume 
> dfs.name.dir is reliable storage this guards against operator errors.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HDFS-903) NN should verify images and edit logs on startup

2010-11-01 Thread Hairong Kuang (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12927041#action_12927041
 ] 

Hairong Kuang commented on HDFS-903:


Oops, I got the wrong jira number. should be "as I pointed out in HDFS-1481".

> NN should verify images and edit logs on startup
> 
>
> Key: HDFS-903
> URL: https://issues.apache.org/jira/browse/HDFS-903
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: name-node
>Reporter: Eli Collins
>Assignee: Hairong Kuang
>Priority: Critical
> Fix For: 0.22.0
>
> Attachments: trunkChecksumImage.patch
>
>
> I was playing around with corrupting fsimage and edits logs when there are 
> multiple dfs.name.dirs specified. I noticed that:
>  * As long as your corruption does not make the image invalid, eg changes an 
> opcode so it's an invalid opcode HDFS doesn't notice and happily uses a 
> corrupt image or applies the corrupt edit.
> * If the first image in dfs.name.dir is "valid" it replaces the other copies 
> in the other name.dirs, even if they are different, with this first image, ie 
> if the first image is actually invalid/old/corrupt metadata than you've lost 
> your valid metadata, which can result in data loss if the namenode garbage 
> collects blocks that it thinks are no longer used.
> How about we maintain a checksum as part of the image and edit log and check 
> those on startup and refuse to startup if they are different. Or at least 
> provide a configuration option to do so if people are worried about the 
> overhead of maintaining checksums of these files. Even if we assume 
> dfs.name.dir is reliable storage this guards against operator errors.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HDFS-903) NN should verify images and edit logs on startup

2010-11-01 Thread Hairong Kuang (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12927072#action_12927072
 ] 

Hairong Kuang commented on HDFS-903:


I forgot to mention that I agree with comment 1. I will change rollFsImage to 
have two parameters, one representing the old checkpoint signature and the 
other representing the new image signature.

> NN should verify images and edit logs on startup
> 
>
> Key: HDFS-903
> URL: https://issues.apache.org/jira/browse/HDFS-903
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: name-node
>Reporter: Eli Collins
>Assignee: Hairong Kuang
>Priority: Critical
> Fix For: 0.22.0
>
> Attachments: trunkChecksumImage.patch
>
>
> I was playing around with corrupting fsimage and edits logs when there are 
> multiple dfs.name.dirs specified. I noticed that:
>  * As long as your corruption does not make the image invalid, eg changes an 
> opcode so it's an invalid opcode HDFS doesn't notice and happily uses a 
> corrupt image or applies the corrupt edit.
> * If the first image in dfs.name.dir is "valid" it replaces the other copies 
> in the other name.dirs, even if they are different, with this first image, ie 
> if the first image is actually invalid/old/corrupt metadata than you've lost 
> your valid metadata, which can result in data loss if the namenode garbage 
> collects blocks that it thinks are no longer used.
> How about we maintain a checksum as part of the image and edit log and check 
> those on startup and refuse to startup if they are different. Or at least 
> provide a configuration option to do so if people are worried about the 
> overhead of maintaining checksums of these files. Even if we assume 
> dfs.name.dir is reliable storage this guards against operator errors.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HDFS-903) NN should verify images and edit logs on startup

2010-11-02 Thread Konstantin Shvachko (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12927743#action_12927743
 ] 

Konstantin Shvachko commented on HDFS-903:
--

I didn't know you discussed it already. I agree the image can be corrupted 
during the transmission.
It seems logical if we include the verification logic in the transmission 
process. That is, SNN send the checksum via servlet, then NN uploads the image 
and calculate the downloaded checksum on the fly, and then matches it with the 
one sent by the SNN. The checksum verification can be done by 
validateCheckpointUpload(), which is already there just need to be extended.
I don't think it will be a good idea to separate the upload and the 
verification, which is imminent if you first upload then send the checksum via 
rollFSImage() and verify inside.

> NN should verify images and edit logs on startup
> 
>
> Key: HDFS-903
> URL: https://issues.apache.org/jira/browse/HDFS-903
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: name-node
>Reporter: Eli Collins
>Assignee: Hairong Kuang
>Priority: Critical
> Fix For: 0.22.0
>
> Attachments: trunkChecksumImage.patch, trunkChecksumImage1.patch
>
>
> I was playing around with corrupting fsimage and edits logs when there are 
> multiple dfs.name.dirs specified. I noticed that:
>  * As long as your corruption does not make the image invalid, eg changes an 
> opcode so it's an invalid opcode HDFS doesn't notice and happily uses a 
> corrupt image or applies the corrupt edit.
> * If the first image in dfs.name.dir is "valid" it replaces the other copies 
> in the other name.dirs, even if they are different, with this first image, ie 
> if the first image is actually invalid/old/corrupt metadata than you've lost 
> your valid metadata, which can result in data loss if the namenode garbage 
> collects blocks that it thinks are no longer used.
> How about we maintain a checksum as part of the image and edit log and check 
> those on startup and refuse to startup if they are different. Or at least 
> provide a configuration option to do so if people are worried about the 
> overhead of maintaining checksums of these files. Even if we assume 
> dfs.name.dir is reliable storage this guards against operator errors.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HDFS-903) NN should verify images and edit logs on startup

2010-11-02 Thread Hairong Kuang (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12927747#action_12927747
 ] 

Hairong Kuang commented on HDFS-903:


> It seems logical if we include the verification logic in the transmission 
> process. 
This seems good to me. This was one of the options that I discussed with Dhruba 
for HDFS-1481.

Probably in this jira, I will send the checksum together while uploading image. 
Then in HDFS-1481, I will add the verification part.

> NN should verify images and edit logs on startup
> 
>
> Key: HDFS-903
> URL: https://issues.apache.org/jira/browse/HDFS-903
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: name-node
>Reporter: Eli Collins
>Assignee: Hairong Kuang
>Priority: Critical
> Fix For: 0.22.0
>
> Attachments: trunkChecksumImage.patch, trunkChecksumImage1.patch
>
>
> I was playing around with corrupting fsimage and edits logs when there are 
> multiple dfs.name.dirs specified. I noticed that:
>  * As long as your corruption does not make the image invalid, eg changes an 
> opcode so it's an invalid opcode HDFS doesn't notice and happily uses a 
> corrupt image or applies the corrupt edit.
> * If the first image in dfs.name.dir is "valid" it replaces the other copies 
> in the other name.dirs, even if they are different, with this first image, ie 
> if the first image is actually invalid/old/corrupt metadata than you've lost 
> your valid metadata, which can result in data loss if the namenode garbage 
> collects blocks that it thinks are no longer used.
> How about we maintain a checksum as part of the image and edit log and check 
> those on startup and refuse to startup if they are different. Or at least 
> provide a configuration option to do so if people are worried about the 
> overhead of maintaining checksums of these files. Even if we assume 
> dfs.name.dir is reliable storage this guards against operator errors.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HDFS-903) NN should verify images and edit logs on startup

2010-11-02 Thread Konstantin Shvachko (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12927749#action_12927749
 ] 

Konstantin Shvachko commented on HDFS-903:
--

Sounds like a plan.

> NN should verify images and edit logs on startup
> 
>
> Key: HDFS-903
> URL: https://issues.apache.org/jira/browse/HDFS-903
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: name-node
>Reporter: Eli Collins
>Assignee: Hairong Kuang
>Priority: Critical
> Fix For: 0.22.0
>
> Attachments: trunkChecksumImage.patch, trunkChecksumImage1.patch
>
>
> I was playing around with corrupting fsimage and edits logs when there are 
> multiple dfs.name.dirs specified. I noticed that:
>  * As long as your corruption does not make the image invalid, eg changes an 
> opcode so it's an invalid opcode HDFS doesn't notice and happily uses a 
> corrupt image or applies the corrupt edit.
> * If the first image in dfs.name.dir is "valid" it replaces the other copies 
> in the other name.dirs, even if they are different, with this first image, ie 
> if the first image is actually invalid/old/corrupt metadata than you've lost 
> your valid metadata, which can result in data loss if the namenode garbage 
> collects blocks that it thinks are no longer used.
> How about we maintain a checksum as part of the image and edit log and check 
> those on startup and refuse to startup if they are different. Or at least 
> provide a configuration option to do so if people are worried about the 
> overhead of maintaining checksums of these files. Even if we assume 
> dfs.name.dir is reliable storage this guards against operator errors.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HDFS-903) NN should verify images and edit logs on startup

2010-11-04 Thread Konstantin Shvachko (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12928515#action_12928515
 ] 

Konstantin Shvachko commented on HDFS-903:
--

Do you also need to set "&newChecksum=" in Checkpointer.uploadCheckpoint()? 
Should be the same as in SNN.putFSImage().

> NN should verify images and edit logs on startup
> 
>
> Key: HDFS-903
> URL: https://issues.apache.org/jira/browse/HDFS-903
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: name-node
>Reporter: Eli Collins
>Assignee: Hairong Kuang
>Priority: Critical
> Fix For: 0.22.0
>
> Attachments: trunkChecksumImage.patch, trunkChecksumImage1.patch, 
> trunkChecksumImage2.patch
>
>
> I was playing around with corrupting fsimage and edits logs when there are 
> multiple dfs.name.dirs specified. I noticed that:
>  * As long as your corruption does not make the image invalid, eg changes an 
> opcode so it's an invalid opcode HDFS doesn't notice and happily uses a 
> corrupt image or applies the corrupt edit.
> * If the first image in dfs.name.dir is "valid" it replaces the other copies 
> in the other name.dirs, even if they are different, with this first image, ie 
> if the first image is actually invalid/old/corrupt metadata than you've lost 
> your valid metadata, which can result in data loss if the namenode garbage 
> collects blocks that it thinks are no longer used.
> How about we maintain a checksum as part of the image and edit log and check 
> those on startup and refuse to startup if they are different. Or at least 
> provide a configuration option to do so if people are worried about the 
> overhead of maintaining checksums of these files. Even if we assume 
> dfs.name.dir is reliable storage this guards against operator errors.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HDFS-903) NN should verify images and edit logs on startup

2010-11-05 Thread Hairong Kuang (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12928735#action_12928735
 ] 

Hairong Kuang commented on HDFS-903:


antPatch.sh passed except for
 [exec] -1 release audit.  The applied patch generated 97 release audit 
warnings (more than the trunk's current 1 warnings).
The release warnings are all about license header. But my patch does not add 
any new file. I do think that there is a bug in the script.

All unit tests are passed except for the known failed ones.

> NN should verify images and edit logs on startup
> 
>
> Key: HDFS-903
> URL: https://issues.apache.org/jira/browse/HDFS-903
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: name-node
>Reporter: Eli Collins
>Assignee: Hairong Kuang
>Priority: Critical
> Fix For: 0.22.0
>
> Attachments: trunkChecksumImage.patch, trunkChecksumImage1.patch, 
> trunkChecksumImage2.patch, trunkChecksumImage3.patch
>
>
> I was playing around with corrupting fsimage and edits logs when there are 
> multiple dfs.name.dirs specified. I noticed that:
>  * As long as your corruption does not make the image invalid, eg changes an 
> opcode so it's an invalid opcode HDFS doesn't notice and happily uses a 
> corrupt image or applies the corrupt edit.
> * If the first image in dfs.name.dir is "valid" it replaces the other copies 
> in the other name.dirs, even if they are different, with this first image, ie 
> if the first image is actually invalid/old/corrupt metadata than you've lost 
> your valid metadata, which can result in data loss if the namenode garbage 
> collects blocks that it thinks are no longer used.
> How about we maintain a checksum as part of the image and edit log and check 
> those on startup and refuse to startup if they are different. Or at least 
> provide a configuration option to do so if people are worried about the 
> overhead of maintaining checksums of these files. Even if we assume 
> dfs.name.dir is reliable storage this guards against operator errors.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HDFS-903) NN should verify images and edit logs on startup

2010-11-07 Thread Konstantin Shvachko (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12929439#action_12929439
 ] 

Konstantin Shvachko commented on HDFS-903:
--

+1 Looks good.

> NN should verify images and edit logs on startup
> 
>
> Key: HDFS-903
> URL: https://issues.apache.org/jira/browse/HDFS-903
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: name-node
>Reporter: Eli Collins
>Assignee: Hairong Kuang
>Priority: Critical
> Fix For: 0.22.0
>
> Attachments: trunkChecksumImage.patch, trunkChecksumImage1.patch, 
> trunkChecksumImage2.patch, trunkChecksumImage3.patch, 
> trunkChecksumImage4.patch
>
>
> I was playing around with corrupting fsimage and edits logs when there are 
> multiple dfs.name.dirs specified. I noticed that:
>  * As long as your corruption does not make the image invalid, eg changes an 
> opcode so it's an invalid opcode HDFS doesn't notice and happily uses a 
> corrupt image or applies the corrupt edit.
> * If the first image in dfs.name.dir is "valid" it replaces the other copies 
> in the other name.dirs, even if they are different, with this first image, ie 
> if the first image is actually invalid/old/corrupt metadata than you've lost 
> your valid metadata, which can result in data loss if the namenode garbage 
> collects blocks that it thinks are no longer used.
> How about we maintain a checksum as part of the image and edit log and check 
> those on startup and refuse to startup if they are different. Or at least 
> provide a configuration option to do so if people are worried about the 
> overhead of maintaining checksums of these files. Even if we assume 
> dfs.name.dir is reliable storage this guards against operator errors.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HDFS-903) NN should verify images and edit logs on startup

2010-01-15 Thread Allen Wittenauer (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12801031#action_12801031
 ] 

Allen Wittenauer commented on HDFS-903:
---

One of the discussions we had about 2 years ago was keeping minimum 3 versions 
of the fsimage and edits file so that we could do parity checking.  

> NN should verify images and edit logs on startup
> 
>
> Key: HDFS-903
> URL: https://issues.apache.org/jira/browse/HDFS-903
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: name-node
>Reporter: Eli Collins
>Assignee: Eli Collins
>Priority: Critical
>
> I was playing around with corrupting fsimage and edits logs when there are 
> multiple dfs.name.dirs specified. I noticed that:
>  * As long as your corruption does not make the image invalid, eg changes an 
> opcode so it's an invalid opcode HDFS doesn't notice and happily uses a 
> corrupt image or applies the corrupt edit.
> * If the first image in dfs.name.dir is "valid" it replaces the other copies 
> in the other name.dirs, even if they are different, with this first image, ie 
> if the first image is actually invalid/old/corrupt metadata than you've lost 
> your valid metadata, which can result in data loss if the namenode garbage 
> collects blocks that it thinks are no longer used.
> How about we maintain a checksum as part of the image and edit log and check 
> those on startup and refuse to startup if they are different. Or at least 
> provide a configuration option to do so if people are worried about the 
> overhead of maintaining checksums of these files. Even if we assume 
> dfs.name.dir is reliable storage this guards against operator errors.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.