[jira] Commented: (HDFS-1496) TestStorageRestore is failing after HDFS-903 fix
[ https://issues.apache.org/jira/browse/HDFS-1496?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12990355#comment-12990355 ] Boris Shkolnik commented on HDFS-1496: -- Looks like the problem is that when we are trying to restore a storage dir, we format it , which always saves the current in-memory state into a _new_ fsimage. So instead we should restore a storage without saving the state and creating new fsimage. It will be copied there during the checkpoint anyway. I've attached the patch to HDFS-1602. Please look at it and comment.(patch is for trunk). TestStorageRestore is failing after HDFS-903 fix Key: HDFS-1496 URL: https://issues.apache.org/jira/browse/HDFS-1496 Project: Hadoop HDFS Issue Type: Bug Components: test Affects Versions: 0.22.0, 0.23.0 Reporter: Konstantin Boudnik Assignee: Hairong Kuang Priority: Blocker Fix For: 0.22.0 Attachments: HDFS-1496.sh, HDFS-1496.sh, HDFS-1496.sh TestStorageRestore seems to be failing after HDFS-903 commit. Running git bisect confirms it. -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (HDFS-1496) TestStorageRestore is failing after HDFS-903 fix
[ https://issues.apache.org/jira/browse/HDFS-1496?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12989099#comment-12989099 ] Hairong Kuang commented on HDFS-1496: - Here is the inconsistency that is introduced by HADOOP-4885. After failed directories are added back, each failed directory contains new image + zero edit while each old good directory contains old image + old edit. If secondary namenode happens to fetch the image from a failed directory but fetch the edit from an old good directory, how would checkpointing work. TestStorageRestore is failing after HDFS-903 fix Key: HDFS-1496 URL: https://issues.apache.org/jira/browse/HDFS-1496 Project: Hadoop HDFS Issue Type: Bug Components: test Affects Versions: 0.22.0, 0.23.0 Reporter: Konstantin Boudnik Assignee: Hairong Kuang Priority: Blocker Fix For: 0.22.0 Attachments: HDFS-1496.sh, HDFS-1496.sh, HDFS-1496.sh TestStorageRestore seems to be failing after HDFS-903 commit. Running git bisect confirms it. -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (HDFS-1496) TestStorageRestore is failing after HDFS-903 fix
[ https://issues.apache.org/jira/browse/HDFS-1496?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12989103#comment-12989103 ] Konstantin Shvachko commented on HDFS-1496: --- How do we fix it? TestStorageRestore is failing after HDFS-903 fix Key: HDFS-1496 URL: https://issues.apache.org/jira/browse/HDFS-1496 Project: Hadoop HDFS Issue Type: Bug Components: test Affects Versions: 0.22.0, 0.23.0 Reporter: Konstantin Boudnik Assignee: Hairong Kuang Priority: Blocker Fix For: 0.22.0 Attachments: HDFS-1496.sh, HDFS-1496.sh, HDFS-1496.sh TestStorageRestore seems to be failing after HDFS-903 commit. Running git bisect confirms it. -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (HDFS-1496) TestStorageRestore is failing after HDFS-903 fix
[ https://issues.apache.org/jira/browse/HDFS-1496?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12989104#comment-12989104 ] Hairong Kuang commented on HDFS-1496: - As I said before, I could not figure out a good way to fix it. All I could do is to disable the feature introduced by HADOOP-4885. TestStorageRestore is failing after HDFS-903 fix Key: HDFS-1496 URL: https://issues.apache.org/jira/browse/HDFS-1496 Project: Hadoop HDFS Issue Type: Bug Components: test Affects Versions: 0.22.0, 0.23.0 Reporter: Konstantin Boudnik Assignee: Hairong Kuang Priority: Blocker Fix For: 0.22.0 Attachments: HDFS-1496.sh, HDFS-1496.sh, HDFS-1496.sh TestStorageRestore seems to be failing after HDFS-903 commit. Running git bisect confirms it. -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (HDFS-1496) TestStorageRestore is failing after HDFS-903 fix
[ https://issues.apache.org/jira/browse/HDFS-1496?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12988372#action_12988372 ] Konstantin Boudnik commented on HDFS-1496: -- Hairong, what I am seen on a real (0.20.2 based cluster) the NN storage volume which has been once removed (e.g. because of a faulty NFS mount or something) is emptied as soon SNN starts checkpoint process. This happens because {{FSEditLog.synchronized void rollEditLog}} calls {{FSImage.attemptRestoreRemovedStorage}} and effectively formats a faulty volume if it becomes available. I guess it is possible that a checkpoint can happen before rollEditLog was called and than the inconsistency you've mentioned might be introduced. I think it won't happen because {{SecondaryNameNode.doMerge}} iterates through Storage.storageDirs which won't contain failed volume unless it has been restored and formatted. If this all is true then we have a test which is failing not because the feature doesn't work but rather because the test needs to be changed in lights of HDFS-903. Please let me know if my analysis is incorrect. TestStorageRestore is failing after HDFS-903 fix Key: HDFS-1496 URL: https://issues.apache.org/jira/browse/HDFS-1496 Project: Hadoop HDFS Issue Type: Bug Components: test Affects Versions: 0.22.0, 0.23.0 Reporter: Konstantin Boudnik Assignee: Hairong Kuang Priority: Blocker Fix For: 0.22.0 Attachments: HDFS-1496.sh, HDFS-1496.sh, HDFS-1496.sh TestStorageRestore seems to be failing after HDFS-903 commit. Running git bisect confirms it. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HDFS-1496) TestStorageRestore is failing after HDFS-903 fix
[ https://issues.apache.org/jira/browse/HDFS-1496?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12982246#action_12982246 ] Konstantin Boudnik commented on HDFS-1496: -- Oh, sorry - I've misread you. +1 on disabling the feature because it isn't fully compatible with existing semantic expressed by functional tests. TestStorageRestore is failing after HDFS-903 fix Key: HDFS-1496 URL: https://issues.apache.org/jira/browse/HDFS-1496 Project: Hadoop HDFS Issue Type: Bug Components: test Affects Versions: 0.22.0, 0.23.0 Reporter: Konstantin Boudnik Assignee: Hairong Kuang Priority: Blocker Fix For: 0.22.0 TestStorageRestore seems to be failing after HDFS-903 commit. Running git bisect confirms it. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HDFS-1496) TestStorageRestore is failing after HDFS-903 fix
[ https://issues.apache.org/jira/browse/HDFS-1496?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12979699#action_12979699 ] Hairong Kuang commented on HDFS-1496: - I do not think that the storage directory restoration scheme introduced in HADOOP-4885 works well because it introduces inconsistent states among fsimage/edits directories. Each old good directory contains old image + old edit, but each restored directory contains new image with an empty edit log. This has the potential to corrupt fsimage if secondary NN happens to download the empty edit log from a newly restored edit log directory. I could not figure out a better way to fix this problem. Is it OK that I disable this feature for now so that unit test could pass? Good that Dhruba already enhanced saveNameSpace in HDFS-1509 that could be used as an alternative to restore the failed image directories. TestStorageRestore is failing after HDFS-903 fix Key: HDFS-1496 URL: https://issues.apache.org/jira/browse/HDFS-1496 Project: Hadoop HDFS Issue Type: Bug Components: test Affects Versions: 0.22.0, 0.23.0 Reporter: Konstantin Boudnik Assignee: Hairong Kuang Priority: Blocker Fix For: 0.22.0 TestStorageRestore seems to be failing after HDFS-903 commit. Running git bisect confirms it. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HDFS-1496) TestStorageRestore is failing after HDFS-903 fix
[ https://issues.apache.org/jira/browse/HDFS-1496?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12931589#action_12931589 ] Hairong Kuang commented on HDFS-1496: - This turns out to be a bug in storage directory restoration. Image validation exposes the error. Currently NN uses rollFSEdits to trigger storage directory recovery. The recovery may trigger a saving of the namespace to the newly restored directory which as a result changes in memory image digest. However later on image edits were fetched from an old storage directory, thus causing the checksum mismatch. The problem with this storage restoration scheme is that it makes the on-disk state of all storage directories inconsistent. TestStorageRestore is failing after HDFS-903 fix Key: HDFS-1496 URL: https://issues.apache.org/jira/browse/HDFS-1496 Project: Hadoop HDFS Issue Type: Bug Components: test Reporter: Konstantin Boudnik Assignee: Hairong Kuang TestStorageRestore seems to be failing after HDFS-903 commit. Running git bisect confirms it. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HDFS-1496) TestStorageRestore is failing after HDFS-903 fix
[ https://issues.apache.org/jira/browse/HDFS-1496?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12931344#action_12931344 ] Konstantin Boudnik commented on HDFS-1496: -- According to git bisect {noformat} 23f1ecd155ba2cb6b22eed42541cad1d1bff329a is the first bad commit commit 23f1ecd155ba2cb6b22eed42541cad1d1bff329a Author: Hairong Kuang hair...@apache.org Date: Mon Nov 8 06:49:32 2010 + HDFS-903. Support fsimage validation using MD5 checksum. Contributed by Hairong Kuang. git-svn-id: https://svn.apache.org/repos/asf/hadoop/hdfs/tr...@1032470 13f79535-47bb-0310-9956-ffa450edef68 :100644 100644 602f09709e3d09a62d5a77cfbd010d9c476b77c7 506f045ff0c05b10140c890b06f4e8d8405491fd M CHANGES.txt :04 04 d31ddaa937084215931b28da37a4a3e81e9ec487 4cbbdb1597c2c66ce9da5e2a1dccdd5daa45ab6d M src {noformat} From offline conversation with Hairong it seems totally possible. TestStorageRestore is failing after HDFS-903 fix Key: HDFS-1496 URL: https://issues.apache.org/jira/browse/HDFS-1496 Project: Hadoop HDFS Issue Type: Bug Components: test Reporter: Konstantin Boudnik TestStorageRestore seems to be failing after HDFS-903 commit. Running git bisect confirms it. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.