[jira] Commented: (HDFS-1496) TestStorageRestore is failing after HDFS-903 fix

2011-02-03 Thread Boris Shkolnik (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-1496?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12990355#comment-12990355
 ] 

Boris Shkolnik commented on HDFS-1496:
--

Looks like the problem is that when we are trying to restore a storage dir, we 
format it , which always saves the current in-memory state into a _new_ 
fsimage. So instead we should restore a storage without saving the state and 
creating new fsimage. It will be copied there during the checkpoint anyway. 
I've attached the patch to HDFS-1602. Please look at it and comment.(patch is 
for trunk).

 TestStorageRestore is failing after HDFS-903 fix
 

 Key: HDFS-1496
 URL: https://issues.apache.org/jira/browse/HDFS-1496
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: test
Affects Versions: 0.22.0, 0.23.0
Reporter: Konstantin Boudnik
Assignee: Hairong Kuang
Priority: Blocker
 Fix For: 0.22.0

 Attachments: HDFS-1496.sh, HDFS-1496.sh, HDFS-1496.sh


 TestStorageRestore seems to be failing after HDFS-903 commit. Running git 
 bisect confirms it.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Commented: (HDFS-1496) TestStorageRestore is failing after HDFS-903 fix

2011-02-01 Thread Hairong Kuang (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-1496?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12989099#comment-12989099
 ] 

Hairong Kuang commented on HDFS-1496:
-

Here is the inconsistency that is introduced by HADOOP-4885. After failed 
directories are added back, each failed directory contains new image + zero 
edit while each old good directory contains old image + old edit. If secondary 
namenode happens to fetch the image from a failed directory but fetch the edit 
from an old good directory, how would checkpointing work.

 TestStorageRestore is failing after HDFS-903 fix
 

 Key: HDFS-1496
 URL: https://issues.apache.org/jira/browse/HDFS-1496
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: test
Affects Versions: 0.22.0, 0.23.0
Reporter: Konstantin Boudnik
Assignee: Hairong Kuang
Priority: Blocker
 Fix For: 0.22.0

 Attachments: HDFS-1496.sh, HDFS-1496.sh, HDFS-1496.sh


 TestStorageRestore seems to be failing after HDFS-903 commit. Running git 
 bisect confirms it.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Commented: (HDFS-1496) TestStorageRestore is failing after HDFS-903 fix

2011-02-01 Thread Konstantin Shvachko (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-1496?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12989103#comment-12989103
 ] 

Konstantin Shvachko commented on HDFS-1496:
---

How do we fix it?

 TestStorageRestore is failing after HDFS-903 fix
 

 Key: HDFS-1496
 URL: https://issues.apache.org/jira/browse/HDFS-1496
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: test
Affects Versions: 0.22.0, 0.23.0
Reporter: Konstantin Boudnik
Assignee: Hairong Kuang
Priority: Blocker
 Fix For: 0.22.0

 Attachments: HDFS-1496.sh, HDFS-1496.sh, HDFS-1496.sh


 TestStorageRestore seems to be failing after HDFS-903 commit. Running git 
 bisect confirms it.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Commented: (HDFS-1496) TestStorageRestore is failing after HDFS-903 fix

2011-02-01 Thread Hairong Kuang (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-1496?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12989104#comment-12989104
 ] 

Hairong Kuang commented on HDFS-1496:
-

As I said before, I could not figure out a good way to fix it. All I could do 
is to disable the feature introduced by HADOOP-4885.

 TestStorageRestore is failing after HDFS-903 fix
 

 Key: HDFS-1496
 URL: https://issues.apache.org/jira/browse/HDFS-1496
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: test
Affects Versions: 0.22.0, 0.23.0
Reporter: Konstantin Boudnik
Assignee: Hairong Kuang
Priority: Blocker
 Fix For: 0.22.0

 Attachments: HDFS-1496.sh, HDFS-1496.sh, HDFS-1496.sh


 TestStorageRestore seems to be failing after HDFS-903 commit. Running git 
 bisect confirms it.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Commented: (HDFS-1496) TestStorageRestore is failing after HDFS-903 fix

2011-01-28 Thread Konstantin Boudnik (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-1496?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12988372#action_12988372
 ] 

Konstantin Boudnik commented on HDFS-1496:
--

Hairong, what I am seen on a real (0.20.2 based cluster) the NN storage volume 
which has been once removed (e.g. because of a faulty NFS mount or something) 
is emptied as soon SNN starts checkpoint process. This happens because 
{{FSEditLog.synchronized void rollEditLog}} calls 
{{FSImage.attemptRestoreRemovedStorage}} and effectively formats a faulty 
volume if it becomes available.

I guess it is possible that a checkpoint can happen before rollEditLog was 
called and than the inconsistency you've mentioned might be introduced. I think 
it won't happen because {{SecondaryNameNode.doMerge}} iterates through 
Storage.storageDirs which won't contain failed volume unless it has been 
restored and formatted. If this all is true then we have a test which is 
failing not because the feature doesn't work but rather because the test needs 
to be changed in lights of HDFS-903.

Please let me know if my analysis is incorrect.

 TestStorageRestore is failing after HDFS-903 fix
 

 Key: HDFS-1496
 URL: https://issues.apache.org/jira/browse/HDFS-1496
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: test
Affects Versions: 0.22.0, 0.23.0
Reporter: Konstantin Boudnik
Assignee: Hairong Kuang
Priority: Blocker
 Fix For: 0.22.0

 Attachments: HDFS-1496.sh, HDFS-1496.sh, HDFS-1496.sh


 TestStorageRestore seems to be failing after HDFS-903 commit. Running git 
 bisect confirms it.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HDFS-1496) TestStorageRestore is failing after HDFS-903 fix

2011-01-16 Thread Konstantin Boudnik (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-1496?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12982246#action_12982246
 ] 

Konstantin Boudnik commented on HDFS-1496:
--

Oh, sorry - I've misread you. +1 on disabling the feature because it isn't 
fully compatible with existing semantic expressed by functional tests.

 TestStorageRestore is failing after HDFS-903 fix
 

 Key: HDFS-1496
 URL: https://issues.apache.org/jira/browse/HDFS-1496
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: test
Affects Versions: 0.22.0, 0.23.0
Reporter: Konstantin Boudnik
Assignee: Hairong Kuang
Priority: Blocker
 Fix For: 0.22.0


 TestStorageRestore seems to be failing after HDFS-903 commit. Running git 
 bisect confirms it.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HDFS-1496) TestStorageRestore is failing after HDFS-903 fix

2011-01-10 Thread Hairong Kuang (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-1496?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12979699#action_12979699
 ] 

Hairong Kuang commented on HDFS-1496:
-

I do not think that the storage directory restoration scheme introduced in 
HADOOP-4885 works well because it introduces inconsistent states among 
fsimage/edits directories. Each old good directory contains old image + old 
edit, but each restored directory contains new image with an empty edit log. 
This has the potential to corrupt fsimage if secondary NN happens to download 
the empty edit log from a newly restored edit log directory.

I could not figure out a better way to fix this problem. Is it OK that I 
disable this feature for now so that unit test could pass? Good that Dhruba 
already enhanced saveNameSpace in HDFS-1509 that could be used as an 
alternative to restore the failed image directories.

 TestStorageRestore is failing after HDFS-903 fix
 

 Key: HDFS-1496
 URL: https://issues.apache.org/jira/browse/HDFS-1496
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: test
Affects Versions: 0.22.0, 0.23.0
Reporter: Konstantin Boudnik
Assignee: Hairong Kuang
Priority: Blocker
 Fix For: 0.22.0


 TestStorageRestore seems to be failing after HDFS-903 commit. Running git 
 bisect confirms it.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HDFS-1496) TestStorageRestore is failing after HDFS-903 fix

2010-11-12 Thread Hairong Kuang (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-1496?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12931589#action_12931589
 ] 

Hairong Kuang commented on HDFS-1496:
-

This turns out to be a bug in storage directory restoration. Image validation 
exposes the error.

Currently NN uses rollFSEdits to trigger storage directory recovery. The 
recovery may trigger a saving of the namespace to the newly restored directory 
which as a result changes in memory image digest. However later on image  
edits were fetched from an old storage directory, thus causing the checksum 
mismatch.

The problem with this storage restoration scheme is that it makes the on-disk 
state of all storage directories inconsistent.

 TestStorageRestore is failing after HDFS-903 fix
 

 Key: HDFS-1496
 URL: https://issues.apache.org/jira/browse/HDFS-1496
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: test
Reporter: Konstantin Boudnik
Assignee: Hairong Kuang

 TestStorageRestore seems to be failing after HDFS-903 commit. Running git 
 bisect confirms it.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HDFS-1496) TestStorageRestore is failing after HDFS-903 fix

2010-11-11 Thread Konstantin Boudnik (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-1496?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12931344#action_12931344
 ] 

Konstantin Boudnik commented on HDFS-1496:
--

According to git bisect
{noformat}
23f1ecd155ba2cb6b22eed42541cad1d1bff329a is the first bad commit
commit 23f1ecd155ba2cb6b22eed42541cad1d1bff329a
Author: Hairong Kuang hair...@apache.org
Date:   Mon Nov 8 06:49:32 2010 +

HDFS-903.  Support fsimage validation using MD5 checksum. Contributed by 
Hairong Kuang.


git-svn-id: https://svn.apache.org/repos/asf/hadoop/hdfs/tr...@1032470 
13f79535-47bb-0310-9956-ffa450edef68

:100644 100644 602f09709e3d09a62d5a77cfbd010d9c476b77c7 
506f045ff0c05b10140c890b06f4e8d8405491fd M  CHANGES.txt
:04 04 d31ddaa937084215931b28da37a4a3e81e9ec487 
4cbbdb1597c2c66ce9da5e2a1dccdd5daa45ab6d M  src
{noformat}

From offline conversation with Hairong it seems totally possible.

 TestStorageRestore is failing after HDFS-903 fix
 

 Key: HDFS-1496
 URL: https://issues.apache.org/jira/browse/HDFS-1496
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: test
Reporter: Konstantin Boudnik

 TestStorageRestore seems to be failing after HDFS-903 commit. Running git 
 bisect confirms it.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.