[jira] [Commented] (HDFS-3597) SNN can fail to start on upgrade

Andy Isaacson (JIRA) Thu, 05 Jul 2012 15:14:43 -0700

    [ 
https://issues.apache.org/jira/browse/HDFS-3597?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13407544#comment-13407544
 ]


Andy Isaacson commented on HDFS-3597:
-------------------------------------

{quote}
I think this should take {{StorageInfo}} as a parameter instead, and you would 
pass {{image.getStorage()}} in.
{quote}
Sounds good, thanks.
{quote}
I'm not 100% convinced of the logic. I think we should always verify that it's 
the same NN – but just loosen the validateStorageInfo check here to not check 
the versioning info. For example, if I accidentally point my 2NN at the wrong 
NN, it won't start, even if that NN happens to be from a different version. It 
should only blow its local storage away if it's the same NN (namespace/cluster) 
but a different version.
{quote}
Fair enough, but we don't want to loosen the check in {{validateStorageInfo}} 
because it's used in a half dozen other places that want full checking I think. 
 I'll refactor the checks.

bq. Instead, can you use {{FSImageTestUtil.corruptVersionFile}} here?

Great, didn't know about that!

bq. No need for these...?
indeed, leftover from a previous test design.

bq. Can you change this test to not need any datanodes? ... mkdir
A fine plan, done.

bq. It seems odd that you print out all of the checkpoint dirs, but then only 
corrupt the property in one of them. Shouldn't you be corrupting it in all of 
them?

That's an issue I was confused about too.  I don't understand why the test has 
multiple checkpoint dirs, nor why my 2NN is running in 
snn.getCheckpointDirs().get(1) rather than .get(0).  (If I corrupt the first 
checkpointdir, there is no perceptible effect on the testcase.)  The println is 
a leftover from when I was still attempting to exercise the upgrade code.

bq. The spelling fix in NNStorage is unrelated. Cleanup's good, but try not to 
do so in files that aren't otherwise touched by your patch.
Dropped.  At some point during development my fix touched NNstorage.

                
> SNN can fail to start on upgrade
> --------------------------------
>
>                 Key: HDFS-3597
>                 URL: https://issues.apache.org/jira/browse/HDFS-3597
>             Project: Hadoop HDFS
>          Issue Type: Bug
>    Affects Versions: 2.0.0-alpha
>            Reporter: Andy Isaacson
>            Assignee: Andy Isaacson
>            Priority: Minor
>         Attachments: hdfs-3597.txt
>
>
> When upgrading from 1.x to 2.0.0, the SecondaryNameNode can fail to start up:
> {code}
> 2012-06-16 09:52:33,812 ERROR 
> org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode: Exception in 
> doCheckpoint
> java.io.IOException: Inconsistent checkpoint fields.
> LV = -40 namespaceID = 64415959 cTime = 1339813974990 ; clusterId = 
> CID-07a82b97-8d04-4fdd-b3a1-f40650163245 ; blockpoolId = 
> BP-1792677198-172.29.121.67-1339813967723.
> Expecting respectively: -19; 64415959; 0; ; .
> at 
> org.apache.hadoop.hdfs.server.namenode.CheckpointSignature.validateStorageInfo(CheckpointSignature.java:120)
> at 
> org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.doCheckpoint(SecondaryNameNode.java:454)
> at 
> org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.doWork(SecondaryNameNode.java:334)
> at 
> org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode$2.run(SecondaryNameNode.java:301)
> at 
> org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:438)
> at 
> org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.run(SecondaryNameNode.java:297)
> at java.lang.Thread.run(Thread.java:662)
> {code}
> The error check we're hitting came from HDFS-1073, and it's intended to 
> verify that we're connecting to the correct NN.  But the check is too strict 
> and considers "different metadata version" to be the same as "different 
> clusterID".
> I believe the check in {{doCheckpoint}} simply needs to explicitly check for 
> and handle the update case.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HDFS-3597) SNN can fail to start on upgrade

Reply via email to