An update on this - around the time the secondary namenode crashed we had been setting up the config to run a secondary namenode process on a separate machine from namenode. Abut 30 minutes before the crash we had added the new node to the conf/masters list and started the secondarynamenode on that node. We have the checkpoint period as 15 minutes, so it looks like one period was processed and then the original secondary failed at the next period.

The documentation appears to indicate that having multiple secondary's is fine, but the timing here seems to indicate otherwise.

To recover here would it simply be best to switch edits.new for edits and then attempt to start the namenode?

Thanks
- Adam

On 1/12/11 10:43 AM, Adam Phelps wrote:
We were restarting the namenode and datanode processes on our cluster
(due to changing some configuration options), however the namenode
failed to restart with the error I've pasted below. (If it matters we're
running the CDH3B3 release)

 From the looks of it the files causing the problem were the output of a
MR job that was being run and I guess the job's infrastructure was
renaming them from temporary output to the final output location.

We scanned through the hadoop logs and discovered that the secondary
namenode process had died the previous day with that same error (copied
that as well).

In the namenode's metadata directory, we see this:

./current:
total 546964
-rw-r--r-- 1 hdfs hdfs 477954043 2011-01-12 16:07 edits.new
-rw-r--r-- 1 hdfs hdfs 2865644 2011-01-11 00:46 edits
-rw-r--r-- 1 hdfs hdfs 8 2011-01-11 00:32 fstime
-rw-r--r-- 1 hdfs hdfs 100 2011-01-11 00:32 VERSION
-rw-r--r-- 1 hdfs hdfs 79595687 2011-01-11 00:32 fsimage

./previous.checkpoint:
total 77116
-rw-r--r-- 1 hdfs hdfs 1051329 2011-01-06 23:47 edits
-rw-r--r-- 1 hdfs hdfs 8 2011-01-06 23:40 fstime
-rw-r--r-- 1 hdfs hdfs 100 2011-01-06 23:40 VERSION
-rw-r--r-- 1 hdfs hdfs 78913323 2011-01-06 23:40 fsimage

The current/edits file contains references to the missing files, so its
my assumption that there was some error which caused those files to be
deleted (we're still investigating this) leading to the failure of the
secondary namenode. So the secondary namenode process never rolled those
edits into fsimage, and the namenode then started writing to edits.new.

Does anyone know how we'd go about getting the namenode running again?
We're fine with discarding those specific files, but would rather not
have to revert to the earlier image since its now quite out of date (and
I imagine could cause problems with other things we had running such as
HBase).

Thanks
- Adam


Namenode error on restart:
2011-01-12 16:51:30,571 WARN org.apache.hadoop.hdfs.StateChange: DIR*
FSDirectory.unprotectedRenameTo: failed to rename
XXX/_temporary/_attempt_201101062358_2180_m_000291_0/part-m-00291.lzo to
XXX/part-m-00291.lzo because source does not exist
2011-01-12 16:51:30,571 WARN org.apache.hadoop.hdfs.StateChange: DIR*
FSDirectory.unprotectedRenameTo: failed to rename
XXX/_temporary/_attempt_201101062358_2180_m_000327_0/part-m-00327.lzo to
XXX/part-m-00327.lzo because source does not exist
2011-01-12 16:51:30,572 ERROR
org.apache.hadoop.hdfs.server.namenode.NameNode:
java.lang.NullPointerException
at
org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.java:1088)

at
org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.java:1100)

at
org.apache.hadoop.hdfs.server.namenode.FSDirectory.addNode(FSDirectory.java:1003)

at
org.apache.hadoop.hdfs.server.namenode.FSDirectory.unprotectedAddFile(FSDirectory.java:206)

at
org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadFSEdits(FSEditLog.java:637)

at
org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSEdits(FSImage.java:1034)

at
org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:845)

at
org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:379)

at
org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:99)

at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:343)

at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.<init>(FSNamesystem.java:317)

at
org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:214)

at
org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:394)
at
org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1148)

at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1157)

Secondary namenode error:
2011-01-11 00:46:35,935 WARN org.apache.hadoop.hdfs.StateChange: DIR*
FSDirectory.unprotectedRenameTo: failed to rename XXX/_temporary/_attem
pt_201101062358_2180_m_000327_0/part-m-00327.lzo to XXX/part-m-00327.lzo
because source does not exist
2011-01-11 00:46:35,936 ERROR
org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode: Throwable
Exception in doCheckpoint:
2011-01-11 00:46:35,936 ERROR
org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode:
java.lang.NullPointerException
at
org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.java:1088)

at
org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.java:1100)

at
org.apache.hadoop.hdfs.server.namenode.FSDirectory.addNode(FSDirectory.java:1003)

at
org.apache.hadoop.hdfs.server.namenode.FSDirectory.unprotectedAddFile(FSDirectory.java:206)

at
org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadFSEdits(FSEditLog.java:637)

at
org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSEdits(FSImage.java:1034)

at
org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode$CheckpointStorage.doMerge(SecondaryNameNode.java:678)

at
org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode$CheckpointStorage.access$500(SecondaryNameNode.java:577)

at
org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.doMerge(SecondaryNameNode.java:454)

at
org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.doCheckpoint(SecondaryNameNode.java:418)

at
org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.doWork(SecondaryNameNode.java:313)

at
org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.run(SecondaryNameNode.java:276)

at java.lang.Thread.run(Thread.java:662)



- Adam


Reply via email to