We were restarting the namenode and datanode processes on our cluster (due to changing some configuration options), however the namenode failed to restart with the error I've pasted below. (If it matters we're running the CDH3B3 release)

From the looks of it the files causing the problem were the output of a MR job that was being run and I guess the job's infrastructure was renaming them from temporary output to the final output location.

We scanned through the hadoop logs and discovered that the secondary namenode process had died the previous day with that same error (copied that as well).

In the namenode's metadata directory, we see this:

./current:
total 546964
-rw-r--r-- 1 hdfs hdfs 477954043 2011-01-12 16:07 edits.new
-rw-r--r-- 1 hdfs hdfs   2865644 2011-01-11 00:46 edits
-rw-r--r-- 1 hdfs hdfs         8 2011-01-11 00:32 fstime
-rw-r--r-- 1 hdfs hdfs       100 2011-01-11 00:32 VERSION
-rw-r--r-- 1 hdfs hdfs  79595687 2011-01-11 00:32 fsimage

./previous.checkpoint:
total 77116
-rw-r--r-- 1 hdfs hdfs  1051329 2011-01-06 23:47 edits
-rw-r--r-- 1 hdfs hdfs        8 2011-01-06 23:40 fstime
-rw-r--r-- 1 hdfs hdfs      100 2011-01-06 23:40 VERSION
-rw-r--r-- 1 hdfs hdfs 78913323 2011-01-06 23:40 fsimage

The current/edits file contains references to the missing files, so its my assumption that there was some error which caused those files to be deleted (we're still investigating this) leading to the failure of the secondary namenode. So the secondary namenode process never rolled those edits into fsimage, and the namenode then started writing to edits.new.

Does anyone know how we'd go about getting the namenode running again? We're fine with discarding those specific files, but would rather not have to revert to the earlier image since its now quite out of date (and I imagine could cause problems with other things we had running such as HBase).

Thanks
- Adam


Namenode error on restart:
2011-01-12 16:51:30,571 WARN org.apache.hadoop.hdfs.StateChange: DIR* FSDirectory.unprotectedRenameTo: failed to rename XXX/_temporary/_attempt_201101062358_2180_m_000291_0/part-m-00291.lzo to XXX/part-m-00291.lzo because source does not exist 2011-01-12 16:51:30,571 WARN org.apache.hadoop.hdfs.StateChange: DIR* FSDirectory.unprotectedRenameTo: failed to rename XXX/_temporary/_attempt_201101062358_2180_m_000327_0/part-m-00327.lzo to XXX/part-m-00327.lzo because source does not exist 2011-01-12 16:51:30,572 ERROR org.apache.hadoop.hdfs.server.namenode.NameNode: java.lang.NullPointerException at org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.java:1088) at org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.java:1100) at org.apache.hadoop.hdfs.server.namenode.FSDirectory.addNode(FSDirectory.java:1003) at org.apache.hadoop.hdfs.server.namenode.FSDirectory.unprotectedAddFile(FSDirectory.java:206) at org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadFSEdits(FSEditLog.java:637) at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSEdits(FSImage.java:1034) at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:845) at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:379) at org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:99) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:343) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.<init>(FSNamesystem.java:317) at org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:214) at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:394) at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1148) at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1157)

Secondary namenode error:
2011-01-11 00:46:35,935 WARN org.apache.hadoop.hdfs.StateChange: DIR* FSDirectory.unprotectedRenameTo: failed to rename XXX/_temporary/_attem pt_201101062358_2180_m_000327_0/part-m-00327.lzo to XXX/part-m-00327.lzo because source does not exist 2011-01-11 00:46:35,936 ERROR org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode: Throwable Exception in doCheckpoint: 2011-01-11 00:46:35,936 ERROR org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode: java.lang.NullPointerException at org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.java:1088) at org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.java:1100) at org.apache.hadoop.hdfs.server.namenode.FSDirectory.addNode(FSDirectory.java:1003) at org.apache.hadoop.hdfs.server.namenode.FSDirectory.unprotectedAddFile(FSDirectory.java:206) at org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadFSEdits(FSEditLog.java:637) at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSEdits(FSImage.java:1034) at org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode$CheckpointStorage.doMerge(SecondaryNameNode.java:678) at org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode$CheckpointStorage.access$500(SecondaryNameNode.java:577) at org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.doMerge(SecondaryNameNode.java:454) at org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.doCheckpoint(SecondaryNameNode.java:418) at org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.doWork(SecondaryNameNode.java:313) at org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.run(SecondaryNameNode.java:276)
        at java.lang.Thread.run(Thread.java:662)



- Adam

Reply via email to