Hi Erik, A few things to try:
- does this FS store sensitive data or would it be possible to bzip2 the files and upload them somewhere? - can you add logging to the replay of FSEditLog so as to be aware of what byte offset is causing the issue? - DO take a backup of all of the state, immediately, before you accidentally lose the data. - Do you have a metadata backup from earlier this week? -Todd On Wed, Dec 16, 2009 at 6:54 AM, Erik Bernhardsson <[email protected]>wrote: > Hi, > we just encountered some problems when restarting our namenode. I'd really > appreciate if anyone has any clue of what is going on here. > > The error message is as follows: > > 09/12/16 14:25:03 INFO namenode.NameNode: STARTUP_MSG: > /************************************************************ > STARTUP_MSG: Starting NameNode > STARTUP_MSG: host = ida.int.sto.spotify.net/78.31.15.83 > STARTUP_MSG: args = [] > STARTUP_MSG: version = 0.20.0 > STARTUP_MSG: build = > https://svn.apache.org/repos/asf/hadoop/core/branches/branch-0.20 -r > 763504; > compiled by 'ndaley' on Thu Apr 9 05:18:40 UTC 2009 > ************************************************************/ > 09/12/16 14:25:03 DEBUG conf.Configuration: java.io.IOException: config() > at org.apache.hadoop.conf.Configuration.<init>(Configuration.java:210) > at org.apache.hadoop.conf.Configuration.<init>(Configuration.java:197) > at > > org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:937) > at > org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:964) > > 09/12/16 14:25:03 INFO metrics.RpcMetrics: Initializing RPC Metrics with > hostName=NameNode, port=54310 > 09/12/16 14:25:03 INFO namenode.NameNode: Namenode up at: > ida.int.sto.spotify.net/78.31.15.83:54310 > 09/12/16 > <http://ida.int.sto.spotify.net/78.31.15.83:54310%0A09/12/16>14:25:03 INFO > jvm.JvmMetrics: Initializing JVM Metrics with > processName=NameNode, sessionId=null > 09/12/16 14:25:03 INFO metrics.NameNodeMetrics: Initializing > NameNodeMeterics using context > object:org.apache.hadoop.metrics.spi.NullContext > 09/12/16 14:25:03 DEBUG security.UserGroupInformation: Unix Login: > root,root > 09/12/16 14:25:03 INFO namenode.FSNamesystem: fsOwner=root,root > 09/12/16 14:25:03 INFO namenode.FSNamesystem: supergroup=supergroup > 09/12/16 14:25:03 INFO namenode.FSNamesystem: isPermissionEnabled=true > 09/12/16 14:25:03 INFO metrics.FSNamesystemMetrics: Initializing > FSNamesystemMetrics using context > object:org.apache.hadoop.metrics.spi.NullContext > 09/12/16 14:25:03 INFO namenode.FSNamesystem: Registered > FSNamesystemStatusMBean > 09/12/16 14:25:03 INFO common.Storage: Number of files = 835872 > 09/12/16 14:25:20 INFO common.Storage: Number of files under construction = > 26 > 09/12/16 14:25:20 INFO common.Storage: Image file of size 128698266 loaded > in 17 seconds. > 09/12/16 14:25:20 ERROR namenode.NameNode: java.lang.NullPointerException > at > > org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.java:1006) > at > > org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.java:992) > at > > org.apache.hadoop.hdfs.server.namenode.FSDirectory.unprotectedMkdir(FSDirectory.java:966) > at > > org.apache.hadoop.hdfs.server.namenode.FSDirectory.unprotectedMkdir(FSDirectory.java:953) > at > > org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadFSEdits(FSEditLog.java:696) > at > > org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSEdits(FSImage.java:992) > at > > org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:812) > at > > org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:364) > at > > org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:87) > at > > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:302) > at > > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.<init>(FSNamesystem.java:283) > at > > org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:201) > at > org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:279) > at > > org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:955) > at > org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:964) > > 09/12/16 14:25:21 INFO namenode.NameNode: SHUTDOWN_MSG: > > For some reason the edits.new seems to have become surprisingly large. At > half a GB, it's more than five times as large as fsimage. > > # ls -al > drwxr-xr-x 3 root root 4096 2009-12-16 10:19 . > drwxr-xr-x 4 root root 4096 2009-12-16 10:36 .. > -rwxr-xr-x 1 root root 1615894 2009-12-16 10:19 edits > -rwxr-xr-x 1 root root 598054491 2009-12-16 10:19 edits.new > -rwxr-xr-x 1 root root 128698266 2009-12-16 10:19 fsimage > -rwxr-xr-x 1 root root 8 2009-12-16 10:19 fstime > -rwxr-xr-x 1 root root 113 2009-12-16 10:19 VERSION > > Before I accidentally changed the timestamps of these file, one could > clearly seen that > 1. None of the files except edits.new had been modified since 2009-11-16 > (time of last restart) > 2. edits.new was modified this morning (2009-12-16) > > Unfortunately the secondary namenode contains no data that is more recent > than 2009-11-16. I'm not sure why this is the case - it might be cause it > was configured in a bad way. However, the fsimage file for the namenode and > secondary namenode are exactly identical. The edits file for the secondary > namenode is almost empty: > > drwxr-xr-x 2 root root 4096 2009-12-16 10:18 . > drwxr-xr-x 4 root root 4096 2009-12-16 10:18 .. > -rwxr-xr-x 1 root root 4 2009-12-16 10:18 edits > -rwxr-xr-x 1 root root 128698266 2009-12-16 10:18 fsimage > -rwxr-xr-x 1 root root 8 2009-12-16 10:18 fstime > -rwxr-xr-x 1 root root 113 2009-12-16 10:18 VERSION > > Tailing edits.new reveals some junk at the end of the file, which I think I > tried trimming away according to the format (I was taking a look in > FSEditsLog.java), but I still get the NPE problem. Even removing edits.new > and restarting the namenode leads to the same NPE. I havent seen any > obvious > corruption in edits, though there might be something there that I can't > spot > with my eyes. > > I also tried restarting the namenode with -importCheckpoint. It seems to > work, but starts outputting warning messages about blocks not belonging to > any specific file. I suppose this makes sense since the secondary namenode > contains data from 2009-11-16. I'm really afraid of losing data from the > past month (2009-11-16 - 2009-12-16) so I haven't let the -importCheckpoint > finish. Is there a risk of losing new blocks when running > -importCheckpoint? > If not, it seems like we should probably start with that, then try to merge > the edits.new somehow. > > Does anyone know how to approach this problem? We've had similar problems > before, typically when the namenode ran out of disk space and the edits > file > got damaged. However, the disk seems far from full right now so I'm not > sure > why we ran into it, though I can't swear the disk hasn't been full during > the past week or so. > > HDFS has seemed quite slow over the past week also, I'm not sure if that's > by random or if it has anything to do with it. > > Regards, > Erik Bernhardsson >
