Hi Erik,

A few things to try:

- does this FS store sensitive data or would it be possible to bzip2 the
files and upload them somewhere?
- can you add logging to the replay of FSEditLog so as to be aware of what
byte offset is causing the issue?
- DO take a backup of all of the state, immediately, before you accidentally
lose the data.
- Do you have a metadata backup from earlier this week?

-Todd

On Wed, Dec 16, 2009 at 6:54 AM, Erik Bernhardsson <[email protected]>wrote:

> Hi,
> we just encountered some problems when restarting our namenode. I'd really
> appreciate if anyone has any clue of what is going on here.
>
> The error message is as follows:
>
> 09/12/16 14:25:03 INFO namenode.NameNode: STARTUP_MSG:
> /************************************************************
> STARTUP_MSG: Starting NameNode
> STARTUP_MSG:   host = ida.int.sto.spotify.net/78.31.15.83
> STARTUP_MSG:   args = []
> STARTUP_MSG:   version = 0.20.0
> STARTUP_MSG:   build =
> https://svn.apache.org/repos/asf/hadoop/core/branches/branch-0.20 -r
> 763504;
> compiled by 'ndaley' on Thu Apr  9 05:18:40 UTC 2009
> ************************************************************/
> 09/12/16 14:25:03 DEBUG conf.Configuration: java.io.IOException: config()
>    at org.apache.hadoop.conf.Configuration.<init>(Configuration.java:210)
>    at org.apache.hadoop.conf.Configuration.<init>(Configuration.java:197)
>    at
>
> org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:937)
>    at
> org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:964)
>
> 09/12/16 14:25:03 INFO metrics.RpcMetrics: Initializing RPC Metrics with
> hostName=NameNode, port=54310
> 09/12/16 14:25:03 INFO namenode.NameNode: Namenode up at:
> ida.int.sto.spotify.net/78.31.15.83:54310
> 09/12/16 
> <http://ida.int.sto.spotify.net/78.31.15.83:54310%0A09/12/16>14:25:03 INFO 
> jvm.JvmMetrics: Initializing JVM Metrics with
> processName=NameNode, sessionId=null
> 09/12/16 14:25:03 INFO metrics.NameNodeMetrics: Initializing
> NameNodeMeterics using context
> object:org.apache.hadoop.metrics.spi.NullContext
> 09/12/16 14:25:03 DEBUG security.UserGroupInformation: Unix Login:
> root,root
> 09/12/16 14:25:03 INFO namenode.FSNamesystem: fsOwner=root,root
> 09/12/16 14:25:03 INFO namenode.FSNamesystem: supergroup=supergroup
> 09/12/16 14:25:03 INFO namenode.FSNamesystem: isPermissionEnabled=true
> 09/12/16 14:25:03 INFO metrics.FSNamesystemMetrics: Initializing
> FSNamesystemMetrics using context
> object:org.apache.hadoop.metrics.spi.NullContext
> 09/12/16 14:25:03 INFO namenode.FSNamesystem: Registered
> FSNamesystemStatusMBean
> 09/12/16 14:25:03 INFO common.Storage: Number of files = 835872
> 09/12/16 14:25:20 INFO common.Storage: Number of files under construction =
> 26
> 09/12/16 14:25:20 INFO common.Storage: Image file of size 128698266 loaded
> in 17 seconds.
> 09/12/16 14:25:20 ERROR namenode.NameNode: java.lang.NullPointerException
>    at
>
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.java:1006)
>    at
>
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.java:992)
>    at
>
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.unprotectedMkdir(FSDirectory.java:966)
>    at
>
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.unprotectedMkdir(FSDirectory.java:953)
>    at
>
> org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadFSEdits(FSEditLog.java:696)
>    at
>
> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSEdits(FSImage.java:992)
>    at
>
> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:812)
>    at
>
> org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:364)
>    at
>
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:87)
>    at
>
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:302)
>    at
>
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.<init>(FSNamesystem.java:283)
>    at
>
> org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:201)
>    at
> org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:279)
>    at
>
> org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:955)
>    at
> org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:964)
>
> 09/12/16 14:25:21 INFO namenode.NameNode: SHUTDOWN_MSG:
>
> For some reason the edits.new seems to have become surprisingly large. At
> half a GB, it's more than five times as large as fsimage.
>
> # ls -al
> drwxr-xr-x 3 root root      4096 2009-12-16 10:19 .
> drwxr-xr-x 4 root root      4096 2009-12-16 10:36 ..
> -rwxr-xr-x 1 root root   1615894 2009-12-16 10:19 edits
> -rwxr-xr-x 1 root root 598054491 2009-12-16 10:19 edits.new
> -rwxr-xr-x 1 root root 128698266 2009-12-16 10:19 fsimage
> -rwxr-xr-x 1 root root         8 2009-12-16 10:19 fstime
> -rwxr-xr-x 1 root root       113 2009-12-16 10:19 VERSION
>
> Before I accidentally changed the timestamps of these file, one could
> clearly seen that
> 1. None of the files except edits.new had been modified since 2009-11-16
> (time of last restart)
> 2. edits.new was modified this morning (2009-12-16)
>
> Unfortunately the secondary namenode contains no data that is more recent
> than 2009-11-16. I'm not sure why this is the case - it might be cause it
> was configured in a bad way. However, the fsimage file for the namenode and
> secondary namenode are exactly identical. The edits file for the secondary
> namenode is almost empty:
>
> drwxr-xr-x 2 root root      4096 2009-12-16 10:18 .
> drwxr-xr-x 4 root root      4096 2009-12-16 10:18 ..
> -rwxr-xr-x 1 root root         4 2009-12-16 10:18 edits
> -rwxr-xr-x 1 root root 128698266 2009-12-16 10:18 fsimage
> -rwxr-xr-x 1 root root         8 2009-12-16 10:18 fstime
> -rwxr-xr-x 1 root root       113 2009-12-16 10:18 VERSION
>
> Tailing edits.new reveals some junk at the end of the file, which I think I
> tried trimming away according to the format (I was taking a look in
> FSEditsLog.java), but I still get the NPE problem. Even removing edits.new
> and restarting the namenode leads to the same NPE. I havent seen any
> obvious
> corruption in edits, though there might be something there that I can't
> spot
> with my eyes.
>
> I also tried restarting the namenode with -importCheckpoint. It seems to
> work, but starts outputting warning messages about blocks not belonging to
> any specific file. I suppose this makes sense since the secondary namenode
> contains data from 2009-11-16. I'm really afraid of losing data from the
> past month (2009-11-16 - 2009-12-16) so I haven't let the -importCheckpoint
> finish. Is there a risk of losing new blocks when running
> -importCheckpoint?
> If not, it seems like we should probably start with that, then try to merge
> the edits.new somehow.
>
> Does anyone know how to approach this problem? We've had similar problems
> before, typically when the namenode ran out of disk space and the edits
> file
> got damaged. However, the disk seems far from full right now so I'm not
> sure
> why we ran into it, though I can't swear the disk hasn't been full during
> the past week or so.
>
> HDFS has seemed quite slow over the past week also, I'm not sure if that's
> by random or if it has anything to do with it.
>
> Regards,
> Erik Bernhardsson
>

Reply via email to