On Sun, Feb 17, 2013 at 4:41 PM, Mohammad Tariq <donta...@gmail.com> wrote:
> Hello Robert, > > It seems that your edit logs and fsimage have got > corrupted somehow. It looks somewhat similar to this one > https://issues.apache.org/jira/browse/HDFS-686 > Similar, but the trace is different. > Have you made any changes to the 'dfs.name.dir' directory > lately? > No. > Do you have enough space where metadata is getting > stored? > Yes. All 3 locations have plenty of space (hundreds of GB). > You can make use of offine image viewer to diagnose > the fsimage file. > > Warm Regards, > Tariq > https://mtariq.jux.com/ > cloudfront.blogspot.com > > > On Mon, Feb 18, 2013 at 3:31 AM, Robert Dyer <psyb...@gmail.com> wrote: > >> It just happened again. This was after a fresh format of HDFS/HBase and >> I am attempting to re-import the (backed up) data. >> >> http://pastebin.com/3fsWCNQY >> >> So now if I restart the namenode, I will lose data from the past 3 hours. >> >> What is causing this? How can I avoid it in the future? Is there an >> easy way to monitor (other than a script grep'ing the logs) the checkpoints >> to see when this happens? >> >> >> On Sat, Feb 16, 2013 at 2:39 PM, Robert Dyer <psyb...@gmail.com> wrote: >> >>> Forgot to mention: Hadoop 1.0.4 >>> >>> >>> On Sat, Feb 16, 2013 at 2:38 PM, Robert Dyer <psyb...@gmail.com> wrote: >>> >>>> I am at a bit of wits end here. Every single time I restart the >>>> namenode, I get this crash: >>>> >>>> 2013-02-16 14:32:42,616 INFO >>>> org.apache.hadoop.hdfs.server.common.Storage: Image file of size 168058 >>>> loaded in 0 seconds. >>>> 2013-02-16 14:32:42,618 ERROR >>>> org.apache.hadoop.hdfs.server.namenode.NameNode: >>>> java.lang.NullPointerException >>>> at >>>> org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.java:1099) >>>> at >>>> org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.java:1111) >>>> at >>>> org.apache.hadoop.hdfs.server.namenode.FSDirectory.addNode(FSDirectory.java:1014) >>>> at >>>> org.apache.hadoop.hdfs.server.namenode.FSDirectory.unprotectedAddFile(FSDirectory.java:208) >>>> at >>>> org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadFSEdits(FSEditLog.java:631) >>>> at >>>> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSEdits(FSImage.java:1021) >>>> at >>>> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:839) >>>> at >>>> org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:377) >>>> at >>>> org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:100) >>>> at >>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:388) >>>> at >>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.<init>(FSNamesystem.java:362) >>>> at >>>> org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:276) >>>> at >>>> org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:496) >>>> at >>>> org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1279) >>>> at >>>> org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1288) >>>> >>>> I am following best practices here, as far as I know. I have the >>>> namenode writing into 3 directories (2 local, 1 NFS). All 3 of these dirs >>>> have the exact same files in them. >>>> >>>> I also run a secondary checkpoint node. This one appears to have >>>> started failing a week ago. So checkpoints were *not* being done since >>>> then. Thus I can get the NN up and running, but with a week old data! >>>> >>>> What is going on here? Why does my NN data *always* wind up causing >>>> this exception over time? Is there some easy way to get notified when the >>>> checkpointing starts to fail? >>>> >>> >>> >>> >>> -- >>> >>> Robert Dyer >>> rd...@iastate.edu >>> >> >> >> >> -- >> >> Robert Dyer >> rd...@iastate.edu >> > > -- Robert Dyer rd...@iastate.edu