It just happened again. This was after a fresh format of HDFS/HBase and I am attempting to re-import the (backed up) data.
http://pastebin.com/3fsWCNQY So now if I restart the namenode, I will lose data from the past 3 hours. What is causing this? How can I avoid it in the future? Is there an easy way to monitor (other than a script grep'ing the logs) the checkpoints to see when this happens? On Sat, Feb 16, 2013 at 2:39 PM, Robert Dyer <[email protected]> wrote: > Forgot to mention: Hadoop 1.0.4 > > > On Sat, Feb 16, 2013 at 2:38 PM, Robert Dyer <[email protected]> wrote: > >> I am at a bit of wits end here. Every single time I restart the >> namenode, I get this crash: >> >> 2013-02-16 14:32:42,616 INFO >> org.apache.hadoop.hdfs.server.common.Storage: Image file of size 168058 >> loaded in 0 seconds. >> 2013-02-16 14:32:42,618 ERROR >> org.apache.hadoop.hdfs.server.namenode.NameNode: >> java.lang.NullPointerException >> at >> org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.java:1099) >> at >> org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.java:1111) >> at >> org.apache.hadoop.hdfs.server.namenode.FSDirectory.addNode(FSDirectory.java:1014) >> at >> org.apache.hadoop.hdfs.server.namenode.FSDirectory.unprotectedAddFile(FSDirectory.java:208) >> at >> org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadFSEdits(FSEditLog.java:631) >> at >> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSEdits(FSImage.java:1021) >> at >> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:839) >> at >> org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:377) >> at >> org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:100) >> at >> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:388) >> at >> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.<init>(FSNamesystem.java:362) >> at >> org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:276) >> at >> org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:496) >> at >> org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1279) >> at >> org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1288) >> >> I am following best practices here, as far as I know. I have the >> namenode writing into 3 directories (2 local, 1 NFS). All 3 of these dirs >> have the exact same files in them. >> >> I also run a secondary checkpoint node. This one appears to have started >> failing a week ago. So checkpoints were *not* being done since then. Thus >> I can get the NN up and running, but with a week old data! >> >> What is going on here? Why does my NN data *always* wind up causing >> this exception over time? Is there some easy way to get notified when the >> checkpointing starts to fail? >> > > > > -- > > Robert Dyer > [email protected] > -- Robert Dyer [email protected]
