Did you ever considered keeping a backup copy of FSImage on a NFS share? The best practice is to have a reliable NFS storage mounted on the namenode and instruct the site-xml to keep a copy on the NFS mount. This will prevent FSImage loss.
-----Original Message----- From: Markus Jelsma [mailto:markus.jel...@openindex.io] Sent: Monday, October 24, 2011 5:01 PM To: hdfs-user@hadoop.apache.org Subject: Re: NameNode corruption: NPE addChild at start up A subsequent test under similar consitions showed an additional NameNode log line prior to shutdown: 2011-10-25 00:00:03,996 INFO org.apache.hadoop.fs.Trash: Created trash checkpoint: /user/systems/.Trash/1110250000 The following restart was without issues. Because the operations were manual i might have mispelled -skipTrash in one occasion. > Log output prior to first shutdown: > > Many StateChange lines: > .. > 2011-10-24 10:09:56,665 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* ask > 178.21.118.71:50010 to delete blk_-6474991874786417864_1 > 10366 blk_4835119182915024496_110378 blk_5400176265071942037_110378 > blk_3844651748475104977_110378 blk_-1608821730784127393_110376 bl > k_6509383172138875668_110378 blk_5451246938053636538_110378 > blk_2897026915704967576_110378 blk_-8171354591956681743_110358 blk_831230 > 8457484333010_110378 blk_-3230829712632219383_110378 > 2011-10-24 10:09:56,665 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* ask > 178.21.118.127:50010 to delete blk_-3399604157806337594_ > 110378 blk_6747099268224653868_110330 blk_-8319971768460712343_110366 > blk_-2961731871717799971_110378 blk_7869501598834020846_110374 > blk_8093290740979129346_110370 blk_-6013635848766803390_110378 > blk_7993259237237421138_110378 blk_5759478854526777476_110378 blk_-249 > 5357027905458427_110366 blk_-1854083208115601228_110366 > 2011-10-24 10:09:56,665 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* ask > 178.21.118.106:50010 to delete blk_-7802030530595759204_ > 110378 blk_8721277224606661791_110374 blk_6747099268224653868_110330 > blk_5656147162240510324_110378 blk_2897026915704967576_110378 bl > k_-240991845576310137_110378 blk_5938453705724038271_110378 > blk_8312308457484333010_110378 blk_-1854083208115601228_110366 blk_469356 > 3029097482715_110378 blk_2642639928334327569_110374 > blk_-1087530163655362136_110378 blk_-330417000415464230_110378 > blk_54512469380536 > 36538_110378 blk_6300864378343690857_110378 blk_9188777915480180330_110378 > blk_6198541037144409710_110378 > 2011-10-24 10:09:56,665 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* ask > 178.21.118.102:50010 to delete blk_-8874651062228528404_ > 110378 blk_-3399604157806337594_110378 blk_5374073877837004061_110338 > blk_-3734465814030860048_110370 blk_4355408303621436967_110378 > blk_4924117654799449604_110378 blk_1289214837876543883_110378 > blk_-240991845576310137_110378 blk_5938453705724038271_110378 blk_-8294 > 59405706536971_110370 blk_-7851154197821762387_110322 > blk_2792433696712651711_110378 blk_-361468857299887395_110378 > blk_5345112692694 > 71506_110342 blk_-7018473240249214767_110378 blk_5393329649654239848_110370 > 2011-10-24 10:10:12,065 INFO > org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Number of > transactions: 1281 Total time for transac > tions(ms): 21Number of transactions batched in Syncs: 10 Number of syncs: > 870 SyncTimes(ms): 13578 > 2011-10-24 10:10:26,250 INFO > org.apache.hadoop.hdfs.server.namenode.NameNode: SHUTDOWN_MSG > /************************************************************ > SHUTDOWN_MSG: Shutting down NameNode at > namenode01.openindex.io/178.21.118.98 > ************************************************************/ > > > Following startup: > > .. > 2011-10-24 10:11:58,601 INFO > org.apache.hadoop.hdfs.server.namenode.FSNamesystem: > isAccessTokenEnabled=false accessKeyUpdateInterval=0 min(s), > accessTokenLifetime=0 min(s) > 2011-10-24 10:12:00,478 INFO > org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Registered > FSNamesystemStateMBean and NameNodeMXBean > 2011-10-24 10:12:00,592 INFO > org.apache.hadoop.hdfs.server.namenode.NameNode: Caching file names > occuring more than 10 times > 2011-10-24 10:12:00,609 INFO org.apache.hadoop.hdfs.server.common.Storage: > Number of files = 58901 > 2011-10-24 10:12:17,101 INFO org.apache.hadoop.hdfs.server.common.Storage: > Number of files under construction = 1 > 2011-10-24 10:12:17,373 INFO org.apache.hadoop.hdfs.server.common.Storage: > Image file of size 8688539 loaded in 16 seconds. > 2011-10-24 10:12:17,376 ERROR > org.apache.hadoop.hdfs.server.namenode.NameNode: > java.lang.NullPointerException > at > org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.jav > a:1113) at > org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.jav > a:1125) at > org.apache.hadoop.hdfs.server.namenode.FSDirectory.addNode(FSDirectory.java > :1028) at > org.apache.hadoop.hdfs.server.namenode.FSDirectory.unprotectedAddFile(FSDir > ectory.java:205) at > org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadFSEdits(FSEditLog.java > :613) at > org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSEdits(FSImage.java:100 > 9) at > org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:827 > ) at > org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImag > e.java:365) at > org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory. > java:97) at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem > .java:379) at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.<init>(FSNamesystem.jav > a:353) at > org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:25 > 4) at > org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:434) > at > org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.jav > a:1153) at > org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1162) > > > Hi, > > > > We are on Hadoop 0.20.203 and use HDFS and MapRed for crawling using > > Apache Nutch. This morning we had our first encounter with an unwilling > > NameNode due to a very similar issue descriped on the list earlier [1]. > > > > We replaced with a backed up checkpoint, restarted the daemon and moved > > files with missing blocks to lost+found. Only transient data was > > affected. > > > > Just now it happened again, same remedy and same results but it occured > > to me that there may be a reason besides coincidence or bad luck. After > > processing several GB's of prepared data we delete (-skipTrash) what we > > don't need and hurry on. > > > > Prior to both corruptions a several GB's were deleted and HDFS stopped > > only seconds later. Is there a possible relation here? It's not > > something i can test because we are not willing to risk corruption of > > important data; we're moving it to other locations as we speak. > > > > Any tips on how to prevent this from happening again or finding the > > source of the problem are very much appreciated. > > > > [1]: > > http://hadoop-common.472056.n3.nabble.com/addChild-NullPointerException- > > when-starting-namenode-and-reading-edits-file-td92226.html > > > > Thanks