RE: NameNode corruption: NPE addChild at start up

Steve Ed Wed, 26 Oct 2011 10:33:12 -0700

Did you ever considered keeping a backup copy of FSImage on a NFS share?

The best practice is to have a reliable NFS storage mounted on the namenode
and instruct the site-xml to keep a copy on the NFS mount. This will prevent
FSImage loss.


-----Original Message-----
From: Markus Jelsma [mailto:markus.jel...@openindex.io] 
Sent: Monday, October 24, 2011 5:01 PM
To: hdfs-user@hadoop.apache.org
Subject: Re: NameNode corruption: NPE addChild at start up

A subsequent test under similar consitions showed an additional NameNode log

line prior to shutdown:

2011-10-25 00:00:03,996 INFO org.apache.hadoop.fs.Trash: Created trash 
checkpoint: /user/systems/.Trash/1110250000

The following restart was without issues. Because the operations were manual
i 
might have mispelled -skipTrash in one occasion.

> Log output prior to first shutdown:
> 
> Many StateChange lines:
> ..
> 2011-10-24 10:09:56,665 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
ask
> 178.21.118.71:50010 to delete  blk_-6474991874786417864_1
> 10366 blk_4835119182915024496_110378 blk_5400176265071942037_110378
> blk_3844651748475104977_110378 blk_-1608821730784127393_110376 bl
> k_6509383172138875668_110378 blk_5451246938053636538_110378
> blk_2897026915704967576_110378 blk_-8171354591956681743_110358 blk_831230
> 8457484333010_110378 blk_-3230829712632219383_110378
> 2011-10-24 10:09:56,665 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
ask
> 178.21.118.127:50010 to delete  blk_-3399604157806337594_
> 110378 blk_6747099268224653868_110330 blk_-8319971768460712343_110366
> blk_-2961731871717799971_110378 blk_7869501598834020846_110374
> blk_8093290740979129346_110370 blk_-6013635848766803390_110378
> blk_7993259237237421138_110378 blk_5759478854526777476_110378 blk_-249
> 5357027905458427_110366 blk_-1854083208115601228_110366
> 2011-10-24 10:09:56,665 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
ask
> 178.21.118.106:50010 to delete  blk_-7802030530595759204_
> 110378 blk_8721277224606661791_110374 blk_6747099268224653868_110330
> blk_5656147162240510324_110378 blk_2897026915704967576_110378 bl
> k_-240991845576310137_110378 blk_5938453705724038271_110378
> blk_8312308457484333010_110378 blk_-1854083208115601228_110366 blk_469356
> 3029097482715_110378 blk_2642639928334327569_110374
> blk_-1087530163655362136_110378 blk_-330417000415464230_110378
> blk_54512469380536
> 36538_110378 blk_6300864378343690857_110378 blk_9188777915480180330_110378
> blk_6198541037144409710_110378
> 2011-10-24 10:09:56,665 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
ask
> 178.21.118.102:50010 to delete  blk_-8874651062228528404_
> 110378 blk_-3399604157806337594_110378 blk_5374073877837004061_110338
> blk_-3734465814030860048_110370 blk_4355408303621436967_110378
> blk_4924117654799449604_110378 blk_1289214837876543883_110378
> blk_-240991845576310137_110378 blk_5938453705724038271_110378 blk_-8294
> 59405706536971_110370 blk_-7851154197821762387_110322
> blk_2792433696712651711_110378 blk_-361468857299887395_110378
> blk_5345112692694
> 71506_110342 blk_-7018473240249214767_110378
blk_5393329649654239848_110370
> 2011-10-24 10:10:12,065 INFO
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Number of
> transactions: 1281 Total time for transac
> tions(ms): 21Number of transactions batched in Syncs: 10 Number of syncs:
> 870 SyncTimes(ms): 13578
> 2011-10-24 10:10:26,250 INFO
> org.apache.hadoop.hdfs.server.namenode.NameNode: SHUTDOWN_MSG
> /************************************************************
> SHUTDOWN_MSG: Shutting down NameNode at
> namenode01.openindex.io/178.21.118.98
> ************************************************************/
> 
> 
> Following startup:
> 
> ..
> 2011-10-24 10:11:58,601 INFO
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem:
> isAccessTokenEnabled=false accessKeyUpdateInterval=0 min(s),
> accessTokenLifetime=0 min(s)
> 2011-10-24 10:12:00,478 INFO
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Registered
> FSNamesystemStateMBean and NameNodeMXBean
> 2011-10-24 10:12:00,592 INFO
> org.apache.hadoop.hdfs.server.namenode.NameNode: Caching file names
> occuring more than 10 times
> 2011-10-24 10:12:00,609 INFO org.apache.hadoop.hdfs.server.common.Storage:
> Number of files = 58901
> 2011-10-24 10:12:17,101 INFO org.apache.hadoop.hdfs.server.common.Storage:
> Number of files under construction = 1
> 2011-10-24 10:12:17,373 INFO org.apache.hadoop.hdfs.server.common.Storage:
> Image file of size 8688539 loaded in 16 seconds.
> 2011-10-24 10:12:17,376 ERROR
> org.apache.hadoop.hdfs.server.namenode.NameNode:
> java.lang.NullPointerException
>         at
>
org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.jav
> a:1113) at
>
org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.jav
> a:1125) at
>
org.apache.hadoop.hdfs.server.namenode.FSDirectory.addNode(FSDirectory.java
> :1028) at
>
org.apache.hadoop.hdfs.server.namenode.FSDirectory.unprotectedAddFile(FSDir
> ectory.java:205) at
>
org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadFSEdits(FSEditLog.java
> :613) at
>
org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSEdits(FSImage.java:100
> 9) at
>
org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:827
> ) at
>
org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImag
> e.java:365) at
>
org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.
> java:97) at
>
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem
> .java:379) at
>
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.<init>(FSNamesystem.jav
> a:353) at
>
org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:25
> 4) at
> org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:434)
>         at
>
org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.jav
> a:1153) at
> org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1162)
> 
> > Hi,
> > 
> > We are on Hadoop 0.20.203 and use HDFS and MapRed for crawling using
> > Apache Nutch. This morning we had our first encounter with an unwilling
> > NameNode due to a very similar issue descriped on the list earlier [1].
> > 
> > We replaced with a backed up checkpoint, restarted the daemon and moved
> > files with missing blocks to lost+found. Only transient data was
> > affected.
> > 
> > Just now it happened again, same remedy and same results but it occured
> > to me that there may be a reason besides coincidence or bad luck. After
> > processing several GB's of prepared data we delete (-skipTrash) what we
> > don't need and hurry on.
> > 
> > Prior to both corruptions a several GB's were deleted and HDFS stopped
> > only seconds later. Is there a possible relation here? It's not
> > something i can test because we are not willing to risk corruption of
> > important data; we're moving it to other locations as we speak.
> > 
> > Any tips on how to prevent this from happening again or finding the
> > source of the problem are very much appreciated.
> > 
> > [1]:
> > http://hadoop-common.472056.n3.nabble.com/addChild-NullPointerException-
> > when-starting-namenode-and-reading-edits-file-td92226.html
> > 
> > Thanks

RE: NameNode corruption: NPE addChild at start up

Reply via email to