According to hadoop tutorial on Yahoo developer netwrok and hadoop documentation on apache, a simple way to achieve namenode backup and recovery from single point namenode failure is to use a folder which is mounted on namenode machine but actually on a different machine to save dfs meta data as well, in addition to the folder on the namenode, as follows:
<property> <name>dfs.name.dir</name> <value>/home/hadoop/dfs/name,/mnt/namenode-backup</value> <final>true</final> </property>where /mnt/namenode-backup is mounted on the namenode machine I followed this approach. However, we did this not to a fresh cluster, instead, we have run the cluster for a while, which means it has data already in hdfs. But this method or my deployment failed and namenode simply failed to start. I did almost the same: instead of mounting the namenode-backup under /mnt, I mount it under "/". The folder "/namenode-backup" belongs to account "hadoop", under which the cluster is running. Thus there is no access restriction issue. I got the following errors in the namenode log on the namenode machine: /************************************************************ STARTUP_MSG: Starting NameNode STARTUP_MSG: host = namenodedomainname/#.#.#.# STARTUP_MSG: args = [] STARTUP_MSG: version = 0.20.2+228 STARTUP_MSG: build = -r cfc3233ece0769b11af9add328261295aaf4d1ad; compiled by 'root' on Mon Mar 22 03:11:39 EDT 2010 ************************************************************/ 2010-06-14 16:46:53,879 INFO org.apache.hadoop.ipc.metrics.RpcMetrics: Initializing RPC Metrics with hostName=NameNode, port=50001 2010-06-14 16:46:53,886 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: Namenode up at: namenodedomainname/#.#.#.#:50001 2010-06-14 16:46:53,888 INFO org.apache.hadoop.metrics.jvm.JvmMetrics: Initializing JVM Metrics with processName=NameNode, sessionId=null 2010-06-14 16:46:53,889 INFO org.apache.hadoop.hdfs.server.namenode.metrics.NameNodeMetrics: Initializing NameNodeMeterics using context object:org.apache.hadoop.metrics.spi.NullContext 2010-06-14 16:46:53,934 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: fsOwner=hadoop,hadoop 2010-06-14 16:46:53,934 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: supergroup=supergroup 2010-06-14 16:46:53,934 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: isPermissionEnabled=true 2010-06-14 16:46:53,940 INFO org.apache.hadoop.hdfs.server.namenode.metrics.FSNamesystemMetrics: Initializing FSNamesystemMetrics using context object:org.apache.hadoop.metrics.spi.NullContext 2010-06-14 16:46:53,942 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Registered FSNamesystemStatusMBean 2010-06-14 16:47:23,974 INFO org.apache.hadoop.hdfs.server.common.Storage: java.io.IOException: No locks available at sun.nio.ch.FileChannelImpl.lock0(Native Method) at sun.nio.ch.FileChannelImpl.tryLock(FileChannelImpl.java:881) at java.nio.channels.FileChannel.tryLock(FileChannel.java:962) at org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.tryLock(Storage.java:527) at org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.lock(Storage.java:505) at org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.analyzeStorage(Storage.java:363) at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:285) at org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:88) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:312) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.<init>(FSNamesystem.java:293) at org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:224) at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:306) at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1004) at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1013) at sun.nio.ch.FileChannelImpl.lock0(Native Method) at sun.nio.ch.FileChannelImpl.tryLock(FileChannelImpl.java:881) at java.nio.channels.FileChannel.tryLock(FileChannel.java:962) at org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.tryLock(Storage.java:527) at org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.lock(Storage.java:505) at org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.analyzeStorage(Storage.java:363) at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:285) at org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:88) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:312) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.<init>(FSNamesystem.java:293) at org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:224) at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:306) at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1004) at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1013) 2010-06-14 16:47:23,976 ERROR org.apache.hadoop.hdfs.server.namenode.FSNamesystem: FSNamesystem initialization failed. java.io.IOException: No locks available at sun.nio.ch.FileChannelImpl.lock0(Native Method) at sun.nio.ch.FileChannelImpl.tryLock(FileChannelImpl.java:881) at java.nio.channels.FileChannel.tryLock(FileChannel.java:962) at org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.tryLock(Storage.java:527) at org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.lock(Storage.java:505) at org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.analyzeStorage(Storage.java:363) at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:285) at org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:88) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:312) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.<init>(FSNamesystem.java:293) at org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:224) at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:306) at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1004) at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1013) 2010-06-14 16:47:23,976 INFO org.apache.hadoop.ipc.Server: Stopping server on 50001 2010-06-14 16:47:23,977 ERROR org.apache.hadoop.hdfs.server.namenode.NameNode: java.io.IOException: No locks available at sun.nio.ch.FileChannelImpl.lock0(Native Method) at sun.nio.ch.FileChannelImpl.tryLock(FileChannelImpl.java:881) at java.nio.channels.FileChannel.tryLock(FileChannel.java:962) at org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.tryLock(Storage.java:527) at org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.lock(Storage.java:505) at org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.analyzeStorage(Storage.java:363) at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:285) at org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:88) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:312) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.<init>(FSNamesystem.java:293) at org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:224) at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:306) at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1004) at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1013) 2010-06-14 16:47:23,978 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: SHUTDOWN_MSG: /************************************************************ SHUTDOWN_MSG: Shutting down NameNode at namenodedomainname/#.#.#.# ************************************************************/ Thanks for your help! -Michael