Hi, Namenode storage on our cluster went out suddenly with the following error
2011-12-30 18:24:59,857 WARN org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Unable to open edit log file /data/d1/hadoop-data/hadoop-hdfs/dfs/name/current/edits 2011-12-30 18:24:59,858 WARN org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Unable to open edit log file /data/d2/hadoop-data/hadoop-hdfs/dfs/name/current/edits dfshealth.jsp reports the Storage to be unhealthy NameNode Storage: Storage Directory Type State /data/d1/hadoop-data/hadoop-hdfs/dfs/name IMAGE_AND_EDITS Failed /data/d2/hadoop-data/hadoop-hdfs/dfs/name IMAGE_AND_EDITS Failed While the cluster is functional the edit logs are not being actively written to. Looks like if the cluster were to be restarted, we would loose changes before this error occurred. It seems like image update from the secondary namenode caused this. Previous pushes from the SNN were OK though. 2011-12-30 18:24:59,855 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Roll FSImage from 10.3.0.161 2011-12-30 18:24:59,855 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Number of transactions: 5552 Total time for transactions(ms): 32Number of transactions batched in Syncs: 694 Number of syncs: 2690 SyncTimes(ms): 1789 1669 2011-12-30 18:24:59,857 WARN org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Unable to open edit log file /data/d1/hadoop-data/hadoop-hdfs/dfs/name/current/edits 2011-12-30 18:24:59,858 WARN org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Unable to open edit log file /data/d2/hadoop-data/hadoop-hdfs/dfs/name/current/edits dfs.name.dir contents: ---------------------------- /data/d1/hadoop-data/hadoop-hdfs/dfs/name: total 12 drwxr-xr-x 2 hdfs hdfs 4096 2011-12-30 18:24 current drwxr-xr-x 2 hdfs hdfs 4096 2011-11-10 18:41 image -rw-r--r-- 1 hdfs hdfs 0 2011-11-28 18:43 in_use.lock drwxr-xr-x 2 hdfs hdfs 4096 2011-11-23 10:51 previous.checkpoint /data/d1/hadoop-data/hadoop-hdfs/dfs/name/current: total 1938828 -rw-r--r-- 1 hdfs hdfs 660736 2011-12-30 18:24 edits -rw-r--r-- 1 hdfs hdfs 991278946 2011-12-30 18:17 fsimage -rw-r--r-- 1 hdfs hdfs 991452881 2011-12-30 18:24 fsimage.ckpt -rw-r--r-- 1 hdfs hdfs 8 2011-12-30 18:17 fstime -rw-r--r-- 1 hdfs hdfs 101 2011-12-30 18:17 VERSION /data/d1/hadoop-data/hadoop-hdfs/dfs/name/image: total 4 -rw-r--r-- 1 hdfs hdfs 157 2011-12-30 18:17 fsimage /data/d1/hadoop-data/hadoop-hdfs/dfs/name/previous.checkpoint: total 9013076 -rw-r--r-- 1 hdfs hdfs 8705120607 2011-11-28 17:31 edits -rw-r--r-- 1 hdfs hdfs 516045025 2011-11-23 10:51 fsimage -rw-r--r-- 1 hdfs hdfs 8 2011-11-23 10:51 fstime -rw-r--r-- 1 hdfs hdfs 101 2011-11-23 10:51 VERSION getImage servlet on NN returns the following error 2011-12-31 13:12:34,686 WARN org.mortbay.log: /getimage: java.io.IOException: GetImage failed. java.lang.NullPointerException at org.apache.hadoop.hdfs.server.namenode.FSImage.getImageFile(FSImage.java:219) at org.apache.hadoop.hdfs.server.namenode.FSImage.getFsImageName(FSImage.java:1584) at org.apache.hadoop.hdfs.server.namenode.GetImageServlet$1.run(GetImageServlet.java:75) at org.apache.hadoop.hdfs.server.namenode.GetImageServlet$1.run(GetImageServlet.java:70) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) What are the causes for this error? What can be done to restart the cluster without loss of data? Any solutions/pointers are greatly appreciated. Regards Srikanth Sundarrajan