Hi,
    Namenode storage on our cluster went out suddenly with the following error

2011-12-30 18:24:59,857 
WARN org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Unable to open edit 
log file /data/d1/hadoop-data/hadoop-hdfs/dfs/name/current/edits
2011-12-30 18:24:59,858 
WARN org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Unable to open edit 
log file /data/d2/hadoop-data/hadoop-hdfs/dfs/name/current/edits


dfshealth.jsp reports the Storage to be unhealthy

NameNode Storage:
Storage Directory                                              Type             
          State
/data/d1/hadoop-data/hadoop-hdfs/dfs/name       IMAGE_AND_EDITS Failed
/data/d2/hadoop-data/hadoop-hdfs/dfs/name       IMAGE_AND_EDITS Failed


While the cluster is functional the edit logs are not being actively written 
to. Looks like if the cluster were to be restarted, we would loose changes 
before this error occurred.

It seems like image update from the secondary namenode caused this. Previous 
pushes from the SNN were OK though.

2011-12-30 18:24:59,855 INFO 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Roll FSImage from 
10.3.0.161
2011-12-30 18:24:59,855 INFO 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Number of transactions: 
5552 Total time for transactions(ms): 32Number of transactions batched in 
Syncs: 694 Number of syncs: 2690 SyncTimes(ms): 1789 1669 
2011-12-30 18:24:59,857 WARN 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Unable to open edit log 
file /data/d1/hadoop-data/hadoop-hdfs/dfs/name/current/edits
2011-12-30 18:24:59,858 WARN 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Unable to open edit log 
file /data/d2/hadoop-data/hadoop-hdfs/dfs/name/current/edits

dfs.name.dir contents:
----------------------------
/data/d1/hadoop-data/hadoop-hdfs/dfs/name:
total 12
drwxr-xr-x 2 hdfs hdfs 4096 2011-12-30 18:24 current
drwxr-xr-x 2 hdfs hdfs 4096 2011-11-10 18:41 image
-rw-r--r-- 1 hdfs hdfs    0 2011-11-28 18:43 in_use.lock
drwxr-xr-x 2 hdfs hdfs 4096 2011-11-23 10:51 previous.checkpoint

/data/d1/hadoop-data/hadoop-hdfs/dfs/name/current:
total 1938828
-rw-r--r-- 1 hdfs hdfs    660736 2011-12-30 18:24 edits
-rw-r--r-- 1 hdfs hdfs 991278946 2011-12-30 18:17 fsimage
-rw-r--r-- 1 hdfs hdfs 991452881 2011-12-30 18:24 fsimage.ckpt
-rw-r--r-- 1 hdfs hdfs         8 2011-12-30 18:17 fstime
-rw-r--r-- 1 hdfs hdfs       101 2011-12-30 18:17 VERSION

/data/d1/hadoop-data/hadoop-hdfs/dfs/name/image:
total 4
-rw-r--r-- 1 hdfs hdfs 157 2011-12-30 18:17 fsimage

/data/d1/hadoop-data/hadoop-hdfs/dfs/name/previous.checkpoint:
total 9013076
-rw-r--r-- 1 hdfs hdfs 8705120607 2011-11-28 17:31 edits
-rw-r--r-- 1 hdfs hdfs  516045025 2011-11-23 10:51 fsimage
-rw-r--r-- 1 hdfs hdfs          8 2011-11-23 10:51 fstime
-rw-r--r-- 1 hdfs hdfs        101 2011-11-23 10:51 VERSION

getImage servlet on NN returns the following error

2011-12-31 13:12:34,686 WARN org.mortbay.log: /getimage: java.io.IOException: 
GetImage failed. java.lang.NullPointerException
at org.apache.hadoop.hdfs.server.namenode.FSImage.getImageFile(FSImage.java:219)
at 
org.apache.hadoop.hdfs.server.namenode.FSImage.getFsImageName(FSImage.java:1584)
at 
org.apache.hadoop.hdfs.server.namenode.GetImageServlet$1.run(GetImageServlet.java:75)
at 
org.apache.hadoop.hdfs.server.namenode.GetImageServlet$1.run(GetImageServlet.java:70)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)


What are the causes for this error? What can be done to restart the cluster 
without loss of data? Any solutions/pointers are greatly appreciated.

Regards
Srikanth Sundarrajan

Reply via email to