Yesterday, I bounced my DFS cluster.  We realized that "ulimit –u" was, in 
extreme cases, preventing the name node from creating threads.  This had only 
started occurring within the last day or so.  When I brought the name node back 
up, it had essentially been rolled back by one week, and I lost all changes 
which had been made since then.

There are a few other factors to consider.

  1.  I had 3 locations for dfs.name.dir — one local and two NFS.  (I 
originally thought this was 2 local and one NFS when I set it up.)  On 1/24, 
the day which we effectively rolled back to, the second NFS mount started 
showing as FAILED on dfshealth.jsp.  We had seen this before without issue, so 
I didn't consider it critical.
  2.  When I brought the name node back up, because of discovering the above, I 
had changed dfs.name.dir to 2 local drives and one NFS, excluding the one which 
had failed.

Reviewing the name node log from the day with the NFS outage, I see:

2013-01-24 16:33:11,794 ERROR 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Unable to sync edit log.
java.io.IOException: Input/output error
        at sun.nio.ch.FileChannelImpl.force0(Native Method)
        at sun.nio.ch.FileChannelImpl.force(FileChannelImpl.java:348)
        at 
org.apache.hadoop.hdfs.server.namenode.FSEditLog$EditLogFileOutputStream.flushAndSync(FSEditLog.java:215)
        at 
org.apache.hadoop.hdfs.server.namenode.EditLogOutputStream.flush(EditLogOutputStream.java:89)
        at 
org.apache.hadoop.hdfs.server.namenode.FSEditLog.logSync(FSEditLog.java:1015)
        at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.completeFile(FSNamesystem.java:1666)
        at 
org.apache.hadoop.hdfs.server.namenode.NameNode.complete(NameNode.java:718)
        at sun.reflect.GeneratedMethodAccessor13.invoke(Unknown Source)
        at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:563)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1388)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1384)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:396)
        at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1382)
2013-01-24 16:33:11,794 WARN org.apache.hadoop.hdfs.server.common.Storage: 
Removing storage dir /rdisks/xxxxxxxxxxxxxx


Unfortunately, since I wasn't expecting anything terrible to happen, I didn't 
look too closely at the file system while the name node was down.  When I 
brought it up, the time stamp on the previous checkpoint directory in the 
dfs.name.dir was right around the above error message.  The current directory 
basically had an fsimage and an empty edits log with the current time stamps.

So: what happened?  Should this failure have led to my data loss?  I would have 
thought the local directory would be fine in this scenario.  Did I have any 
other options for data recovery?

Thanks.


--Pete

Reply via email to