Advice on post mortem of data loss (v 1.0.3)

Peter Sheridan Fri, 01 Feb 2013 08:40:54 -0800

Yesterday, I bounced my DFS cluster.  We realized that "ulimit –u" was, in 
extreme cases, preventing the name node from creating threads.  This had only 
started occurring within the last day or so.  When I brought the name node back 
up, it had essentially been rolled back by one week, and I lost all changes 
which had been made since then.

There are a few other factors to consider.

1. I had 3 locations for dfs.name.dir — one local and two NFS. (I
originally thought this was 2 local and one NFS when I set it up.) On 1/24,
the day which we effectively rolled back to, the second NFS mount started
showing as FAILED on dfshealth.jsp. We had seen this before without issue, so
I didn't consider it critical.
2. When I brought the name node back up, because of discovering the above, I
had changed dfs.name.dir to 2 local drives and one NFS, excluding the one which
had failed.

Reviewing the name node log from the day with the NFS outage, I see:

2013-01-24 16:33:11,794 ERROR
org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Unable to sync edit log.
java.io.IOException: Input/output error
at sun.nio.ch.FileChannelImpl.force0(Native Method)
at sun.nio.ch.FileChannelImpl.force(FileChannelImpl.java:348)
at
org.apache.hadoop.hdfs.server.namenode.FSEditLog$EditLogFileOutputStream.flushAndSync(FSEditLog.java:215)
at
org.apache.hadoop.hdfs.server.namenode.EditLogOutputStream.flush(EditLogOutputStream.java:89)
at
org.apache.hadoop.hdfs.server.namenode.FSEditLog.logSync(FSEditLog.java:1015)
at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.completeFile(FSNamesystem.java:1666)
at
org.apache.hadoop.hdfs.server.namenode.NameNode.complete(NameNode.java:718)
at sun.reflect.GeneratedMethodAccessor13.invoke(Unknown Source)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:563)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1388)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1384)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1382)
2013-01-24 16:33:11,794 WARN org.apache.hadoop.hdfs.server.common.Storage:
Removing storage dir /rdisks/xxxxxxxxxxxxxx

Unfortunately, since I wasn't expecting anything terrible to happen, I didn't
look too closely at the file system while the name node was down. When I
brought it up, the time stamp on the previous checkpoint directory in the
dfs.name.dir was right around the above error message. The current directory
basically had an fsimage and an empty edits log with the current time stamps.

So: what happened? Should this failure have led to my data loss? I would have
thought the local directory would be fine in this scenario. Did I have any
other options for data recovery?

Thanks.

--Pete

Advice on post mortem of data loss (v 1.0.3)

Reply via email to