[jira] Commented: (HADOOP-2073) Datanode corruption if machine dies while writing VERSION file

Michael Bieniosek (JIRA) Thu, 18 Oct 2007 11:03:48 -0700

    [ 
https://issues.apache.org/jira/browse/HADOOP-2073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12535990
 ]


Michael Bieniosek commented on HADOOP-2073:
-------------------------------------------

There's no Exception in my logs: I'm assuming the linux OOM killer sends the 
jvm a SIGKILL (http://lxr.linux.no/source/mm/oom_kill.c#L271), so the jvm 
prints out the shutdown message and exits without giving exception handlers a 
chance to do anything.

There aren't any upgrades involved here.

FWIW, my logs look like (repeated over and over again):

2007-10-17 00:19:07,051 INFO org.apache.hadoop.dfs.DataNode: STARTUP_MSG: 
/************************************************************
STARTUP_MSG: Starting DataNode
STARTUP_MSG:   host = (the hostname)
STARTUP_MSG:   args = []
************************************************************/
2007-10-17 00:19:08,338 INFO org.apache.hadoop.metrics.jvm.JvmMetrics: 
Initializing JVM Metrics 
with processName=DataNode, sessionId=null
2007-10-17 00:19:08,439 INFO org.apache.hadoop.dfs.DataNode: SHUTDOWN_MSG: 
/************************************************************
SHUTDOWN_MSG: Shutting down DataNode at (the hostname)
************************************************************/

Note that it didn't take long before the process was killed.

> Datanode corruption if machine dies while writing VERSION file
> --------------------------------------------------------------
>
>                 Key: HADOOP-2073
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2073
>             Project: Hadoop
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.14.0
>            Reporter: Michael Bieniosek
>
> Yesterday, due to a bad mapreduce job, some of my machines went on OOM 
> killing sprees and killed a bunch of datanodes, among other processes.  Since 
> my monitoring software kept trying to bring up the datanodes, only to have 
> the kernel kill them off again, each machine's datanode was probably killed 
> many times.  A large percentage of these datanodes will not come up now, and 
> write this message to the logs:
> 2007-10-18 00:23:28,076 ERROR org.apache.hadoop.dfs.DataNode: 
> org.apache.hadoop.dfs.InconsistentFSStateException: Directory 
> /hadoop/dfs/data is in an inconsistent state: file VERSION is invalid.
> When I check, /hadoop/dfs/data/current/VERSION is an empty file.  
> Consequently, I have to delete all the blocks on the datanode and start over. 
>  Since the OOM killing sprees happened simultaneously on several datanodes in 
> my DFS cluster, this could have crippled my dfs cluster.
> I checked the hadoop code, and in org.apache.hadoop.dfs.Storage, I see this:
> {{{
>     /**
>      * Write version file.
>      * 
>      * @throws IOException
>      */
>     void write() throws IOException {
>       corruptPreUpgradeStorage(root);
>       write(getVersionFile());
>     }
>     void write(File to) throws IOException {
>       Properties props = new Properties();
>       setFields(props, this);
>       RandomAccessFile file = new RandomAccessFile(to, "rws");
>       FileOutputStream out = null;
>       try {
>         file.setLength(0);
>         file.seek(0);
>         out = new FileOutputStream(file.getFD());
>         props.store(out, null);
>       } finally {
>         if (out != null) {
>           out.close();
>         }
>         file.close();
>       }
>     }
> }}}
> So if the datanode dies after file.setLength(0), but before props.store(out, 
> null), the VERSION file will get trashed in the corrupted state I see.  Maybe 
> it would be better if this method created a temporary file VERSION.tmp, and 
> then copied it to VERSION, then deleted VERSION.tmp?  That way, if VERSION 
> was detected to be corrupt, the datanode could look at VERSION.tmp to recover 
> the data.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-2073) Datanode corruption if machine dies while writing VERSION file

Reply via email to