[
https://issues.apache.org/jira/browse/HADOOP-2073?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Raghu Angadi reassigned HADOOP-2073:
------------------------------------
Assignee: Raghu Angadi
> Datanode corruption if machine dies while writing VERSION file
> --------------------------------------------------------------
>
> Key: HADOOP-2073
> URL: https://issues.apache.org/jira/browse/HADOOP-2073
> Project: Hadoop
> Issue Type: Bug
> Components: dfs
> Affects Versions: 0.14.0
> Reporter: Michael Bieniosek
> Assignee: Raghu Angadi
>
> Yesterday, due to a bad mapreduce job, some of my machines went on OOM
> killing sprees and killed a bunch of datanodes, among other processes. Since
> my monitoring software kept trying to bring up the datanodes, only to have
> the kernel kill them off again, each machine's datanode was probably killed
> many times. A large percentage of these datanodes will not come up now, and
> write this message to the logs:
> 2007-10-18 00:23:28,076 ERROR org.apache.hadoop.dfs.DataNode:
> org.apache.hadoop.dfs.InconsistentFSStateException: Directory
> /hadoop/dfs/data is in an inconsistent state: file VERSION is invalid.
> When I check, /hadoop/dfs/data/current/VERSION is an empty file.
> Consequently, I have to delete all the blocks on the datanode and start over.
> Since the OOM killing sprees happened simultaneously on several datanodes in
> my DFS cluster, this could have crippled my dfs cluster.
> I checked the hadoop code, and in org.apache.hadoop.dfs.Storage, I see this:
> {{{
> /**
> * Write version file.
> *
> * @throws IOException
> */
> void write() throws IOException {
> corruptPreUpgradeStorage(root);
> write(getVersionFile());
> }
> void write(File to) throws IOException {
> Properties props = new Properties();
> setFields(props, this);
> RandomAccessFile file = new RandomAccessFile(to, "rws");
> FileOutputStream out = null;
> try {
> file.setLength(0);
> file.seek(0);
> out = new FileOutputStream(file.getFD());
> props.store(out, null);
> } finally {
> if (out != null) {
> out.close();
> }
> file.close();
> }
> }
> }}}
> So if the datanode dies after file.setLength(0), but before props.store(out,
> null), the VERSION file will get trashed in the corrupted state I see. Maybe
> it would be better if this method created a temporary file VERSION.tmp, and
> then copied it to VERSION, then deleted VERSION.tmp? That way, if VERSION
> was detected to be corrupt, the datanode could look at VERSION.tmp to recover
> the data.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.