[ 
https://issues.apache.org/jira/browse/HADOOP-2073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12536356
 ] 

Raghu Angadi commented on HADOOP-2073:
--------------------------------------

>> It still leaves the problem with multiple data directories?
>What is that problem? This is not intended to solve all reliability problems, 
>just one.

Yes, we want to fix just this problem ("Incosistent VERSION file, caused by 
frequent restarts of datanode"). If there are multiple data directories (common 
case in many installtions), I was wondering if we will see the same 
problem/symptoms caused by same root cause even with this patch, when only 
config of Datanode is different. 


> Datanode corruption if machine dies while writing VERSION file
> --------------------------------------------------------------
>
>                 Key: HADOOP-2073
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2073
>             Project: Hadoop
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.14.0
>            Reporter: Michael Bieniosek
>            Assignee: Raghu Angadi
>         Attachments: versionFileSize.patch
>
>
> Yesterday, due to a bad mapreduce job, some of my machines went on OOM 
> killing sprees and killed a bunch of datanodes, among other processes.  Since 
> my monitoring software kept trying to bring up the datanodes, only to have 
> the kernel kill them off again, each machine's datanode was probably killed 
> many times.  A large percentage of these datanodes will not come up now, and 
> write this message to the logs:
> 2007-10-18 00:23:28,076 ERROR org.apache.hadoop.dfs.DataNode: 
> org.apache.hadoop.dfs.InconsistentFSStateException: Directory 
> /hadoop/dfs/data is in an inconsistent state: file VERSION is invalid.
> When I check, /hadoop/dfs/data/current/VERSION is an empty file.  
> Consequently, I have to delete all the blocks on the datanode and start over. 
>  Since the OOM killing sprees happened simultaneously on several datanodes in 
> my DFS cluster, this could have crippled my dfs cluster.
> I checked the hadoop code, and in org.apache.hadoop.dfs.Storage, I see this:
> {{{
>     /**
>      * Write version file.
>      * 
>      * @throws IOException
>      */
>     void write() throws IOException {
>       corruptPreUpgradeStorage(root);
>       write(getVersionFile());
>     }
>     void write(File to) throws IOException {
>       Properties props = new Properties();
>       setFields(props, this);
>       RandomAccessFile file = new RandomAccessFile(to, "rws");
>       FileOutputStream out = null;
>       try {
>         file.setLength(0);
>         file.seek(0);
>         out = new FileOutputStream(file.getFD());
>         props.store(out, null);
>       } finally {
>         if (out != null) {
>           out.close();
>         }
>         file.close();
>       }
>     }
> }}}
> So if the datanode dies after file.setLength(0), but before props.store(out, 
> null), the VERSION file will get trashed in the corrupted state I see.  Maybe 
> it would be better if this method created a temporary file VERSION.tmp, and 
> then copied it to VERSION, then deleted VERSION.tmp?  That way, if VERSION 
> was detected to be corrupt, the datanode could look at VERSION.tmp to recover 
> the data.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to