[ 
https://issues.apache.org/jira/browse/HDFS-1848?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13022905#comment-13022905
 ] 

Bharath Mundlapudi commented on HDFS-1848:
------------------------------------------

Thanks Eli for explaining on the usecase. I briefly talked to Koji about this 
Jira. 

Some more thoughts on this. 

1. If fs.data.dir.critical is not defined, then implementation should fall back 
to existing tolerate a volume failure case. 

2. If fs.data.dir.critical is defined, then fail-fast and fail-stop as you 
described. 

Case 2 you mentioned is interesting too. Today, datanode is not aware of this 
case since it may not be part of the dfs.data.dir config. 

I see that the key benefit of having this Jira is fail-fast. Meaning, if any of 
the critical volume(s) fail, we let the namenode know immediately and datanode 
will exit. So the replication will be taken care and cluster/datanode restarts 
might see less issues with missing blocks. 

W.r.t case 2 you mentioned, there are the possibilites of failures, right?

1. Data is stored on root partition disk say /root/hadoop (binaries,conf,log), 
/root/data0
Failures: /root readonly filesystem or failure, /root/data0 readonly filesystem 
or failure, complete disk0 failure.

2. Data NOT stored on root partition disk, /root(disk1), /data0(disk2)
Failures:  /root readonly filesystem or failure, /data0(disk2) readonly 
filesystem or failure.

3. Swap partition failure
How will this be detected?

I am wondering, if datanode should worry about all these issues regarding its 
health or should a 
configuration like in TaskTracker for health check script which will let 
Datanode about the disk issues, 
network issues etc is a better option?




    





> Datanodes should shutdown when a critical volume fails
> ------------------------------------------------------
>
>                 Key: HDFS-1848
>                 URL: https://issues.apache.org/jira/browse/HDFS-1848
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>          Components: data-node
>            Reporter: Eli Collins
>             Fix For: 0.23.0
>
>
> A DN should shutdown when a critical volume (eg the volume that hosts the OS, 
> logs, pid, tmp dir etc.) fails. The admin should be able to specify which 
> volumes are critical, eg they might specify the volume that lives on the boot 
> disk. A failure in one of these volumes would not be subject to the threshold 
> (HDFS-1161) or result in host decommissioning (HDFS-1847) as the 
> decommissioning process would likely fail.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to