[ 
https://issues.apache.org/jira/browse/HDFS-1848?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13022931#comment-13022931
 ] 

Eli Collins commented on HDFS-1848:
-----------------------------------

Good points Bharath.

I think the DN should explicitly check its volumes for health as it does today 
and either fail-fast or tolerate failures appropriately based on the volume 
that failed. This may require help from an admin in the form of specifying 
critical volumes, or maybe we could detect these automatically.

In general, the DN and TT need to fail-fast when they face unrecoverable 
failures, eg if you turn off volume checking and make the root disk read-only 
the DN and TT should not try to solider on. Ie some exception handling 
situations should result in termination of service, and if possible a shutdown.

> Datanodes should shutdown when a critical volume fails
> ------------------------------------------------------
>
>                 Key: HDFS-1848
>                 URL: https://issues.apache.org/jira/browse/HDFS-1848
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>          Components: data-node
>            Reporter: Eli Collins
>             Fix For: 0.23.0
>
>
> A DN should shutdown when a critical volume (eg the volume that hosts the OS, 
> logs, pid, tmp dir etc.) fails. The admin should be able to specify which 
> volumes are critical, eg they might specify the volume that lives on the boot 
> disk. A failure in one of these volumes would not be subject to the threshold 
> (HDFS-1161) or result in host decommissioning (HDFS-1847) as the 
> decommissioning process would likely fail.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to