[jira] [Commented] (HDFS-10777) DataNode should report&remove volume failures if DU cannot access files

Akira Ajisaka (JIRA) Wed, 14 Sep 2016 03:09:38 -0700

    [ 
https://issues.apache.org/jira/browse/HDFS-10777?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15490031#comment-15490031
 ]


Akira Ajisaka commented on HDFS-10777:
--------------------------------------

Therefore just logging or increment a metric is fine.

> DataNode should report&remove volume failures if DU cannot access files
> -----------------------------------------------------------------------
>
>                 Key: HDFS-10777
>                 URL: https://issues.apache.org/jira/browse/HDFS-10777
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: datanode
>    Affects Versions: 2.8.0
>            Reporter: Wei-Chiu Chuang
>            Assignee: Wei-Chiu Chuang
>         Attachments: HDFS-10777.01.patch
>
>
> HADOOP-12973 refactored DU and makes it pluggable. The refactory has a 
> side-effect that if DU encounters an exception, the exception is caught, 
> logged and ignored, essentially fixes HDFS-9908 (in which case runaway 
> exceptions prevent DataNodes from handshaking with NameNodes).
> However, this "fix" is not good, in the sense that if the disk is bad, there 
> is no immediate action made by the DataNode other than logging the exception. 
> Existing {{FsDatasetSpi#checkDataDir}} has been reduced to only check a few 
> number of directories blindly. If a disk goes bad, it is often possible that 
> only a few files are bad initially and that by checking only a small number 
> of directories it is easy to overlook the degraded disk.
> I propose: in addition to logging the exception, DataNode should proactively 
> verify the files are not accessible, remove the volume, and make the failure 
> visible by showing it in JMX, so that administrators can spot the failure via 
> monitoring systems.
> A different fix, based on HDFS-9908, is needed before Hadoop 2.8.0



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Commented] (HDFS-10777) DataNode should report&remove volume failures if DU cannot access files

Reply via email to