[ https://issues.apache.org/jira/browse/HDFS-10777?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15490031#comment-15490031 ]
Akira Ajisaka commented on HDFS-10777: -------------------------------------- Therefore just logging or increment a metric is fine. > DataNode should report&remove volume failures if DU cannot access files > ----------------------------------------------------------------------- > > Key: HDFS-10777 > URL: https://issues.apache.org/jira/browse/HDFS-10777 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode > Affects Versions: 2.8.0 > Reporter: Wei-Chiu Chuang > Assignee: Wei-Chiu Chuang > Attachments: HDFS-10777.01.patch > > > HADOOP-12973 refactored DU and makes it pluggable. The refactory has a > side-effect that if DU encounters an exception, the exception is caught, > logged and ignored, essentially fixes HDFS-9908 (in which case runaway > exceptions prevent DataNodes from handshaking with NameNodes). > However, this "fix" is not good, in the sense that if the disk is bad, there > is no immediate action made by the DataNode other than logging the exception. > Existing {{FsDatasetSpi#checkDataDir}} has been reduced to only check a few > number of directories blindly. If a disk goes bad, it is often possible that > only a few files are bad initially and that by checking only a small number > of directories it is easy to overlook the degraded disk. > I propose: in addition to logging the exception, DataNode should proactively > verify the files are not accessible, remove the volume, and make the failure > visible by showing it in JMX, so that administrators can spot the failure via > monitoring systems. > A different fix, based on HDFS-9908, is needed before Hadoop 2.8.0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org