[ https://issues.apache.org/jira/browse/HDFS-4239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13872990#comment-13872990 ]
Hadoop QA commented on HDFS-4239: --------------------------------- {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12623271/hdfs-4239.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 2 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. The javadoc tool did not generate any warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-hdfs-project/hadoop-hdfs: org.apache.hadoop.hdfs.server.namenode.metrics.TestNameNodeMetrics {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/5891//testReport/ Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/5891//console This message is automatically generated. > Means of telling the datanode to stop using a sick disk > ------------------------------------------------------- > > Key: HDFS-4239 > URL: https://issues.apache.org/jira/browse/HDFS-4239 > Project: Hadoop HDFS > Issue Type: Improvement > Reporter: stack > Assignee: Jimmy Xiang > Attachments: hdfs-4239.patch > > > If a disk has been deemed 'sick' -- i.e. not dead but wounded, failing > occasionally, or just exhibiting high latency -- your choices are: > 1. Decommission the total datanode. If the datanode is carrying 6 or 12 > disks of data, especially on a cluster that is smallish -- 5 to 20 nodes -- > the rereplication of the downed datanode's data can be pretty disruptive, > especially if the cluster is doing low latency serving: e.g. hosting an hbase > cluster. > 2. Stop the datanode, unmount the bad disk, and restart the datanode (You > can't unmount the disk while it is in use). This latter is better in that > only the bad disk's data is rereplicated, not all datanode data. > Is it possible to do better, say, send the datanode a signal to tell it stop > using a disk an operator has designated 'bad'. This would be like option #2 > above minus the need to stop and restart the datanode. Ideally the disk > would become unmountable after a while. > Nice to have would be being able to tell the datanode to restart using a disk > after its been replaced. -- This message was sent by Atlassian JIRA (v6.1.5#6160)