[ 
https://issues.apache.org/jira/browse/HDFS-4239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14212757#comment-14212757
 ] 

Yongjun Zhang commented on HDFS-4239:
-------------------------------------

HI [~qwertymaniac],

My bad that I did not notice your earlier comment 
{quote}
I just noticed Steve's comment referring the same - should've gone through 
properly before spending google cycles. I feel HDFS-1362 implemented would 
solve half of this - and the other half would be to make the removals 
automatic. Right now the checkDiskError does not eject if its slow - as long as 
its succeed, which would have to be done via this JIRA I think. The re-add 
would be possible via HDFS-1362.
{quote}
until now. So we need to use the functionality provided by HDFS-1362 to 
automatically remove a sick disk. It seems the original goal of HDFS-4239 is 
the same as HDFS-1362 (right?), and we can create a new jira for  automatically 
removing a sick disk?

Thanks.


> Means of telling the datanode to stop using a sick disk
> -------------------------------------------------------
>
>                 Key: HDFS-4239
>                 URL: https://issues.apache.org/jira/browse/HDFS-4239
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>            Reporter: stack
>            Assignee: Yongjun Zhang
>         Attachments: hdfs-4239.patch, hdfs-4239_v2.patch, hdfs-4239_v3.patch, 
> hdfs-4239_v4.patch, hdfs-4239_v5.patch
>
>
> If a disk has been deemed 'sick' -- i.e. not dead but wounded, failing 
> occasionally, or just exhibiting high latency -- your choices are:
> 1. Decommission the total datanode.  If the datanode is carrying 6 or 12 
> disks of data, especially on a cluster that is smallish -- 5 to 20 nodes -- 
> the rereplication of the downed datanode's data can be pretty disruptive, 
> especially if the cluster is doing low latency serving: e.g. hosting an hbase 
> cluster.
> 2. Stop the datanode, unmount the bad disk, and restart the datanode (You 
> can't unmount the disk while it is in use).  This latter is better in that 
> only the bad disk's data is rereplicated, not all datanode data.
> Is it possible to do better, say, send the datanode a signal to tell it stop 
> using a disk an operator has designated 'bad'.  This would be like option #2 
> above minus the need to stop and restart the datanode.  Ideally the disk 
> would become unmountable after a while.
> Nice to have would be being able to tell the datanode to restart using a disk 
> after its been replaced.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to