[ 
https://issues.apache.org/jira/browse/HDFS-5522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15150220#comment-15150220
 ] 

Vinayakumar B commented on HDFS-5522:
-------------------------------------

bq. So, if one node is down (eg due to a rolling restart or a crash) all of the 
other nodes are very soon running checkDiskError for no particularly good 
reason. Coupled with HDFS-7489, this failure can also cascade
Yes, samething has been experienced in one of our customer's cluster. 
Due to some nodes' n/w issue, all other datanodes (connected in pipeline) 
started checkdisk. And without HDFS-8845 (2.7.2), all Datanode's disk I/O hit 
100%.
By the time first round of diskcheck is done, some other exception requested 
for diskcheck again. This continued for more than 40 hours slowing down every 
other application.


> Datanode disk error check may be incorrectly skipped
> ----------------------------------------------------
>
>                 Key: HDFS-5522
>                 URL: https://issues.apache.org/jira/browse/HDFS-5522
>             Project: Hadoop HDFS
>          Issue Type: Bug
>    Affects Versions: 0.23.9, 2.2.0
>            Reporter: Kihwal Lee
>            Assignee: Rushabh S Shah
>             Fix For: 2.5.0
>
>         Attachments: HDFS-5522-v2.patch, HDFS-5522-v3.patch, HDFS-5522.patch
>
>
> After HDFS-4581 and HDFS-4699, {{checkDiskError()}} is not called when 
> network errors occur during processing data node requests.  This appears to 
> create problems when a disk is having problems, but not failing I/O soon. 
> If I/O hangs for a long time, network read/write may timeout first and the 
> peer may close the connection. Although the error was caused by a faulty 
> local disk, disk check is not being carried out in this case. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to