[ https://issues.apache.org/jira/browse/HDFS-5522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15150220#comment-15150220 ]
Vinayakumar B commented on HDFS-5522: ------------------------------------- bq. So, if one node is down (eg due to a rolling restart or a crash) all of the other nodes are very soon running checkDiskError for no particularly good reason. Coupled with HDFS-7489, this failure can also cascade Yes, samething has been experienced in one of our customer's cluster. Due to some nodes' n/w issue, all other datanodes (connected in pipeline) started checkdisk. And without HDFS-8845 (2.7.2), all Datanode's disk I/O hit 100%. By the time first round of diskcheck is done, some other exception requested for diskcheck again. This continued for more than 40 hours slowing down every other application. > Datanode disk error check may be incorrectly skipped > ---------------------------------------------------- > > Key: HDFS-5522 > URL: https://issues.apache.org/jira/browse/HDFS-5522 > Project: Hadoop HDFS > Issue Type: Bug > Affects Versions: 0.23.9, 2.2.0 > Reporter: Kihwal Lee > Assignee: Rushabh S Shah > Fix For: 2.5.0 > > Attachments: HDFS-5522-v2.patch, HDFS-5522-v3.patch, HDFS-5522.patch > > > After HDFS-4581 and HDFS-4699, {{checkDiskError()}} is not called when > network errors occur during processing data node requests. This appears to > create problems when a disk is having problems, but not failing I/O soon. > If I/O hangs for a long time, network read/write may timeout first and the > peer may close the connection. Although the error was caused by a faulty > local disk, disk check is not being carried out in this case. -- This message was sent by Atlassian JIRA (v6.3.4#6332)