[ 
https://issues.apache.org/jira/browse/HADOOP-13738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15606549#comment-15606549
 ] 

Xiaoyu Yao commented on HADOOP-13738:
-------------------------------------

Thanks [~arpiagariu] for working on this, [~kihwal] and [~anu] for the 
discussion. 

I can see some benefits of using random file name. The diskchecker may run 
multiple times. A random file name will not be impacted by the failed deletion 
from previous runs. If we want to use pattern for test file naming, we should 
do clean up of files from previous run before the disk check like 
[~arpitagarwal] has already done in the unit test. 

Can we have some timer/threshold (in ms level) for the expected execution time 
of each diskIoCheckWithoutNativeIo() test to break out of the retry loop? This 
way, we won't have to wait forever even with the current serialized disk check 
in datanode. 

> DiskChecker should perform some disk IO
> ---------------------------------------
>
>                 Key: HADOOP-13738
>                 URL: https://issues.apache.org/jira/browse/HADOOP-13738
>             Project: Hadoop Common
>          Issue Type: Improvement
>            Reporter: Arpit Agarwal
>            Assignee: Arpit Agarwal
>         Attachments: HADOOP-13738.01.patch, HADOOP-13738.02.patch, 
> HADOOP-13738.03.patch
>
>
> DiskChecker can fail to detect total disk/controller failures indefinitely. 
> We have seen this in real clusters. DiskChecker performs simple 
> permissions-based checks on directories which do not guarantee that any disk 
> IO will be attempted.
> A simple improvement is to write some data and flush it to the disk.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org

Reply via email to