[ https://issues.apache.org/jira/browse/HADOOP-13738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15606549#comment-15606549 ]
Xiaoyu Yao commented on HADOOP-13738: ------------------------------------- Thanks [~arpiagariu] for working on this, [~kihwal] and [~anu] for the discussion. I can see some benefits of using random file name. The diskchecker may run multiple times. A random file name will not be impacted by the failed deletion from previous runs. If we want to use pattern for test file naming, we should do clean up of files from previous run before the disk check like [~arpitagarwal] has already done in the unit test. Can we have some timer/threshold (in ms level) for the expected execution time of each diskIoCheckWithoutNativeIo() test to break out of the retry loop? This way, we won't have to wait forever even with the current serialized disk check in datanode. > DiskChecker should perform some disk IO > --------------------------------------- > > Key: HADOOP-13738 > URL: https://issues.apache.org/jira/browse/HADOOP-13738 > Project: Hadoop Common > Issue Type: Improvement > Reporter: Arpit Agarwal > Assignee: Arpit Agarwal > Attachments: HADOOP-13738.01.patch, HADOOP-13738.02.patch, > HADOOP-13738.03.patch > > > DiskChecker can fail to detect total disk/controller failures indefinitely. > We have seen this in real clusters. DiskChecker performs simple > permissions-based checks on directories which do not guarantee that any disk > IO will be attempted. > A simple improvement is to write some data and flush it to the disk. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org