Haibin Huang created HDFS-15744:
-----------------------------------
Summary: Use cumulative counting way to improve the accuracy of
slow disk detection
Key: HDFS-15744
URL: https://issues.apache.org/jira/browse/HDFS-15744
Project: Hadoop HDFS
Issue Type: Improvement
Reporter: Haibin Huang
Assignee: Haibin Huang
Attachments: image-2020-12-22-11-37-14-734.png,
image-2020-12-22-11-37-35-280.png, image-2020-12-22-11-46-48-817.png
11461 support the datanode disk outlier detection, we can use it to find out
slow disk via SlowDiskReport(11551).However i found the slow disk information
may not be accurate enough in practice.
Because a large number of short-term writes can lead to miscalculation. Here is
the example, this disk is health, when it encounters a lot of writing in a few
minute, it's write io does get slow, and will be considered to be slow disk.The
disk just slow in a few minute but SlowDiskReport will keep it until the
information becomes invalid. This scenario confuse us since we want to use
SlowDiskReport to detect the real bad disk.
!image-2020-12-22-11-37-14-734.png!
!image-2020-12-22-11-37-35-280.png!
To improve the deteciton accuracy, we use a cumulative counting way to detect
slow disk. If within the reportValidityMs interval, a disk is considered to be
outlier over 50% times, than it should be a real bad disk.
Here is an exsample, if reportValidityMs is one hour and detection interval is
five minute, there will be 12 times disk outlier detection in one hour. If a
disk is considered to be outlier over 6 times, it should be a real bad disk. We
use this way to detect bad disk in cluster, it can reach over 90% accuracy.
!image-2020-12-22-11-46-48-817.png!
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]