On 24/06/2024 13:16, Achim Rehor wrote:
CAUTION: This email originated outside the University. Check before clicking links or attachments.
well ... not necessarily 😄
but on the disk ... just as i expected ... taking it out helps a lot.

Now on taking it out automatically when raising too many errors was a discussion i had several times with the GNR development. The issue really is .. I/O errors on disks (as seen in the mmlsrecoverygroupevent logs) can be due to several issues  (the disk itself,
the expander, the IOM, the adapter, the cable ... )
in case of a more general part serving like 5 or more pdisks, that would risk the FT , if we took them out automatically.
Thus ... we dont do that ..


When smartctl for the disk says

Error counter log:
Errors Corrected by Total Correction Gigabytes Total ECC rereads/ errors algorithm processed uncorrected fast | delayed rewrites corrected invocations [10^9 bytes] errors read: 0 33839 32 0 0 137434.705 32 write: 0 36 0 0 0 178408.893 0

Non-medium error count:        0


A disk with 32 read errors in smartctl is fubar, no ifs no buts. Whatever the balance in ejecting bad disks is, IMHO currently it's in the wrong place because it failed to eject an actual bad disk.

At an absolute bare minimum mmhealth should be not be saying everything is fine and dandy because clearly it was not. That's the bigger issue. I can live with them not been taken out automatically, it is unacceptable that mmhealth was giving false and inaccurate information about the state of the filesystem. Had it even just changed something to a "degraded" state the problems could have been picked up much much sooner.

Presumably the disk category was still good because the vdisk's where theoretically good. I suggest renaming that to VDISK to more accurately reflect what it is about and add a PDISK category. Then when a pdisk starts showing IO errors you can increment the number of disks in a degraded state and it can be picked up without end users having to roll their own monitoring.

The idea is to improve the disk hospital more and more, so that the decision to switch a disk back to OK is more accurate,   over time.

Until then .. it might always be a good idea to scan the event log for pdisk errors ...


That is my conclusion, that mmhealth is as useful as a chocolate teapot because you can't rely on it to provide correct information and I need to do my own health monitoring of the system.


JAB.

--
Jonathan A. Buzzard                         Tel: +44141-5483420
HPC System Administrator, ARCHIE-WeSt.
University of Strathclyde, John Anderson Building, Glasgow. G4 0NG


_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at gpfsug.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss_gpfsug.org

Reply via email to