Re: [gpfsug-discuss] Bad disk but not failed in DSS-G

Jonathan Buzzard Mon, 24 Jun 2024 05:54:50 -0700

On 24/06/2024 13:16, Achim Rehor wrote:

CAUTION: This email originated outside the University. Check beforeclicking links or attachments.
well ... not necessarily 😄
but on the disk ... just as i expected ... taking it out helps a lot.
Now on taking it out automatically when raising too many errors was adiscussion i had several times with the GNR development.The issue really is .. I/O errors on disks (as seen in themmlsrecoverygroupevent logs) can be due to several issues (the diskitself,
the expander, the IOM, the adapter, the cable ... )
in case of a more general part serving like 5 or more pdisks, that wouldrisk the FT , if we took them out automatically.
Thus ... we dont do that ..


When smartctl for the disk says

Error counter log:

Errors Corrected by Total CorrectionGigabytes TotalECC rereads/ errors algorithmprocessed uncorrectedfast | delayed rewrites corrected invocations [10^9bytes] errorsread: 0 33839 32 0 0 137434.70532write: 0 36 0 0 0 178408.8930


Non-medium error count:        0

A disk with 32 read errors in smartctl is fubar, no ifs no buts.Whatever the balance in ejecting bad disks is, IMHO currently it's inthe wrong place because it failed to eject an actual bad disk.

At an absolute bare minimum mmhealth should be not be saying everythingis fine and dandy because clearly it was not. That's the bigger issue. Ican live with them not been taken out automatically, it is unacceptablethat mmhealth was giving false and inaccurate information about thestate of the filesystem. Had it even just changed something to a"degraded" state the problems could have been picked up much much sooner.

Presumably the disk category was still good because the vdisk's wheretheoretically good. I suggest renaming that to VDISK to more accuratelyreflect what it is about and add a PDISK category. Then when a pdiskstarts showing IO errors you can increment the number of disks in adegraded state and it can be picked up without end users having to rolltheir own monitoring.

The idea is to improve the disk hospital more and more, so that thedecision to switch a disk back to OK is more accurate, over time.
Until then .. it might always be a good idea to scan the event log forpdisk errors ...

That is my conclusion, that mmhealth is as useful as a chocolate teapotbecause you can't rely on it to provide correct information and I needto do my own health monitoring of the system.



JAB.

--
Jonathan A. Buzzard                         Tel: +44141-5483420
HPC System Administrator, ARCHIE-WeSt.
University of Strathclyde, John Anderson Building, Glasgow. G4 0NG


_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at gpfsug.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss_gpfsug.org

Re: [gpfsug-discuss] Bad disk but not failed in DSS-G

Reply via email to