Re: [gpfsug-discuss] Bad disk but not failed in DSS-G

Luke Sudbery Fri, 05 Jul 2024 10:34:05 -0700

Have you opened a ticket with Lenovo and/or IBM about this?

If there is a genuine bug here (and it seems there might be), that's the way to 
get it fixed.

Generally find the disk hospital very reliable and it takes disks out of 
"rotation" (pun intended) for slow performance long before they cause any 
problems, but have yet to see it the other way round - although if it's not 
reporting things we could be missing them...

Cheers,

Luke
-- 
Luke Sudbery
Principal Engineer (HPC and Storage).
Architecture, Infrastructure and Systems
Advanced Research Computing, IT Services
Room 132, Computer Centre G5, Elms Road

Please note I don’t work on Monday.

-----Original Message-----
From: gpfsug-discuss <gpfsug-discuss-boun...@gpfsug.org> On Behalf Of Jonathan 
Buzzard
Sent: Monday, June 24, 2024 1:52 PM
To: Achim Rehor <achim.re...@de.ibm.com>; gpfsug-discuss@gpfsug.org
Subject: Re: [gpfsug-discuss] Bad disk but not failed in DSS-G

CAUTION: This email originated from outside the organisation. Do not click 
links or open attachments unless you recognise the sender and know the content 
is safe.

On 24/06/2024 13:16, Achim Rehor wrote:
> CAUTION: This email originated outside the University. Check before
> clicking links or attachments.
> well ... not necessarily 😄
> but on the disk ... just as i expected ... taking it out helps a lot.
>
> Now on taking it out automatically when raising too many errors was a
> discussion i had several times with the GNR development.
> The issue really is .. I/O errors on disks (as seen in the
> mmlsrecoverygroupevent logs) can be due to several issues  (the disk
> itself,
> the expander, the IOM, the adapter, the cable ... )
> in case of a more general part serving like 5 or more pdisks, that would
> risk the FT , if we took them out automatically.
> Thus ... we dont do that ..
>

When smartctl for the disk says

Error counter log:
            Errors Corrected by           Total   Correction
Gigabytes    Total
                ECC          rereads/    errors   algorithm
processed    uncorrected
            fast | delayed   rewrites  corrected  invocations   [10^9
bytes]  errors
read:          0    33839        32         0          0     137434.705
         32
write:         0       36         0         0          0     178408.893
          0

Non-medium error count:        0

A disk with 32 read errors in smartctl is fubar, no ifs no buts.
Whatever the balance in ejecting bad disks is, IMHO currently it's in
the wrong place because it failed to eject an actual bad disk.

At an absolute bare minimum mmhealth should be not be saying everything
is fine and dandy because clearly it was not. That's the bigger issue. I
can live with them not been taken out automatically, it is unacceptable
that mmhealth was giving false and inaccurate information about the
state of the filesystem. Had it even just changed something to a
"degraded" state the problems could have been picked up much much sooner.

Presumably the disk category was still good because the vdisk's where
theoretically good. I suggest renaming that to VDISK to more accurately
reflect what it is about and add a PDISK category. Then when a pdisk
starts showing IO errors you can increment the number of disks in a
degraded state and it can be picked up without end users having to roll
their own monitoring.

> The idea is to improve the disk hospital more and more, so that the
> decision to switch a disk back to OK is more accurate,   over time.
>
> Until then .. it might always be a good idea to scan the event log for
> pdisk errors ...
>

That is my conclusion, that mmhealth is as useful as a chocolate teapot
because you can't rely on it to provide correct information and I need
to do my own health monitoring of the system.

JAB.

--
Jonathan A. Buzzard                         Tel: +44141-5483420
HPC System Administrator, ARCHIE-WeSt.
University of Strathclyde, John Anderson Building, Glasgow. G4 0NG

_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at gpfsug.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss_gpfsug.org
_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at gpfsug.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss_gpfsug.org

Re: [gpfsug-discuss] Bad disk but not failed in DSS-G

Reply via email to