Re: [gpfsug-discuss] Bad disk but not failed in DSS-G

Achim Rehor Thu, 20 Jun 2024 15:35:38 -0700

Fred is most probably correct here. the two errors are not necessarily the same.


i would guess that looking at

# mmlsreoverygroupevents dssg2
or
# mmvdisk recoverygroup list --recovery-group dssg2  --events

you would see e1d2s25    multiple times, changing its state from ok to 
diagnosing, and back to ok

If you feel this is recurring too often (and i tend to agree, given the number 
of IOErrors,
you can always '--simulate-failing' this pdisk , and replace it
# mmvdisk pdisk change --recovery-group dssg2 --pdisk e1d2s25 --simulate-failing



--

Mit freundlichen Grüßen / Kind regards

Achim Rehor

Technical Support Specialist Spectrum Scale and ESS (SME)
Advisory Product Services Professional
IBM Systems Storage Support - EMEA

achim.re...@de.ibm.com +49-170-4521194
IBM Deutschland GmbH
Vorsitzender des Aufsichtsrats: Sebastian Krause
Geschäftsführung: Gregor Pillen (Vorsitzender), Nicole Reimer,
Gabriele Schwarenthorer, Christine Rupp, Frank Theisen
Sitz der Gesellschaft: Ehningen / Registergericht: AmtsgerichtStuttgart, HRB 
14562 / WEEE-Reg.-Nr. DE 99369940


-----Original Message-----
From: Fred Stock <sto...@us.ibm.com>
Reply-To: gpfsug main discussion list <gpfsug-discuss@gpfsug.org>
To: gpfsug main discussion list <gpfsug-discuss@gpfsug.org>
Subject: [EXTERNAL] Re: [gpfsug-discuss] Bad disk but not failed in DSS-G
Date: Thu, 20 Jun 2024 21:02:43 +0000

I think you are seeing two different errors. The backup is failing due to a 
stale file handle error which usually means the file system was unmounted while 
the file handle was open. The write error on the physical disk, may have 
contributed

#pfptBannerqlqwaj3 { all: revert !important; display: block !important; 
visibility: visible !important; opacity: 1 !important; background-color: 
#D0D8DC !important; max-width: none !important; max-height: none !important } 
.pfptPrimaryButtonqlqwaj3:hover, .pfptPrimaryButtonqlqwaj3:focus { 
background-color: #b4c1c7 !important; } .pfptPrimaryButtonqlqwaj3:active { 
background-color: #90a4ae !important; }<!-- /* Font Definitions */ @font-face 
{font-family:"Cambria Math"; panose-1:2 4 5 3 5 4 6 3 2 4;} @font-face 
{font-family:Calibri; panose-1:2 15 5 2 2 2 4 3 2 4;} @font-face 
{font-family:Verdana; panose-1:2 11 6 4 3 5 4 4 2 4;} @font-face 
{font-family:Aptos; panose-1:2 11 0 4 2 2 2 2 2 4;} @font-face 
{font-family:"Times New Roman \(Body CS\)"; panose-1:2 11 6 4 2 2 2 2 2 4;} /* 
Style Definitions */ p.MsoNormal, li.MsoNormal, div.MsoNormal {margin:0in; 
font-size:12.0pt; font-family:"Aptos",sans-serif;} a:link, span.MsoHyperlink 
{mso-style-priority:99; color:blue; text-decoration:underline;} 
span.EmailStyle19 {mso-style-type:personal-reply; 
font-family:"Arial",sans-serif; color:windowtext; font-weight:normal; 
font-style:normal;} .MsoChpDefault {mso-style-type:export-only; 
font-size:10.0pt; mso-ligatures:none;} @page WordSection1 {size:8.5in 11.0in; 
margin:1.0in 1.0in 1.0in 1.0in;} div.WordSection1 {page:WordSection1;} -->
I think you are seeing two different errors.  The backup is failing due to a 
stale file handle error which usually means the file system was unmounted while 
the file handle was open.  The write error on the physical disk, may have 
contributed to the stale file handle but I doubt that is the case.  As I 
understand a single IO error on a physical disk in an ESS (DSS) system will not 
cause the disk to be considered bad.  This is likely why the system considers 
the disk to be ok.  I suggest you track down the source of the stale file 
handle and correct that issue to see if your backups will then again be 
successful.

Fred

Fred Stock, Spectrum Scale Development Advocacy
sto...@us.ibm.com | 720-430-8821



From:gpfsug-discuss <gpfsug-discuss-boun...@gpfsug.org> on behalf of Jonathan 
Buzzard <jonathan.buzz...@strath.ac.uk>
Date: Thursday, June 20, 2024 at 4:16 PM
To: gpfsug-discuss@gpfsug.org <gpfsug-discuss@gpfsug.org>
Subject: [EXTERNAL] [gpfsug-discuss] Bad disk but not failed in DSS-G

So came to light because I was checking the mmbackup logs and found that
we had not been getting any successful backups for several days and
seeing lots of errors like this

Wed Jun 19 21:45:28 2024 mmbackup:Error encountered in policy scan: [E]
Error on gpfs_iopen([/gpfs/users/xxxyyyyy/.swr],68050746): Stale file handle
Wed Jun 19 21:45:28 2024 mmbackup:Error encountered in policy scan: [E]
Summary of errors:: _dirscan failures:3, _serious unclassified errors:3.

After some digging around wondering what was going on I came across
these being logged on one of the DSS-G nodes

[Wed Jun 12 22:22:05 2024] blk_update_request: I/O error, dev sdbv,
sector 9144672512 op 0x1:(WRITE) flags 0x700 phys_seg 17 prio class 0

Yikes looks like I have a failed disk/ However if I do

[root@gpfs2 ~]# mmvdisk pdisk list --recovery-group all --not-ok
mmvdisk: All pdisks are ok.

Clearly that's a load of rubbish.

After a lot more prodding

[root@gpfs2 ~]# mmvdisk pdisk list --recovery-group dssg2 --pdisk e1d2s25 -L
pdisk:
    replacementPriority = 1000
    name = "e1d2s25"
    device =
"//gpfs1/dev/sdft(notEnabled),//gpfs1/dev/sdfu(notEnabled),//gpfs2/dev/sdfb,//gpfs2/dev/sdbv"
    recoveryGroup = "dssg2"
    declusteredArray = "DA1"
    state = "ok"
    IOErrors = 444
    IOTimeouts = 8958
    mediaErrors = 15


What on earth gives? Why has the disk not been failed? It's not great
that a clearly bad disk is allowed to stick around in the file system
and cause problems IMHO.

When I try and prepare the disk for removal I get

[root@gpfs2 ~]# mmvdisk pdisk replace --prepare --rg dssg2 --pdisk e1d2s25
mmvdisk: Pdisk e1d2s25 of recovery group dssg2 is not currently
scheduled for replacement.
mmvdisk:
mmvdisk:
mmvdisk: Command failed. Examine previous error messages to determine cause.

Do I have to use the --force option? I would like to get this disk out
the file system ASAP.



JAB.

_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at gpfsug.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss_gpfsug.org

_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at gpfsug.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss_gpfsug.org

Re: [gpfsug-discuss] Bad disk but not failed in DSS-G

Reply via email to