Fred is most probably correct here. the two errors are not necessarily the same.
i would guess that looking at # mmlsreoverygroupevents dssg2 or # mmvdisk recoverygroup list --recovery-group dssg2 --events you would see e1d2s25 multiple times, changing its state from ok to diagnosing, and back to ok If you feel this is recurring too often (and i tend to agree, given the number of IOErrors, you can always '--simulate-failing' this pdisk , and replace it # mmvdisk pdisk change --recovery-group dssg2 --pdisk e1d2s25 --simulate-failing -- Mit freundlichen Grüßen / Kind regards Achim Rehor Technical Support Specialist Spectrum Scale and ESS (SME) Advisory Product Services Professional IBM Systems Storage Support - EMEA achim.re...@de.ibm.com +49-170-4521194 IBM Deutschland GmbH Vorsitzender des Aufsichtsrats: Sebastian Krause Geschäftsführung: Gregor Pillen (Vorsitzender), Nicole Reimer, Gabriele Schwarenthorer, Christine Rupp, Frank Theisen Sitz der Gesellschaft: Ehningen / Registergericht: AmtsgerichtStuttgart, HRB 14562 / WEEE-Reg.-Nr. DE 99369940 -----Original Message----- From: Fred Stock <sto...@us.ibm.com> Reply-To: gpfsug main discussion list <gpfsug-discuss@gpfsug.org> To: gpfsug main discussion list <gpfsug-discuss@gpfsug.org> Subject: [EXTERNAL] Re: [gpfsug-discuss] Bad disk but not failed in DSS-G Date: Thu, 20 Jun 2024 21:02:43 +0000 I think you are seeing two different errors. The backup is failing due to a stale file handle error which usually means the file system was unmounted while the file handle was open. The write error on the physical disk, may have contributed #pfptBannerqlqwaj3 { all: revert !important; display: block !important; visibility: visible !important; opacity: 1 !important; background-color: #D0D8DC !important; max-width: none !important; max-height: none !important } .pfptPrimaryButtonqlqwaj3:hover, .pfptPrimaryButtonqlqwaj3:focus { background-color: #b4c1c7 !important; } .pfptPrimaryButtonqlqwaj3:active { background-color: #90a4ae !important; }<!-- /* Font Definitions */ @font-face {font-family:"Cambria Math"; panose-1:2 4 5 3 5 4 6 3 2 4;} @font-face {font-family:Calibri; panose-1:2 15 5 2 2 2 4 3 2 4;} @font-face {font-family:Verdana; panose-1:2 11 6 4 3 5 4 4 2 4;} @font-face {font-family:Aptos; panose-1:2 11 0 4 2 2 2 2 2 4;} @font-face {font-family:"Times New Roman \(Body CS\)"; panose-1:2 11 6 4 2 2 2 2 2 4;} /* Style Definitions */ p.MsoNormal, li.MsoNormal, div.MsoNormal {margin:0in; font-size:12.0pt; font-family:"Aptos",sans-serif;} a:link, span.MsoHyperlink {mso-style-priority:99; color:blue; text-decoration:underline;} span.EmailStyle19 {mso-style-type:personal-reply; font-family:"Arial",sans-serif; color:windowtext; font-weight:normal; font-style:normal;} .MsoChpDefault {mso-style-type:export-only; font-size:10.0pt; mso-ligatures:none;} @page WordSection1 {size:8.5in 11.0in; margin:1.0in 1.0in 1.0in 1.0in;} div.WordSection1 {page:WordSection1;} --> I think you are seeing two different errors. The backup is failing due to a stale file handle error which usually means the file system was unmounted while the file handle was open. The write error on the physical disk, may have contributed to the stale file handle but I doubt that is the case. As I understand a single IO error on a physical disk in an ESS (DSS) system will not cause the disk to be considered bad. This is likely why the system considers the disk to be ok. I suggest you track down the source of the stale file handle and correct that issue to see if your backups will then again be successful. Fred Fred Stock, Spectrum Scale Development Advocacy sto...@us.ibm.com | 720-430-8821 From:gpfsug-discuss <gpfsug-discuss-boun...@gpfsug.org> on behalf of Jonathan Buzzard <jonathan.buzz...@strath.ac.uk> Date: Thursday, June 20, 2024 at 4:16 PM To: gpfsug-discuss@gpfsug.org <gpfsug-discuss@gpfsug.org> Subject: [EXTERNAL] [gpfsug-discuss] Bad disk but not failed in DSS-G So came to light because I was checking the mmbackup logs and found that we had not been getting any successful backups for several days and seeing lots of errors like this Wed Jun 19 21:45:28 2024 mmbackup:Error encountered in policy scan: [E] Error on gpfs_iopen([/gpfs/users/xxxyyyyy/.swr],68050746): Stale file handle Wed Jun 19 21:45:28 2024 mmbackup:Error encountered in policy scan: [E] Summary of errors:: _dirscan failures:3, _serious unclassified errors:3. After some digging around wondering what was going on I came across these being logged on one of the DSS-G nodes [Wed Jun 12 22:22:05 2024] blk_update_request: I/O error, dev sdbv, sector 9144672512 op 0x1:(WRITE) flags 0x700 phys_seg 17 prio class 0 Yikes looks like I have a failed disk/ However if I do [root@gpfs2 ~]# mmvdisk pdisk list --recovery-group all --not-ok mmvdisk: All pdisks are ok. Clearly that's a load of rubbish. After a lot more prodding [root@gpfs2 ~]# mmvdisk pdisk list --recovery-group dssg2 --pdisk e1d2s25 -L pdisk: replacementPriority = 1000 name = "e1d2s25" device = "//gpfs1/dev/sdft(notEnabled),//gpfs1/dev/sdfu(notEnabled),//gpfs2/dev/sdfb,//gpfs2/dev/sdbv" recoveryGroup = "dssg2" declusteredArray = "DA1" state = "ok" IOErrors = 444 IOTimeouts = 8958 mediaErrors = 15 What on earth gives? Why has the disk not been failed? It's not great that a clearly bad disk is allowed to stick around in the file system and cause problems IMHO. When I try and prepare the disk for removal I get [root@gpfs2 ~]# mmvdisk pdisk replace --prepare --rg dssg2 --pdisk e1d2s25 mmvdisk: Pdisk e1d2s25 of recovery group dssg2 is not currently scheduled for replacement. mmvdisk: mmvdisk: mmvdisk: Command failed. Examine previous error messages to determine cause. Do I have to use the --force option? I would like to get this disk out the file system ASAP. JAB. _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at gpfsug.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss_gpfsug.org
_______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at gpfsug.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss_gpfsug.org