Re: [lustre-discuss] Disk failures triggered during OST creation and mounting on OSS Servers

Jane Liu via lustre-discuss Mon, 15 May 2023 13:29:32 -0700

During the installation of RHEL8.7 OS and Lustre 2.15.2, it appearedthat the iDRAC cards were, by default, set to enable OS to iDRACPass-through. Thus Lustre mistakenly identified the iDRAC networkinterface as the primary network connection, which it attempted to useover the actual network.

To resolve this, we manually disconnected the iDRAC connection and inputthe correct NID into the modprobe file, which eliminated the diskfailure issue.


Here are the steps we took to disconnect the iDRAC connection:

1. We ran nmcli connection show to list all connections (ensure thatiDRAC is listed as "Wired connection 1" or under an assigned NAME).2. We then ran nmcli connection delete "Wired connection 1" (or replaced"Wired connection 1" with the NAME assigned to the iDRAC) to delete theconnection.


Jane


On 2023-05-13 09:25, John Hearns wrote:

Can you say more about these networking issues?
Good to make a note of them in case anyone sees similar in the future.


On Fri, 12 May 2023, 20:40 Jane Liu via lustre-discuss,
<lustre-discuss@lists.lustre.org> wrote:

Hi Jeff,

Thanks for your response. We discovered later that the network
issues
originating from the iDRAC IP were causing the SAS driver to hang or

experience timeouts when trying to access the drives. This resulted
in
the drives being kicked out.

Once we resolved this issue, both the mkfs and mount operations
started
working fine.

Thanks,
Jane

On 2023-05-10 12:43, Jeff Johnson wrote:

Jane,

You're having hardware errors, the codes in those mpt3sas errors
define as "PL_LOGINFO_SUB_CODE_OPEN_FAILURE_ORR_TIMEOUT", or in

other

words your SAS HBA cannot open a command dialogue with your disk.

I'd

suspect backplane or cabling issues as an internal disk failure

will

be reported by the target disk with its own error code. In this

case

your HBA can't even talk to it properly.

Is sdah the partner mpath device to sdef? Or is sdah a second

failing

disk interface?

Looking at this, I don't think your hardware is deploy-ready.

--Jeff

On Wed, May 10, 2023 at 9:29 AM Jane Liu via lustre-discuss
<lustre-discuss@lists.lustre.org> wrote:

Hi,

We recently attempted to add several new OSS servers ( RHEL 8.7

and

Lustre 2.15.2). While creating new OSTs, I noticed that mdstat
reported
some disk failures after the mkfs, even though the disks were
functional
before the mkfs command. Our hardware admins managed to resolve

the

mdstat issue and restore the disks to normal operation. However,
when I
ran the mount OST command (when network had a problem and mount
command
timed out), similar problems occurred, and several disks were

kicked


out. The relevant /var/log/messages are provided below.

This problem was consistent across all our OSS servers. Any

insights


into the possible cause would be appreciated.

Jane

-----------------------------

May  9 13:33:15 sphnxoss47 kernel: LDISKFS-fs (md0): mounted
filesystem
with ordered data mode. Opts: errors=remount-ro
May  9 13:33:15 sphnxoss47 systemd[1]: tmp-mntmirJ5z.mount:
Succeeded.
May  9 13:33:16 sphnxoss47 kernel: LNet: HW NUMA nodes: 2, HW CPU
cores:
72, npartitions: 2
May  9 13:33:16 sphnxoss47 kernel: alg: No test for adler32
(adler32-zlib)
May  9 13:33:16 sphnxoss47 kernel: Key type ._llcrypt registered
May  9 13:33:16 sphnxoss47 kernel: Key type .llcrypt registered
May  9 13:33:16 sphnxoss47 kernel: Lustre: Lustre: Build Version:
2.15.2
May  9 13:33:16 sphnxoss47 kernel: LNet: Added LNI

169.254.1.2@tcp

[8/256/0/180]
May  9 13:33:16 sphnxoss47 kernel: LNet: Accept secure, port 988
May  9 13:33:17 sphnxoss47 kernel: LDISKFS-fs (md0): mounted
filesystem
with ordered data mode. Opts:
errors=remount-ro,no_mbcache,nodelalloc
May  9 13:33:17 sphnxoss47 kernel: Lustre: sphnx01-OST0244-osd:
enabled
'large_dir' feature on device /dev/md0
May  9 13:33:25 sphnxoss47 systemd-logind[8609]: New session 7 of
user
root.
May  9 13:33:25 sphnxoss47 systemd[1]: Started Session 7 of user
root.
May  9 13:34:36 sphnxoss47 kernel: LustreError: 15f-b:
sphnx01-OST0244:
cannot register this server with the MGS: rc = -110. Is the MGS
running?
May  9 13:34:36 sphnxoss47 kernel: LustreError:
45314:0:(obd_mount_server.c:2027:server_fill_super()) Unable to
start
targets: -110
May  9 13:34:36 sphnxoss47 kernel: LustreError:
45314:0:(obd_mount_server.c:1644:server_put_super()) no obd
sphnx01-OST0244
May  9 13:34:36 sphnxoss47 kernel: LustreError:
45314:0:(obd_mount_server.c:131:server_deregister_mount())
sphnx01-OST0244 not registered
May  9 13:34:39 sphnxoss47 kernel: Lustre: server umount
sphnx01-OST0244
complete
May  9 13:34:39 sphnxoss47 kernel: LustreError:
45314:0:(super25.c:176:lustre_fill_super()) llite: Unable to

mount

<unknown>: rc = -110
May  9 13:34:40 sphnxoss47 kernel: LDISKFS-fs (md1): mounted
filesystem
with ordered data mode. Opts: errors=remount-ro
May  9 13:34:40 sphnxoss47 systemd[1]: tmp-mntXT85fz.mount:
Succeeded.
May  9 13:34:41 sphnxoss47 kernel: LDISKFS-fs (md1): mounted
filesystem
with ordered data mode. Opts:
errors=remount-ro,no_mbcache,nodelalloc
May  9 13:34:41 sphnxoss47 kernel: Lustre: sphnx01-OST0245-osd:
enabled
'large_dir' feature on device /dev/md1
May  9 13:36:00 sphnxoss47 kernel: LustreError: 15f-b:
sphnx01-OST0245:
cannot register this server with the MGS: rc = -110. Is the MGS
running?
May  9 13:36:00 sphnxoss47 kernel: LustreError:
46127:0:(obd_mount_server.c:2027:server_fill_super()) Unable to
start
targets: -110
May  9 13:36:00 sphnxoss47 kernel: LustreError:
46127:0:(obd_mount_server.c:1644:server_put_super()) no obd
sphnx01-OST0245
May  9 13:36:00 sphnxoss47 kernel: LustreError:
46127:0:(obd_mount_server.c:131:server_deregister_mount())
sphnx01-OST0245 not registered
May  9 13:36:08 sphnxoss47 kernel: Lustre: server umount
sphnx01-OST0245
complete
May  9 13:36:08 sphnxoss47 kernel: LustreError:
46127:0:(super25.c:176:lustre_fill_super()) llite: Unable to

mount

<unknown>: rc = -110
May  9 13:36:08 sphnxoss47 kernel: LDISKFS-fs (md2): mounted
filesystem
with ordered data mode. Opts: errors=remount-ro
May  9 13:36:08 sphnxoss47 systemd[1]: tmp-mnt17IOaq.mount:
Succeeded.
May  9 13:36:09 sphnxoss47 kernel: LDISKFS-fs (md2): mounted
filesystem
with ordered data mode. Opts:
errors=remount-ro,no_mbcache,nodelalloc
Show less
11:03 AM

-----------------------------

it just repeats for all of the md raids, then the errors start

and

the
drive fails and is disabled:

May  9 13:44:31 sphnxoss47 kernel: LustreError:
48069:0:(super25.c:176:lustre_fill_super()) llite: Unable to

mount

<unknown>: rc = -110
May  9 13:44:33 sphnxoss47 kernel: mpt3sas_cm1:
log_info(0x3112011a):
originator(PL), code(0x12), sub_code(0x011a)
May  9 13:44:33 sphnxoss47 kernel: mpt3sas_cm1:
log_info(0x3112011a):
originator(PL), code(0x12), sub_code(0x011a)
May  9 13:44:33 sphnxoss47 kernel: mpt3sas_cm1:
log_info(0x3112011a):
originator(PL), code(0x12), sub_code(0x011a)
May  9 13:44:33 sphnxoss47 kernel: mpt3sas_cm1:
log_info(0x3112011a):
originator(PL), code(0x12), sub_code(0x011a)
May  9 13:44:33 sphnxoss47 kernel: mpt3sas_cm1:
log_info(0x3112011a):
originator(PL), code(0x12), sub_code(0x011a)
May  9 13:44:33 sphnxoss47 kernel: mpt3sas_cm1:
log_info(0x3112011a):
originator(PL), code(0x12), sub_code(0x011a)
May  9 13:44:33 sphnxoss47 kernel: mpt3sas_cm1:
log_info(0x3112011a):
originator(PL), code(0x12), sub_code(0x011a)
....
....
May  9 13:44:33 sphnxoss47 kernel: sd 16:0:31:0: [sdef] tag#1102
FAILED
Result: hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK cmd_age=1s
May  9 13:44:33 sphnxoss47 kernel: sd 16:0:31:0: [sdef] tag#1102
CDB:
Read(10) 28 00 00 00 87 79 00 00 01 00
May  9 13:44:33 sphnxoss47 kernel: blk_update_request: I/O error,
dev
sdef, sector 277448 op 0x0:(READ) flags 0x84700 phys_seg 1 prio
class 0
May  9 13:44:33 sphnxoss47 kernel: sd 16:0:31:0: [sdef] tag#6800
FAILED
Result: hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK cmd_age=1s
May  9 13:44:33 sphnxoss47 kernel: sd 16:0:31:0: [sdef] tag#6800
CDB:
Read(10) 28 00 00 00 87 dd 00 00 01 00
May  9 13:44:33 sphnxoss47 kernel: blk_update_request: I/O error,
dev
sdef, sector 278248 op 0x0:(READ) flags 0x84700 phys_seg 1 prio
class 0
May  9 13:44:33 sphnxoss47 kernel: device-mapper: multipath:

253:52:


Failing path 128:112.
May  9 13:44:33 sphnxoss47 multipathd[6051]: sdef: mark as failed
May  9 13:44:33 sphnxoss47 multipathd[6051]: mpathae: remaining
active
paths: 1
...
...
May  9 13:44:34 sphnxoss47 kernel: mpt3sas_cm0:
log_info(0x3112011a):
originator(PL), code(0x12), sub_code(0x011a)
May  9 13:44:34 sphnxoss47 kernel: mpt3sas_cm0:
log_info(0x3112011a):
originator(PL), code(0x12), sub_code(0x011a)
May  9 13:44:34 sphnxoss47 kernel: mpt3sas_cm0:
log_info(0x3112011a):
originator(PL), code(0x12), sub_code(0x011a)
May  9 13:44:34 sphnxoss47 kernel: md: super_written gets

error=-5

May  9 13:44:34 sphnxoss47 kernel: md/raid:md8: Disk failure on
dm-55,
disabling device.
May  9 13:44:34 sphnxoss47 kernel: md: super_written gets

error=-5

May  9 13:44:34 sphnxoss47 kernel: md/raid:md8: Operation

continuing

on
9 devices.
May  9 13:44:34 sphnxoss47 multipathd[6051]: sdah: mark as failed
_______________________________________________
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

[1] [1]


--

------------------------------
Jeff Johnson
Co-Founder
Aeon Computing

jeff.john...@aeoncomputing.com
www.aeoncomputing.com [2] [2]
t: 858-412-3810 x1001   f: 858-412-3845
m: 619-204-9061

4170 Morena Boulevard, Suite C - San Diego, CA 92117

High-Performance Computing / Lustre Filesystems / Scale-out

Storage


Links:
------
[1]

https://urldefense.com/v3/__http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org__;!!P4SdNyxKAPE!B65twCaGe4aP1xnGrjpUnd-1OYuemL3X9zWyxfWEA54zk2tnvbhhrBFW5x9rXl7nFEkSsZpiRGIbodWHehLDQyvnK6u95iVHjg$

[2]

https://urldefense.com/v3/__http://www.aeoncomputing.com__;!!P4SdNyxKAPE!B65twCaGe4aP1xnGrjpUnd-1OYuemL3X9zWyxfWEA54zk2tnvbhhrBFW5x9rXl7nFEkSsZpiRGIbodWHehLDQyvnK6vvMMT5RQ$

_______________________________________________
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org [1]



Links:
------

[1]https://urldefense.com/v3/__http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org__;!!P4SdNyxKAPE!ENmGbJj8onJWgmw49s5KVk86KaoTMYullAdsqNGrBNU4w2sqCMGDae-1PxLFMgpP5RmkI7DRMfA4av16bOfR$[2]https://urldefense.com/v3/__http://www.aeoncomputing.com__;!!P4SdNyxKAPE!ENmGbJj8onJWgmw49s5KVk86KaoTMYullAdsqNGrBNU4w2sqCMGDae-1PxLFMgpP5RmkI7DRMfA4aqBzCC82$

_______________________________________________
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Re: [lustre-discuss] Disk failures triggered during OST creation and mounting on OSS Servers

Reply via email to