Have you had a close look at the logs from your subnet manager?
Assuming you run Opensm on a server this is opensm.log




On Fri, 21 Jun 2024 at 16:35, Kurt Strosahl via lustre-discuss <
lustre-discuss@lists.lustre.org> wrote:

> Good Morning,
>
>     We've been experiencing a fairly nasty issue with our clients
> following our move to Alma 9.  It seems to occur randomly (a few days to
> over a week), the clients with connectX-3 cards start getting lnet network
> errors and seeing moving hangs on random osts spread across our oss
> systems, as well as issues talking with the mgs.  This can then trigger
> crash cycles on the oss systems themselves (again in the lnet layer).  The
> only answer we have found so far is to power down all the impacted clients
> and let the impacted oss systems reboot.
>
> Here is a snippet of the error as we see it on the client:
> Jun21 08:16] Lustre: lustre19-OST0020-osc-ffff934c22a29800: Connection
> restored to 172.17.0.97@o2ib (at 172.17.0.97@o2ib)
>
>
> [  +0.000006] Lustre: Skipped 2 previous similar messages
> [  +3.079695] Lustre: lustre19-MDT0000-mdc-ffff934c22a29800: Connection
> restored to 172.17.0.37@o2ib (at 172.17.0.37@o2ib)
>
>
> [  +0.223480] LustreError: 4478:0:(events.c:211:client_bulk_callback())
> event type 2, status -5, desc 00000000784c6e4f
> [  +0.000007] LustreError: 4478:0:(events.c:211:client_bulk_callback())
> Skipped 3 previous similar messages
> [ +22.955501] Lustre:
> 3935794:0:(client.c:2289:ptlrpc_expire_one_request()) @@@ Request sent has
> failed due to network error: [sent 1718972176/real 1718972176]
>  req@000000008c377199 x1801581392820160/t0(0)
> o13->lustre24-OST0006-osc-ffff934b8f4a7000@172.17.1.42@o2ib:7/4 lens
> 224/368 e 0 to 1 dl 1718972183 ref 2 fl Rpc:eXQr/0/ffffffff rc 0/-1
> job:'lfs.7953'
> [  +0.000006] Lustre:
> 3935794:0:(client.c:2289:ptlrpc_expire_one_request()) Skipped 21 previous
> similar messages
> [ +20.333921] Lustre: lustre19-OST000a-osc-ffff934c22a29800: Connection
> restored to 172.17.0.39@o2ib (at 172.17.0.39@o2ib)
>
>
> [Jun21 08:17] LustreError: 166-1: MGC172.17.0.36@o2ib: Connection to MGS
> (at 172.17.0.37@o2ib) was lost; in progress operations using this service
> will fail
>
> [  +0.000302] Lustre: lustre19-OST0046-osc-ffff934c22a29800: Connection to
> lustre19-OST0046 (at 172.17.0.103@o2ib) was lost; in progress operations
> using this service will wait for recovery to complete
>
> [  +0.000005] Lustre: Skipped 6 previous similar messages
> [  +6.144196] Lustre: MGC172.17.0.36@o2ib: Connection restored to
> 172.17.0.37@o2ib (at 172.17.0.37@o2ib)
> [  +0.000006] Lustre: Skipped 1 previous similar message
>
> We have a mix of client hardware, but the systems are uniform in their
> kernels and lustre clients.
>
> Here are the software versions:
> kernel-modules-core-5.14.0-362.24.1.el9_3.x86_64
> kernel-core-5.14.0-362.24.1.el9_3.x86_64
> kernel-modules-5.14.0-362.24.1.el9_3.x86_64
> kernel-5.14.0-362.24.1.el9_3.x86_64
> texlive-l3kernel-20200406-26.el9_2.noarch
> kernel-modules-core-5.14.0-362.24.2.el9_3.x86_64
> kernel-core-5.14.0-362.24.2.el9_3.x86_64
> kernel-modules-5.14.0-362.24.2.el9_3.x86_64
> kernel-tools-libs-5.14.0-362.24.2.el9_3.x86_64
> kernel-tools-5.14.0-362.24.2.el9_3.x86_64
> kernel-5.14.0-362.24.2.el9_3.x86_64
> kernel-headers-5.14.0-362.24.2.el9_3.x86_64
>
> and lustre:
> kmod-lustre-client-2.15.4-1.el9.jlab.x86_64
> lustre-client-2.15.4-1.el9.jlab.x86_64
>
> Our oss systems are running el7, are running MOFED for their infiniband
> stack, and have ConnectX-3 cards
> kernel-tools-libs-3.10.0-1160.76.1.el7.x86_64
> kernel-tools-3.10.0-1160.76.1.el7.x86_64
> kernel-headers-3.10.0-1160.76.1.el7.x86_64
> kernel-abi-whitelists-3.10.0-1160.76.1.el7.noarch
> kernel-devel-3.10.0-1160.76.1.el7.x86_64
> kernel-3.10.0-1160.76.1.el7.x86_64
>
> and lustre version
> lustre-2.12.9-1.el7.x86_64
> kmod-lustre-osd-zfs-2.12.9-1.el7.x86_64
> lustre-osd-zfs-mount-2.12.9-1.el7.x86_64
> lustre-resource-agents-2.12.9-1.el7.x86_64
> kmod-lustre-2.12.9-1.el7.x86_64
>
> w/r,
>
> Kurt J. Strosahl (he/him)
> System Administrator: Lustre, HPC
> Scientific Computing Group, Thomas Jefferson National Accelerator Facility
> _______________________________________________
> lustre-discuss mailing list
> lustre-discuss@lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
>
_______________________________________________
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Reply via email to