thanks i'll see if the printk turns up anything no, lnet pings from client->router dont work once the client is knocked offline, regular ping via ethernet does work fine. and for the record the lnet routers are not going down, i have hundreds of other machines connected without issue
On Thu, Oct 30, 2025 at 7:34 PM Horn, Chris <[email protected]> wrote: > > As a further troubleshooting step, I would suggest enabling neterror in the > printk mask on the client and LNet routers: > > lctl set_param printk=+neterror > > This may surface additional information around the routes going down. > > Another thing you ought to try is checking connectivity between client and > routers after the routes get marked down. Do pings over the LNet interface > work? > > ping -I <client_ip> <router_ip> > lnetctl ping --source <client_nid> <router_nid> > > There were only a handful of LNet changes, so it is unlikely to be some > regression in LNet. > > > git -P le 2.15.7 ^2.15.6 lnet > 17fc6dbcd6 LU-17784 build: improve wiretest for flexible arrays > 8535cfe29a LU-18572 lnet: Uninitialized var in lnet_peer_add > c00bb50624 LU-18697 lnet: lnet_peer_del_nid refcount loss > 9d8dbed27c LU-16594 build: get_random_u32_below, get_acl with dentry > 247ae64877 LU-17081 build: compatibility for 6.5 kernels > > > > Chris Horn > > From: lustre-discuss <[email protected]> on behalf of > Michael DiDomenico via lustre-discuss <[email protected]> > Date: Thursday, October 30, 2025 at 2:08 PM > To: lustre-discuss <[email protected]> > Subject: [lustre-discuss] client failing off network > > our network is running 2.15.6 everywhere on rhel9.5, we recently built > a new machine using 2.15.7 on rhel9.6 and i'm seeing a strange > problem. the client is ethernet connected to ten lnet routers which > bridge ethernet to infiniband. > > i can mount the client just fine, read/write data, but then several > hours later, the client marks all the routers offline. the only > recovery is to lazy unmount, lustre_rmmod, and then restart the lustre > mount > > nothing unusual comes out in the journal/dmesg logs. to lustre it > "looks" like someone pulled the network cable, but there's no evidence > that this has happened physically or even at the switch/software > layers > > we upgraded two other machine to see if the problem replicates, but so > far it hasn't. the only significant difference between the three > machines is the one with the problem has heavy container (podman) > usage, the others have zero. i'm not sure if this is an cause or just > a red herring > > any suggestions? > _______________________________________________ > lustre-discuss mailing list > [email protected] > https://urldefense.com/v3/__http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org__;!!NpxR!ibXFDE5f0Z10bD2MkR6l2DaJMCZpX6tzg8uJXOztC1mZt_r7Or5inWyefgVRAv10RUkPfLDg73fzg3o7ppoMYTibfHs2$ _______________________________________________ lustre-discuss mailing list [email protected] http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
