Michael,

It might be a long shot, but is there any chance another machine has the same 
IP address as the one having problems?

--Rick



On 10/30/25, 3:09 PM, "lustre-discuss on behalf of Michael DiDomenico via 
lustre-discuss" wrote:
our network is running 2.15.6 everywhere on rhel9.5, we recently built a new 
machine using 2.15.7 on rhel9.6 and i'm seeing a strange problem. the client is 
ethernet connected to ten lnet routers which bridge ethernet to infiniband. i 
can mount the client just fine, read/write data, but then several hours later, 
the client marks all the routers offline. the only recovery is to lazy unmount, 
lustre_rmmod, and then restart the lustre mount nothing unusual comes out in 
the journal/dmesg logs. to lustre it "looks" like someone pulled the network 
cable, but there's no evidence that this has happened physically or even at the 
switch/software layers we upgraded two other machine to see if the problem 
replicates, but so far it hasn't. the only significant difference between the 
three machines is the one with the problem has heavy container (podman) usage, 
the others have zero. i'm not sure if this is an cause or just a red herring 
any suggestions
 

_______________________________________________
lustre-discuss mailing list
[email protected]
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Reply via email to