Good Morning,

     I'm presently working on an issue with my OPA network that seems to be 
having an unusual impact on lustre.  What happens is that when one of the nodes 
on the OPA fabric reboots it sometimes has trouble reaching one of the four 
lnet routers that we have set up.  This isn't, itself, a lustre problem as the 
nodes experiencing this issue can't even ping the lnet routers opa interface.  
The impact this has is that the lustre file system can't mount, even though the 
other three lnet routers are available.  Eventually the issue clears up and 
lustre is able to mount, but I'm wondering why having one of the four lnet 
routers down would prevent the lustre file system from mounting.  Is it because 
the lnet router is only down from the OPA side, while the IB side is still up?


Here is what the system records in dmesg

[   45.839655] Lustre: Server MGS version (2.5.42.4) is much older than client. 
Consider upgrading server (2.10.4)
[   52.847699] Lustre: 3469:0:(client.c:2114:ptlrpc_expire_one_request()) @@@ 
Request sent has timed out for slow reply: [sent 1555330354/real 1555330354]  
req@ffff882ef0930300 x1630882080227408/t0(0) 
o501->MGC172.17.4.125@o2ib@172.17.4.125@o2ib:26/25 lens 296/272 e 0 to 1 dl 
1555330361 ref 2 fl Rpc:X/0/ffffffff rc 0/-1
[   52.847735] LustreError: 166-1: MGC172.17.4.125@o2ib: Connection to MGS (at 
172.17.4.125@o2ib) was lost; in progress operations using this service will fail
[   52.847933] LustreError: 15c-8: MGC172.17.4.125@o2ib: The configuration from 
log 'lustre2-client' failed (-5). This may be the result of communication 
errors between this node and the MGS, a bad configuration, or other errors. See 
the syslog for more information.
[   52.848876] Lustre: Unmounted lustre2-client
[   52.849424] Lustre: MGC172.17.4.125@o2ib: Connection restored to 
MGC172.17.4.125@o2ib_0 (at 172.17.4.125@o2ib)
[   52.857803] LustreError: 3469:0:(obd_mount.c:1582:lustre_fill_super()) 
Unable to mount  (-5)
[   56.828713] LNet: 3604:0:(o2iblnd_cb.c:3192:kiblnd_check_conns()) Timed out 
tx for 172.20.0.3@o2ib1: 4294722 seconds
[  106.829847] LNet: 3604:0:(o2iblnd_cb.c:3192:kiblnd_check_conns()) Timed out 
tx for 172.20.0.3@o2ib1: 4294772 seconds

The 172.20.0.3 address is the lnet router in question.


w/r,

Kurt J. Strosahl

System Administrator: Lustre, HPC
Scientific Computing Group, Thomas Jefferson National Accelerator Facility
_______________________________________________
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Reply via email to