Progress.  I did another round of "tunefs.lustre –writeconf" to take out the IB 
so we are on Ethernet only.  I think the MDS/MGS failover worked properly – 
note the "Connection restored to MGC192.52.98.30@tcp_1 (at 192.52.98.31@tcp)" 
message in the oss logs below – and that the ptlrpc_expire_one_request messages 
stop once this happens.  The info in /proc/fs/lustre/mgc is still pointed to 
the original MGS IP though.  Ahh, but I see that while the path of 
/proc/fs/lustre/mgc/MGC192.52.98.30@tcp/import indicates this is still pointed 
to the primary, the *contents* indicates that it is really pointed to the 
secondary (see the very bottom).  I probably need to put the IB back in the mix 
and test this again...

Its still not clear to me if this is a setup error on my part or a lustre bug.  
I'm kind of thinking its a bug since the setup is the same and all I've really 
done is just removed the IB network.  But maybe doing something wrong with 
multiple fabric setup.  Thoughts appreciated.  

Details – all after the failover. 

OSS00 logs:

Jan 13 23:54:27 hpfs-fsl-oss00 kernel: Lustre: 
27683:0:(client.c:2113:ptlrpc_expire_one_request()) @@@ Request sent has timed 
out for slow reply: [sent 1484373260/real 1484373260]  req@ffff881e9e8a3c00 
x1556477804282864/t0(0) 
o400->[email protected]@tcp:12/10 lens 224/224 e 0 to 1 
dl 1484373267 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
Jan 13 23:54:27 hpfs-fsl-oss00 kernel: Lustre: hpfs-fsl-MDT0000-lwp-OST0000: 
Connection to hpfs-fsl-MDT0000 (at 192.52.98.30@tcp) was lost; in progress 
operations using this service will wait for recovery to complete
Jan 13 23:54:33 hpfs-fsl-oss00 kernel: Lustre: 
27671:0:(client.c:2113:ptlrpc_expire_one_request()) @@@ Request sent has timed 
out for slow reply: [sent 1484373267/real 1484373267]  req@ffff881e9e8a3f00 
x1556477804282880/t0(0) 
o38->[email protected]@tcp:12/10 lens 520/544 e 0 to 1 
dl 1484373273 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
Jan 13 23:54:58 hpfs-fsl-oss00 kernel: Lustre: 
27671:0:(client.c:2113:ptlrpc_expire_one_request()) @@@ Request sent has timed 
out for slow reply: [sent 1484373292/real 1484373292]  req@ffff881e9e8a4500 
x1556477804282912/t0(0) 
o38->[email protected]@tcp:12/10 lens 520/544 e 0 to 1 
dl 1484373298 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
Jan 13 23:55:04 hpfs-fsl-oss00 kernel: Lustre: 
27682:0:(client.c:2113:ptlrpc_expire_one_request()) @@@ Request sent has timed 
out for slow reply: [sent 1484373260/real 1484373260]  req@ffff881e9e8a3900 
x1556477804282848/t0(0) o400->MGC192.52.98.30@[email protected]@tcp:26/25 lens 
224/224 e 0 to 1 dl 1484373304 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
Jan 13 23:55:04 hpfs-fsl-oss00 kernel: LustreError: 166-1: MGC192.52.98.30@tcp: 
Connection to MGS (at 192.52.98.30@tcp) was lost; in progress operations using 
this service will fail
Jan 13 23:55:07 hpfs-fsl-oss00 kernel: Lustre: hpfs-fsl-OST0000: Connection 
restored to MGC192.52.98.30@tcp_1 (at 192.52.98.31@tcp)
Jan 13 23:55:10 hpfs-fsl-oss00 kernel: Lustre: 
27671:0:(client.c:2113:ptlrpc_expire_one_request()) @@@ Request sent has timed 
out for slow reply: [sent 1484373304/real 1484373304]  req@ffff881e9e8a4800 
x1556477804282928/t0(0) o250->MGC192.52.98.30@[email protected]@tcp:26/25 lens 
520/544 e 0 to 1 dl 1484373310 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
Jan 13 23:55:29 hpfs-fsl-oss00 kernel: Lustre: Evicted from MGS (at 
MGC192.52.98.30@tcp_1) after server handle changed from 0x9027cb7bbd974ef1 to 
0x1bd9753462f57d48
Jan 13 23:55:29 hpfs-fsl-oss00 kernel: Lustre: MGC192.52.98.30@tcp: Connection 
restored to MGC192.52.98.30@tcp_1 (at 192.52.98.31@tcp)
Jan 13 23:55:40 hpfs-fsl-oss00 kernel: Lustre: 
27671:0:(client.c:2113:ptlrpc_expire_one_request()) @@@ Request sent has timed 
out for slow reply: [sent 1484373329/real 1484373329]  req@ffff881e9e8a4e00 
x1556477804282960/t0(0) 
o38->[email protected]@tcp:12/10 lens 520/544 e 0 to 1 
dl 1484373340 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
Jan 13 23:55:46 hpfs-fsl-oss00 kernel: Lustre: hpfs-fsl-OST0000: deleting 
orphan objects from 0x0:16904081 to 0x0:16904417
Jan 13 23:55:54 hpfs-fsl-oss00 kernel: LustreError: 167-0: 
hpfs-fsl-MDT0000-lwp-OST0000: This client was evicted by hpfs-fsl-MDT0000; in 
progress operations using this service will fail.
Jan 13 23:55:54 hpfs-fsl-oss00 kernel: Lustre: hpfs-fsl-MDT0000-lwp-OST0000: 
Connection restored to 192.52.98.31@tcp (at 192.52.98.31@tcp)
Jan 14 00:30:01 hpfs-fsl-oss00 systemd: Starting Session 644 of user root.




[root@hpfs-fsl-oss00 ~]# cat /proc/fs/lustre/mgc/MGC192.52.98.30@tcp/import 
import:
    name: MGC192.52.98.30@tcp
    target: MGS
    state: FULL
    connect_flags: [ version, adaptive_timeouts, mds_mds_connection, full20, 
imp_recov, bulk_mbits ]
    connect_data:
       flags: 0x2000011005000020
       instance: 0
       target_version: 2.9.51.0
    import_flags: [ pingable, connect_tried ]
    connection:
       failover_nids: [ 192.52.98.31@tcp, 192.52.98.30@tcp ]
       current_connection: 192.52.98.31@tcp
       connection_attempts: 3
       generation: 2
       in-progress_invalidations: 0
[root@hpfs-fsl-oss00 ~]#



_______________________________________________
lustre-discuss mailing list
[email protected]
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Reply via email to