Progress. I did another round of "tunefs.lustre –writeconf" to take out the IB so we are on Ethernet only. I think the MDS/MGS failover worked properly – note the "Connection restored to MGC192.52.98.30@tcp_1 (at 192.52.98.31@tcp)" message in the oss logs below – and that the ptlrpc_expire_one_request messages stop once this happens. The info in /proc/fs/lustre/mgc is still pointed to the original MGS IP though. Ahh, but I see that while the path of /proc/fs/lustre/mgc/MGC192.52.98.30@tcp/import indicates this is still pointed to the primary, the *contents* indicates that it is really pointed to the secondary (see the very bottom). I probably need to put the IB back in the mix and test this again...
Its still not clear to me if this is a setup error on my part or a lustre bug. I'm kind of thinking its a bug since the setup is the same and all I've really done is just removed the IB network. But maybe doing something wrong with multiple fabric setup. Thoughts appreciated. Details – all after the failover. OSS00 logs: Jan 13 23:54:27 hpfs-fsl-oss00 kernel: Lustre: 27683:0:(client.c:2113:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1484373260/real 1484373260] req@ffff881e9e8a3c00 x1556477804282864/t0(0) o400->[email protected]@tcp:12/10 lens 224/224 e 0 to 1 dl 1484373267 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 Jan 13 23:54:27 hpfs-fsl-oss00 kernel: Lustre: hpfs-fsl-MDT0000-lwp-OST0000: Connection to hpfs-fsl-MDT0000 (at 192.52.98.30@tcp) was lost; in progress operations using this service will wait for recovery to complete Jan 13 23:54:33 hpfs-fsl-oss00 kernel: Lustre: 27671:0:(client.c:2113:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1484373267/real 1484373267] req@ffff881e9e8a3f00 x1556477804282880/t0(0) o38->[email protected]@tcp:12/10 lens 520/544 e 0 to 1 dl 1484373273 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 Jan 13 23:54:58 hpfs-fsl-oss00 kernel: Lustre: 27671:0:(client.c:2113:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1484373292/real 1484373292] req@ffff881e9e8a4500 x1556477804282912/t0(0) o38->[email protected]@tcp:12/10 lens 520/544 e 0 to 1 dl 1484373298 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 Jan 13 23:55:04 hpfs-fsl-oss00 kernel: Lustre: 27682:0:(client.c:2113:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1484373260/real 1484373260] req@ffff881e9e8a3900 x1556477804282848/t0(0) o400->MGC192.52.98.30@[email protected]@tcp:26/25 lens 224/224 e 0 to 1 dl 1484373304 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 Jan 13 23:55:04 hpfs-fsl-oss00 kernel: LustreError: 166-1: MGC192.52.98.30@tcp: Connection to MGS (at 192.52.98.30@tcp) was lost; in progress operations using this service will fail Jan 13 23:55:07 hpfs-fsl-oss00 kernel: Lustre: hpfs-fsl-OST0000: Connection restored to MGC192.52.98.30@tcp_1 (at 192.52.98.31@tcp) Jan 13 23:55:10 hpfs-fsl-oss00 kernel: Lustre: 27671:0:(client.c:2113:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1484373304/real 1484373304] req@ffff881e9e8a4800 x1556477804282928/t0(0) o250->MGC192.52.98.30@[email protected]@tcp:26/25 lens 520/544 e 0 to 1 dl 1484373310 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 Jan 13 23:55:29 hpfs-fsl-oss00 kernel: Lustre: Evicted from MGS (at MGC192.52.98.30@tcp_1) after server handle changed from 0x9027cb7bbd974ef1 to 0x1bd9753462f57d48 Jan 13 23:55:29 hpfs-fsl-oss00 kernel: Lustre: MGC192.52.98.30@tcp: Connection restored to MGC192.52.98.30@tcp_1 (at 192.52.98.31@tcp) Jan 13 23:55:40 hpfs-fsl-oss00 kernel: Lustre: 27671:0:(client.c:2113:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1484373329/real 1484373329] req@ffff881e9e8a4e00 x1556477804282960/t0(0) o38->[email protected]@tcp:12/10 lens 520/544 e 0 to 1 dl 1484373340 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 Jan 13 23:55:46 hpfs-fsl-oss00 kernel: Lustre: hpfs-fsl-OST0000: deleting orphan objects from 0x0:16904081 to 0x0:16904417 Jan 13 23:55:54 hpfs-fsl-oss00 kernel: LustreError: 167-0: hpfs-fsl-MDT0000-lwp-OST0000: This client was evicted by hpfs-fsl-MDT0000; in progress operations using this service will fail. Jan 13 23:55:54 hpfs-fsl-oss00 kernel: Lustre: hpfs-fsl-MDT0000-lwp-OST0000: Connection restored to 192.52.98.31@tcp (at 192.52.98.31@tcp) Jan 14 00:30:01 hpfs-fsl-oss00 systemd: Starting Session 644 of user root. [root@hpfs-fsl-oss00 ~]# cat /proc/fs/lustre/mgc/MGC192.52.98.30@tcp/import import: name: MGC192.52.98.30@tcp target: MGS state: FULL connect_flags: [ version, adaptive_timeouts, mds_mds_connection, full20, imp_recov, bulk_mbits ] connect_data: flags: 0x2000011005000020 instance: 0 target_version: 2.9.51.0 import_flags: [ pingable, connect_tried ] connection: failover_nids: [ 192.52.98.31@tcp, 192.52.98.30@tcp ] current_connection: 192.52.98.31@tcp connection_attempts: 3 generation: 2 in-progress_invalidations: 0 [root@hpfs-fsl-oss00 ~]# _______________________________________________ lustre-discuss mailing list [email protected] http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
