I tried a failover making sure lustre, including lnet, was completely shutdown on the primary MDS. This didn't work either. Lnet hung like I remembered. So I powered down the primary MDS to force it offline and then mounted lustre on the secondary MDS. The services and a client recovers but the OST's still appear to be pointing to the primary MGS (same lctl output and /proc/fs/lustre/mgc) and the ptlrpc_expire_one_request messages start up on the OSS's. I then tried to remount a OST, thinking that it might contact the secondary MGS properly when mounting. That also did not work.
Any ideas why lnet is hanging when I try to stop it on the MDS? This works properly on the OSS. It sure seems like we either don't have something configured properly or we aren't doing the failover properly (or there is a bug in lustre). The details of what was described above follow. On the primary MDS: mds0# cd /etc/init.d ; ./lustre stop This returns quickly: Jan 11 09:15:53 hpfs-fsl-mds0 kernel: Lustre: Failing over hpfs-fsl-MDT0000 Jan 11 09:15:54 hpfs-fsl-mds0 kernel: LustreError: 137-5: hpfs-fsl-MDT0000_UUID: not available for connect from 192.52.98.32 @tcp (no target). If you are running an HA pair check that the target is mounted on the other server. Jan 11 09:15:54 hpfs-fsl-mds0 kernel: LustreError: Skipped 1 previous similar message Jan 11 09:15:54 hpfs-fsl-mds0 kernel: LustreError: 137-5: hpfs-fsl-MDT0000_UUID: not available for connect from 192.52.98.35 @tcp (no target). If you are running an HA pair check that the target is mounted on the other server. Jan 11 09:15:54 hpfs-fsl-mds0 kernel: LustreError: Skipped 1 previous similar message Jan 11 09:15:56 hpfs-fsl-mds0 kernel: LustreError: 137-5: hpfs-fsl-MDT0000_UUID: not available for connect from 192.52.98.40 @tcp (no target). If you are running an HA pair check that the target is mounted on the other server. Jan 11 09:15:59 hpfs-fsl-mds0 kernel: Lustre: 21424:0:(client.c:2113:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1484147753/real 1484147753] req@ffff881eccfb6900 x1556149769946448/t0(0) o251->MGC192.52.98.30@t cp@0@lo:26/25 lens 224/224 e 0 to 1 dl 1484147759 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1 Jan 11 09:15:59 hpfs-fsl-mds0 kernel: Lustre: server umount hpfs-fsl-MDT0000 complete Then stop lnet: mds0# ./lnet stop This hangs: Jan 11 09:16:35 hpfs-fsl-mds0 kernel: LNetError: 7065:0:(lib-move.c:1990:lnet_parse()) 192.52.98.39@tcp, src 192.52.98.39@tc p: Dropping PUT (error -108 looking up sender) Jan 11 09:16:36 hpfs-fsl-mds0 kernel: LNet: Removed LNI 10.148.0.30@o2ib Jan 11 09:16:37 hpfs-fsl-mds0 kernel: LNet: 21555:0:(socklnd.c:2577:ksocknal_shutdown()) waiting for 1 peers to disconnect Jan 11 09:16:41 hpfs-fsl-mds0 kernel: LNet: 21555:0:(socklnd.c:2577:ksocknal_shutdown()) waiting for 1 peers to disconnect Jan 11 09:16:49 hpfs-fsl-mds0 kernel: LNet: 21555:0:(socklnd.c:2577:ksocknal_shutdown()) waiting for 1 peers to disconnect Jan 11 09:17:05 hpfs-fsl-mds0 kernel: LNet: 21555:0:(socklnd.c:2577:ksocknal_shutdown()) waiting for 1 peers to disconnect Jan 11 09:17:37 hpfs-fsl-mds0 kernel: LNet: 21555:0:(socklnd.c:2577:ksocknal_shutdown()) waiting for 1 peers to disconnect Jan 11 09:18:41 hpfs-fsl-mds0 kernel: LNet: 21555:0:(socklnd.c:2577:ksocknal_shutdown()) waiting for 1 peers to disconnect Jan 11 09:20:49 hpfs-fsl-mds0 kernel: LNet: 21555:0:(socklnd.c:2577:ksocknal_shutdown()) waiting for 1 peers to disconnect Jan 11 09:25:05 hpfs-fsl-mds0 kernel: LNet: 21555:0:(socklnd.c:2577:ksocknal_shutdown()) waiting for 1 peers to disconnect Mds0 was powered down at this point. I looked back through the logs and found the last time I tried this and eventually lnet dumps a stack trace. Here's that info from the previous attempt: Jan 9 16:26:13 hpfs-fsl-mds0 kernel: Lustre: Failing over hpfs-fsl-MDT0000 Jan 9 16:26:19 hpfs-fsl-mds0 kernel: Lustre: 25690:0:(client.c:2113:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1484000773/real 1484000773] req@ffff88069d615400 x1556086544936704/t0(0) o251->MGC192.52.98.30@tcp@0@lo:26/25 lens 224/224 e 0 to 1 dl 1484000779 ref 2 fl Rpc:XN/0/ff ffffff rc 0/-1 Jan 9 16:26:19 hpfs-fsl-mds0 kernel: Lustre: 25690:0:(client.c:2113:ptlrpc_expire_one_request()) Skipped 11 previous similar messages Jan 9 16:26:20 hpfs-fsl-mds0 kernel: Lustre: server umount hpfs-fsl-MDT0000 complete Jan 9 16:26:39 hpfs-fsl-mds0 kernel: LNetError: 25392:0:(lib-move.c:1990:lnet_parse()) 192.52.98.40@tcp, src 192.52.98.40@tcp: Dropping PUT (error -108 looking up sender) Jan 9 16:26:40 hpfs-fsl-mds0 kernel: LNet: Removed LNI 10.148.0.30@o2ib Jan 9 16:26:41 hpfs-fsl-mds0 kernel: LNet: 25820:0:(socklnd.c:2577:ksocknal_shutdown()) waiting for 1 peers to disconnect Jan 9 16:26:45 hpfs-fsl-mds0 kernel: LNet: 25820:0:(socklnd.c:2577:ksocknal_shutdown()) waiting for 1 peers to disconnect Jan 9 16:26:53 hpfs-fsl-mds0 kernel: LNet: 25820:0:(socklnd.c:2577:ksocknal_shutdown()) waiting for 1 peers to disconnect Jan 9 16:27:09 hpfs-fsl-mds0 kernel: LNet: 25820:0:(socklnd.c:2577:ksocknal_shutdown()) waiting for 1 peers to disconnect Jan 9 16:27:41 hpfs-fsl-mds0 kernel: LNet: 25820:0:(socklnd.c:2577:ksocknal_shutdown()) waiting for 1 peers to disconnect Jan 9 16:28:45 hpfs-fsl-mds0 kernel: LNet: 25820:0:(socklnd.c:2577:ksocknal_shutdown()) waiting for 1 peers to disconnect Jan 9 16:30:53 hpfs-fsl-mds0 kernel: LNet: 25820:0:(socklnd.c:2577:ksocknal_shutdown()) waiting for 1 peers to disconnect Jan 9 16:35:09 hpfs-fsl-mds0 kernel: LNet: 25820:0:(socklnd.c:2577:ksocknal_shutdown()) waiting for 1 peers to disconnect Jan 9 16:42:54 hpfs-fsl-mds0 kernel: INFO: task lctl:25908 blocked for more than 120 seconds. Jan 9 16:42:54 hpfs-fsl-mds0 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Jan 9 16:42:54 hpfs-fsl-mds0 kernel: lctl D ffffffffa0d0b560 0 25908 25900 0x00000084 Jan 9 16:42:54 hpfs-fsl-mds0 kernel: ffff881e9ffc7d20 0000000000000082 ffff880f77a7bec0 ffff881e9ffc7fd8 Jan 9 16:42:54 hpfs-fsl-mds0 kernel: ffff881e9ffc7fd8 ffff881e9ffc7fd8 ffff880f77a7bec0 ffffffffa0d0b558 Jan 9 16:42:54 hpfs-fsl-mds0 kernel: ffffffffa0d0b55c ffff880f77a7bec0 00000000ffffffff ffffffffa0d0b560 Jan 9 16:42:54 hpfs-fsl-mds0 kernel: Call Trace: Jan 9 16:42:54 hpfs-fsl-mds0 kernel: [<ffffffff8168c989>] schedule_preempt_disabled+0x29/0x70 Jan 9 16:42:54 hpfs-fsl-mds0 kernel: [<ffffffff8168a5e5>] __mutex_lock_slowpath+0xc5/0x1c0 Jan 9 16:42:54 hpfs-fsl-mds0 kernel: [<ffffffff81689a4f>] mutex_lock+0x1f/0x2f Jan 9 16:42:54 hpfs-fsl-mds0 kernel: [<ffffffffa0cccf45>] LNetNIInit+0x45/0xa10 [lnet] Jan 9 16:42:54 hpfs-fsl-mds0 kernel: [<ffffffff811806bb>] ? unlock_page+0x2b/0x30 Jan 9 16:42:54 hpfs-fsl-mds0 kernel: [<ffffffffa0ce6372>] lnet_configure+0x52/0x80 [lnet] Jan 9 16:42:54 hpfs-fsl-mds0 kernel: [<ffffffffa0ce64eb>] lnet_ioctl+0x14b/0x180 [lnet] Jan 9 16:42:54 hpfs-fsl-mds0 kernel: [<ffffffffa0bf2e5c>] libcfs_ioctl+0x2ac/0x4c0 [libcfs] Jan 9 16:42:54 hpfs-fsl-mds0 kernel: [<ffffffffa0bef427>] libcfs_psdev_ioctl+0x67/0xf0 [libcfs] Jan 9 16:42:54 hpfs-fsl-mds0 kernel: [<ffffffff81212035>] do_vfs_ioctl+0x2d5/0x4b0 Jan 9 16:42:54 hpfs-fsl-mds0 kernel: [<ffffffff8121ccd7>] ? __fd_install+0x47/0x60 Jan 9 16:42:54 hpfs-fsl-mds0 kernel: [<ffffffff812122b1>] SyS_ioctl+0xa1/0xc0 Jan 9 16:42:54 hpfs-fsl-mds0 kernel: [<ffffffff816967c9>] system_call_fastpath+0x16/0x1b Jan 9 16:43:41 hpfs-fsl-mds0 kernel: LNet: 25820:0:(socklnd.c:2577:ksocknal_shutdown()) waiting for 1 peers to disconnect So with the primary MDS shut down I mounted on the secondary MDS: mds1# cd /etc/init.d/ ; ./lustre start Jan 11 09:29:48 hpfs-fsl-mds1 kernel: LNet: HW nodes: 2, HW CPU cores: 16, npartitions: 2 Jan 11 09:29:48 hpfs-fsl-mds1 kernel: alg: No test for adler32 (adler32-zlib) Jan 11 09:29:48 hpfs-fsl-mds1 kernel: alg: No test for crc32 (crc32-table) Jan 11 09:29:56 hpfs-fsl-mds1 kernel: LNet: Added LNI 192.52.98.31@tcp [8/256/0/180] Jan 11 09:29:56 hpfs-fsl-mds1 kernel: LNet: Using FMR for registration Jan 11 09:29:57 hpfs-fsl-mds1 kernel: LNet: Added LNI 10.148.0.31@o2ib [8/256/0/180] Jan 11 09:29:57 hpfs-fsl-mds1 kernel: LNet: Accept secure, port 988 Jan 11 09:30:22 hpfs-fsl-mds1 kernel: Lustre: Lustre: Build Version: 2.9.51 Jan 11 09:30:22 hpfs-fsl-mds1 kernel: Lustre: MGS: Connection restored to d08a6361-1b98-2c42-a6c4-ec1317aa9351 (at 0@lo) Jan 11 09:30:23 hpfs-fsl-mds1 kernel: Lustre: hpfs-fsl-MDT0000: Imperative Recovery not enabled, recovery window 300-900 Jan 11 09:30:28 hpfs-fsl-mds1 kernel: Lustre: 10312:0:(client.c:2113:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1484148623/real 1484148626] req@ffff881010219e00 x1556242625462976/t0(0) o38->[email protected]@tcp:12/10 lens 520/544 e 0 to 1 dl 1484148628 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 Jan 11 09:30:48 hpfs-fsl-mds1 kernel: Lustre: hpfs-fsl-MDT0000: Connection restored to d08a6361-1b98-2c42-a6c4-ec1317aa9351 (at 0@lo) Jan 11 09:31:01 hpfs-fsl-mds1 kernel: Lustre: hpfs-fsl-MDT0000: Connection restored to 192.52.98.32@tcp (at 192.52.98.32@tcp) Jan 11 09:31:03 hpfs-fsl-mds1 kernel: Lustre: hpfs-fsl-MDT0000: Connection restored to 192.52.98.40@tcp (at 192.52.98.40@tcp) Jan 11 09:31:03 hpfs-fsl-mds1 kernel: Lustre: Skipped 1 previous similar message Jan 11 09:31:08 hpfs-fsl-mds1 kernel: Lustre: hpfs-fsl-MDT0000: Connection restored to 192.52.98.43@tcp (at 192.52.98.43@tcp) Jan 11 09:31:08 hpfs-fsl-mds1 kernel: Lustre: Skipped 1 previous similar message Jan 11 09:31:26 hpfs-fsl-mds1 kernel: Lustre: hpfs-fsl-MDT0000: Will be in recovery for at least 5:00, or until 1 client reconnects Jan 11 09:31:26 hpfs-fsl-mds1 kernel: Lustre: MGS: Connection restored to 47b7f6ce-5d63-8eb1-59b6-4d26560019e9 (at 192.52.98.55@tcp) Jan 11 09:31:26 hpfs-fsl-mds1 kernel: Lustre: Skipped 1 previous similar message Jan 11 09:31:26 hpfs-fsl-mds1 kernel: Lustre: hpfs-fsl-MDT0000: Recovery over after 0:01, of 1 clients 1 recovered and 0 were evicted. Jan 11 09:31:50 hpfs-fsl-mds1 kernel: Lustre: hpfs-fsl-MDT0000: Connection restored to 192.52.98.42@tcp (at 192.52.98.42@tcp) Jan 11 09:31:50 hpfs-fsl-mds1 kernel: Lustre: Skipped 6 previous similar messages And the OSS log while all this is happening. As mentioned above, note that the ptlrpc_expire_one_request messages to the primary MGS persist beyond when the MDT/MGC is mounted on the secondary MDS. Jan 11 09:15:54 hpfs-fsl-oss00 kernel: LustreError: 11-0: hpfs-fsl-MDT0000-lwp-OST0000: operation obd_ping to node 192.52.98.30@tcp failed: rc = -107 Jan 11 09:15:54 hpfs-fsl-oss00 kernel: Lustre: hpfs-fsl-MDT0000-lwp-OST0000: Connection to hpfs-fsl-MDT0000 (at 192.52.98.30@tcp) was lost; in progress operations using this service will wait for recovery to complete Jan 11 09:16:01 hpfs-fsl-oss00 kernel: Lustre: 17097:0:(client.c:2113:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1484147754/real 1484147754] req@ffff88101f2fbc00 x1556149818209744/t0(0) o400->MGC192.52.98.30@[email protected]@tcp:26/25 lens 224/224 e 0 to 1 dl 1484147761 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 Jan 11 09:16:01 hpfs-fsl-oss00 kernel: Lustre: 17097:0:(client.c:2113:ptlrpc_expire_one_request()) Skipped 5 previous similar messages Jan 11 09:16:01 hpfs-fsl-oss00 kernel: LustreError: 166-1: MGC192.52.98.30@tcp: Connection to MGS (at 192.52.98.30@tcp) was lost; in progress operations using this service will fail Jan 11 09:17:27 hpfs-fsl-oss00 kernel: Lustre: 17090:0:(client.c:2113:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1484147836/real 1484147836] req@ffff88101f256000 x1556149818209888/t0(0) o38->[email protected]@tcp:12/10 lens 520/544 e 0 to 1 dl 1484147847 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 Jan 11 09:17:27 hpfs-fsl-oss00 kernel: Lustre: 17090:0:(client.c:2113:ptlrpc_expire_one_request()) Skipped 5 previous similar messages Jan 11 09:20:37 hpfs-fsl-oss00 kernel: Lustre: 17090:0:(client.c:2113:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1484148011/real 1484148011] req@ffff88101f486f00 x1556149818210048/t0(0) o38->[email protected]@tcp:12/10 lens 520/544 e 0 to 1 dl 1484148037 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 Jan 11 09:20:37 hpfs-fsl-oss00 kernel: Lustre: 17090:0:(client.c:2113:ptlrpc_expire_one_request()) Skipped 9 previous similar messages Jan 11 09:25:41 hpfs-fsl-oss00 kernel: Lustre: 17090:0:(client.c:2113:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1484148286/real 1484148286] req@ffff88101f487b00 x1556149818210224/t0(0) o250->MGC192.52.98.30@[email protected]@tcp:26/25 lens 520/544 e 0 to 1 dl 1484148341 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 Jan 11 09:25:41 hpfs-fsl-oss00 kernel: Lustre: 17090:0:(client.c:2113:ptlrpc_expire_one_request()) Skipped 12 previous similar messages Jan 11 09:30:23 hpfs-fsl-oss00 kernel: Lustre: hpfs-fsl-OST0000: Connection restored to hpfs-fsl-MDT0000-mdtlov_UUID (at 192.52.98.31@tcp) Jan 11 09:30:23 hpfs-fsl-oss00 kernel: Lustre: Skipped 1 previous similar message Jan 11 09:31:01 hpfs-fsl-oss00 kernel: LustreError: 167-0: hpfs-fsl-MDT0000-lwp-OST0000: This client was evicted by hpfs-fsl-MDT0000; in progress operations using this service will fail. Jan 11 09:31:01 hpfs-fsl-oss00 kernel: Lustre: hpfs-fsl-MDT0000-lwp-OST0000: Connection restored to 192.52.98.31@tcp (at 192.52.98.31@tcp) Jan 11 09:31:26 hpfs-fsl-oss00 kernel: Lustre: hpfs-fsl-OST0000: deleting orphan objects from 0x0:16904081 to 0x0:16904321 Jan 11 09:36:06 hpfs-fsl-oss00 kernel: Lustre: 17090:0:(client.c:2113:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1484148911/real 1484148914] req@ffff88101f35ad00 x1556149818210640/t0(0) o250->MGC192.52.98.30@[email protected]@tcp:26/25 lens 520/544 e 0 to 1 dl 1484148966 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 Jan 11 09:36:06 hpfs-fsl-oss00 kernel: Lustre: 17090:0:(client.c:2113:ptlrpc_expire_one_request()) Skipped 14 previous similar messages Jan 11 09:47:21 hpfs-fsl-oss00 kernel: Lustre: 17090:0:(client.c:2113:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1484149586/real 1484149586] req@ffff88101f35e600 x1556149818211216/t0(0) o250->MGC192.52.98.30@[email protected]@tcp:26/25 lens 520/544 e 0 to 1 dl 1484149641 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 Jan 11 09:47:21 hpfs-fsl-oss00 kernel: Lustre: 17090:0:(client.c:2113:ptlrpc_expire_one_request()) Skipped 8 previous similar messages Jan 11 09:58:36 hpfs-fsl-oss00 kernel: Lustre: 17090:0:(client.c:2113:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1484150261/real 1484150261] req@ffff88101f2f9b00 x1556149818211792/t0(0) o250->MGC192.52.98.30@[email protected]@tcp:26/25 lens 520/544 e 0 to 1 dl 1484150316 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 Jan 11 09:58:36 hpfs-fsl-oss00 kernel: Lustre: 17090:0:(client.c:2113:ptlrpc_expire_one_request()) Skipped 8 previous similar messages Jan 11 10:09:51 hpfs-fsl-oss00 kernel: Lustre: 17090:0:(client.c:2113:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1484150936/real 1484150936] req@ffff88101f483f00 x1556149818212368/t0(0) o250->MGC192.52.98.30@[email protected]@tcp:26/25 lens 520/544 e 0 to 1 dl 1484150991 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 Jan 11 10:09:51 hpfs-fsl-oss00 kernel: Lustre: 17090:0:(client.c:2113:ptlrpc_expire_one_request()) Skipped 8 previous similar messages And this: [root@hpfs-fsl-oss00 ~]# date Wed Jan 11 10:12:43 CST 2017 [root@hpfs-fsl-oss00 ~]# lctl dl 0 UP osd-zfs hpfs-fsl-OST0000-osd hpfs-fsl-OST0000-osd_UUID 5 1 UP mgc MGC192.52.98.30@tcp 75fa2ba9-749d-e00f-84d3-e4e9b8753be3 5 2 UP ost OSS OSS_uuid 3 3 UP obdfilter hpfs-fsl-OST0000 hpfs-fsl-OST0000_UUID 7 4 UP lwp hpfs-fsl-MDT0000-lwp-OST0000 hpfs-fsl-MDT0000-lwp-OST0000_UUID 5 [root@hpfs-fsl-oss00 ~]# ls /proc/fs/lustre/mgc/ MGC192.52.98.30@tcp [root@hpfs-fsl-oss00 ~]# I was wondering if it might work better to remount a OST with the LFS failed over to the secondary MDS. I tried that – all the below is while the MDT is still mounted on mds1: [root@hpfs-fsl-oss00 ~]# date Wed Jan 11 10:14:23 CST 2017 [root@hpfs-fsl-oss00 ~]# cd /etc/init.d/ [root@hpfs-fsl-oss00 init.d]# ./lustre stop local Unmounting /mnt/lustre/local/hpfs-fsl-OST0000 [root@hpfs-fsl-oss00 init.d]# ./lnet stop [root@hpfs-fsl-oss00 init.d]# ./lnet start LNET configured [root@hpfs-fsl-oss00 init.d]# ./lustre start local Mounting oss00-0/ost-fsl on /mnt/lustre/local/hpfs-fsl-OST0000 [root@hpfs-fsl-oss00 init.d]# lctl dl 0 UP osd-zfs hpfs-fsl-OST0000-osd hpfs-fsl-OST0000-osd_UUID 5 1 UP mgc MGC192.52.98.30@tcp 17af2f5d-ebd3-b57d-0c3d-9c7bc7654172 5 2 UP ost OSS OSS_uuid 3 3 UP obdfilter hpfs-fsl-OST0000 hpfs-fsl-OST0000_UUID 7 4 UP lwp hpfs-fsl-MDT0000-lwp-OST0000 hpfs-fsl-MDT0000-lwp-OST0000_UUID 5 [root@hpfs-fsl-oss00 init.d]# ls /proc/fs/lustre/mgc/ MGC192.52.98.30@tcp [root@hpfs-fsl-oss00 init.d]# Same result. MDS1 and OSS00 logs are below. Jan 11 10:14:42 hpfs-fsl-mds1 kernel: Lustre: 10323:0:(client.c:2113:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1484151275/real 1484151275] req@ffff882036ef7800 x1556242625566576/t0(0) o13->[email protected]@tcp:7/4 lens 224/368 e 0 to 1 dl 1484151282 ref 1 fl Rpc:X/0/ffffffff rc 0/-1 Jan 11 10:14:42 hpfs-fsl-mds1 kernel: Lustre: hpfs-fsl-OST0000-osc-MDT0000: Connection to hpfs-fsl-OST0000 (at 192.52.98.32@tcp) was lost; in progress operations using this service will wait for recovery to complete Jan 11 10:14:48 hpfs-fsl-mds1 kernel: Lustre: 10312:0:(client.c:2113:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1484151282/real 1484151282] req@ffff88101a9ff800 x1556242625566928/t0(0) o8->[email protected]@tcp:28/4 lens 520/544 e 0 to 1 dl 1484151288 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 Jan 11 10:15:43 hpfs-fsl-mds1 kernel: Lustre: 10312:0:(client.c:2113:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1484151332/real 1484151332] req@ffff88101a9fce00 x1556242625568768/t0(0) o8->[email protected]@tcp:28/4 lens 520/544 e 0 to 1 dl 1484151343 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 Jan 11 10:17:12 hpfs-fsl-mds1 kernel: Lustre: hpfs-fsl-MDT0000: Connection restored to 192.52.98.32@tcp (at 192.52.98.32@tcp) Jan 11 10:17:23 hpfs-fsl-mds1 kernel: Lustre: hpfs-fsl-OST0000-osc-MDT0000: Connection restored to 192.52.98.32@tcp (at 192.52.98.32@tcp) Jan 11 10:14:30 hpfs-fsl-oss00 kernel: Lustre: Failing over hpfs-fsl-OST0000 Jan 11 10:14:30 hpfs-fsl-oss00 kernel: Lustre: server umount hpfs-fsl-OST0000 complete Jan 11 10:15:18 hpfs-fsl-oss00 kernel: LNet: Removed LNI 10.148.0.32@o2ib Jan 11 10:15:20 hpfs-fsl-oss00 kernel: LNet: Removed LNI 192.52.98.32@tcp Jan 11 10:15:26 hpfs-fsl-oss00 kernel: LNet: HW nodes: 2, HW CPU cores: 16, npartitions: 2 Jan 11 10:15:26 hpfs-fsl-oss00 kernel: alg: No test for adler32 (adler32-zlib) Jan 11 10:15:26 hpfs-fsl-oss00 kernel: alg: No test for crc32 (crc32-table) Jan 11 10:15:34 hpfs-fsl-oss00 kernel: LNet: Added LNI 192.52.98.32@tcp [8/256/0/180] Jan 11 10:15:34 hpfs-fsl-oss00 kernel: LNet: Using FMR for registration Jan 11 10:15:34 hpfs-fsl-oss00 kernel: LNet: Added LNI 10.148.0.32@o2ib [8/256/0/180] Jan 11 10:15:34 hpfs-fsl-oss00 kernel: LNet: Accept secure, port 988 Jan 11 10:15:41 hpfs-fsl-oss00 kernel: Lustre: Lustre: Build Version: 2.9.51 Jan 11 10:15:43 hpfs-fsl-oss00 kernel: LustreError: 137-5: hpfs-fsl-OST0000_UUID: not available for connect from 192.52.98.55@tcp (no target). If you are running an HA pair check that the target is mounted on the other server. Jan 11 10:15:46 hpfs-fsl-oss00 kernel: Lustre: 20564:0:(client.c:2113:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1484151341/real 1484151341] req@ffff882020280000 x1556245476540432/t0(0) o250->MGC192.52.98.30@[email protected]@tcp:26/25 lens 520/544 e 0 to 1 dl 1484151346 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 Jan 11 10:16:16 hpfs-fsl-oss00 kernel: Lustre: 20564:0:(client.c:2113:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1484151366/real 1484151366] req@ffff880fff120000 x1556245476540496/t0(0) o250->MGC192.52.98.30@[email protected]@tcp:26/25 lens 520/544 e 0 to 1 dl 1484151376 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 Jan 11 10:16:22 hpfs-fsl-oss00 kernel: LustreError: 137-5: hpfs-fsl-OST0000_UUID: not available for connect from 192.52.98.31@tcp (no target). If you are running an HA pair check that the target is mounted on the other server. Jan 11 10:16:29 hpfs-fsl-oss00 kernel: LustreError: 20547:0:(mgc_request.c:249:do_config_log_add()) MGC192.52.98.30@tcp: failed processing log, type 4: rc = -110 Jan 11 10:16:33 hpfs-fsl-oss00 kernel: LustreError: 137-5: hpfs-fsl-OST0000_UUID: not available for connect from 192.52.98.55@tcp (no target). If you are running an HA pair check that the target is mounted on the other server. Jan 11 10:16:35 hpfs-fsl-oss00 kernel: LustreError: 20629:0:(sec_config.c:1107:sptlrpc_target_local_read_conf()) missing llog context Jan 11 10:16:46 hpfs-fsl-oss00 kernel: Lustre: 20564:0:(client.c:2113:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1484151391/real 1484151391] req@ffff880fff120300 x1556245476540528/t0(0) o250->MGC192.52.98.30@[email protected]@tcp:26/25 lens 520/544 e 0 to 1 dl 1484151406 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 Jan 11 10:16:52 hpfs-fsl-oss00 kernel: Lustre: 20564:0:(client.c:2113:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1484151407/real 1484151407] req@ffff880fff120600 x1556245476540576/t0(0) o38->[email protected]@tcp:12/10 lens 520/544 e 0 to 1 dl 1484151412 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 Jan 11 10:17:04 hpfs-fsl-oss00 kernel: Lustre: hpfs-fsl-OST0000: Imperative Recovery not enabled, recovery window 300-900 Jan 11 10:17:12 hpfs-fsl-oss00 kernel: Lustre: hpfs-fsl-OST0000: Will be in recovery for at least 5:00, or until 2 clients reconnect Jan 11 10:17:12 hpfs-fsl-oss00 kernel: Lustre: hpfs-fsl-OST0000: Connection restored to (at 192.52.98.31@tcp) Jan 11 10:17:23 hpfs-fsl-oss00 kernel: Lustre: hpfs-fsl-OST0000: Connection restored to (at 192.52.98.55@tcp) Jan 11 10:17:23 hpfs-fsl-oss00 kernel: Lustre: hpfs-fsl-OST0000: Recovery over after 0:11, of 2 clients 2 recovered and 0 were evicted. Jan 11 10:17:32 hpfs-fsl-oss00 kernel: Lustre: 20564:0:(client.c:2113:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1484151432/real 1484151432] req@ffff880fff090000 x1556245476540624/t0(0) o250->MGC192.52.98.30@[email protected]@tcp:26/25 lens 520/544 e 0 to 1 dl 1484151452 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 Jan 11 10:18:02 hpfs-fsl-oss00 kernel: Lustre: 20564:0:(client.c:2113:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1484151457/real 1484151457] req@ffff880fff090600 x1556245476540656/t0(0) o250->MGC192.52.98.30@[email protected]@tcp:26/25 lens 520/544 e 0 to 1 dl 1484151482 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 Jan 11 10:18:57 hpfs-fsl-oss00 kernel: Lustre: 20564:0:(client.c:2113:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1484151507/real 1484151507] req@ffff880fff090f00 x1556245476540704/t0(0) o250->MGC192.52.98.30@[email protected]@tcp:26/25 lens 520/544 e 0 to 1 dl 1484151537 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 _______________________________________________ lustre-discuss mailing list [email protected] http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
