I tried a failover making sure lustre, including lnet, was completely shutdown 
on the primary MDS.  This didn't work either.  Lnet hung like I remembered.  So 
I powered down the primary MDS to force it offline and then mounted lustre on 
the secondary MDS.  The services and a client recovers but the OST's still 
appear to be pointing to the primary MGS (same lctl output and 
/proc/fs/lustre/mgc) and the ptlrpc_expire_one_request messages start up on the 
OSS's.  I then tried to remount a OST, thinking that it might contact the 
secondary MGS properly when mounting.  That also did not work.  

Any ideas why lnet is hanging when I try to stop it on the MDS?  This works 
properly on the OSS.  

It sure seems like we either don't have something configured properly or we 
aren't doing the failover properly (or there is a bug in lustre).  



 

The details of what was described above follow.  On the primary MDS:

mds0# cd /etc/init.d ; ./lustre stop

This returns quickly:


Jan 11 09:15:53 hpfs-fsl-mds0 kernel: Lustre: Failing over hpfs-fsl-MDT0000
Jan 11 09:15:54 hpfs-fsl-mds0 kernel: LustreError: 137-5: 
hpfs-fsl-MDT0000_UUID: not available for connect from 192.52.98.32
@tcp (no target). If you are running an HA pair check that the target is 
mounted on the other server.
Jan 11 09:15:54 hpfs-fsl-mds0 kernel: LustreError: Skipped 1 previous similar 
message
Jan 11 09:15:54 hpfs-fsl-mds0 kernel: LustreError: 137-5: 
hpfs-fsl-MDT0000_UUID: not available for connect from 192.52.98.35
@tcp (no target). If you are running an HA pair check that the target is 
mounted on the other server.
Jan 11 09:15:54 hpfs-fsl-mds0 kernel: LustreError: Skipped 1 previous similar 
message
Jan 11 09:15:56 hpfs-fsl-mds0 kernel: LustreError: 137-5: 
hpfs-fsl-MDT0000_UUID: not available for connect from 192.52.98.40
@tcp (no target). If you are running an HA pair check that the target is 
mounted on the other server.
Jan 11 09:15:59 hpfs-fsl-mds0 kernel: Lustre: 
21424:0:(client.c:2113:ptlrpc_expire_one_request()) @@@ Request sent has timed
 out for slow reply: [sent 1484147753/real 1484147753]  req@ffff881eccfb6900 
x1556149769946448/t0(0) o251->MGC192.52.98.30@t
cp@0@lo:26/25 lens 224/224 e 0 to 1 dl 1484147759 ref 2 fl Rpc:XN/0/ffffffff rc 
0/-1
Jan 11 09:15:59 hpfs-fsl-mds0 kernel: Lustre: server umount hpfs-fsl-MDT0000 
complete

Then stop lnet:

mds0# ./lnet stop

This hangs:


Jan 11 09:16:35 hpfs-fsl-mds0 kernel: LNetError: 
7065:0:(lib-move.c:1990:lnet_parse()) 192.52.98.39@tcp, src 192.52.98.39@tc
p: Dropping PUT (error -108 looking up sender)
Jan 11 09:16:36 hpfs-fsl-mds0 kernel: LNet: Removed LNI 10.148.0.30@o2ib
Jan 11 09:16:37 hpfs-fsl-mds0 kernel: LNet: 
21555:0:(socklnd.c:2577:ksocknal_shutdown()) waiting for 1 peers to disconnect
Jan 11 09:16:41 hpfs-fsl-mds0 kernel: LNet: 
21555:0:(socklnd.c:2577:ksocknal_shutdown()) waiting for 1 peers to disconnect
Jan 11 09:16:49 hpfs-fsl-mds0 kernel: LNet: 
21555:0:(socklnd.c:2577:ksocknal_shutdown()) waiting for 1 peers to disconnect
Jan 11 09:17:05 hpfs-fsl-mds0 kernel: LNet: 
21555:0:(socklnd.c:2577:ksocknal_shutdown()) waiting for 1 peers to disconnect
Jan 11 09:17:37 hpfs-fsl-mds0 kernel: LNet: 
21555:0:(socklnd.c:2577:ksocknal_shutdown()) waiting for 1 peers to disconnect
Jan 11 09:18:41 hpfs-fsl-mds0 kernel: LNet: 
21555:0:(socklnd.c:2577:ksocknal_shutdown()) waiting for 1 peers to disconnect
Jan 11 09:20:49 hpfs-fsl-mds0 kernel: LNet: 
21555:0:(socklnd.c:2577:ksocknal_shutdown()) waiting for 1 peers to disconnect
Jan 11 09:25:05 hpfs-fsl-mds0 kernel: LNet: 
21555:0:(socklnd.c:2577:ksocknal_shutdown()) waiting for 1 peers to disconnect



Mds0 was powered down at this point. I looked back through the logs and found 
the last time I tried this and eventually lnet dumps a stack trace.  Here's 
that info from the previous attempt:



Jan  9 16:26:13 hpfs-fsl-mds0 kernel: Lustre: Failing over hpfs-fsl-MDT0000
Jan  9 16:26:19 hpfs-fsl-mds0 kernel: Lustre: 
25690:0:(client.c:2113:ptlrpc_expire_one_request()) @@@ Request sent has timed 
out for slow reply: [sent 1484000773/real 1484000773]  req@ffff88069d615400 
x1556086544936704/t0(0) o251->MGC192.52.98.30@tcp@0@lo:26/25 lens 224/224 e 0 
to 1 dl 1484000779 ref 2 fl Rpc:XN/0/ff
ffffff rc 0/-1
Jan  9 16:26:19 hpfs-fsl-mds0 kernel: Lustre: 
25690:0:(client.c:2113:ptlrpc_expire_one_request()) Skipped 11 previous similar 
messages
Jan  9 16:26:20 hpfs-fsl-mds0 kernel: Lustre: server umount hpfs-fsl-MDT0000 
complete
Jan  9 16:26:39 hpfs-fsl-mds0 kernel: LNetError: 
25392:0:(lib-move.c:1990:lnet_parse()) 192.52.98.40@tcp, src 192.52.98.40@tcp: 
Dropping PUT (error -108 looking up sender)
Jan  9 16:26:40 hpfs-fsl-mds0 kernel: LNet: Removed LNI 10.148.0.30@o2ib
Jan  9 16:26:41 hpfs-fsl-mds0 kernel: LNet: 
25820:0:(socklnd.c:2577:ksocknal_shutdown()) waiting for 1 peers to disconnect
Jan  9 16:26:45 hpfs-fsl-mds0 kernel: LNet: 
25820:0:(socklnd.c:2577:ksocknal_shutdown()) waiting for 1 peers to disconnect
Jan  9 16:26:53 hpfs-fsl-mds0 kernel: LNet: 
25820:0:(socklnd.c:2577:ksocknal_shutdown()) waiting for 1 peers to disconnect
Jan  9 16:27:09 hpfs-fsl-mds0 kernel: LNet: 
25820:0:(socklnd.c:2577:ksocknal_shutdown()) waiting for 1 peers to disconnect
Jan  9 16:27:41 hpfs-fsl-mds0 kernel: LNet: 
25820:0:(socklnd.c:2577:ksocknal_shutdown()) waiting for 1 peers to disconnect
Jan  9 16:28:45 hpfs-fsl-mds0 kernel: LNet: 
25820:0:(socklnd.c:2577:ksocknal_shutdown()) waiting for 1 peers to disconnect
Jan  9 16:30:53 hpfs-fsl-mds0 kernel: LNet: 
25820:0:(socklnd.c:2577:ksocknal_shutdown()) waiting for 1 peers to disconnect
Jan  9 16:35:09 hpfs-fsl-mds0 kernel: LNet: 
25820:0:(socklnd.c:2577:ksocknal_shutdown()) waiting for 1 peers to disconnect
Jan  9 16:42:54 hpfs-fsl-mds0 kernel: INFO: task lctl:25908 blocked for more 
than 120 seconds.
Jan  9 16:42:54 hpfs-fsl-mds0 kernel: "echo 0 > 
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jan  9 16:42:54 hpfs-fsl-mds0 kernel: lctl            D ffffffffa0d0b560     0 
25908  25900 0x00000084
Jan  9 16:42:54 hpfs-fsl-mds0 kernel: ffff881e9ffc7d20 0000000000000082 
ffff880f77a7bec0 ffff881e9ffc7fd8
Jan  9 16:42:54 hpfs-fsl-mds0 kernel: ffff881e9ffc7fd8 ffff881e9ffc7fd8 
ffff880f77a7bec0 ffffffffa0d0b558
Jan  9 16:42:54 hpfs-fsl-mds0 kernel: ffffffffa0d0b55c ffff880f77a7bec0 
00000000ffffffff ffffffffa0d0b560
Jan  9 16:42:54 hpfs-fsl-mds0 kernel: Call Trace:
Jan  9 16:42:54 hpfs-fsl-mds0 kernel: [<ffffffff8168c989>] 
schedule_preempt_disabled+0x29/0x70
Jan  9 16:42:54 hpfs-fsl-mds0 kernel: [<ffffffff8168a5e5>] 
__mutex_lock_slowpath+0xc5/0x1c0
Jan  9 16:42:54 hpfs-fsl-mds0 kernel: [<ffffffff81689a4f>] mutex_lock+0x1f/0x2f
Jan  9 16:42:54 hpfs-fsl-mds0 kernel: [<ffffffffa0cccf45>] 
LNetNIInit+0x45/0xa10 [lnet]
Jan  9 16:42:54 hpfs-fsl-mds0 kernel: [<ffffffff811806bb>] ? 
unlock_page+0x2b/0x30
Jan  9 16:42:54 hpfs-fsl-mds0 kernel: [<ffffffffa0ce6372>] 
lnet_configure+0x52/0x80 [lnet]
Jan  9 16:42:54 hpfs-fsl-mds0 kernel: [<ffffffffa0ce64eb>] 
lnet_ioctl+0x14b/0x180 [lnet]
Jan  9 16:42:54 hpfs-fsl-mds0 kernel: [<ffffffffa0bf2e5c>] 
libcfs_ioctl+0x2ac/0x4c0 [libcfs]
Jan  9 16:42:54 hpfs-fsl-mds0 kernel: [<ffffffffa0bef427>] 
libcfs_psdev_ioctl+0x67/0xf0 [libcfs]
Jan  9 16:42:54 hpfs-fsl-mds0 kernel: [<ffffffff81212035>] 
do_vfs_ioctl+0x2d5/0x4b0
Jan  9 16:42:54 hpfs-fsl-mds0 kernel: [<ffffffff8121ccd7>] ? 
__fd_install+0x47/0x60
Jan  9 16:42:54 hpfs-fsl-mds0 kernel: [<ffffffff812122b1>] SyS_ioctl+0xa1/0xc0
Jan  9 16:42:54 hpfs-fsl-mds0 kernel: [<ffffffff816967c9>] 
system_call_fastpath+0x16/0x1b
Jan  9 16:43:41 hpfs-fsl-mds0 kernel: LNet: 
25820:0:(socklnd.c:2577:ksocknal_shutdown()) waiting for 1 peers to disconnect






So with the primary MDS shut down I mounted on the secondary MDS:

mds1# cd /etc/init.d/ ; ./lustre start




Jan 11 09:29:48 hpfs-fsl-mds1 kernel: LNet: HW nodes: 2, HW CPU cores: 16, 
npartitions: 2
Jan 11 09:29:48 hpfs-fsl-mds1 kernel: alg: No test for adler32 (adler32-zlib)
Jan 11 09:29:48 hpfs-fsl-mds1 kernel: alg: No test for crc32 (crc32-table)
Jan 11 09:29:56 hpfs-fsl-mds1 kernel: LNet: Added LNI 192.52.98.31@tcp 
[8/256/0/180]
Jan 11 09:29:56 hpfs-fsl-mds1 kernel: LNet: Using FMR for registration
Jan 11 09:29:57 hpfs-fsl-mds1 kernel: LNet: Added LNI 10.148.0.31@o2ib 
[8/256/0/180]
Jan 11 09:29:57 hpfs-fsl-mds1 kernel: LNet: Accept secure, port 988
Jan 11 09:30:22 hpfs-fsl-mds1 kernel: Lustre: Lustre: Build Version: 2.9.51
Jan 11 09:30:22 hpfs-fsl-mds1 kernel: Lustre: MGS: Connection restored to 
d08a6361-1b98-2c42-a6c4-ec1317aa9351 (at 0@lo)
Jan 11 09:30:23 hpfs-fsl-mds1 kernel: Lustre: hpfs-fsl-MDT0000: Imperative 
Recovery not enabled, recovery window 300-900
Jan 11 09:30:28 hpfs-fsl-mds1 kernel: Lustre: 
10312:0:(client.c:2113:ptlrpc_expire_one_request()) @@@ Request sent has timed 
out for slow reply: [sent 1484148623/real 1484148626]  req@ffff881010219e00 
x1556242625462976/t0(0) 
o38->[email protected]@tcp:12/10 lens 520/544 e 0 to 1 
dl 1484148628 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
Jan 11 09:30:48 hpfs-fsl-mds1 kernel: Lustre: hpfs-fsl-MDT0000: Connection 
restored to d08a6361-1b98-2c42-a6c4-ec1317aa9351 (at 0@lo)
Jan 11 09:31:01 hpfs-fsl-mds1 kernel: Lustre: hpfs-fsl-MDT0000: Connection 
restored to 192.52.98.32@tcp (at 192.52.98.32@tcp)
Jan 11 09:31:03 hpfs-fsl-mds1 kernel: Lustre: hpfs-fsl-MDT0000: Connection 
restored to 192.52.98.40@tcp (at 192.52.98.40@tcp)
Jan 11 09:31:03 hpfs-fsl-mds1 kernel: Lustre: Skipped 1 previous similar message
Jan 11 09:31:08 hpfs-fsl-mds1 kernel: Lustre: hpfs-fsl-MDT0000: Connection 
restored to 192.52.98.43@tcp (at 192.52.98.43@tcp)
Jan 11 09:31:08 hpfs-fsl-mds1 kernel: Lustre: Skipped 1 previous similar message
Jan 11 09:31:26 hpfs-fsl-mds1 kernel: Lustre: hpfs-fsl-MDT0000: Will be in 
recovery for at least 5:00, or until 1 client reconnects
Jan 11 09:31:26 hpfs-fsl-mds1 kernel: Lustre: MGS: Connection restored to 
47b7f6ce-5d63-8eb1-59b6-4d26560019e9 (at 192.52.98.55@tcp)
Jan 11 09:31:26 hpfs-fsl-mds1 kernel: Lustre: Skipped 1 previous similar message
Jan 11 09:31:26 hpfs-fsl-mds1 kernel: Lustre: hpfs-fsl-MDT0000: Recovery over 
after 0:01, of 1 clients 1 recovered and 0 were evicted.
Jan 11 09:31:50 hpfs-fsl-mds1 kernel: Lustre: hpfs-fsl-MDT0000: Connection 
restored to 192.52.98.42@tcp (at 192.52.98.42@tcp)
Jan 11 09:31:50 hpfs-fsl-mds1 kernel: Lustre: Skipped 6 previous similar 
messages



And the OSS log while all this is happening.  As mentioned above, note that the 
ptlrpc_expire_one_request messages to the primary MGS persist beyond when the 
MDT/MGC is mounted on the secondary MDS.  




Jan 11 09:15:54 hpfs-fsl-oss00 kernel: LustreError: 11-0: 
hpfs-fsl-MDT0000-lwp-OST0000: operation obd_ping to node 192.52.98.30@tcp 
failed: rc = -107
Jan 11 09:15:54 hpfs-fsl-oss00 kernel: Lustre: hpfs-fsl-MDT0000-lwp-OST0000: 
Connection to hpfs-fsl-MDT0000 (at 192.52.98.30@tcp) was lost; in progress 
operations using this service will wait for recovery to complete
Jan 11 09:16:01 hpfs-fsl-oss00 kernel: Lustre: 
17097:0:(client.c:2113:ptlrpc_expire_one_request()) @@@ Request sent has timed 
out for slow reply: [sent 1484147754/real 1484147754]  req@ffff88101f2fbc00 
x1556149818209744/t0(0) o400->MGC192.52.98.30@[email protected]@tcp:26/25 lens 
224/224 e 0 to 1 dl 1484147761 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
Jan 11 09:16:01 hpfs-fsl-oss00 kernel: Lustre: 
17097:0:(client.c:2113:ptlrpc_expire_one_request()) Skipped 5 previous similar 
messages
Jan 11 09:16:01 hpfs-fsl-oss00 kernel: LustreError: 166-1: MGC192.52.98.30@tcp: 
Connection to MGS (at 192.52.98.30@tcp) was lost; in progress operations using 
this service will fail
Jan 11 09:17:27 hpfs-fsl-oss00 kernel: Lustre: 
17090:0:(client.c:2113:ptlrpc_expire_one_request()) @@@ Request sent has timed 
out for slow reply: [sent 1484147836/real 1484147836]  req@ffff88101f256000 
x1556149818209888/t0(0) 
o38->[email protected]@tcp:12/10 lens 520/544 e 0 to 1 
dl 1484147847 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
Jan 11 09:17:27 hpfs-fsl-oss00 kernel: Lustre: 
17090:0:(client.c:2113:ptlrpc_expire_one_request()) Skipped 5 previous similar 
messages
Jan 11 09:20:37 hpfs-fsl-oss00 kernel: Lustre: 
17090:0:(client.c:2113:ptlrpc_expire_one_request()) @@@ Request sent has timed 
out for slow reply: [sent 1484148011/real 1484148011]  req@ffff88101f486f00 
x1556149818210048/t0(0) 
o38->[email protected]@tcp:12/10 lens 520/544 e 0 to 1 
dl 1484148037 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
Jan 11 09:20:37 hpfs-fsl-oss00 kernel: Lustre: 
17090:0:(client.c:2113:ptlrpc_expire_one_request()) Skipped 9 previous similar 
messages
Jan 11 09:25:41 hpfs-fsl-oss00 kernel: Lustre: 
17090:0:(client.c:2113:ptlrpc_expire_one_request()) @@@ Request sent has timed 
out for slow reply: [sent 1484148286/real 1484148286]  req@ffff88101f487b00 
x1556149818210224/t0(0) o250->MGC192.52.98.30@[email protected]@tcp:26/25 lens 
520/544 e 0 to 1 dl 1484148341 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
Jan 11 09:25:41 hpfs-fsl-oss00 kernel: Lustre: 
17090:0:(client.c:2113:ptlrpc_expire_one_request()) Skipped 12 previous similar 
messages
Jan 11 09:30:23 hpfs-fsl-oss00 kernel: Lustre: hpfs-fsl-OST0000: Connection 
restored to hpfs-fsl-MDT0000-mdtlov_UUID (at 192.52.98.31@tcp)
Jan 11 09:30:23 hpfs-fsl-oss00 kernel: Lustre: Skipped 1 previous similar 
message
Jan 11 09:31:01 hpfs-fsl-oss00 kernel: LustreError: 167-0: 
hpfs-fsl-MDT0000-lwp-OST0000: This client was evicted by hpfs-fsl-MDT0000; in 
progress operations using this service will fail.
Jan 11 09:31:01 hpfs-fsl-oss00 kernel: Lustre: hpfs-fsl-MDT0000-lwp-OST0000: 
Connection restored to 192.52.98.31@tcp (at 192.52.98.31@tcp)
Jan 11 09:31:26 hpfs-fsl-oss00 kernel: Lustre: hpfs-fsl-OST0000: deleting 
orphan objects from 0x0:16904081 to 0x0:16904321
Jan 11 09:36:06 hpfs-fsl-oss00 kernel: Lustre: 
17090:0:(client.c:2113:ptlrpc_expire_one_request()) @@@ Request sent has timed 
out for slow reply: [sent 1484148911/real 1484148914]  req@ffff88101f35ad00 
x1556149818210640/t0(0) o250->MGC192.52.98.30@[email protected]@tcp:26/25 lens 
520/544 e 0 to 1 dl 1484148966 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
Jan 11 09:36:06 hpfs-fsl-oss00 kernel: Lustre: 
17090:0:(client.c:2113:ptlrpc_expire_one_request()) Skipped 14 previous similar 
messages
Jan 11 09:47:21 hpfs-fsl-oss00 kernel: Lustre: 
17090:0:(client.c:2113:ptlrpc_expire_one_request()) @@@ Request sent has timed 
out for slow reply: [sent 1484149586/real 1484149586]  req@ffff88101f35e600 
x1556149818211216/t0(0) o250->MGC192.52.98.30@[email protected]@tcp:26/25 lens 
520/544 e 0 to 1 dl 1484149641 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
Jan 11 09:47:21 hpfs-fsl-oss00 kernel: Lustre: 
17090:0:(client.c:2113:ptlrpc_expire_one_request()) Skipped 8 previous similar 
messages
Jan 11 09:58:36 hpfs-fsl-oss00 kernel: Lustre: 
17090:0:(client.c:2113:ptlrpc_expire_one_request()) @@@ Request sent has timed 
out for slow reply: [sent 1484150261/real 1484150261]  req@ffff88101f2f9b00 
x1556149818211792/t0(0) o250->MGC192.52.98.30@[email protected]@tcp:26/25 lens 
520/544 e 0 to 1 dl 1484150316 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
Jan 11 09:58:36 hpfs-fsl-oss00 kernel: Lustre: 
17090:0:(client.c:2113:ptlrpc_expire_one_request()) Skipped 8 previous similar 
messages
Jan 11 10:09:51 hpfs-fsl-oss00 kernel: Lustre: 
17090:0:(client.c:2113:ptlrpc_expire_one_request()) @@@ Request sent has timed 
out for slow reply: [sent 1484150936/real 1484150936]  req@ffff88101f483f00 
x1556149818212368/t0(0) o250->MGC192.52.98.30@[email protected]@tcp:26/25 lens 
520/544 e 0 to 1 dl 1484150991 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
Jan 11 10:09:51 hpfs-fsl-oss00 kernel: Lustre: 
17090:0:(client.c:2113:ptlrpc_expire_one_request()) Skipped 8 previous similar 
messages


And this:

[root@hpfs-fsl-oss00 ~]# date
Wed Jan 11 10:12:43 CST 2017
[root@hpfs-fsl-oss00 ~]# lctl dl
  0 UP osd-zfs hpfs-fsl-OST0000-osd hpfs-fsl-OST0000-osd_UUID 5
  1 UP mgc MGC192.52.98.30@tcp 75fa2ba9-749d-e00f-84d3-e4e9b8753be3 5
  2 UP ost OSS OSS_uuid 3
  3 UP obdfilter hpfs-fsl-OST0000 hpfs-fsl-OST0000_UUID 7
  4 UP lwp hpfs-fsl-MDT0000-lwp-OST0000 hpfs-fsl-MDT0000-lwp-OST0000_UUID 5
[root@hpfs-fsl-oss00 ~]# ls /proc/fs/lustre/mgc/
MGC192.52.98.30@tcp
[root@hpfs-fsl-oss00 ~]#



I was wondering if it might work better to remount a OST with the LFS failed 
over to the secondary MDS.  I tried that – all the below is while the MDT is 
still mounted on mds1:



[root@hpfs-fsl-oss00 ~]# date
Wed Jan 11 10:14:23 CST 2017
[root@hpfs-fsl-oss00 ~]# cd /etc/init.d/
[root@hpfs-fsl-oss00 init.d]# ./lustre stop local
Unmounting /mnt/lustre/local/hpfs-fsl-OST0000
[root@hpfs-fsl-oss00 init.d]# ./lnet stop
[root@hpfs-fsl-oss00 init.d]# ./lnet start
LNET configured
[root@hpfs-fsl-oss00 init.d]# ./lustre start local
Mounting oss00-0/ost-fsl on /mnt/lustre/local/hpfs-fsl-OST0000
[root@hpfs-fsl-oss00 init.d]# lctl dl
  0 UP osd-zfs hpfs-fsl-OST0000-osd hpfs-fsl-OST0000-osd_UUID 5
  1 UP mgc MGC192.52.98.30@tcp 17af2f5d-ebd3-b57d-0c3d-9c7bc7654172 5
  2 UP ost OSS OSS_uuid 3
  3 UP obdfilter hpfs-fsl-OST0000 hpfs-fsl-OST0000_UUID 7
  4 UP lwp hpfs-fsl-MDT0000-lwp-OST0000 hpfs-fsl-MDT0000-lwp-OST0000_UUID 5
[root@hpfs-fsl-oss00 init.d]# ls /proc/fs/lustre/mgc/
MGC192.52.98.30@tcp
[root@hpfs-fsl-oss00 init.d]# 



Same result.  MDS1 and OSS00 logs are below.  



Jan 11 10:14:42 hpfs-fsl-mds1 kernel: Lustre: 
10323:0:(client.c:2113:ptlrpc_expire_one_request()) @@@ Request sent has timed 
out for slow reply: [sent 1484151275/real 1484151275]  req@ffff882036ef7800 
x1556242625566576/t0(0) o13->[email protected]@tcp:7/4 
lens 224/368 e 0 to 1 dl 1484151282 ref 1 fl Rpc:X/0/ffffffff rc 0/-1
Jan 11 10:14:42 hpfs-fsl-mds1 kernel: Lustre: hpfs-fsl-OST0000-osc-MDT0000: 
Connection to hpfs-fsl-OST0000 (at 192.52.98.32@tcp) was lost; in progress 
operations using this service will wait for recovery to complete
Jan 11 10:14:48 hpfs-fsl-mds1 kernel: Lustre: 
10312:0:(client.c:2113:ptlrpc_expire_one_request()) @@@ Request sent has timed 
out for slow reply: [sent 1484151282/real 1484151282]  req@ffff88101a9ff800 
x1556242625566928/t0(0) o8->[email protected]@tcp:28/4 
lens 520/544 e 0 to 1 dl 1484151288 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
Jan 11 10:15:43 hpfs-fsl-mds1 kernel: Lustre: 
10312:0:(client.c:2113:ptlrpc_expire_one_request()) @@@ Request sent has timed 
out for slow reply: [sent 1484151332/real 1484151332]  req@ffff88101a9fce00 
x1556242625568768/t0(0) o8->[email protected]@tcp:28/4 
lens 520/544 e 0 to 1 dl 1484151343 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
Jan 11 10:17:12 hpfs-fsl-mds1 kernel: Lustre: hpfs-fsl-MDT0000: Connection 
restored to 192.52.98.32@tcp (at 192.52.98.32@tcp)
Jan 11 10:17:23 hpfs-fsl-mds1 kernel: Lustre: hpfs-fsl-OST0000-osc-MDT0000: 
Connection restored to 192.52.98.32@tcp (at 192.52.98.32@tcp)





Jan 11 10:14:30 hpfs-fsl-oss00 kernel: Lustre: Failing over hpfs-fsl-OST0000
Jan 11 10:14:30 hpfs-fsl-oss00 kernel: Lustre: server umount hpfs-fsl-OST0000 
complete
Jan 11 10:15:18 hpfs-fsl-oss00 kernel: LNet: Removed LNI 10.148.0.32@o2ib
Jan 11 10:15:20 hpfs-fsl-oss00 kernel: LNet: Removed LNI 192.52.98.32@tcp
Jan 11 10:15:26 hpfs-fsl-oss00 kernel: LNet: HW nodes: 2, HW CPU cores: 16, 
npartitions: 2
Jan 11 10:15:26 hpfs-fsl-oss00 kernel: alg: No test for adler32 (adler32-zlib)
Jan 11 10:15:26 hpfs-fsl-oss00 kernel: alg: No test for crc32 (crc32-table)
Jan 11 10:15:34 hpfs-fsl-oss00 kernel: LNet: Added LNI 192.52.98.32@tcp 
[8/256/0/180]
Jan 11 10:15:34 hpfs-fsl-oss00 kernel: LNet: Using FMR for registration
Jan 11 10:15:34 hpfs-fsl-oss00 kernel: LNet: Added LNI 10.148.0.32@o2ib 
[8/256/0/180]
Jan 11 10:15:34 hpfs-fsl-oss00 kernel: LNet: Accept secure, port 988
Jan 11 10:15:41 hpfs-fsl-oss00 kernel: Lustre: Lustre: Build Version: 2.9.51
Jan 11 10:15:43 hpfs-fsl-oss00 kernel: LustreError: 137-5: 
hpfs-fsl-OST0000_UUID: not available for connect from 192.52.98.55@tcp (no 
target). If you are running an HA pair check that the target is mounted on the 
other server.
Jan 11 10:15:46 hpfs-fsl-oss00 kernel: Lustre: 
20564:0:(client.c:2113:ptlrpc_expire_one_request()) @@@ Request sent has timed 
out for slow reply: [sent 1484151341/real 1484151341]  req@ffff882020280000 
x1556245476540432/t0(0) o250->MGC192.52.98.30@[email protected]@tcp:26/25 lens 
520/544 e 0 to 1 dl 1484151346 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
Jan 11 10:16:16 hpfs-fsl-oss00 kernel: Lustre: 
20564:0:(client.c:2113:ptlrpc_expire_one_request()) @@@ Request sent has timed 
out for slow reply: [sent 1484151366/real 1484151366]  req@ffff880fff120000 
x1556245476540496/t0(0) o250->MGC192.52.98.30@[email protected]@tcp:26/25 lens 
520/544 e 0 to 1 dl 1484151376 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
Jan 11 10:16:22 hpfs-fsl-oss00 kernel: LustreError: 137-5: 
hpfs-fsl-OST0000_UUID: not available for connect from 192.52.98.31@tcp (no 
target). If you are running an HA pair check that the target is mounted on the 
other server.
Jan 11 10:16:29 hpfs-fsl-oss00 kernel: LustreError: 
20547:0:(mgc_request.c:249:do_config_log_add()) MGC192.52.98.30@tcp: failed 
processing log, type 4: rc = -110
Jan 11 10:16:33 hpfs-fsl-oss00 kernel: LustreError: 137-5: 
hpfs-fsl-OST0000_UUID: not available for connect from 192.52.98.55@tcp (no 
target). If you are running an HA pair check that the target is mounted on the 
other server.
Jan 11 10:16:35 hpfs-fsl-oss00 kernel: LustreError: 
20629:0:(sec_config.c:1107:sptlrpc_target_local_read_conf()) missing llog 
context
Jan 11 10:16:46 hpfs-fsl-oss00 kernel: Lustre: 
20564:0:(client.c:2113:ptlrpc_expire_one_request()) @@@ Request sent has timed 
out for slow reply: [sent 1484151391/real 1484151391]  req@ffff880fff120300 
x1556245476540528/t0(0) o250->MGC192.52.98.30@[email protected]@tcp:26/25 lens 
520/544 e 0 to 1 dl 1484151406 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
Jan 11 10:16:52 hpfs-fsl-oss00 kernel: Lustre: 
20564:0:(client.c:2113:ptlrpc_expire_one_request()) @@@ Request sent has timed 
out for slow reply: [sent 1484151407/real 1484151407]  req@ffff880fff120600 
x1556245476540576/t0(0) 
o38->[email protected]@tcp:12/10 lens 520/544 e 0 to 1 
dl 1484151412 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
Jan 11 10:17:04 hpfs-fsl-oss00 kernel: Lustre: hpfs-fsl-OST0000: Imperative 
Recovery not enabled, recovery window 300-900
Jan 11 10:17:12 hpfs-fsl-oss00 kernel: Lustre: hpfs-fsl-OST0000: Will be in 
recovery for at least 5:00, or until 2 clients reconnect
Jan 11 10:17:12 hpfs-fsl-oss00 kernel: Lustre: hpfs-fsl-OST0000: Connection 
restored to  (at 192.52.98.31@tcp)
Jan 11 10:17:23 hpfs-fsl-oss00 kernel: Lustre: hpfs-fsl-OST0000: Connection 
restored to  (at 192.52.98.55@tcp)
Jan 11 10:17:23 hpfs-fsl-oss00 kernel: Lustre: hpfs-fsl-OST0000: Recovery over 
after 0:11, of 2 clients 2 recovered and 0 were evicted.
Jan 11 10:17:32 hpfs-fsl-oss00 kernel: Lustre: 
20564:0:(client.c:2113:ptlrpc_expire_one_request()) @@@ Request sent has timed 
out for slow reply: [sent 1484151432/real 1484151432]  req@ffff880fff090000 
x1556245476540624/t0(0) o250->MGC192.52.98.30@[email protected]@tcp:26/25 lens 
520/544 e 0 to 1 dl 1484151452 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
Jan 11 10:18:02 hpfs-fsl-oss00 kernel: Lustre: 
20564:0:(client.c:2113:ptlrpc_expire_one_request()) @@@ Request sent has timed 
out for slow reply: [sent 1484151457/real 1484151457]  req@ffff880fff090600 
x1556245476540656/t0(0) o250->MGC192.52.98.30@[email protected]@tcp:26/25 lens 
520/544 e 0 to 1 dl 1484151482 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
Jan 11 10:18:57 hpfs-fsl-oss00 kernel: Lustre: 
20564:0:(client.c:2113:ptlrpc_expire_one_request()) @@@ Request sent has timed 
out for slow reply: [sent 1484151507/real 1484151507]  req@ffff880fff090f00 
x1556245476540704/t0(0) o250->MGC192.52.98.30@[email protected]@tcp:26/25 lens 
520/544 e 0 to 1 dl 1484151537 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1


_______________________________________________
lustre-discuss mailing list
[email protected]
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Reply via email to