Hi all,
We had a problem with one of our MDS (ldiskfs) on Lustre 2.12.6, which we
think is a bug - but haven't been able to identify it. Can anyone shed
any light? We unmounted and remounted the mdt at around 23:00.
Client logs:
May 16 22:15:41 m8011 kernel: LustreError: 11-0:
lustrefs8-MDT-mdc-956fb73c3800: operation ldlm_enqueue to node
172.18.185.1@o2ib failed: rc = -107
May 16 22:15:41 m8011 kernel: Lustre: lustrefs8-MDT-mdc-956fb73c3800:
Connection to lustrefs8-MDT (at 172.18.185.1@o2ib) was lost; in progress
operations using this service will wait for recovery to complete
May 16 22:15:41 m8011 kernel: LustreError: Skipped 5 previous similar messages
May 16 22:15:48 m8011 kernel: Lustre:
101710:0:(client.c:2146:ptlrpc_expire_one_request()) @@@ Request sent has timed
out for slow reply: [sent 1652735641/real 1652735641] req@949d8cb1de80
x1724290358528896/t0(0)
o101->lustrefs8-MDT-mdc-956fb73c3800@172.18.185.1@o2ib:12/10 lens
480/568 e 4 to 1 dl 1652735748 ref 2 fl Rpc:X/0/ rc 0/-1
May 16 22:15:48 m8011 kernel: Lustre:
101710:0:(client.c:2146:ptlrpc_expire_one_request()) Skipped 6 previous similar
messages
May 16 23:00:15 m8011 kernel: Lustre:
4784:0:(client.c:2146:ptlrpc_expire_one_request()) @@@ Request sent has timed out
for slow reply: [sent 1652738408/real 1652738408] req@94ea07314380
x1724290358763776/t0(0) o400->MGC172.18.185.1@o2ib@172.18.185.1@o2ib:26/25 lens
224/224 e 0 to 1 dl 1652738415 ref 1 fl Rpc:XN/0/ rc 0/-1
May 16 23:00:15 m8011 kernel: LustreError: 166-1: MGC172.18.185.1@o2ib:
Connection to MGS (at 172.18.185.1@o2ib) was lost; in progress operations using
this service will fail
May 16 23:00:15 m8011 kernel: Lustre: Evicted from MGS (at
MGC172.18.185.1@o2ib_0) after server handle changed from 0xdb7c7c778c8908d6 to
0xdb7c7cbad3be9e79
May 16 23:00:15 m8011 kernel: Lustre: MGC172.18.185.1@o2ib: Connection restored
to MGC172.18.185.1@o2ib_0 (at 172.18.185.1@o2ib)
May 16 23:01:49 m8011 kernel: LustreError: 167-0:
lustrefs8-MDT-mdc-956fb73c3800: This client was evicted by
lustrefs8-MDT; in progress operations using this service will fail.
May 16 23:01:49 m8011 kernel: LustreError:
101719:0:(vvp_io.c:1562:vvp_io_init()) lustrefs8: refresh file layout
[0x28107:0x9b08:0x0] error -108.
May 16 23:01:49 m8011 kernel: LustreError:
101719:0:(vvp_io.c:1562:vvp_io_init()) Skipped 3 previous similar messages
May 16 23:01:49 m8011 kernel: Lustre: lustrefs8-MDT-mdc-956fb73c3800:
Connection restored to 172.18.185.1@o2ib (at 172.18.185.1@o2ib)
MDS server logs:
May 16 22:15:40 c8mds1 kernel: LustreError:
10686:0:(ldlm_lockd.c:256:expired_lock_main()) ### lock callback timer expired
after 99s: evicting client at 172.18.181.11@o2ib ns:
mdt-lustrefs8-MDT_UUID lock: 97b3730d98c0/0xdb7c7cbad3be1c7b lrc: 3/0,0
mode: PW/PW res: [0x29119:0x327f:0x0].0x0 bits 0x40/0x0 rrc: 201 type: IBT
flags: 0x6020040020 nid: 172.18.181.11@o2ib remote: 0xe62e31610edfb808
expref: 90 pid: 10707 timeout: 8482830 lvb_type: 0
May 16 22:15:40 c8mds1 kernel: LustreError:
10712:0:(ldlm_lockd.c:1351:ldlm_handle_enqueue0()) ### lock on destroyed export
9769eaf46c00 ns: mdt-lustrefs8-MDT_UUID lock:
97d828635e80/0xdb7c7cbad3be1c90 lrc: 3/0,0 mode: PW/PW res:
[0x29119:0x327f:0x0].0x0 bits 0x40/0x0 rrc: 199 type: IBT flags:
0x5020040020 nid: 172.18.181.11@o2ib remote: 0xe62e31610edfb80f expref: 77
pid: 10712 timeout: 0 lvb_type: 0
May 16 22:15:40 c8mds1 kernel: LustreError:
10712:0:(ldlm_lockd.c:1351:ldlm_handle_enqueue0()) Skipped 27 previous similar
messages
May 16 22:17:22 c8mds1 kernel: LNet: Service thread pid 10783 was inactive for
200.73s. The thread might be hung, or it might only be slow and will resume
later. Dumping the stack trace for debugging purposes:
May 16 22:17:22 c8mds1 kernel: LNet: Skipped 3 previous similar messages
May 16 22:17:22 c8mds1 kernel: Pid: 10783, comm: mdt01_040
3.10.0-1160.2.1.el7_lustre.x86_64 #1 SMP Wed Dec 9 20:53:35 UTC 2020
May 16 22:17:22 c8mds1 kernel: Call Trace:
May 16 22:17:22 c8mds1 kernel: []
ldlm_completion_ast+0x430/0x860 [ptlrpc]
May 16 22:17:22 c8mds1 kernel: []
ldlm_cli_enqueue_local+0x231/0x830 [ptlrpc]
May 16 22:17:22 c8mds1 kernel: []
mdt_object_local_lock+0x50b/0xb20 [mdt]
May 16 22:17:22 c8mds1 kernel: []
mdt_object_lock_internal+0x70/0x360 [mdt]
May 16 22:17:22 c8mds1 kernel: [] mdt_object_lock+0x20/0x30
[mdt]
May 16 22:17:22 c8mds1 kernel: [] mdt_brw_enqueue+0x44b/0x760
[mdt]
May 16 22:17:22 c8mds1 kernel: [] mdt_intent_brw+0x1f/0x30
[mdt]
May 16 22:17:22 c8mds1 kernel: []
mdt_intent_policy+0x435/0xd80 [mdt]
May 16 22:17:22 c8mds1 kernel: []
ldlm_lock_enqueue+0x376/0x9b0 [ptlrpc]
May 16 22:17:22 c8mds1 kernel: []
ldlm_handle_enqueue0+0xa86/0x1620 [ptlrpc]
May 16 22:17:22 c8mds1 kernel: [] tgt_enqueue+0x62/0x210
[ptlrpc]
May 16 22:17:22 c8mds1 kernel: []
tgt_request_handle+0xada/0x1570 [ptlrpc]
May 16