Hi Stephane,

Thanks.  We will perform the update during our next downtime in February. 
In the meantime, is there anything anyone can suggest to do to prevent or 
reduce lockups?

We're currently seeing about 1 per day, ranging from a few hunred seconds 
to a few hours.

Are there many other 2.12.2 systems in production at the moment?

Thanks,
Alastair.

On Wed, 27 Nov 2019, Stephane Thiell wrote:

> Hi Alastair,
>
> The first thing to do is to upgrade your servers to 2.12.3, as many bugs have 
> been fixed.
>
> http://wiki.lustre.org/Lustre_2.12.3_Changelog
>
> Stephane
>
>> On Nov 20, 2019, at 7:29 AM, BASDEN, ALASTAIR G. <a.g.bas...@durham.ac.uk> 
>> wrote:
>>
>> Hi,
>>
>> We have a new 2.12.2 system, and are seeing fairly frequent lockups on the
>> primary mds.  We get messages such as:
>>
>> Nov 20 14:24:12 c6mds1 kernel: LustreError:
>> 38853:0:(ldlm_lockd.c:256:expired_lock_main()) ### lock callback timer
>> expired after 150s: evicting client at 172.18.122.165@o2ib  ns:
>> mdt-cos6-MDT0000_UUID lock: ffff92596372cec0/0x2efa065d0bb180f3 lrc: 3/0,0
>> mode: PW/PW res: [0x200007a26:0x14:0x0].0x0 bits 0x40/0x0 rrc: 50 type:
>> IBT flags: 0x60200400000020 nid: 172.18.122.165@o2ib remote:
>> 0x37bce663787828ed expref: 11 pid: 39074 timeout: 4429040 lvb_type: 0
>> Nov 20 14:25:03 c6mds1 kernel: LNet: Service thread pid 39057 was inactive
>> for 200.67s. The thread might be hung, or it might only be slow and will
>> resume later. Dumping the stack trace for debugging purposes:
>> Nov 20 14:25:03 c6mds1 kernel: Pid: 39057, comm: mdt00_045
>> 3.10.0-957.10.1.el7_lustre.x86_64 #1 SMP Tue Apr 30 22:18:15 UTC 2019
>> Nov 20 14:25:03 c6mds1 kernel: Call Trace:
>> Nov 20 14:25:03 c6mds1 kernel: [<ffffffffc0eafd11>]
>> ldlm_completion_ast+0x5b1/0x920 [ptlrpc]
>> Nov 20 14:25:03 c6mds1 kernel: [<ffffffffc0eb0aac>]
>> ldlm_cli_enqueue_local+0x23c/0x870 [ptlrpc]
>> Nov 20 14:25:03 c6mds1 kernel: [<ffffffffc144c52b>]
>> mdt_object_local_lock+0x50b/0xb20 [mdt]
>> Nov 20 14:25:03 c6mds1 kernel: [<ffffffffc144cbb0>]
>> mdt_object_lock_internal+0x70/0x360 [mdt]
>> Nov 20 14:25:03 c6mds1 kernel: [<ffffffffc144cec0>]
>> mdt_object_lock+0x20/0x30 [mdt]
>> Nov 20 14:25:03 c6mds1 kernel: [<ffffffffc1489ccb>]
>> mdt_brw_enqueue+0x44b/0x760 [mdt]
>> Nov 20 14:25:03 c6mds1 kernel: [<ffffffffc143a4bf>]
>> mdt_intent_brw+0x1f/0x30 [mdt]
>> Nov 20 14:25:03 c6mds1 kernel: [<ffffffffc1452a18>]
>> mdt_intent_policy+0x2e8/0xd00 [mdt]
>> Nov 20 14:25:03 c6mds1 kernel: [<ffffffffc0e96d26>]
>> ldlm_lock_enqueue+0x366/0xa60 [ptlrpc]
>> Nov 20 14:25:03 c6mds1 kernel: [<ffffffffc0ebf587>]
>> ldlm_handle_enqueue0+0xa47/0x15a0 [ptlrpc]
>> Nov 20 14:25:03 c6mds1 kernel: [<ffffffffc0f47882>] tgt_enqueue+0x62/0x210
>> [ptlrpc]
>> Nov 20 14:25:03 c6mds1 kernel: [<ffffffffc0f4c1da>]
>> tgt_request_handle+0xaea/0x1580 [ptlrpc]
>> Nov 20 14:25:03 c6mds1 kernel: [<ffffffffc0ef180b>]
>> ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
>> Nov 20 14:25:03 c6mds1 kernel: [<ffffffffc0ef513c>]
>> ptlrpc_main+0xafc/0x1fc0 [ptlrpc]
>> Nov 20 14:25:03 c6mds1 kernel: [<ffffffffa00c1c71>] kthread+0xd1/0xe0
>> Nov 20 14:25:03 c6mds1 kernel: [<ffffffffa0775c1d>]
>> ret_from_fork_nospec_begin+0x7/0x21
>> Nov 20 14:25:03 c6mds1 kernel: [<ffffffffffffffff>] 0xffffffffffffffff
>> Nov 20 14:25:03 c6mds1 kernel: LustreError: dumping log to
>> /tmp/lustre-log.1574259903.39057
>>
>> Nov 20 14:25:03 c6mds1 kernel: LNet: Service thread pid 39024 was inactive
>> for 201.36s. Watchdog stack traces are limited to 3 per 300 seconds,
>> skipping this one.
>> Nov 20 14:25:52 c6mds1 kernel: LustreError:
>> 38853:0:(ldlm_lockd.c:256:expired_lock_main()) ### lock callback timer
>> expired after 100s: evicting client at 172.18.122.167@o2ib  ns:
>> mdt-cos6-MDT0000_UUID lock: ffff922fb4238480/0x2efa065d0bb1817f lrc: 3/0,0
>> mode: PW/PW res: [0x200007a26:0x14:0x0].0x0 bits 0x40/0x0 rrc: 53 type:
>> IBT flags: 0x60200400000020 nid: 172.18.122.167@o2ib remote:
>> 0x1c35b518c55069d8 expref: 15 pid: 39076 timeout: 4429140 lvb_type: 0
>> Nov 20 14:25:52 c6mds1 kernel: LNet: Service thread pid 39054 completed
>> after 249.98s. This indicates the system was overloaded (too many service
>> threads, or there were not enough hardware resources).
>> Nov 20 14:25:52 c6mds1 kernel: LustreError:
>> 39074:0:(ldlm_lockd.c:1357:ldlm_handle_enqueue0()) ### lock on destroyed
>> export ffff924c828ec000 ns: mdt-cos6-MDT0000_UUID lock:
>> ffff92596372c140/0x2efa065d0bb186db lrc: 3/0,0 mode: PR/PR res:
>> [0x200007a26:0x14:0x0].0x0 bits 0x20/0x0 rrc: 17 type: IBT flags:
>> 0x50200000000000 nid: 172.18.122.165@o2ib remote: 0x37bce663787829b8
>> expref: 2 pid: 39074 timeout: 0 lvb_type: 0
>> Nov 20 14:25:52 c6mds1 kernel: LustreError:
>> 39074:0:(ldlm_lockd.c:1357:ldlm_handle_enqueue0()) Skipped 1 previous
>> similar message
>> Nov 20 14:25:52 c6mds1 kernel: LNet: Skipped 7 previous similar messages
>>
>>
>>
>>
>> Any suggestions?  Its a ldisfs backend for the MDS (OSSs are zfs).
>>
>> Thanks,
>> Alastair.
>> _______________________________________________
>> lustre-discuss mailing list
>> lustre-discuss@lists.lustre.org
>> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
>
>
>
_______________________________________________
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Reply via email to