[lustre-discuss] MDT hanging

Christopher Mountford via lustre-discuss Tue, 09 Mar 2021 10:15:55 -0800

Hi,

We've had a couple of MDT hangs on 2 of our lustre filesystems after updating 
to 2.12.6 (though I'm sure I've seen this exact behaviour on previous versions).


Ths symptoms are a gradualy increasing load on the affected MDS, processes 
doing I/O on the filesystem blocking indefinately, showing messages on the 
client similar to:

Mar  9 15:37:22 spectre09 kernel: Lustre: 
25309:0:(client.c:2146:ptlrpc_expire_one_request()) @@@ Request sent has timed 
out for slow reply: [sent 1615303641/real 
1615303641]  req@ffff972dbe51bf00 x1692620480891456/t0(0) 
o44->ahome3-MDT0001-mdc-ffff9718e3be0000@10.143.254.212@o2ib:12/10 lens 448/440 
e 2 to 1 dl 1615304242 re
f 2 fl Rpc:X/0/ffffffff rc 0/-1
Mar  9 15:37:22 spectre09 kernel: Lustre: ahome3-MDT0001-mdc-ffff9718e3be0000: 
Connection to ahome3-MDT0001 (at 10.143.254.212@o2ib) was lost; in progress 
operatio
ns using this service will wait for recovery to complete
Mar  9 15:37:22 spectre09 kernel: Lustre: ahome3-MDT0001-mdc-ffff9718e3be0000: 
Connection restored to 10.143.254.212@o2ib (at 10.143.254.212@o2ib)

Warnings of hung mdt_io tasks on the MDS and lustre debug log dumps to /tmp.

Rebooting the affected MDS cleared the problem and everything recovered.



Looking at the MDS system logs, the first sign of trouble appears to be:

Mar  9 15:24:11 amds01b kernel: VERIFY3(dr->dr_dbuf->db_level == level) failed 
(0 == 18446744073709551615)
Mar  9 15:24:11 amds01b kernel: PANIC at dbuf.c:3391:dbuf_sync_list()
Mar  9 15:24:11 amds01b kernel: Showing stack for process 18137
Mar  9 15:24:11 amds01b kernel: CPU: 3 PID: 18137 Comm: dp_sync_taskq Tainted: 
P           OE  ------------   3.10.0-1160.2.1.el7_lustre.x86_64 #1
Mar  9 15:24:11 amds01b kernel: Hardware name: HPE ProLiant DL360 
Gen10/ProLiant DL360 Gen10, BIOS U32 07/16/2020
Mar  9 15:24:11 amds01b kernel: Call Trace:
Mar  9 15:24:11 amds01b kernel: [<ffffffff9af813c0>] dump_stack+0x19/0x1b
Mar  9 15:24:11 amds01b kernel: [<ffffffffc0979f24>] spl_dumpstack+0x44/0x50 
[spl]
Mar  9 15:24:11 amds01b kernel: [<ffffffffc0979ff9>] spl_panic+0xc9/0x110 [spl]
Mar  9 15:24:11 amds01b kernel: [<ffffffff9a96b075>] ? tracing_is_on+0x15/0x30
Mar  9 15:24:11 amds01b kernel: [<ffffffff9a96ed4d>] ? 
tracing_record_cmdline+0x1d/0x120
Mar  9 15:24:11 amds01b kernel: [<ffffffffc0974fc5>] ? spl_kmem_free+0x35/0x40 
[spl]
Mar  9 15:24:11 amds01b kernel: [<ffffffff9a8e43cc>] ? update_curr+0x14c/0x1e0
Mar  9 15:24:11 amds01b kernel: [<ffffffff9a8e111e>] ? 
account_entity_dequeue+0xae/0xd0
Mar  9 15:24:11 amds01b kernel: [<ffffffffc0a7014b>] dbuf_sync_list+0x7b/0xd0 
[zfs]
Mar  9 15:24:11 amds01b kernel: [<ffffffffc0a8f4f0>] dnode_sync+0x370/0x890 
[zfs]
Mar  9 15:24:11 amds01b kernel: [<ffffffffc0a7b1d1>] 
sync_dnodes_task+0x61/0x150 [zfs]
Mar  9 15:24:11 amds01b kernel: [<ffffffffc0977d7c>] taskq_thread+0x2ac/0x4f0 
[spl]
Mar  9 15:24:11 amds01b kernel: [<ffffffff9a8daaf0>] ? wake_up_state+0x20/0x20
Mar  9 15:24:11 amds01b kernel: [<ffffffffc0977ad0>] ? 
taskq_thread_spawn+0x60/0x60 [spl]
Mar  9 15:24:11 amds01b kernel: [<ffffffff9a8c5c21>] kthread+0xd1/0xe0
Mar  9 15:24:11 amds01b kernel: [<ffffffff9a8c5b50>] ? 
insert_kthread_work+0x40/0x40
Mar  9 15:24:11 amds01b kernel: [<ffffffff9af93ddd>] 
ret_from_fork_nospec_begin+0x7/0x21
Mar  9 15:24:11 amds01b kernel: [<ffffffff9a8c5b50>] ? 
insert_kthread_work+0x40/0x40




My read of this is that ZFS failed whilst syncing cached data out to disk and 
panicked (I guess this panic is internal to ZFS as the system remained up and 
otherwise responsive - no kernel panic triggered). Does this seem correct?

The pacemaker ZFS resource did not pick up the failure, it relies on 'zpool 
list -H -o health'. Is there any way anyone can think of that we can detect 
this sort of problem to trigger an automated reset of the affected server? 
Unfortunately I'd rebooted the server before I spotted the log entry. Next time 
I'll run some zfs commands to see what they return before rebooting.

Any advice on what additional steps to take? I guess this is probably more a 
ZFS rather than Lustre issue.

The MDS are based on HPE DL360s, connected to D3700 JBODs, MDTs are on ZFS, 
Centos Lustre 7.9, zfs 0.7.13, lustre 2.12.6, kernel 
3.10.0-1160.2.1.el7_lustre.x86_64

Kind Regards,
Christopher.
_______________________________________________
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

[lustre-discuss] MDT hanging

Reply via email to