Hi Colin, Not yet, we last scrubbed the pool ~2 weeks ago when we first saw this problem. I've got a few additional tests to run now to see if we can track the cause to a particular job/process, but kicking off a scrub is my next thing to do (It should only take ~40 minutes, it's a fairly small ssd based MDT).
Thanks, Chris ________________________________________ From: Colin Faber <cfa...@gmail.com> Sent: 15 March 2023 18:41 To: Mountford, Christopher J. (Dr.) Cc: lustre-discuss Subject: Re: [lustre-discuss] Repeated ZFS panics on MDT ***CAUTION:*** This email was sent from an EXTERNAL source. Think before clicking links or opening attachments. Have you tried resilvering the pool? On Wed, Mar 15, 2023, 11:57 AM Mountford, Christopher J. (Dr.) via lustre-discuss <lustre-discuss@lists.lustre.org<mailto:lustre-discuss@lists.lustre.org>> wrote: I'm hoping someone offer some suggestions. We have a problem on our production Lustre/ZFS filesystem (CentOS 7, ZFS 0.7.13, Lustre 2.12.9), so far I've drawn a blank trying to track down the cause of this. We see the following zfs panic message in the logs (in every case the VERIFY3/panic lines are identical): Mar 15 17:15:39 amds01a kernel: VERIFY3(sa.sa_magic == 0x2F505A) failed (8 == 3100762) Mar 15 17:15:39 amds01a kernel: PANIC at zfs_vfsops.c:584:zfs_space_delta_cb() Mar 15 17:15:39 amds01a kernel: Showing stack for process 15381 Mar 15 17:15:39 amds01a kernel: CPU: 31 PID: 15381 Comm: mdt00_020 Tainted: P OE ------------ 3.10.0-1160.49.1.el7_lustre.x86_64 #1 Mar 15 17:15:39 amds01a kernel: Hardware name: HPE ProLiant DL360 Gen10/ProLiant DL360 Gen10, BIOS U32 02/09/2023 Mar 15 17:15:39 amds01a kernel: Call Trace: Mar 15 17:15:39 amds01a kernel: [<ffffffff99d83539>] dump_stack+0x19/0x1b Mar 15 17:15:39 amds01a kernel: [<ffffffffc0b76f24>] spl_dumpstack+0x44/0x50 [spl] Mar 15 17:15:39 amds01a kernel: [<ffffffffc0b76ff9>] spl_panic+0xc9/0x110 [spl] Mar 15 17:15:39 amds01a kernel: [<ffffffff996e482c>] ? update_curr+0x14c/0x1e0 Mar 15 17:15:39 amds01a kernel: [<ffffffff99707cf4>] ? getrawmonotonic64+0x34/0xc0 Mar 15 17:15:39 amds01a kernel: [<ffffffffc0c87aa3>] ? dmu_zfetch+0x393/0x520 [zfs] Mar 15 17:15:39 amds01a kernel: [<ffffffffc0c6a073>] ? dbuf_rele_and_unlock+0x283/0x4c0 [zfs] Mar 15 17:15:39 amds01a kernel: [<ffffffffc0b78ff1>] ? __cv_init+0x41/0x60 [spl] Mar 15 17:15:39 amds01a kernel: [<ffffffffc0d0f53c>] zfs_space_delta_cb+0x9c/0x200 [zfs] Mar 15 17:15:39 amds01a kernel: [<ffffffffc0c7a944>] dmu_objset_userquota_get_ids+0x154/0x440 [zfs] Mar 15 17:15:39 amds01a kernel: [<ffffffffc0c89e98>] dnode_setdirty+0x38/0xf0 [zfs] Mar 15 17:15:39 amds01a kernel: [<ffffffffc0c8a21c>] dnode_allocate+0x18c/0x230 [zfs] Mar 15 17:15:39 amds01a kernel: [<ffffffffc0c76d2b>] dmu_object_alloc_dnsize+0x34b/0x3e0 [zfs] Mar 15 17:15:39 amds01a kernel: [<ffffffffc1d73052>] __osd_object_create+0x82/0x170 [osd_zfs] Mar 15 17:15:39 amds01a kernel: [<ffffffffc1d7ce23>] ? osd_declare_xattr_set+0xb3/0x190 [osd_zfs] Mar 15 17:15:39 amds01a kernel: [<ffffffffc1d733bd>] osd_mkreg+0x7d/0x210 [osd_zfs] Mar 15 17:15:39 amds01a kernel: [<ffffffff99828f01>] ? __kmalloc_node+0x1d1/0x2b0 Mar 15 17:15:39 amds01a kernel: [<ffffffffc1d6f8f6>] osd_create+0x336/0xb10 [osd_zfs] Mar 15 17:15:39 amds01a kernel: [<ffffffffc2016fb5>] lod_sub_create+0x1f5/0x480 [lod] Mar 15 17:15:39 amds01a kernel: [<ffffffffc2007729>] lod_create+0x69/0x340 [lod] Mar 15 17:15:39 amds01a kernel: [<ffffffffc1d65690>] ? osd_trans_create+0x410/0x410 [osd_zfs] Mar 15 17:15:39 amds01a kernel: [<ffffffffc2081993>] mdd_create_object_internal+0xc3/0x300 [mdd] Mar 15 17:15:39 amds01a kernel: [<ffffffffc206aa4b>] mdd_create_object+0x7b/0x820 [mdd] Mar 15 17:15:39 amds01a kernel: [<ffffffffc2074fd8>] mdd_create+0xdd8/0x14a0 [mdd] Mar 15 17:15:39 amds01a kernel: [<ffffffffc1f0e118>] mdt_reint_open+0x2588/0x3970 [mdt] Mar 15 17:15:39 amds01a kernel: [<ffffffffc16f82b9>] ? check_unlink_entry+0x19/0xd0 [obdclass] Mar 15 17:15:39 amds01a kernel: [<ffffffffc1eede52>] ? ucred_set_audit_enabled.isra.15+0x22/0x60 [mdt] Mar 15 17:15:39 amds01a kernel: [<ffffffffc1f00f23>] mdt_reint_rec+0x83/0x210 [mdt] Mar 15 17:15:39 amds01a kernel: [<ffffffffc1edc413>] mdt_reint_internal+0x6e3/0xaf0 [mdt] Mar 15 17:15:39 amds01a kernel: [<ffffffffc1ee8ec6>] ? mdt_intent_fixup_resent+0x36/0x220 [mdt] Mar 15 17:15:39 amds01a kernel: [<ffffffffc1ee9132>] mdt_intent_open+0x82/0x3a0 [mdt] Mar 15 17:15:39 amds01a kernel: [<ffffffffc1edf74a>] mdt_intent_opc+0x1ba/0xb50 [mdt] Mar 15 17:15:39 amds01a kernel: [<ffffffffc1a0d6c0>] ? lustre_swab_ldlm_policy_data+0x30/0x30 [ptlrpc] Mar 15 17:15:39 amds01a kernel: [<ffffffffc1ee90b0>] ? mdt_intent_fixup_resent+0x220/0x220 [mdt] Mar 15 17:15:39 amds01a kernel: [<ffffffffc1ee79e4>] mdt_intent_policy+0x1a4/0x360 [mdt] Mar 15 17:15:39 amds01a kernel: [<ffffffffc19bc4e6>] ldlm_lock_enqueue+0x376/0x9b0 [ptlrpc] Mar 15 17:15:39 amds01a kernel: [<ffffffffc10a22b7>] ? cfs_hash_bd_add_locked+0x67/0x90 [libcfs] Mar 15 17:15:39 amds01a kernel: [<ffffffffc10a5a4e>] ? cfs_hash_add+0xbe/0x1a0 [libcfs] Mar 15 17:15:39 amds01a kernel: [<ffffffffc19e3aa6>] ldlm_handle_enqueue0+0xa86/0x1620 [ptlrpc] Mar 15 17:15:39 amds01a kernel: [<ffffffffc1a0d740>] ? lustre_swab_ldlm_lock_desc+0x30/0x30 [ptlrpc] Mar 15 17:15:39 amds01a kernel: [<ffffffffc1a6d092>] tgt_enqueue+0x62/0x210 [ptlrpc] Mar 15 17:15:39 amds01a kernel: [<ffffffffc1a73eea>] tgt_request_handle+0xada/0x1570 [ptlrpc] Mar 15 17:15:39 amds01a kernel: [<ffffffffc1a4d601>] ? ptlrpc_nrs_req_get_nolock0+0xd1/0x170 [ptlrpc] Mar 15 17:15:39 amds01a kernel: [<ffffffffc1096bde>] ? ktime_get_real_seconds+0xe/0x10 [libcfs] Mar 15 17:15:39 amds01a kernel: [<ffffffffc1a18bcb>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc] Mar 15 17:15:39 amds01a kernel: [<ffffffffc1a156e5>] ? ptlrpc_wait_event+0xa5/0x360 [ptlrpc] Mar 15 17:15:39 amds01a kernel: [<ffffffff99d7dcf3>] ? queued_spin_lock_slowpath+0xb/0xf Mar 15 17:15:39 amds01a kernel: [<ffffffff99d8baa0>] ? _raw_spin_lock+0x20/0x30 Mar 15 17:15:39 amds01a kernel: [<ffffffffc1a1c534>] ptlrpc_main+0xb34/0x1470 [ptlrpc] Mar 15 17:15:39 amds01a kernel: [<ffffffffc1a1ba00>] ? ptlrpc_register_service+0xf80/0xf80 [ptlrpc] Mar 15 17:15:39 amds01a kernel: [<ffffffff996c5e61>] kthread+0xd1/0xe0 Mar 15 17:15:39 amds01a kernel: [<ffffffff996c5d90>] ? insert_kthread_work+0x40/0x40 Mar 15 17:15:39 amds01a kernel: [<ffffffff99d95ddd>] ret_from_fork_nospec_begin+0x7/0x21 Mar 15 17:15:39 amds01a kernel: [<ffffffff996c5d90>] ? insert_kthread_work+0x40/0x40 At this point all ZFS I/O freezes completely and the MDS has to be fenced. This has happened ~4 times in the last hour. I'm at a loss how to correct this - I'm currently thinking that we may have to rebuild and recover our entire filesystem from backups (thankfully this is our home file system which is small and entirely ssd based, so should not take to long to recover). May be related to this: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=216586<https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.freebsd.org%2Fbugzilla%2Fshow_bug.cgi%3Fid%3D216586&data=05%7C01%7Ccjm14%40leicester.ac.uk%7C5c7cc6d25763481ed5a208db2584e3fb%7Caebecd6a31d44b0195ce8274afe853d9%7C0%7C0%7C638145024919249784%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=IwEtffNVogSuKdq3E%2FBJWHfs4ECK%2BYclEvN9rO3qVrA%3D&reserved=0> bug seen on freebsd (with a much more recent ZFS version). The problem was first seen 3 weeks ago, but went away after a couple of reboots. This time it seems to be more serious. Kind Regards, Christopher. ------------------------------------ Dr. Christopher Mountford, System Specialist, RCS, Digital Services, University Of Leicester. _______________________________________________ lustre-discuss mailing list lustre-discuss@lists.lustre.org<mailto:lustre-discuss@lists.lustre.org> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org<https://eur03.safelinks.protection.outlook.com/?url=http%3A%2F%2Flists.lustre.org%2Flistinfo.cgi%2Flustre-discuss-lustre.org&data=05%7C01%7Ccjm14%40leicester.ac.uk%7C5c7cc6d25763481ed5a208db2584e3fb%7Caebecd6a31d44b0195ce8274afe853d9%7C0%7C0%7C638145024919406019%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=XY77MjDv76FVZiQlS%2BTCERwUQN9P9VcIfgZ5DpeGo4E%3D&reserved=0> _______________________________________________ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org