[lustre-discuss] MDS cases with ldlm_flock_deadlock error
Hi, Starting from yesterday, we see frequent MDS crashes, all of them are showing ldlm_flock_deadlock. Servers are running Lustre 2.15.4, MDT and MGT are on LDISKFS and OSTs are on ZFS. AlmaLinux 8.9. Clients are mostly CentOS 7.9 with Lustre client 2.15.4. In one of these crashes, we have a complete coredump in case if someone wants to check. Thanks, Lixin. [15817.464501] LustreError: 22687:0:(ldlm_flock.c:230:ldlm_flock_deadlock()) ASSERTION( req != lock ) failed: [15817.474247] LustreError: 22687:0:(ldlm_flock.c:230:ldlm_flock_deadlock()) LBUG [15817.481497] Pid: 22687, comm: mdt01_003 4.18.0-513.9.1.el8_lustre.x86_64 #1 SMP Sat Dec 23 05:23:32 UTC 2023 [15817.491318] Call Trace TBD: [15817.494137] [<0>] libcfs_call_trace+0x6f/0xa0 [libcfs] [15817.499297] [<0>] lbug_with_loc+0x3f/0x70 [libcfs] [15817.504097] [<0>] ldlm_flock_deadlock.isra.10+0x1fb/0x240 [ptlrpc] [15817.510398] [<0>] ldlm_process_flock_lock+0x289/0x1f90 [ptlrpc] [15817.516402] [<0>] ldlm_lock_enqueue+0x2a5/0xaa0 [ptlrpc] [15817.521813] [<0>] ldlm_handle_enqueue0+0x634/0x1520 [ptlrpc] [15817.527562] [<0>] tgt_enqueue+0xa4/0x220 [ptlrpc] [15817.532368] [<0>] tgt_request_handle+0xccd/0x1a20 [ptlrpc] [15817.537949] [<0>] ptlrpc_server_handle_request+0x323/0xbe0 [ptlrpc] [15817.544311] [<0>] ptlrpc_main+0xbec/0x1530 [ptlrpc] [15817.549294] [<0>] kthread+0x134/0x150 [15817.552966] [<0>] ret_from_fork+0x1f/0x40 [15817.556980] Kernel panic - not syncing: LBUG [15817.561248] CPU: 23 PID: 22687 Comm: mdt01_003 Kdump: loaded Tainted: G OE- - - 4.18.0-513.9.1.el8_lustre.x86_64 #1 [15817.573669] Hardware name: Dell Inc. PowerEdge R640/0CRT1G, BIOS 2.19.1 06/04/2023 [15817.581235] Call Trace: [15817.583687] dump_stack+0x41/0x60 [15817.587007] panic+0xe7/0x2ac [15817.589979] ? ret_from_fork+0x1f/0x40 [15817.593733] lbug_with_loc.cold.8+0x18/0x18 [libcfs] [15817.598714] ldlm_flock_deadlock.isra.10+0x1fb/0x240 [ptlrpc] [15817.604557] ldlm_process_flock_lock+0x289/0x1f90 [ptlrpc] [15817.610121] ? lustre_msg_get_flags+0x2a/0x90 [ptlrpc] [15817.615346] ? lustre_msg_add_version+0x21/0xa0 [ptlrpc] [15817.620745] ldlm_lock_enqueue+0x2a5/0xaa0 [ptlrpc] [15817.625702] ldlm_handle_enqueue0+0x634/0x1520 [ptlrpc] [15817.631007] tgt_enqueue+0xa4/0x220 [ptlrpc] [15817.635365] tgt_request_handle+0xccd/0x1a20 [ptlrpc] [15817.640503] ? ptlrpc_nrs_req_get_nolock0+0xff/0x1f0 [ptlrpc] [15817.646337] ptlrpc_server_handle_request+0x323/0xbe0 [ptlrpc] [15817.652256] ptlrpc_main+0xbec/0x1530 [ptlrpc] [15817.656791] ? ptlrpc_wait_event+0x590/0x590 [ptlrpc] [15817.661928] kthread+0x134/0x150 [15817.665161] ? set_kthread_struct+0x50/0x50 [15817.669346] ret_from_fork+0x1f/0x40 ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
Re: [lustre-discuss] MDS crashes, lustre version 2.15.3
Hi Aurelien, Thanks, I guess we will have to rebuild our own 2.15.x server. I see other crashes have different dump, usually like these: [36664.403408] BUG: unable to handle kernel NULL pointer dereference at [36664.411237] PGD 0 P4D 0 [36664.413776] Oops: [#1] SMP PTI [36664.417268] CPU: 28 PID: 11101 Comm: qmt_reba_cedar_ Kdump: loaded Tainted: G IOE- - - 4.18.0-477.10.1.el8_lustre.x86_64 #1 [36664.430293] Hardware name: Dell Inc. PowerEdge R640/0CRT1G, BIOS 2.19.1 06/04/2023 [36664.437860] RIP: 0010:qmt_id_lock_cb+0x69/0x100 [lquota] [36664.443199] Code: 48 8b 53 20 8b 4a 0c 85 c9 74 78 89 c1 48 8b 42 18 83 78 10 02 75 0a 83 e1 01 b8 01 00 00 00 74 17 48 63 44 24 04 48 c1 e0 04 <48> 03 45 00 f6 40 08 0c 0f 95 c0 0f b6 c0 48 8b 4c 24 08 65 48 33 [36664.461942] RSP: 0018:aa2e303f3df0 EFLAGS: 00010246 [36664.467169] RAX: RBX: 98722c74b700 RCX: [36664.474301] RDX: 9880415ce660 RSI: 0010 RDI: 9881240b5c64 [36664.481435] RBP: R08: R09: 0004 [36664.488566] R10: 0010 R11: f000 R12: 98722c74b700 [36664.495697] R13: 9875fc07a320 R14: 9878444d3d10 R15: 9878444d3cc0 [36664.502832] FS: () GS:987f20f8() knlGS: [36664.510917] CS: 0010 DS: ES: CR0: 80050033 [36664.516664] CR2: CR3: 002065a10004 CR4: 007706e0 [36664.523794] DR0: DR1: DR2: [36664.530927] DR3: DR6: fffe0ff0 DR7: 0400 [36664.538058] PKRU: 5554 [36664.540772] Call Trace: [36664.543231] ? cfs_cdebug_show.part.3.constprop.23+0x20/0x20 [lquota] [36664.549699] qmt_glimpse_lock.isra.20+0x1e7/0xfa0 [lquota] [36664.555204] qmt_reba_thread+0x5cd/0x9b0 [lquota] [36664.559927] ? qmt_glimpse_lock.isra.20+0xfa0/0xfa0 [lquota] [36664.565602] kthread+0x134/0x150 [36664.568834] ? set_kthread_struct+0x50/0x50 [36664.573021] ret_from_fork+0x1f/0x40 [36664.576603] Modules linked in: osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgs(OE) mgc(OE) osd_ldiskfs(OE) ldiskfs(OE) lquota(OE) mbcache jbd2 lustre(OE) lmv(OE) mdc(OE) lov(OE) osc(OE) fid(OE) fld(OE) ksocklnd(OE) ko2iblnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) libcfs(OE) 8021q garp mrp stp llc dell_rbu vfat fat dm_round_robin dm_multipath rpcrdma sunrpc rdma_ucm ib_srpt ib_isert iscsi_target_mod target_core_mod ib_iser libiscsi opa_vnic scsi_transport_iscsi ib_umad rdma_cm ib_ipoib iw_cm ib_cm intel_rapl_msr intel_rapl_common isst_if_common skx_edac nfit libnvdimm x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel dell_smbios iTCO_wdt iTCO_vendor_support wmi_bmof dell_wmi_descriptor dcdbas kvm ipmi_ssif irqbypass crct10dif_pclmul hfi1 mgag200 crc32_pclmul drm_shmem_helper ghash_clmulni_intel rdmavt qla2xxx drm_kms_helper rapl ib_uverbs nvme_fc intel_cstate syscopyarea nvme_fabrics sysfillrect sysimgblt nvme_core intel_uncore fb_sys_fops pcspkr acpi_ipmi ib_core scsi_transport_fc igb [36664.576699] drm ipmi_si i2c_algo_bit mei_me dca ipmi_devintf mei i2c_i801 lpc_ich wmi ipmi_msghandler acpi_power_meter xfs libcrc32c sd_mod t10_pi sg ahci libahci crc32c_intel libata megaraid_sas dm_mirror dm_region_hash dm_log dm_mod [36664.684758] CR2: Is this also related to the same bug? Thanks, Lixin. From: Aurelien Degremont Date: Wednesday, November 29, 2023 at 8:31 AM To: lustre-discuss , Lixin Liu Subject: RE: MDS crashes, lustre version 2.15.3 You are likely hitting that bug https://jira.whamcloud.com/browse/LU-15207 which is fixed in (not yet released) 2.16.0 Aurélien De : lustre-discuss de la part de Lixin Liu via lustre-discuss Envoyé : mercredi 29 novembre 2023 17:18 À : lustre-discuss Objet : [lustre-discuss] MDS crashes, lustre version 2.15.3 External email: Use caution opening links or attachments Hi, We built our 2.15.3 environment a few months ago. MDT is using ldiskfs and OSTs are using ZFS. The system seems to perform well at the beginning, but recently, we see frequent MDS crashes. The vmcore-dmesg.txt shows the following: [26056.031259] LustreError: 69513:0:(hash.c:1469:cfs_hash_for_each_tight()) ASSERTION( !cfs_hash_is_rehashing(hs) ) failed: [26056.043494] LustreError: 69513:0:(hash.c:1469:cfs_hash_for_each_tight()) LBUG [26056.051460] Pid: 69513, comm: lquota_wb_cedar 4.18.0-477.10.1.el8_lustre.x86_64 #1 SMP Tue Jun 20 00:12:13 UTC 2023 [26056.063099] Call Trace TBD: [26056.066221] [<0>] libcfs_call_trace+0x6f/0xa0 [libcfs] [26056.071970] [<0>] lbug_with_loc+0x3f/0x70 [libcfs] [26056.077322] [<0>] cfs_hash_for_each_tight+0x301/0x310 [libcfs] [26056.083839] [<0>] qsd_start_reint_thread+0x561/0xcc0 [lquota] [26056.090265] [<0>] qsd_upd_thread+0xd43/0x1040 [lquota] [26056.096008] [<0>] kthread+0x13
[lustre-discuss] Random drop off OST from clients
Hi, Recently, we frequently see OSTs are randomly dropped by some client nodes. We have 4 Lustre filesystems, total 126 OSTs. All clients are running 2.15.3 client on CentOS 7. Servers are CentOS 7 with Lustre 2.12.8 (3 FS') and 2.15.3 on Alma 8.8. Failures can happen from both versions of servers. LNET is using OPA interface. One example of the failure is like # lctl dl | grep ' IN ' 126 IN osc cedar_sc-OST000a-osc-980c76944800 52e66575-6443-4be9-a7ce-348b526a0836 4 In syslog, we see Oct 4 23:24:30 cedar5 kernel: LustreError: 11-0: cedar_sc-OST000a-osc-980c76944800: operation ldlm_enqueue to node 172.19.128.33@o2ib failed: rc = -107 Oct 4 23:24:30 cedar5 kernel: Lustre: cedar_sc-OST000a-osc-980c76944800: Connection to cedar_sc-OST000a (at 172.19.128.33@o2ib) was lost; in progress operations using this service will wait for recovery to complete Oct 4 23:24:30 cedar5 kernel: LustreError: 5195:0:(osc_request.c:1037:osc_init_grant()) cedar_sc-OST000a-osc-980c76944800: granted 3407872 but already consumed 519700480 Oct 4 23:24:30 cedar5 kernel: LustreError: 167-0: cedar_sc-OST000a-osc-980c76944800: This client was evicted by cedar_sc-OST000a; in progress operations using this service will fail. Oct 4 23:24:31 cedar5 kernel: LustreError: 62880:0:(ldlm_resource.c:1126:ldlm_resource_complain()) cedar_sc-OST000a-osc-980c76944800: namespace resource [0x73fbbe2:0x0:0x0].0x0 (97fe127e3080) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 4 23:24:31 cedar5 kernel: LustreError: 5218:0:(osc_request.c:711:osc_announce_cached()) cedar_sc-OST000a-osc-980c76944800: dirty 131074 > system dirty_max 131072 Oct 4 23:24:36 cedar5 kernel: LustreError: 5209:0:(osc_request.c:711:osc_announce_cached()) cedar_sc-OST000a-osc-980c76944800: dirty 131074 > system dirty_max 131072 Oct 4 23:24:47 cedar5 kernel: LustreError: 5220:0:(osc_request.c:711:osc_announce_cached()) cedar_sc-OST000a-osc-980c76944800: dirty 131072 > system dirty_max 131072 Oct 4 23:25:36 cedar5 kernel: LustreError: 5242:0:(osc_request.c:711:osc_announce_cached()) cedar_sc-OST000a-osc-980c76944800: dirty 131074 > system dirty_max 131072 This one in particular is 2.15.3 server. Once this happen, it appears the only way is to reboot the client and then the issue goes away. Any ideas where we should check? Thank you very much. Lixin. ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
[lustre-discuss] Cannot move files to another directory
I tried to send this yesterday, but did not seem to get through. Try again now. Hi, We moved our MDT and MGT to a new storage device (DDN SFA200NV) this week. Everything appears to work but there is a very strange problem. We could not "mv" a file to another directory. This is for both old and new data. Here is an example: $ mkdir testdir $ dd if=/dev/zero of=testfile bs=1024k count=10 10+0 records in 10+0 records out 10485760 bytes (10 MB, 10 MiB) copied, 0.292493 s, 35.8 MB/s $ cp testfile testdir/ $ mv testfile testdir/testfile.new mv: cannot move 'testfile' to 'testdir/testfile.new': No such file or directory $ chmod 666 testfile $ ls -l total 10244 drwxr-x--- 2 liu liu 4096 Dec 10 08:12 testdir -rw-rw-rw- 1 liu liu 10485760 Dec 10 08:11 testfile $ rm testfile $ "mv" a file in the same directory works. I have an open case with DDN about this issue, but would like to know if someone here has any suggestions. Thanks, Lixin Liu Simon Fraser University ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
Re: [lustre-discuss] Robinhood changelog errors
After set changelog_mask and restart robinhood, this problem is cleared. Lixin. On 2020-05-29, 11:06 PM, "lustre-discuss on behalf of Lixin Liu" wrote: I am getting an error every second in the robinhood log: 2020/05/29 22:10:37 [38025/3] ChangeLog | Error in llapi_changelog_recv(): -2: No such file or directory 2020/05/29 22:10:38 [38025/3] ChangeLog | Error in llapi_changelog_recv(): -2: No such file or directory 2020/05/29 22:10:39 [38025/3] ChangeLog | Error in llapi_changelog_recv(): -2: No such file or directory 2020/05/29 22:10:40 [38025/3] ChangeLog | Error in llapi_changelog_recv(): -2: No such file or directory 2020/05/29 22:10:41 [38025/3] ChangeLog | Error in llapi_changelog_recv(): -2: No such file or directory These started in early April. Is there something I can do to determine what is the cause? Thanks, Lixin Liu Simon Fraser University ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
[lustre-discuss] Robinhood changelog errors
I am getting an error every second in the robinhood log: 2020/05/29 22:10:37 [38025/3] ChangeLog | Error in llapi_changelog_recv(): -2: No such file or directory 2020/05/29 22:10:38 [38025/3] ChangeLog | Error in llapi_changelog_recv(): -2: No such file or directory 2020/05/29 22:10:39 [38025/3] ChangeLog | Error in llapi_changelog_recv(): -2: No such file or directory 2020/05/29 22:10:40 [38025/3] ChangeLog | Error in llapi_changelog_recv(): -2: No such file or directory 2020/05/29 22:10:41 [38025/3] ChangeLog | Error in llapi_changelog_recv(): -2: No such file or directory These started in early April. Is there something I can do to determine what is the cause? Thanks, Lixin Liu Simon Fraser University ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
Re: [lustre-discuss] strange errors on Lustre servers
Hi Zeeshan, Thanks for the hint. OPA works fine, but then I found someone brought up a misconfigured node which has the conflict IP address on OPA interface. Fixing it and problem solved. Lixin. From: Zeeshan Ali Shah Date: Saturday, August 11, 2018 at 11:20 PM To: Lixin Liu Cc: "lustre-discuss@lists.lustre.org" Subject: Re: [lustre-discuss] strange errors on Lustre servers What is output of opainfo ? Sent from my iPhone On 12 Aug 2018, at 04:04, Lixin Liu mailto:l...@sfu.ca>> wrote: Hi, I am getting these errors on all our MDS and OSS servers (Lustre 2.10.1): Aug 11 11:45:52 ndc-oss5b kernel: LNet: 24727:0:(o2iblnd_cb.c:2410:kiblnd_passive_connect()) Conn stale 172.19.142.119@o2ib version 12/12 incarnation 1533927051163335/1533998625080752 Aug 11 11:55:52 ndc-oss5b kernel: LNet: 105990:0:(o2iblnd_cb.c:2410:kiblnd_passive_connect()) Conn stale 172.19.142.119@o2ib version 12/12 incarnation 1533927051163335/1533998625080752 Aug 11 12:05:52 ndc-oss5b kernel: LNet: 105990:0:(o2iblnd_cb.c:2410:kiblnd_passive_connect()) Conn stale 172.19.142.119@o2ib version 12/12 incarnation 1533927051163335/1533998625080752 Aug 11 12:15:52 ndc-oss5b kernel: LNet: 105990:0:(o2iblnd_cb.c:2410:kiblnd_passive_connect()) Conn stale 172.19.142.119@o2ib version 12/12 incarnation 1533927051163335/1533998625080752 Aug 11 12:25:52 ndc-oss5b kernel: LNet: 105990:0:(o2iblnd_cb.c:2410:kiblnd_passive_connect()) Conn stale 172.19.142.119@o2ib version 12/12 incarnation 1533927051163335/1533998625080752 This is a new node we brought online recently. Is it an indication that we have problem with it OPA interface on the node? This machine has a 8160F CPU (OPA interface on chip). Thanks, Lixin Liu High Performance Computing Simon Fraser University ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org<mailto:lustre-discuss@lists.lustre.org> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
[lustre-discuss] strange errors on Lustre servers
Hi, I am getting these errors on all our MDS and OSS servers (Lustre 2.10.1): Aug 11 11:45:52 ndc-oss5b kernel: LNet: 24727:0:(o2iblnd_cb.c:2410:kiblnd_passive_connect()) Conn stale 172.19.142.119@o2ib version 12/12 incarnation 1533927051163335/1533998625080752 Aug 11 11:55:52 ndc-oss5b kernel: LNet: 105990:0:(o2iblnd_cb.c:2410:kiblnd_passive_connect()) Conn stale 172.19.142.119@o2ib version 12/12 incarnation 1533927051163335/1533998625080752 Aug 11 12:05:52 ndc-oss5b kernel: LNet: 105990:0:(o2iblnd_cb.c:2410:kiblnd_passive_connect()) Conn stale 172.19.142.119@o2ib version 12/12 incarnation 1533927051163335/1533998625080752 Aug 11 12:15:52 ndc-oss5b kernel: LNet: 105990:0:(o2iblnd_cb.c:2410:kiblnd_passive_connect()) Conn stale 172.19.142.119@o2ib version 12/12 incarnation 1533927051163335/1533998625080752 Aug 11 12:25:52 ndc-oss5b kernel: LNet: 105990:0:(o2iblnd_cb.c:2410:kiblnd_passive_connect()) Conn stale 172.19.142.119@o2ib version 12/12 incarnation 1533927051163335/1533998625080752 This is a new node we brought online recently. Is it an indication that we have problem with it OPA interface on the node? This machine has a 8160F CPU (OPA interface on chip). Thanks, Lixin Liu High Performance Computing Simon Fraser University ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
Re: [lustre-discuss] BAD CHECKSUM
Hi Andreas, I have seen very similar errors in our 2.10.1 environment. Same errors from different clients to different OSS servers and OSTs. Our network is OPA and we are using the latest driver and firmware for all HFIs and switches (10.6). Thanks, Lixin Liu Compute Canada From: lustre-discuss <lustre-discuss-boun...@lists.lustre.org> on behalf of "Dilger, Andreas" <andreas.dil...@intel.com> Date: Saturday, December 9, 2017 at 9:07 PM To: Hans Henrik Happe <ha...@nbi.dk> Cc: "lustre-discuss@lists.lustre.org" <lustre-discuss@lists.lustre.org> Subject: Re: [lustre-discuss] BAD CHECKSUM Based on the messages on the client, this isn’t related to mmap() or writes done by the client, since the data has the same checksum from before it was sent and after it got the checksum error returned from the server. That means the pages did not change on the client. Possible causes include the client network card, server network card, memory, or possibly the OFED driver? It could of course be something in Lustre/LNet, though we haven’t had any reports of anything similar. When the checksum code was first written, it was motivated by a faulty Ethernet NIC that had TCP checksum offload, but bad onboard cache, and the data was corrupted when copied onto the NIC but the TCP checksum was computed on the bad data and the checksum was “correct” when received by the server, so it didn’t cause TCP resends. Are you seeing this on multiple servers? The client log only shows one server, while the server log shows multiple clients. If it is only happening on one server it might point to hardware. Did you also upgrade the kernel and OFED at the same time as Lustre? You could try building Lustre 2.10.1 on the old 2.9.0 kernel and OFED to see if that works properly. Cheers, Andreas On Dec 9, 2017, at 11:09, Hans Henrik Happe <ha...@nbi.dk> wrote: On 09-12-2017 18:57, Hans Henrik Happe wrote: On 07-12-2017 21:36, Dilger, Andreas wrote: On Dec 7, 2017, at 10:37, Hans Henrik Happe <ha...@nbi.dk> wrote: Hi, Can an application cause BAD CHECKSUM errors in Lustre logs by somehow overwriting memory while being DMA'ed to network? After upgrading to 2.10.1 on the server side we started seeing this from a user's application (MPI I/O). Both 2.9.0 and 2.10.1 clients emit these errors. We have not yet established weather the application is doing things correctly. If applications are using mmap IO it is possible for the page to become inconsistent after the checksum has been computed. However, mmap IO is normally detected by the client and no message should be printed. There isn't anything that the application needs to do, since the client will resend the data if there is a checksum error, but the resends do slow down the IO. If the inconsistency is on the client, there is no cause for concern (though it would be good to figure out the root cause). It would be interesting to see what the exact error message is, since that will say whether the data became inconsistent on the client, or over the network. If the inconsistency is over the network or on the server, then that may point to hardware issues. I've attached logs from a server and a client. There was a cut n' paste error in the first set of files. This should be better. Looks like a something goes wrong over the network. Cheers, Hans Henrik ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org