[lustre-discuss] Lustre errors asking for help
Dear Andreas, All, Thanks for your response. Just to give you some more information. Yes, we have run e2fsck on all OSTs, MDTs, and MGS and all come back completely clean. We have also removed and recreated the quota files on them, again without any issues. The underlying storage upon which the OSTs and MDTs are built is fine. We have run verifies on all LUNs, and then come back clean. All the spinning disks are fine and the controllers report no errors are failures. Similarly, the Infiniband fabric connecting all storage servers has also been checked and no errors or issues are present. You are correct that no files are being created on those OSTs. However, Lustre is behaving poorly, and sometimes goes into a hung state. Today, it was reporting the following: pdsh -g storage uptime mds2: 14:39:27 up 4 days, 21:17, 0 users, load average: 0.00, 0.00, 0.00 mds1: 14:39:27 up 4 days, 21:17, 0 users, load average: 0.00, 0.00, 0.00 oss2: 14:39:27 up 4 days, 21:07, 0 users, load average: 0.00, 0.00, 0.00 oss1: 14:39:27 up 4 days, 21:07, 0 users, load average: 0.00, 0.00, 0.00 oss4: 14:39:27 up 4 days, 21:06, 0 users, load average: 0.00, 0.00, 0.00 oss6: 14:39:27 up 4 days, 21:06, 0 users, load average: 0.00, 0.00, 0.00 oss5: 14:39:27 up 4 days, 21:06, 0 users, load average: 0.00, 0.00, 0.00 oss3: 14:39:27 up 4 days, 21:07, 0 users, load average: 0.06, 0.03, 0.00 The load on the storage servers is effectively zero yet the following messages are being produced on mds2 (the mds serving the problematic OSTs) and other OSTs from our lustre.log Jan 19 14:36:57 oss5 kernel: : LustreError: 13751:0:(ofd_obd.c:1348:ofd_create()) scratch-OST0004: unable to precreate: rc = -116 Jan 19 14:36:57 oss5 kernel: : LustreError: 13751:0:(ofd_obd.c:1348:ofd_create()) Skipped 76 previous similar messages Jan 19 14:39:35 mds2 kernel: : Lustre: 24647:0:(service.c:1339:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/-150), not sending early reply Jan 19 14:39:35 mds2 kernel: : Lustre: 24647:0:(service.c:1339:ptlrpc_at_send_early_reply()) Skipped 133 previous similar messages Jan 19 14:40:57 oss3 kernel: : LustreError: 13903:0:(ofd_obd.c:1348:ofd_create()) scratch-OST0014: unable to precreate: rc = -116 Jan 19 14:40:57 oss3 kernel: : LustreError: 13903:0:(ofd_obd.c:1348:ofd_create()) Skipped 80 previous similar messages Jan 19 14:43:31 mds2 kernel: : Lustre: scratch-MDT: Client e258850e-e603-8c7b-843f-66886cc67347 (at 192.168.113.1@o2ib) reconnecting Jan 19 14:43:31 mds2 kernel: : Lustre: Skipped 4160 previous similar messagesJan 19 14:43:31 mds2 kernel: : Lustre: scratch-MDT: Client e258850e-e603-8c7b-843f-66886cc67347 (at 192.168.113.1@o2ib) refused reconnection, still busy with 1 active RPCs Jan 19 14:43:31 mds2 kernel: : Lustre: Skipped 4094 previous similar messagesJan 19 14:44:30 mds2 kernel: : Lustre: lock timed out (enqueued at 1705703070, 1200s ago)Jan 19 14:44:30 mds2 kernel: : LustreError: dumping log to /tmp/lustre-log.1705704270.4999 Jan 19 14:44:30 mds2 kernel: : Lustre: Skipped 10 previous similar messages Jan 19 14:44:32 mds2 kernel: : LustreError: 0:0:(ldlm_lockd.c:391:waiting_locks_callback()) ### lock callback timer expired after 4307s: evicting client at 192.168.122.11@o2ib ns: mdt-scratch-MDT_UUID lock: 8804455dc480/0x280a58b66b7affa7 lrc: 3/0,0 mode: PR/PR res: [0x2000733da:0x14d:0x0].0 bits 0x13 rrc: 267 type: IBT flags: 0x20040020 nid: 192.168.122.11@o2ib remote: 0xa9053ebb6b8f23c1 expref: 7 pid: 28304 timeout: 4717246067 lvb_type: 0 Jan 19 14:44:32 mds2 kernel: : LustreError: 0:0:(ldlm_lockd.c:391:waiting_locks_callback()) Skipped 1 previous similar message Jan 19 14:44:32 mds2 kernel: : Lustre: 26575:0:(service.c:2031:ptlrpc_server_handle_request()) @@@ Request took longer than estimated (755:4253s); client may timeout. req@880127936800 x1788463595089220/t609893172938(0) o101->03cbede6-68a6-36ae-866e-74e858cf47f1@192.168.114.2@o2ib:0/0 lens 584/600 e 0 to 0 dl 1705700019 ref 1 fl Complete:/0/0 rc 0/0 Jan 19 14:44:32 mds2 kernel: : LustreError: 0:0:(ldlm_lockd.c:391:waiting_locks_callback()) Skipped 1 previous similar message Jan 19 14:44:32 mds2 kernel: : Lustre: 26575:0:(service.c:2031:ptlrpc_server_handle_request()) @@@ Request took longer than estimated (755:4253s); client may timeout. req@880127936800 x1788463595089220/t609893172938(0) o101->03cbede6-68a6-36ae-866e-74e858cf47f1@192.168.114.2@o2ib:0/0 lens 584/600 e 0 to 0 dl 1705700019 ref 1 fl Complete:/0/0 rc 0/0 Jan 19 14:44:32 mds2 kernel: : LustreError: 29019:0:(service.c:1999:ptlrpc_server_handle_request()) @@@ Dropping timed-out request from 12345-192.168.128.8@o2ib: deadline 100:599s ago Jan 19 14:44:32 mds2 kernel: : LustreError: 29019:0:(service.c:1999:ptlrpc_server_handle_request()) Skipped 73 previous similar messages Jan 19 14:44:32 mds2 kernel: : LustreError: 24604:0:(ldlm_lockd.c:1309:ldlm_handle_enqueue0()) ### lock on disco
Re: [lustre-discuss] Lustre errors asking for help
Roman, have you tried running e2fsck on the underlying device ("-fn" to start)? It is usually best to run with the latest version of e2fsprogs as it has most fixes. It is definitely strange that all OSTs are reporting errors at the same time, which makes me wonder how the underlying hardware is holding up? Can you log in to the controller and check the RAID status? The error might be coming from the Object Index on those OSTs. However, this version is old enough that I'm not sure if OI Scrub is even existed in that version. Otherwise, it would be possible to just remove the OI files and they would be recreated on the next mount. The filesystem currently isn't able to create any new files on those OSTs, so that may also be why the performance is lower. After 12+ years, it might be time to update to newer storage? In particular, such old HDDs often fail after a significant power failure, so you might be running on the last legs, and it's a good time to make a backup. Given the age of the storage, I expect a modern HDD or two would have enough capacity to backup the whole filesystem (even if not performing as well), in case you don't have a chance to upgrade before it finally gives out. Cheers, Andreas > On Jan 17, 2024, at 17:55, Baranowski, Roman wrote: > > > Dear All, > > We have a legacy version of Lustre installed as part of a DDN storage > solution: > > lustre: 2.4.3 (circa 2011) > > kernel: patchless_client > > Build Version: > EXAScaler-ddn1.0--PRISTINE-2.6.32-358.23.2.el6_lustre.es279.devel.x86_64 > > > > It has been running fine for years but after a particularly bad power > failure,it started producing the following messages: > > Jan 15 10:03:07 mds2 kernel: : LustreError: > 3394:0:(osp_precreate.c:989:osp_precreate_thread()) > scratch-OST0014-osc-MDT: cannot precreate objects: rc = -116 > Jan 15 10:03:07 mds2 kernel: : LustreError: > 3394:0:(osp_precreate.c:989:osp_precreate_thread()) Skipped 210 previous > similar messages > Jan 15 10:07:51 mds2 kernel: : Lustre: scratch-OST000f-osc-MDT: slow > creates, last=[0x1000f:0x1217571a:0x0], > next=[0x1000f:0x1217571a:0x0], reserved=0, syn_changes=0, > syn_rpc_in_progress=0, status=0 > Jan 15 10:07:51 mds2 kernel: : Lustre: Skipped 3 previous similar messages > Jan 15 10:08:32 oss5 kernel: : LustreError: > 26943:0:(ofd_obd.c:1348:ofd_create()) scratch-OST0004: unable to precreate: > rc = -116 > Jan 15 10:08:32 oss5 kernel: : LustreError: > 26943:0:(ofd_obd.c:1348:ofd_create()) Skipped 66 previous similar messages > Jan 15 10:09:26 oss4 kernel: : LustreError: > 18223:0:(ofd_obd.c:1348:ofd_create()) scratch-OST000f: unable to precreate: > rc = -116 > Jan 15 10:09:26 oss4 kernel: : LustreError: > 18223:0:(ofd_obd.c:1348:ofd_create()) Skipped 70 previous similar messages > Jan 15 10:09:37 oss3 kernel: : LustreError: > 16621:0:(ofd_obd.c:1348:ofd_create()) scratch-OST0014: unable to precreate: > rc = -116 > Jan 15 10:09:37 oss3 kernel: : LustreError: > 16621:0:(ofd_obd.c:1348:ofd_create()) Skipped 77 previous similar messages > Jan 15 10:09:38 mds2 kernel: : Lustre: scratch-OST0014-osc-MDT: slow > creates, last=[0x10014:0x11dd257a:0x0], > next=[0x10014:0x11dd257a:0x0], reserved=0, syn_changes=0, > syn_rpc_in_progress=0, status=-116 > Jan 15 10:13:12 mds2 kernel: : LustreError: > 3404:0:(osp_precreate.c:484:osp_precreate_send()) > scratch-OST0004-osc-MDT: can't precreate: rc = -116 > Jan 15 10:13:12 mds2 kernel: : LustreError: > 3404:0:(osp_precreate.c:484:osp_precreate_send()) Skipped 226 previous > similar messages > Jan 15 10:13:12 mds2 kernel: : LustreError: > 3404:0:(osp_precreate.c:484:osp_precreate_send()) Skipped 226 previous > similar messages > Jan 15 10:13:12 mds2 kernel: : LustreError: > 3404:0:(osp_precreate.c:989:osp_precreate_thread()) > scratch-OST0004-osc-MDT: cannot precreate objects: rc = -116 > Jan 15 10:13:12 mds2 kernel: : LustreError: > 3404:0:(osp_precreate.c:989:osp_precreate_thread()) Skipped 226 previous > similar messages > Jan 15 10:18:37 oss5 kernel: : LustreError: > 1791:0:(ofd_obd.c:1348:ofd_create()) scratch-OST0004: unable to precreate: rc > = -116 > Jan 15 10:18:37 oss5 kernel: : LustreError: > 1791:0:(ofd_obd.c:1348:ofd_create()) Skipped 77 previous similar messages > Jan 15 10:19:36 oss4 kernel: : LustreError: > 1687:0:(ofd_obd.c:1348:ofd_create()) scratch-OST000f: unable to precreate: rc > = -116 > Jan 15 10:19:36 oss4 kernel: : LustreError: > 1687:0:(ofd_obd.c:1348:ofd_create()) Skipped 77 previous similar messages > Jan 15 10:19:42 oss3 kernel: : LustreError: > 1196:0:(ofd_obd.c:1348:ofd_create()) scratch-OST0014: unable to precreate: rc > = -116 > Jan 15 10:19:42 oss3 kernel: : LustreError: > 1196:0:(ofd_obd.c:1348:ofd_create()) Skipped 75 previous similar messages > Jan 15 10:23:16 mds2 kernel: : LustreError: > 3400:0:(osp_precreate.c:484:osp_precreate_send()) > scratch-OST000f-osc-MDT: can't precreate:
[lustre-discuss] Lustre errors asking for help
Dear All, We have a legacy version of Lustre installed as part of a DDN storage solution: lustre: 2.4.3 (circa 2011) kernel: patchless_client Build Version: EXAScaler-ddn1.0--PRISTINE-2.6.32-358.23.2.el6_lustre.es279.devel.x86_64 It has been running fine for years but after a particularly bad power failure,it started producing the following messages: Jan 15 10:03:07 mds2 kernel: : LustreError: 3394:0:(osp_precreate.c:989:osp_precreate_thread()) scratch-OST0014-osc-MDT: cannot precreate objects: rc = -116 Jan 15 10:03:07 mds2 kernel: : LustreError: 3394:0:(osp_precreate.c:989:osp_precreate_thread()) Skipped 210 previous similar messages Jan 15 10:07:51 mds2 kernel: : Lustre: scratch-OST000f-osc-MDT: slow creates, last=[0x1000f:0x1217571a:0x0], next=[0x1000f:0x1217571a:0x0], reserved=0, syn_changes=0, syn_rpc_in_progress=0, status=0 Jan 15 10:07:51 mds2 kernel: : Lustre: Skipped 3 previous similar messages Jan 15 10:08:32 oss5 kernel: : LustreError: 26943:0:(ofd_obd.c:1348:ofd_create()) scratch-OST0004: unable to precreate: rc = -116 Jan 15 10:08:32 oss5 kernel: : LustreError: 26943:0:(ofd_obd.c:1348:ofd_create()) Skipped 66 previous similar messages Jan 15 10:09:26 oss4 kernel: : LustreError: 18223:0:(ofd_obd.c:1348:ofd_create()) scratch-OST000f: unable to precreate: rc = -116 Jan 15 10:09:26 oss4 kernel: : LustreError: 18223:0:(ofd_obd.c:1348:ofd_create()) Skipped 70 previous similar messages Jan 15 10:09:37 oss3 kernel: : LustreError: 16621:0:(ofd_obd.c:1348:ofd_create()) scratch-OST0014: unable to precreate: rc = -116 Jan 15 10:09:37 oss3 kernel: : LustreError: 16621:0:(ofd_obd.c:1348:ofd_create()) Skipped 77 previous similar messages Jan 15 10:09:38 mds2 kernel: : Lustre: scratch-OST0014-osc-MDT: slow creates, last=[0x10014:0x11dd257a:0x0], next=[0x10014:0x11dd257a:0x0], reserved=0, syn_changes=0, syn_rpc_in_progress=0, status=-116 Jan 15 10:13:12 mds2 kernel: : LustreError: 3404:0:(osp_precreate.c:484:osp_precreate_send()) scratch-OST0004-osc-MDT: can't precreate: rc = -116 Jan 15 10:13:12 mds2 kernel: : LustreError: 3404:0:(osp_precreate.c:484:osp_precreate_send()) Skipped 226 previous similar messages Jan 15 10:13:12 mds2 kernel: : LustreError: 3404:0:(osp_precreate.c:484:osp_precreate_send()) Skipped 226 previous similar messages Jan 15 10:13:12 mds2 kernel: : LustreError: 3404:0:(osp_precreate.c:989:osp_precreate_thread()) scratch-OST0004-osc-MDT: cannot precreate objects: rc = -116 Jan 15 10:13:12 mds2 kernel: : LustreError: 3404:0:(osp_precreate.c:989:osp_precreate_thread()) Skipped 226 previous similar messages Jan 15 10:18:37 oss5 kernel: : LustreError: 1791:0:(ofd_obd.c:1348:ofd_create()) scratch-OST0004: unable to precreate: rc = -116 Jan 15 10:18:37 oss5 kernel: : LustreError: 1791:0:(ofd_obd.c:1348:ofd_create()) Skipped 77 previous similar messages Jan 15 10:19:36 oss4 kernel: : LustreError: 1687:0:(ofd_obd.c:1348:ofd_create()) scratch-OST000f: unable to precreate: rc = -116 Jan 15 10:19:36 oss4 kernel: : LustreError: 1687:0:(ofd_obd.c:1348:ofd_create()) Skipped 77 previous similar messages Jan 15 10:19:42 oss3 kernel: : LustreError: 1196:0:(ofd_obd.c:1348:ofd_create()) scratch-OST0014: unable to precreate: rc = -116 Jan 15 10:19:42 oss3 kernel: : LustreError: 1196:0:(ofd_obd.c:1348:ofd_create()) Skipped 75 previous similar messages Jan 15 10:23:16 mds2 kernel: : LustreError: 3400:0:(osp_precreate.c:484:osp_precreate_send()) scratch-OST000f-osc-MDT: can't precreate: rc = -116 The messages concern the same 3 OSTs and appear both on the OSS servers serving those OSTs and the mds server responsible for that filesystem (/global/scratch). They appear continuously, about every 4 minutes, and appear as soon as the filesystem is mounted even before any I/O occurs. In other words, even on an inactive filesystem, the messages appear continuously. While everything seems to work, the performance is terrible. Creating a directory on the filesystem can take 1-2 minutes to complete. The load on the mds server climbs to incredibly high values (100-160) during normal I/O operations and the filesystem overall is extremely slow. The mds server complains about slow connections (see messages above). We think the error messages above indicate the problem but despite searching many hours on the web, have not been able to find any documentation about what may be causing them, or how to correct the issue. Any help would be greatly appreciated. Thanks a million for any suggestions and solutions All the best Roman ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org