[lustre-discuss] Lustre errors asking for help

2024-01-19 Thread Baranowski, Roman via lustre-discuss
Dear Andreas, All,

Thanks for your response.  Just to give you some more information.  Yes, we have
run e2fsck on all OSTs, MDTs, and MGS and all come back completely clean.  We
have also removed and recreated the quota files on them, again without any
issues.  The underlying storage upon which the OSTs and MDTs are built is fine.
We have run verifies on all LUNs, and then come back clean.  All the spinning
disks are fine and the controllers report no errors are failures.  Similarly,
the Infiniband fabric connecting all storage servers has also been checked and
no errors or issues are present.

You are correct that no files are being created on those OSTs.  However, Lustre
is behaving poorly, and sometimes goes into a hung state.  Today, it was
reporting the following:

pdsh -g storage uptime
mds2:  14:39:27 up 4 days, 21:17,  0 users,  load average: 0.00, 0.00, 0.00
mds1:  14:39:27 up 4 days, 21:17,  0 users,  load average: 0.00, 0.00, 0.00
oss2:  14:39:27 up 4 days, 21:07,  0 users,  load average: 0.00, 0.00, 0.00
oss1:  14:39:27 up 4 days, 21:07,  0 users,  load average: 0.00, 0.00, 0.00
oss4:  14:39:27 up 4 days, 21:06,  0 users,  load average: 0.00, 0.00, 0.00
oss6:  14:39:27 up 4 days, 21:06,  0 users,  load average: 0.00, 0.00, 0.00
oss5:  14:39:27 up 4 days, 21:06,  0 users,  load average: 0.00, 0.00, 0.00
oss3:  14:39:27 up 4 days, 21:07,  0 users,  load average: 0.06, 0.03, 0.00

The load on the storage servers is effectively zero yet the following messages
are being produced on mds2 (the mds serving the problematic OSTs) and other
OSTs from our lustre.log

Jan 19 14:36:57 oss5 kernel: : LustreError: 
13751:0:(ofd_obd.c:1348:ofd_create()) scratch-OST0004: unable to precreate: rc 
= -116
Jan 19 14:36:57 oss5 kernel: : LustreError: 
13751:0:(ofd_obd.c:1348:ofd_create()) Skipped 76 previous similar messages
Jan 19 14:39:35 mds2 kernel: : Lustre: 
24647:0:(service.c:1339:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time 
(5/-150), not sending early reply
Jan 19 14:39:35 mds2 kernel: : Lustre: 
24647:0:(service.c:1339:ptlrpc_at_send_early_reply()) Skipped 133 previous 
similar messages
Jan 19 14:40:57 oss3 kernel: : LustreError: 
13903:0:(ofd_obd.c:1348:ofd_create()) scratch-OST0014: unable to precreate: rc 
= -116
Jan 19 14:40:57 oss3 kernel: : LustreError: 
13903:0:(ofd_obd.c:1348:ofd_create()) Skipped 80 previous similar messages
Jan 19 14:43:31 mds2 kernel: : Lustre: scratch-MDT: Client 
e258850e-e603-8c7b-843f-66886cc67347 (at 192.168.113.1@o2ib) reconnecting
Jan 19 14:43:31 mds2 kernel: : Lustre: Skipped 4160 previous similar 
messagesJan 19 14:43:31 mds2 kernel: : Lustre: scratch-MDT: Client 
e258850e-e603-8c7b-843f-66886cc67347 (at 192.168.113.1@o2ib) refused 
reconnection, still busy with 1 active RPCs
Jan 19 14:43:31 mds2 kernel: : Lustre: Skipped 4094 previous similar 
messagesJan 19 14:44:30 mds2 kernel: : Lustre: lock timed out (enqueued at 
1705703070, 1200s ago)Jan 19 14:44:30 mds2 kernel: : LustreError: dumping log 
to /tmp/lustre-log.1705704270.4999
Jan 19 14:44:30 mds2 kernel: : Lustre: Skipped 10 previous similar messages
Jan 19 14:44:32 mds2 kernel: : LustreError: 
0:0:(ldlm_lockd.c:391:waiting_locks_callback()) ### lock callback timer expired 
after 4307s: evicting client at 192.168.122.11@o2ib  ns: 
mdt-scratch-MDT_UUID lock: 8804455dc480/0x280a58b66b7affa7 lrc: 3/0,0 
mode: PR/PR res: [0x2000733da:0x14d:0x0].0 bits 0x13 rrc: 267 type: IBT flags: 
0x20040020 nid: 192.168.122.11@o2ib remote: 0xa9053ebb6b8f23c1 expref: 7 
pid: 28304 timeout: 4717246067 lvb_type: 0
Jan 19 14:44:32 mds2 kernel: : LustreError: 
0:0:(ldlm_lockd.c:391:waiting_locks_callback()) Skipped 1 previous similar 
message
Jan 19 14:44:32 mds2 kernel: : Lustre: 
26575:0:(service.c:2031:ptlrpc_server_handle_request()) @@@ Request took longer 
than estimated (755:4253s); client may timeout.  req@880127936800 
x1788463595089220/t609893172938(0) 
o101->03cbede6-68a6-36ae-866e-74e858cf47f1@192.168.114.2@o2ib:0/0 lens 584/600 
e 0 to 0 dl 1705700019 ref 1 fl Complete:/0/0 rc 0/0
Jan 19 14:44:32 mds2 kernel: : LustreError: 
0:0:(ldlm_lockd.c:391:waiting_locks_callback()) Skipped 1 previous similar 
message
Jan 19 14:44:32 mds2 kernel: : Lustre: 
26575:0:(service.c:2031:ptlrpc_server_handle_request()) @@@ Request took longer 
than estimated (755:4253s); client may timeout.  req@880127936800 
x1788463595089220/t609893172938(0) 
o101->03cbede6-68a6-36ae-866e-74e858cf47f1@192.168.114.2@o2ib:0/0 lens 584/600 
e 0 to 0 dl 1705700019 ref 1 fl Complete:/0/0 rc 0/0
Jan 19 14:44:32 mds2 kernel: : LustreError: 
29019:0:(service.c:1999:ptlrpc_server_handle_request()) @@@ Dropping timed-out 
request from 12345-192.168.128.8@o2ib: deadline 100:599s ago
Jan 19 14:44:32 mds2 kernel: : LustreError: 
29019:0:(service.c:1999:ptlrpc_server_handle_request()) Skipped 73 previous 
similar messages
Jan 19 14:44:32 mds2 kernel: : LustreError: 
24604:0:(ldlm_lockd.c:1309:ldlm_handle_enqueue0()) ### lock on disco

Re: [lustre-discuss] Lustre errors asking for help

2024-01-17 Thread Andreas Dilger via lustre-discuss
Roman,
have you tried running e2fsck on the underlying device ("-fn" to start)?  It is 
usually best
to run with the latest version of e2fsprogs as it has most fixes.  

It is definitely strange that all OSTs are reporting errors at the same time, 
which makes me
wonder how the underlying hardware is holding up?  Can you log in to the 
controller and check
the RAID status?

The error might be coming from the Object Index on those OSTs.  However, this 
version is old
enough that I'm not sure if OI Scrub is even existed in that version.  
Otherwise, it would be
possible to just remove the OI files and they would be recreated on the next 
mount.

The filesystem currently isn't able to create any new files on those OSTs, so 
that may also
be why the performance is lower.

After 12+ years, it might be time to update to newer storage?  In particular, 
such old HDDs
often fail after a significant power failure, so you might be running on the 
last legs, and
it's a good time to make a backup.  Given the age of the storage, I expect a 
modern HDD or
two would have enough capacity to backup the whole filesystem (even if not 
performing as
well), in case you don't have a chance to upgrade before it finally gives out.

Cheers, Andreas

> On Jan 17, 2024, at 17:55, Baranowski, Roman wrote:
> 
> 
> Dear All,
> 
> We have a legacy version of Lustre installed as part of a DDN storage 
> solution:
> 
> lustre: 2.4.3 (circa 2011)
> 
> kernel: patchless_client
> 
> Build Version: 
> EXAScaler-ddn1.0--PRISTINE-2.6.32-358.23.2.el6_lustre.es279.devel.x86_64
> 
> 
> 
> It has been running fine for years but after a particularly bad power 
> failure,it started producing the following messages:
> 
> Jan 15 10:03:07 mds2 kernel: : LustreError: 
> 3394:0:(osp_precreate.c:989:osp_precreate_thread()) 
> scratch-OST0014-osc-MDT: cannot precreate objects: rc = -116
> Jan 15 10:03:07 mds2 kernel: : LustreError: 
> 3394:0:(osp_precreate.c:989:osp_precreate_thread()) Skipped 210 previous 
> similar messages
> Jan 15 10:07:51 mds2 kernel: : Lustre: scratch-OST000f-osc-MDT: slow 
> creates, last=[0x1000f:0x1217571a:0x0], 
> next=[0x1000f:0x1217571a:0x0], reserved=0, syn_changes=0, 
> syn_rpc_in_progress=0, status=0
> Jan 15 10:07:51 mds2 kernel: : Lustre: Skipped 3 previous similar messages
> Jan 15 10:08:32 oss5 kernel: : LustreError: 
> 26943:0:(ofd_obd.c:1348:ofd_create()) scratch-OST0004: unable to precreate: 
> rc = -116
> Jan 15 10:08:32 oss5 kernel: : LustreError: 
> 26943:0:(ofd_obd.c:1348:ofd_create()) Skipped 66 previous similar messages
> Jan 15 10:09:26 oss4 kernel: : LustreError: 
> 18223:0:(ofd_obd.c:1348:ofd_create()) scratch-OST000f: unable to precreate: 
> rc = -116
> Jan 15 10:09:26 oss4 kernel: : LustreError: 
> 18223:0:(ofd_obd.c:1348:ofd_create()) Skipped 70 previous similar messages
> Jan 15 10:09:37 oss3 kernel: : LustreError: 
> 16621:0:(ofd_obd.c:1348:ofd_create()) scratch-OST0014: unable to precreate: 
> rc = -116
> Jan 15 10:09:37 oss3 kernel: : LustreError: 
> 16621:0:(ofd_obd.c:1348:ofd_create()) Skipped 77 previous similar messages
> Jan 15 10:09:38 mds2 kernel: : Lustre: scratch-OST0014-osc-MDT: slow 
> creates, last=[0x10014:0x11dd257a:0x0], 
> next=[0x10014:0x11dd257a:0x0], reserved=0, syn_changes=0, 
> syn_rpc_in_progress=0, status=-116
> Jan 15 10:13:12 mds2 kernel: : LustreError: 
> 3404:0:(osp_precreate.c:484:osp_precreate_send()) 
> scratch-OST0004-osc-MDT: can't precreate: rc = -116
> Jan 15 10:13:12 mds2 kernel: : LustreError: 
> 3404:0:(osp_precreate.c:484:osp_precreate_send()) Skipped 226 previous 
> similar messages
> Jan 15 10:13:12 mds2 kernel: : LustreError: 
> 3404:0:(osp_precreate.c:484:osp_precreate_send()) Skipped 226 previous 
> similar messages
> Jan 15 10:13:12 mds2 kernel: : LustreError: 
> 3404:0:(osp_precreate.c:989:osp_precreate_thread()) 
> scratch-OST0004-osc-MDT: cannot precreate objects: rc = -116
> Jan 15 10:13:12 mds2 kernel: : LustreError: 
> 3404:0:(osp_precreate.c:989:osp_precreate_thread()) Skipped 226 previous 
> similar messages
> Jan 15 10:18:37 oss5 kernel: : LustreError: 
> 1791:0:(ofd_obd.c:1348:ofd_create()) scratch-OST0004: unable to precreate: rc 
> = -116
> Jan 15 10:18:37 oss5 kernel: : LustreError: 
> 1791:0:(ofd_obd.c:1348:ofd_create()) Skipped 77 previous similar messages
> Jan 15 10:19:36 oss4 kernel: : LustreError: 
> 1687:0:(ofd_obd.c:1348:ofd_create()) scratch-OST000f: unable to precreate: rc 
> = -116
> Jan 15 10:19:36 oss4 kernel: : LustreError: 
> 1687:0:(ofd_obd.c:1348:ofd_create()) Skipped 77 previous similar messages
> Jan 15 10:19:42 oss3 kernel: : LustreError: 
> 1196:0:(ofd_obd.c:1348:ofd_create()) scratch-OST0014: unable to precreate: rc 
> = -116
> Jan 15 10:19:42 oss3 kernel: : LustreError: 
> 1196:0:(ofd_obd.c:1348:ofd_create()) Skipped 75 previous similar messages
> Jan 15 10:23:16 mds2 kernel: : LustreError: 
> 3400:0:(osp_precreate.c:484:osp_precreate_send()) 
> scratch-OST000f-osc-MDT: can't precreate:

[lustre-discuss] Lustre errors asking for help

2024-01-17 Thread Baranowski, Roman via lustre-discuss

Dear All,

We have a legacy version of Lustre installed as part of a DDN storage solution:

lustre: 2.4.3 (circa 2011)

kernel: patchless_client

Build Version: 
EXAScaler-ddn1.0--PRISTINE-2.6.32-358.23.2.el6_lustre.es279.devel.x86_64



It has been running fine for years but after a particularly bad power 
failure,it started producing the following messages:

Jan 15 10:03:07 mds2 kernel: : LustreError: 
3394:0:(osp_precreate.c:989:osp_precreate_thread()) 
scratch-OST0014-osc-MDT: cannot precreate objects: rc = -116
Jan 15 10:03:07 mds2 kernel: : LustreError: 
3394:0:(osp_precreate.c:989:osp_precreate_thread()) Skipped 210 previous 
similar messages
Jan 15 10:07:51 mds2 kernel: : Lustre: scratch-OST000f-osc-MDT: slow 
creates, last=[0x1000f:0x1217571a:0x0], next=[0x1000f:0x1217571a:0x0], 
reserved=0, syn_changes=0, syn_rpc_in_progress=0, status=0
Jan 15 10:07:51 mds2 kernel: : Lustre: Skipped 3 previous similar messages
Jan 15 10:08:32 oss5 kernel: : LustreError: 
26943:0:(ofd_obd.c:1348:ofd_create()) scratch-OST0004: unable to precreate: rc 
= -116
Jan 15 10:08:32 oss5 kernel: : LustreError: 
26943:0:(ofd_obd.c:1348:ofd_create()) Skipped 66 previous similar messages
Jan 15 10:09:26 oss4 kernel: : LustreError: 
18223:0:(ofd_obd.c:1348:ofd_create()) scratch-OST000f: unable to precreate: rc 
= -116
Jan 15 10:09:26 oss4 kernel: : LustreError: 
18223:0:(ofd_obd.c:1348:ofd_create()) Skipped 70 previous similar messages
Jan 15 10:09:37 oss3 kernel: : LustreError: 
16621:0:(ofd_obd.c:1348:ofd_create()) scratch-OST0014: unable to precreate: rc 
= -116
Jan 15 10:09:37 oss3 kernel: : LustreError: 
16621:0:(ofd_obd.c:1348:ofd_create()) Skipped 77 previous similar messages
Jan 15 10:09:38 mds2 kernel: : Lustre: scratch-OST0014-osc-MDT: slow 
creates, last=[0x10014:0x11dd257a:0x0], next=[0x10014:0x11dd257a:0x0], 
reserved=0, syn_changes=0, syn_rpc_in_progress=0, status=-116
Jan 15 10:13:12 mds2 kernel: : LustreError: 
3404:0:(osp_precreate.c:484:osp_precreate_send()) scratch-OST0004-osc-MDT: 
can't precreate: rc = -116
Jan 15 10:13:12 mds2 kernel: : LustreError: 
3404:0:(osp_precreate.c:484:osp_precreate_send()) Skipped 226 previous similar 
messages
Jan 15 10:13:12 mds2 kernel: : LustreError: 
3404:0:(osp_precreate.c:484:osp_precreate_send()) Skipped 226 previous similar 
messages
Jan 15 10:13:12 mds2 kernel: : LustreError: 
3404:0:(osp_precreate.c:989:osp_precreate_thread()) 
scratch-OST0004-osc-MDT: cannot precreate objects: rc = -116
Jan 15 10:13:12 mds2 kernel: : LustreError: 
3404:0:(osp_precreate.c:989:osp_precreate_thread()) Skipped 226 previous 
similar messages
Jan 15 10:18:37 oss5 kernel: : LustreError: 
1791:0:(ofd_obd.c:1348:ofd_create()) scratch-OST0004: unable to precreate: rc = 
-116
Jan 15 10:18:37 oss5 kernel: : LustreError: 
1791:0:(ofd_obd.c:1348:ofd_create()) Skipped 77 previous similar messages
Jan 15 10:19:36 oss4 kernel: : LustreError: 
1687:0:(ofd_obd.c:1348:ofd_create()) scratch-OST000f: unable to precreate: rc = 
-116
Jan 15 10:19:36 oss4 kernel: : LustreError: 
1687:0:(ofd_obd.c:1348:ofd_create()) Skipped 77 previous similar messages
Jan 15 10:19:42 oss3 kernel: : LustreError: 
1196:0:(ofd_obd.c:1348:ofd_create()) scratch-OST0014: unable to precreate: rc = 
-116
Jan 15 10:19:42 oss3 kernel: : LustreError: 
1196:0:(ofd_obd.c:1348:ofd_create()) Skipped 75 previous similar messages
Jan 15 10:23:16 mds2 kernel: : LustreError: 
3400:0:(osp_precreate.c:484:osp_precreate_send()) scratch-OST000f-osc-MDT: 
can't precreate: rc = -116

The messages concern the same 3 OSTs and appear both on the OSS servers serving 
those OSTs and the mds server responsible for that filesystem (/global/scratch).
They appear continuously, about every 4 minutes, and appear as soon as the 
filesystem is mounted even before any I/O occurs.  In other words, even on 
an inactive filesystem, the messages appear continuously.

While everything seems to work, the performance is terrible.  Creating a 
directory on the filesystem can take 1-2 minutes to complete.  The load on the 
mds server climbs to incredibly high values (100-160) during normal I/O 
operations and the filesystem overall is extremely slow.  The mds server 
complains about slow connections (see messages above).

We think the error messages above indicate the problem but despite searching 
many hours on the web, have not been able to find any documentation about what 
may be causing them, or how to correct the issue.

Any help would be greatly appreciated. Thanks a million for any suggestions and 
solutions

All the best
Roman


___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org