Re: [lustre-discuss] Unresponsiveness of OSS and Directory Listing Hang-up

2023-05-22 Thread Jane Liu via lustre-discuss

Hi,

We found out a solution to address the issues of directory listing 
hang-ups and unresponsive OSS. There seemed metadata inconsistency in 
the MDT. While the on-the-fly "lctl lfsck_start ..." would not work, we 
tried the offline ext4 level fix. We unmounted the MDT and ran the 
e2fsck command to fix metadata inconsistencies. The commands we executed 
are as follows:


e2fsck -fp /dev/mapper/mds01-mds01

After the MDT was mounted, everything just worked.

Jane


On 2023-05-18 16:44, Jane Liu via lustre-discuss wrote:

Hi,

We have recently upgraded our Lustre servers to run on RHEL 8.7, along
with Lustre 2.15.2. Despite running smoothly for several weeks, we
have encountered an issue that is same as the one reported on this
webpage:
https://urldefense.com/v3/__https://jira.whamcloud.com/browse/LU-10697__;!!P4SdNyxKAPE!GoFxQH3CIfXmrkYK7xAsYqTrz_kWEXPCBLXkAKX3COGlgMHcrp4dKzf9aNF-iw3kP7nlt_IOTuIbYSCy5GgrT5E3X2JEtzN54FNF$
. Although the Lustre version described there differs from ours, the
symptoms are identical.

“uname -a” on our system returns the following output:
4.18.0-425.3.1.el8_lustre.x86_64 #1 SMP Wed Jan 11 23:55:00 UTC 2023
x86_64 x86_64 x86_64 GNU/Linux. And the content of /etc/redhat-release
is: Red Hat Enterprise Linux release 8.7 (Ootpa)

Here are the details about the issue.

On 5/16, around 4:47 am, one OSS named oss29 began experiencing
problems. There was a rapid increase in the number of active requests,
from 1 to 123, occurring from roughly 4:47 am to 10:20 am. At around
5/16 10:20 am, I/O on oss29 stopped entirely, and the number of active
requests remained at 123. Concurrently, the system load experienced a
significant increase, shooting up from a very low number to a high
number as 400, again within the timeframe of 4:47 am to 10:20 am.
Interestingly, despite the extreme system load, the CPU usage remained
idle.

Furthermore, when executing the lfs df command on the MDS, no OSTs on
oss29 were visible.

We noticed a lot of error about “This server is not able to keep up
with request traffic (cpu-bound)” in syslog on oss29:

May 16 05:44:52 oss29 kernel: Lustre: ost_io: This server is not able
to keep up with request traffic (cpu-bound).
May 16 06:13:49 oss29 kernel: Lustre: ost_io: This server is not able
to keep up with request traffic (cpu-bound).
May 16 06:23:39 oss29 kernel: Lustre: ost_io: This server is not able
to keep up with request traffic (cpu-bound).
May 16 06:32:56 oss29 kernel: Lustre: ost_io: This server is not able
to keep up with request traffic (cpu-bound).
May 16 06:42:46 oss29 kernel: Lustre: ost_io: This server is not able
to keep up with request traffic (cpu-bound).
…

At the same time, users also reported an issue with a specific
directory, which became inaccessible. Running ls -l on this directory
resulted in a hang, while the ls command worked. Users found they
could read certain files within the directory, but not all of them.

In an attempt to fix the situation, I tried to unmount the OSTs on
oss29, but this was unsuccessful. We then made the decision to reboot
oss29 on 5/17 around 3:30 pm. However, upon the system's return, oss29
immediately reverted to its previous unresponsive state with high
load. Listing the directory still hung.

I did lfsck on MDT, but it just hung there.

Here are related MDS syslog during the perios:

May 16 05:09:13 sphnxmds01 kernel: Lustre:
sphnx01-OST0192-osc-MDT: Connection to sphnx01-OST0192 (at
10.42.73.42@tcp) was lost; in progress operations using this service
will wait for recovery to complete
May 16 05:09:13 sphnxmds01 kernel: LustreError:
384795:0:(osp_precreate.c:677:osp_precreate_send())
sphnx01-OST0192-osc-MDT: can't precreate: rc = -11
May 16 05:09:13 sphnxmds01 kernel: Lustre:
sphnx01-OST0192-osc-MDT: Connection restored to 10.42.73.42@tcp
(at 10.42.73.42@tcp)
May 16 05:09:13 sphnxmds01 kernel: LustreError:
384795:0:(osp_precreate.c:1340:osp_precreate_thread())
sphnx01-OST0192-osc-MDT: cannot precreate objects: rc = -11

…
May 16 05:22:17 sphnxmds01 kernel: Lustre:
sphnx01-OST0192-osc-MDT: Connection to sphnx01-OST0192 (at
10.42.73.42@tcp) was lost; in progress operations using this service
will wait for recovery to complete
May 16 05:22:17 sphnxmds01 kernel: LustreError:
384795:0:(osp_precreate.c:967:osp_precreate_cleanup_orphans())
sphnx01-OST0192-osc-MDT: cannot cleanup orphans: rc = -11
May 16 05:22:17 sphnxmds01 kernel: Lustre:
sphnx01-OST0192-osc-MDT: Connection restored to 10.42.73.42@tcp
(at 10.42.73.42@tcp)

And this is OSS syslog:

May 16 04:47:21 oss29 kernel: Lustre: sphnx01-OST0192: Export
9a00357a already connecting from 130.199.206.80@tcp
May 16 04:47:26 oss29 kernel: Lustre: sphnx01-OST0192: Export
9a00357a already connecting from 130.199.206.80@tcp
May 16 04:47:31 oss29 kernel: Lustre: sphnx01-OST0192: Export
12e8b1d0 already connecting from 130.199.48.37@tcp
May 16 04:47:36 oss29 kernel: Lustre: sphnx01-OST0192: Export
9a00357a 

[lustre-discuss] Unresponsiveness of OSS and Directory Listing Hang-up

2023-05-18 Thread Jane Liu via lustre-discuss

Hi,

We have recently upgraded our Lustre servers to run on RHEL 8.7, along 
with Lustre 2.15.2. Despite running smoothly for several weeks, we have 
encountered an issue that is same as the one reported on this webpage: 
https://jira.whamcloud.com/browse/LU-10697. Although the Lustre version 
described there differs from ours, the symptoms are identical.


“uname -a” on our system returns the following output: 
4.18.0-425.3.1.el8_lustre.x86_64 #1 SMP Wed Jan 11 23:55:00 UTC 2023 
x86_64 x86_64 x86_64 GNU/Linux. And the content of /etc/redhat-release 
is: Red Hat Enterprise Linux release 8.7 (Ootpa)


Here are the details about the issue.

On 5/16, around 4:47 am, one OSS named oss29 began experiencing 
problems. There was a rapid increase in the number of active requests, 
from 1 to 123, occurring from roughly 4:47 am to 10:20 am. At around 
5/16 10:20 am, I/O on oss29 stopped entirely, and the number of active 
requests remained at 123. Concurrently, the system load experienced a 
significant increase, shooting up from a very low number to a high 
number as 400, again within the timeframe of 4:47 am to 10:20 am. 
Interestingly, despite the extreme system load, the CPU usage remained 
idle.


Furthermore, when executing the lfs df command on the MDS, no OSTs on 
oss29 were visible.


We noticed a lot of error about “This server is not able to keep up with 
request traffic (cpu-bound)” in syslog on oss29:


May 16 05:44:52 oss29 kernel: Lustre: ost_io: This server is not able to 
keep up with request traffic (cpu-bound).
May 16 06:13:49 oss29 kernel: Lustre: ost_io: This server is not able to 
keep up with request traffic (cpu-bound).
May 16 06:23:39 oss29 kernel: Lustre: ost_io: This server is not able to 
keep up with request traffic (cpu-bound).
May 16 06:32:56 oss29 kernel: Lustre: ost_io: This server is not able to 
keep up with request traffic (cpu-bound).
May 16 06:42:46 oss29 kernel: Lustre: ost_io: This server is not able to 
keep up with request traffic (cpu-bound).

…

At the same time, users also reported an issue with a specific 
directory, which became inaccessible. Running ls -l on this directory 
resulted in a hang, while the ls command worked. Users found they could 
read certain files within the directory, but not all of them.


In an attempt to fix the situation, I tried to unmount the OSTs on 
oss29, but this was unsuccessful. We then made the decision to reboot 
oss29 on 5/17 around 3:30 pm. However, upon the system's return, oss29 
immediately reverted to its previous unresponsive state with high load. 
Listing the directory still hung.


I did lfsck on MDT, but it just hung there.

Here are related MDS syslog during the perios:

May 16 05:09:13 sphnxmds01 kernel: Lustre: sphnx01-OST0192-osc-MDT: 
Connection to sphnx01-OST0192 (at 10.42.73.42@tcp) was lost; in progress 
operations using this service will wait for recovery to complete
May 16 05:09:13 sphnxmds01 kernel: LustreError: 
384795:0:(osp_precreate.c:677:osp_precreate_send()) 
sphnx01-OST0192-osc-MDT: can't precreate: rc = -11
May 16 05:09:13 sphnxmds01 kernel: Lustre: sphnx01-OST0192-osc-MDT: 
Connection restored to 10.42.73.42@tcp (at 10.42.73.42@tcp)
May 16 05:09:13 sphnxmds01 kernel: LustreError: 
384795:0:(osp_precreate.c:1340:osp_precreate_thread()) 
sphnx01-OST0192-osc-MDT: cannot precreate objects: rc = -11


…
May 16 05:22:17 sphnxmds01 kernel: Lustre: sphnx01-OST0192-osc-MDT: 
Connection to sphnx01-OST0192 (at 10.42.73.42@tcp) was lost; in progress 
operations using this service will wait for recovery to complete
May 16 05:22:17 sphnxmds01 kernel: LustreError: 
384795:0:(osp_precreate.c:967:osp_precreate_cleanup_orphans()) 
sphnx01-OST0192-osc-MDT: cannot cleanup orphans: rc = -11
May 16 05:22:17 sphnxmds01 kernel: Lustre: sphnx01-OST0192-osc-MDT: 
Connection restored to 10.42.73.42@tcp (at 10.42.73.42@tcp)


And this is OSS syslog:

May 16 04:47:21 oss29 kernel: Lustre: sphnx01-OST0192: Export 
9a00357a already connecting from 130.199.206.80@tcp
May 16 04:47:26 oss29 kernel: Lustre: sphnx01-OST0192: Export 
9a00357a already connecting from 130.199.206.80@tcp
May 16 04:47:31 oss29 kernel: Lustre: sphnx01-OST0192: Export 
12e8b1d0 already connecting from 130.199.48.37@tcp
May 16 04:47:36 oss29 kernel: Lustre: sphnx01-OST0192: Export 
9a00357a already connecting from 130.199.206.80@tcp

May 16 04:47:36 oss29 kernel: Lustre: Skipped 1 previous similar message
May 16 04:47:41 oss29 kernel: Lustre: sphnx01-OST0192: Export 
9a00357a already connecting from 130.199.206.80@tcp
May 16 04:47:51 oss29 kernel: Lustre: sphnx01-OST0192: Export 
9a00357a already connecting from 130.199.206.80@tcp

May 16 04:47:51 oss29 kernel: Lustre: Skipped 1 previous similar message
May 16 04:48:11 oss29 kernel: Lustre: sphnx01-OST0192: Export 
9a00357a already connecting from 130.199.206.80@tcp
May 16 04:48:11 oss29 kernel: Lustre: Skipped 4 previous similar