[lustre-discuss] root squash config not working anymore
Hi, We are facing an issue with root squash in our Lustre file system. It used to work perfectly, but recently, it stopped taking effect. I've been trying to have root access on our MDS server to access the Lustre file system, but none of the configurations I've tried seem to work anymore. Here are the settings I've tried so far: (1) mdt.myfs01-MDT.root_squash=0:0 (2) mdt.myfs01-MDT.nosquash_nids=20.42.34.79@tcp mdt.myfs01-MDT.root_squash=99:99 (3) mdt.myfs01-MDT.root_squash=0:0 mdt.myfs01-MDT.nosquash_nids=20.42.34.79@tcp Despite all these attempts, I still face a 'Permission denied' error when trying to create a regular file in lustre using root account. Here is an example: [root@myfsmds01 /]# mount -t lustre 20.42.34.79@tcp:/myfs01 /mnt/ [root@myfsmds01 /]# lctl get_param mdt.*.*squa* mdt.myfs01-MDT.nosquash_nids=20.42.34.79@tcp mdt.myfs01-MDT.root_squash=0:0 [root@myfsmds01 /]# cd /mnt [root@myfsmds01 mnt]# mv /tmp/aaa /tmp/test.dat [root@myfsmds01 mnt]# cp /tmp/test.dat ./test18 cp: cannot create regular file './test18': Permission denied [root@myfsmds01 mnt]# sestatus SELinux status: disabled This issue started recently, and I haven't made any significant changes to the system. Are there any known problems with root squash that I might have missed? Thanks in advance, Jane ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
Re: [lustre-discuss] Unresponsiveness of OSS and Directory Listing Hang-up
Hi, We found out a solution to address the issues of directory listing hang-ups and unresponsive OSS. There seemed metadata inconsistency in the MDT. While the on-the-fly "lctl lfsck_start ..." would not work, we tried the offline ext4 level fix. We unmounted the MDT and ran the e2fsck command to fix metadata inconsistencies. The commands we executed are as follows: e2fsck -fp /dev/mapper/mds01-mds01 After the MDT was mounted, everything just worked. Jane On 2023-05-18 16:44, Jane Liu via lustre-discuss wrote: Hi, We have recently upgraded our Lustre servers to run on RHEL 8.7, along with Lustre 2.15.2. Despite running smoothly for several weeks, we have encountered an issue that is same as the one reported on this webpage: https://urldefense.com/v3/__https://jira.whamcloud.com/browse/LU-10697__;!!P4SdNyxKAPE!GoFxQH3CIfXmrkYK7xAsYqTrz_kWEXPCBLXkAKX3COGlgMHcrp4dKzf9aNF-iw3kP7nlt_IOTuIbYSCy5GgrT5E3X2JEtzN54FNF$ . Although the Lustre version described there differs from ours, the symptoms are identical. “uname -a” on our system returns the following output: 4.18.0-425.3.1.el8_lustre.x86_64 #1 SMP Wed Jan 11 23:55:00 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux. And the content of /etc/redhat-release is: Red Hat Enterprise Linux release 8.7 (Ootpa) Here are the details about the issue. On 5/16, around 4:47 am, one OSS named oss29 began experiencing problems. There was a rapid increase in the number of active requests, from 1 to 123, occurring from roughly 4:47 am to 10:20 am. At around 5/16 10:20 am, I/O on oss29 stopped entirely, and the number of active requests remained at 123. Concurrently, the system load experienced a significant increase, shooting up from a very low number to a high number as 400, again within the timeframe of 4:47 am to 10:20 am. Interestingly, despite the extreme system load, the CPU usage remained idle. Furthermore, when executing the lfs df command on the MDS, no OSTs on oss29 were visible. We noticed a lot of error about “This server is not able to keep up with request traffic (cpu-bound)” in syslog on oss29: May 16 05:44:52 oss29 kernel: Lustre: ost_io: This server is not able to keep up with request traffic (cpu-bound). May 16 06:13:49 oss29 kernel: Lustre: ost_io: This server is not able to keep up with request traffic (cpu-bound). May 16 06:23:39 oss29 kernel: Lustre: ost_io: This server is not able to keep up with request traffic (cpu-bound). May 16 06:32:56 oss29 kernel: Lustre: ost_io: This server is not able to keep up with request traffic (cpu-bound). May 16 06:42:46 oss29 kernel: Lustre: ost_io: This server is not able to keep up with request traffic (cpu-bound). … At the same time, users also reported an issue with a specific directory, which became inaccessible. Running ls -l on this directory resulted in a hang, while the ls command worked. Users found they could read certain files within the directory, but not all of them. In an attempt to fix the situation, I tried to unmount the OSTs on oss29, but this was unsuccessful. We then made the decision to reboot oss29 on 5/17 around 3:30 pm. However, upon the system's return, oss29 immediately reverted to its previous unresponsive state with high load. Listing the directory still hung. I did lfsck on MDT, but it just hung there. Here are related MDS syslog during the perios: May 16 05:09:13 sphnxmds01 kernel: Lustre: sphnx01-OST0192-osc-MDT: Connection to sphnx01-OST0192 (at 10.42.73.42@tcp) was lost; in progress operations using this service will wait for recovery to complete May 16 05:09:13 sphnxmds01 kernel: LustreError: 384795:0:(osp_precreate.c:677:osp_precreate_send()) sphnx01-OST0192-osc-MDT: can't precreate: rc = -11 May 16 05:09:13 sphnxmds01 kernel: Lustre: sphnx01-OST0192-osc-MDT: Connection restored to 10.42.73.42@tcp (at 10.42.73.42@tcp) May 16 05:09:13 sphnxmds01 kernel: LustreError: 384795:0:(osp_precreate.c:1340:osp_precreate_thread()) sphnx01-OST0192-osc-MDT: cannot precreate objects: rc = -11 … May 16 05:22:17 sphnxmds01 kernel: Lustre: sphnx01-OST0192-osc-MDT: Connection to sphnx01-OST0192 (at 10.42.73.42@tcp) was lost; in progress operations using this service will wait for recovery to complete May 16 05:22:17 sphnxmds01 kernel: LustreError: 384795:0:(osp_precreate.c:967:osp_precreate_cleanup_orphans()) sphnx01-OST0192-osc-MDT: cannot cleanup orphans: rc = -11 May 16 05:22:17 sphnxmds01 kernel: Lustre: sphnx01-OST0192-osc-MDT: Connection restored to 10.42.73.42@tcp (at 10.42.73.42@tcp) And this is OSS syslog: May 16 04:47:21 oss29 kernel: Lustre: sphnx01-OST0192: Export 9a00357a already connecting from 130.199.206.80@tcp May 16 04:47:26 oss29 kernel: Lustre: sphnx01-OST0192: Export 9a00357a already connecting from 130.199.206.80@tcp May 16 04:47:31 oss29 kernel: Lustre: sphnx01-OST0192: Export 12e8b1d0 already connecting from 130.199.48.37@tcp May 16 04:47:36 oss29 kernel: Lustre: sphnx01-OST0192: Export 00
[lustre-discuss] Unresponsiveness of OSS and Directory Listing Hang-up
Hi, We have recently upgraded our Lustre servers to run on RHEL 8.7, along with Lustre 2.15.2. Despite running smoothly for several weeks, we have encountered an issue that is same as the one reported on this webpage: https://jira.whamcloud.com/browse/LU-10697. Although the Lustre version described there differs from ours, the symptoms are identical. “uname -a” on our system returns the following output: 4.18.0-425.3.1.el8_lustre.x86_64 #1 SMP Wed Jan 11 23:55:00 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux. And the content of /etc/redhat-release is: Red Hat Enterprise Linux release 8.7 (Ootpa) Here are the details about the issue. On 5/16, around 4:47 am, one OSS named oss29 began experiencing problems. There was a rapid increase in the number of active requests, from 1 to 123, occurring from roughly 4:47 am to 10:20 am. At around 5/16 10:20 am, I/O on oss29 stopped entirely, and the number of active requests remained at 123. Concurrently, the system load experienced a significant increase, shooting up from a very low number to a high number as 400, again within the timeframe of 4:47 am to 10:20 am. Interestingly, despite the extreme system load, the CPU usage remained idle. Furthermore, when executing the lfs df command on the MDS, no OSTs on oss29 were visible. We noticed a lot of error about “This server is not able to keep up with request traffic (cpu-bound)” in syslog on oss29: May 16 05:44:52 oss29 kernel: Lustre: ost_io: This server is not able to keep up with request traffic (cpu-bound). May 16 06:13:49 oss29 kernel: Lustre: ost_io: This server is not able to keep up with request traffic (cpu-bound). May 16 06:23:39 oss29 kernel: Lustre: ost_io: This server is not able to keep up with request traffic (cpu-bound). May 16 06:32:56 oss29 kernel: Lustre: ost_io: This server is not able to keep up with request traffic (cpu-bound). May 16 06:42:46 oss29 kernel: Lustre: ost_io: This server is not able to keep up with request traffic (cpu-bound). … At the same time, users also reported an issue with a specific directory, which became inaccessible. Running ls -l on this directory resulted in a hang, while the ls command worked. Users found they could read certain files within the directory, but not all of them. In an attempt to fix the situation, I tried to unmount the OSTs on oss29, but this was unsuccessful. We then made the decision to reboot oss29 on 5/17 around 3:30 pm. However, upon the system's return, oss29 immediately reverted to its previous unresponsive state with high load. Listing the directory still hung. I did lfsck on MDT, but it just hung there. Here are related MDS syslog during the perios: May 16 05:09:13 sphnxmds01 kernel: Lustre: sphnx01-OST0192-osc-MDT: Connection to sphnx01-OST0192 (at 10.42.73.42@tcp) was lost; in progress operations using this service will wait for recovery to complete May 16 05:09:13 sphnxmds01 kernel: LustreError: 384795:0:(osp_precreate.c:677:osp_precreate_send()) sphnx01-OST0192-osc-MDT: can't precreate: rc = -11 May 16 05:09:13 sphnxmds01 kernel: Lustre: sphnx01-OST0192-osc-MDT: Connection restored to 10.42.73.42@tcp (at 10.42.73.42@tcp) May 16 05:09:13 sphnxmds01 kernel: LustreError: 384795:0:(osp_precreate.c:1340:osp_precreate_thread()) sphnx01-OST0192-osc-MDT: cannot precreate objects: rc = -11 … May 16 05:22:17 sphnxmds01 kernel: Lustre: sphnx01-OST0192-osc-MDT: Connection to sphnx01-OST0192 (at 10.42.73.42@tcp) was lost; in progress operations using this service will wait for recovery to complete May 16 05:22:17 sphnxmds01 kernel: LustreError: 384795:0:(osp_precreate.c:967:osp_precreate_cleanup_orphans()) sphnx01-OST0192-osc-MDT: cannot cleanup orphans: rc = -11 May 16 05:22:17 sphnxmds01 kernel: Lustre: sphnx01-OST0192-osc-MDT: Connection restored to 10.42.73.42@tcp (at 10.42.73.42@tcp) And this is OSS syslog: May 16 04:47:21 oss29 kernel: Lustre: sphnx01-OST0192: Export 9a00357a already connecting from 130.199.206.80@tcp May 16 04:47:26 oss29 kernel: Lustre: sphnx01-OST0192: Export 9a00357a already connecting from 130.199.206.80@tcp May 16 04:47:31 oss29 kernel: Lustre: sphnx01-OST0192: Export 12e8b1d0 already connecting from 130.199.48.37@tcp May 16 04:47:36 oss29 kernel: Lustre: sphnx01-OST0192: Export 9a00357a already connecting from 130.199.206.80@tcp May 16 04:47:36 oss29 kernel: Lustre: Skipped 1 previous similar message May 16 04:47:41 oss29 kernel: Lustre: sphnx01-OST0192: Export 9a00357a already connecting from 130.199.206.80@tcp May 16 04:47:51 oss29 kernel: Lustre: sphnx01-OST0192: Export 9a00357a already connecting from 130.199.206.80@tcp May 16 04:47:51 oss29 kernel: Lustre: Skipped 1 previous similar message May 16 04:48:11 oss29 kernel: Lustre: sphnx01-OST0192: Export 9a00357a already connecting from 130.199.206.80@tcp May 16 04:48:11 oss29 kernel: Lustre: Skipped 4 previous similar
Re: [lustre-discuss] Disk failures triggered during OST creation and mounting on OSS Servers
During the installation of RHEL8.7 OS and Lustre 2.15.2, it appeared that the iDRAC cards were, by default, set to enable OS to iDRAC Pass-through. Thus Lustre mistakenly identified the iDRAC network interface as the primary network connection, which it attempted to use over the actual network. To resolve this, we manually disconnected the iDRAC connection and input the correct NID into the modprobe file, which eliminated the disk failure issue. Here are the steps we took to disconnect the iDRAC connection: 1. We ran nmcli connection show to list all connections (ensure that iDRAC is listed as "Wired connection 1" or under an assigned NAME). 2. We then ran nmcli connection delete "Wired connection 1" (or replaced "Wired connection 1" with the NAME assigned to the iDRAC) to delete the connection. Jane On 2023-05-13 09:25, John Hearns wrote: Can you say more about these networking issues? Good to make a note of them in case anyone sees similar in the future. On Fri, 12 May 2023, 20:40 Jane Liu via lustre-discuss, wrote: Hi Jeff, Thanks for your response. We discovered later that the network issues originating from the iDRAC IP were causing the SAS driver to hang or experience timeouts when trying to access the drives. This resulted in the drives being kicked out. Once we resolved this issue, both the mkfs and mount operations started working fine. Thanks, Jane On 2023-05-10 12:43, Jeff Johnson wrote: Jane, You're having hardware errors, the codes in those mpt3sas errors define as "PL_LOGINFO_SUB_CODE_OPEN_FAILURE_ORR_TIMEOUT", or in other words your SAS HBA cannot open a command dialogue with your disk. I'd suspect backplane or cabling issues as an internal disk failure will be reported by the target disk with its own error code. In this case your HBA can't even talk to it properly. Is sdah the partner mpath device to sdef? Or is sdah a second failing disk interface? Looking at this, I don't think your hardware is deploy-ready. --Jeff On Wed, May 10, 2023 at 9:29 AM Jane Liu via lustre-discuss wrote: Hi, We recently attempted to add several new OSS servers ( RHEL 8.7 and Lustre 2.15.2). While creating new OSTs, I noticed that mdstat reported some disk failures after the mkfs, even though the disks were functional before the mkfs command. Our hardware admins managed to resolve the mdstat issue and restore the disks to normal operation. However, when I ran the mount OST command (when network had a problem and mount command timed out), similar problems occurred, and several disks were kicked out. The relevant /var/log/messages are provided below. This problem was consistent across all our OSS servers. Any insights into the possible cause would be appreciated. Jane - May 9 13:33:15 sphnxoss47 kernel: LDISKFS-fs (md0): mounted filesystem with ordered data mode. Opts: errors=remount-ro May 9 13:33:15 sphnxoss47 systemd[1]: tmp-mntmirJ5z.mount: Succeeded. May 9 13:33:16 sphnxoss47 kernel: LNet: HW NUMA nodes: 2, HW CPU cores: 72, npartitions: 2 May 9 13:33:16 sphnxoss47 kernel: alg: No test for adler32 (adler32-zlib) May 9 13:33:16 sphnxoss47 kernel: Key type ._llcrypt registered May 9 13:33:16 sphnxoss47 kernel: Key type .llcrypt registered May 9 13:33:16 sphnxoss47 kernel: Lustre: Lustre: Build Version: 2.15.2 May 9 13:33:16 sphnxoss47 kernel: LNet: Added LNI 169.254.1.2@tcp [8/256/0/180] May 9 13:33:16 sphnxoss47 kernel: LNet: Accept secure, port 988 May 9 13:33:17 sphnxoss47 kernel: LDISKFS-fs (md0): mounted filesystem with ordered data mode. Opts: errors=remount-ro,no_mbcache,nodelalloc May 9 13:33:17 sphnxoss47 kernel: Lustre: sphnx01-OST0244-osd: enabled 'large_dir' feature on device /dev/md0 May 9 13:33:25 sphnxoss47 systemd-logind[8609]: New session 7 of user root. May 9 13:33:25 sphnxoss47 systemd[1]: Started Session 7 of user root. May 9 13:34:36 sphnxoss47 kernel: LustreError: 15f-b: sphnx01-OST0244: cannot register this server with the MGS: rc = -110. Is the MGS running? May 9 13:34:36 sphnxoss47 kernel: LustreError: 45314:0:(obd_mount_server.c:2027:server_fill_super()) Unable to start targets: -110 May 9 13:34:36 sphnxoss47 kernel: LustreError: 45314:0:(obd_mount_server.c:1644:server_put_super()) no obd sphnx01-OST0244 May 9 13:34:36 sphnxoss47 kernel: LustreError: 45314:0:(obd_mount_server.c:131:server_deregister_mount()) sphnx01-OST0244 not registered May 9 13:34:39 sphnxoss47 kernel: Lustre: server umount sphnx01-OST0244 complete May 9 13:34:39 sphnxoss47 kernel: LustreError: 45314:0:(super25.c:176:lustre_fill_super()) llite: Unable to mount : rc = -110 May 9 13:34:40 sphnxoss47 kernel: LDISKFS-fs (md1): mounted filesystem with ordered data mode. Opts: errors=remount-ro May 9 13:34:40 sphnxoss47 systemd[1]: tmp-mntXT85fz.mount: Succeeded. May 9 13:34:41 sphnxoss47 kernel: LDISKFS-fs (md1): mounted filesystem with ordered data mode. Opts: errors
Re: [lustre-discuss] Disk failures triggered during OST creation and mounting on OSS Servers
Hi Jeff, Thanks for your response. We discovered later that the network issues originating from the iDRAC IP were causing the SAS driver to hang or experience timeouts when trying to access the drives. This resulted in the drives being kicked out. Once we resolved this issue, both the mkfs and mount operations started working fine. Thanks, Jane On 2023-05-10 12:43, Jeff Johnson wrote: Jane, You're having hardware errors, the codes in those mpt3sas errors define as "PL_LOGINFO_SUB_CODE_OPEN_FAILURE_ORR_TIMEOUT", or in other words your SAS HBA cannot open a command dialogue with your disk. I'd suspect backplane or cabling issues as an internal disk failure will be reported by the target disk with its own error code. In this case your HBA can't even talk to it properly. Is sdah the partner mpath device to sdef? Or is sdah a second failing disk interface? Looking at this, I don't think your hardware is deploy-ready. --Jeff On Wed, May 10, 2023 at 9:29 AM Jane Liu via lustre-discuss wrote: Hi, We recently attempted to add several new OSS servers ( RHEL 8.7 and Lustre 2.15.2). While creating new OSTs, I noticed that mdstat reported some disk failures after the mkfs, even though the disks were functional before the mkfs command. Our hardware admins managed to resolve the mdstat issue and restore the disks to normal operation. However, when I ran the mount OST command (when network had a problem and mount command timed out), similar problems occurred, and several disks were kicked out. The relevant /var/log/messages are provided below. This problem was consistent across all our OSS servers. Any insights into the possible cause would be appreciated. Jane - May 9 13:33:15 sphnxoss47 kernel: LDISKFS-fs (md0): mounted filesystem with ordered data mode. Opts: errors=remount-ro May 9 13:33:15 sphnxoss47 systemd[1]: tmp-mntmirJ5z.mount: Succeeded. May 9 13:33:16 sphnxoss47 kernel: LNet: HW NUMA nodes: 2, HW CPU cores: 72, npartitions: 2 May 9 13:33:16 sphnxoss47 kernel: alg: No test for adler32 (adler32-zlib) May 9 13:33:16 sphnxoss47 kernel: Key type ._llcrypt registered May 9 13:33:16 sphnxoss47 kernel: Key type .llcrypt registered May 9 13:33:16 sphnxoss47 kernel: Lustre: Lustre: Build Version: 2.15.2 May 9 13:33:16 sphnxoss47 kernel: LNet: Added LNI 169.254.1.2@tcp [8/256/0/180] May 9 13:33:16 sphnxoss47 kernel: LNet: Accept secure, port 988 May 9 13:33:17 sphnxoss47 kernel: LDISKFS-fs (md0): mounted filesystem with ordered data mode. Opts: errors=remount-ro,no_mbcache,nodelalloc May 9 13:33:17 sphnxoss47 kernel: Lustre: sphnx01-OST0244-osd: enabled 'large_dir' feature on device /dev/md0 May 9 13:33:25 sphnxoss47 systemd-logind[8609]: New session 7 of user root. May 9 13:33:25 sphnxoss47 systemd[1]: Started Session 7 of user root. May 9 13:34:36 sphnxoss47 kernel: LustreError: 15f-b: sphnx01-OST0244: cannot register this server with the MGS: rc = -110. Is the MGS running? May 9 13:34:36 sphnxoss47 kernel: LustreError: 45314:0:(obd_mount_server.c:2027:server_fill_super()) Unable to start targets: -110 May 9 13:34:36 sphnxoss47 kernel: LustreError: 45314:0:(obd_mount_server.c:1644:server_put_super()) no obd sphnx01-OST0244 May 9 13:34:36 sphnxoss47 kernel: LustreError: 45314:0:(obd_mount_server.c:131:server_deregister_mount()) sphnx01-OST0244 not registered May 9 13:34:39 sphnxoss47 kernel: Lustre: server umount sphnx01-OST0244 complete May 9 13:34:39 sphnxoss47 kernel: LustreError: 45314:0:(super25.c:176:lustre_fill_super()) llite: Unable to mount : rc = -110 May 9 13:34:40 sphnxoss47 kernel: LDISKFS-fs (md1): mounted filesystem with ordered data mode. Opts: errors=remount-ro May 9 13:34:40 sphnxoss47 systemd[1]: tmp-mntXT85fz.mount: Succeeded. May 9 13:34:41 sphnxoss47 kernel: LDISKFS-fs (md1): mounted filesystem with ordered data mode. Opts: errors=remount-ro,no_mbcache,nodelalloc May 9 13:34:41 sphnxoss47 kernel: Lustre: sphnx01-OST0245-osd: enabled 'large_dir' feature on device /dev/md1 May 9 13:36:00 sphnxoss47 kernel: LustreError: 15f-b: sphnx01-OST0245: cannot register this server with the MGS: rc = -110. Is the MGS running? May 9 13:36:00 sphnxoss47 kernel: LustreError: 46127:0:(obd_mount_server.c:2027:server_fill_super()) Unable to start targets: -110 May 9 13:36:00 sphnxoss47 kernel: LustreError: 46127:0:(obd_mount_server.c:1644:server_put_super()) no obd sphnx01-OST0245 May 9 13:36:00 sphnxoss47 kernel: LustreError: 46127:0:(obd_mount_server.c:131:server_deregister_mount()) sphnx01-OST0245 not registered May 9 13:36:08 sphnxoss47 kernel: Lustre: server umount sphnx01-OST0245 complete May 9 13:36:08 sphnxoss47 kernel: LustreError: 46127:0:(super25.c:176:lustre_fill_super()) llite: Unable to mount : rc = -110 May 9 13:36:08 sphnxoss47 kernel: LDISKFS-fs (md2): mounted filesystem with ordered data mode. Opts: errors=remount-ro May 9 13:36:08 sphnxoss47 systemd[1]: tmp-mnt17IOaq.mount: Succeeded. May 9 13:36:09
[lustre-discuss] Disk failures triggered during OST creation and mounting on OSS Servers
Hi, We recently attempted to add several new OSS servers ( RHEL 8.7 and Lustre 2.15.2). While creating new OSTs, I noticed that mdstat reported some disk failures after the mkfs, even though the disks were functional before the mkfs command. Our hardware admins managed to resolve the mdstat issue and restore the disks to normal operation. However, when I ran the mount OST command (when network had a problem and mount command timed out), similar problems occurred, and several disks were kicked out. The relevant /var/log/messages are provided below. This problem was consistent across all our OSS servers. Any insights into the possible cause would be appreciated. Jane - May 9 13:33:15 sphnxoss47 kernel: LDISKFS-fs (md0): mounted filesystem with ordered data mode. Opts: errors=remount-ro May 9 13:33:15 sphnxoss47 systemd[1]: tmp-mntmirJ5z.mount: Succeeded. May 9 13:33:16 sphnxoss47 kernel: LNet: HW NUMA nodes: 2, HW CPU cores: 72, npartitions: 2 May 9 13:33:16 sphnxoss47 kernel: alg: No test for adler32 (adler32-zlib) May 9 13:33:16 sphnxoss47 kernel: Key type ._llcrypt registered May 9 13:33:16 sphnxoss47 kernel: Key type .llcrypt registered May 9 13:33:16 sphnxoss47 kernel: Lustre: Lustre: Build Version: 2.15.2 May 9 13:33:16 sphnxoss47 kernel: LNet: Added LNI 169.254.1.2@tcp [8/256/0/180] May 9 13:33:16 sphnxoss47 kernel: LNet: Accept secure, port 988 May 9 13:33:17 sphnxoss47 kernel: LDISKFS-fs (md0): mounted filesystem with ordered data mode. Opts: errors=remount-ro,no_mbcache,nodelalloc May 9 13:33:17 sphnxoss47 kernel: Lustre: sphnx01-OST0244-osd: enabled 'large_dir' feature on device /dev/md0 May 9 13:33:25 sphnxoss47 systemd-logind[8609]: New session 7 of user root. May 9 13:33:25 sphnxoss47 systemd[1]: Started Session 7 of user root. May 9 13:34:36 sphnxoss47 kernel: LustreError: 15f-b: sphnx01-OST0244: cannot register this server with the MGS: rc = -110. Is the MGS running? May 9 13:34:36 sphnxoss47 kernel: LustreError: 45314:0:(obd_mount_server.c:2027:server_fill_super()) Unable to start targets: -110 May 9 13:34:36 sphnxoss47 kernel: LustreError: 45314:0:(obd_mount_server.c:1644:server_put_super()) no obd sphnx01-OST0244 May 9 13:34:36 sphnxoss47 kernel: LustreError: 45314:0:(obd_mount_server.c:131:server_deregister_mount()) sphnx01-OST0244 not registered May 9 13:34:39 sphnxoss47 kernel: Lustre: server umount sphnx01-OST0244 complete May 9 13:34:39 sphnxoss47 kernel: LustreError: 45314:0:(super25.c:176:lustre_fill_super()) llite: Unable to mount : rc = -110 May 9 13:34:40 sphnxoss47 kernel: LDISKFS-fs (md1): mounted filesystem with ordered data mode. Opts: errors=remount-ro May 9 13:34:40 sphnxoss47 systemd[1]: tmp-mntXT85fz.mount: Succeeded. May 9 13:34:41 sphnxoss47 kernel: LDISKFS-fs (md1): mounted filesystem with ordered data mode. Opts: errors=remount-ro,no_mbcache,nodelalloc May 9 13:34:41 sphnxoss47 kernel: Lustre: sphnx01-OST0245-osd: enabled 'large_dir' feature on device /dev/md1 May 9 13:36:00 sphnxoss47 kernel: LustreError: 15f-b: sphnx01-OST0245: cannot register this server with the MGS: rc = -110. Is the MGS running? May 9 13:36:00 sphnxoss47 kernel: LustreError: 46127:0:(obd_mount_server.c:2027:server_fill_super()) Unable to start targets: -110 May 9 13:36:00 sphnxoss47 kernel: LustreError: 46127:0:(obd_mount_server.c:1644:server_put_super()) no obd sphnx01-OST0245 May 9 13:36:00 sphnxoss47 kernel: LustreError: 46127:0:(obd_mount_server.c:131:server_deregister_mount()) sphnx01-OST0245 not registered May 9 13:36:08 sphnxoss47 kernel: Lustre: server umount sphnx01-OST0245 complete May 9 13:36:08 sphnxoss47 kernel: LustreError: 46127:0:(super25.c:176:lustre_fill_super()) llite: Unable to mount : rc = -110 May 9 13:36:08 sphnxoss47 kernel: LDISKFS-fs (md2): mounted filesystem with ordered data mode. Opts: errors=remount-ro May 9 13:36:08 sphnxoss47 systemd[1]: tmp-mnt17IOaq.mount: Succeeded. May 9 13:36:09 sphnxoss47 kernel: LDISKFS-fs (md2): mounted filesystem with ordered data mode. Opts: errors=remount-ro,no_mbcache,nodelalloc Show less 11:03 AM - it just repeats for all of the md raids, then the errors start and the drive fails and is disabled: May 9 13:44:31 sphnxoss47 kernel: LustreError: 48069:0:(super25.c:176:lustre_fill_super()) llite: Unable to mount : rc = -110 May 9 13:44:33 sphnxoss47 kernel: mpt3sas_cm1: log_info(0x3112011a): originator(PL), code(0x12), sub_code(0x011a) May 9 13:44:33 sphnxoss47 kernel: mpt3sas_cm1: log_info(0x3112011a): originator(PL), code(0x12), sub_code(0x011a) May 9 13:44:33 sphnxoss47 kernel: mpt3sas_cm1: log_info(0x3112011a): originator(PL), code(0x12), sub_code(0x011a) May 9 13:44:33 sphnxoss47 kernel: mpt3sas_cm1: log_info(0x3112011a): originator(PL), code(0x12), sub_code(0x011a) May 9 13:44:33 sphnxoss47 kernel: mpt3sas_cm1: log_info(0x3112011a): originator(PL), code(0x12),
Re: [lustre-discuss] Missing Files in /proc/fs/lustre after Upgrading to Lustre 2.15.X
Hi Andreas, Thanks. This helps. Jane On 2023-05-04 20:28, Andreas Dilger wrote: On May 4, 2023, at 16:43, Jane Liu via lustre-discuss wrote: Hi, We previously had a monitoring tool in Lustre 2.12.X that relied on files located under /proc/fs/lustre for gathering metrics. However, after upgrading our system to version 2.15.2, we noticed that at least five files previously found under /proc/fs/lustre are no longer present. Here is a list of these files as an example: /proc/fs/lustre/osd-ldiskfs/fsname-OST0078/brw_stats /proc/fs/lustre/osd-ldiskfs/fsname-OST0078/kbytestotal /proc/fs/lustre/osd-ldiskfs/fsname-OST0078/kbytesfree /proc/fs/lustre/osd-ldiskfs/fsname-OST0078/filestotal /proc/fs/lustre/osd-ldiskfs/fsname-OST0078/filesfree We have been unable to locate these files in the new version. We can still obtain size information using the following commands: lctl get_param obdfilter.*.kbytestotal lctl get_param obdfilter.*.kbytesfree lctl get_param obdfilter.*.filestotal lctl get_param obdfilter.*.filesfree However, we are unsure how to access the information previously available in the brw_stats file. Any guidance or suggestions would be greatly appreciated. You've already partially answered your own question - the parameters for "lctl get_param" are under "osd-ldiskfs.*.{brw_stats,kbytes*,files*}" and not "obdfilter.*.*", but they have (mostly) moved from /proc/fs/lustre/osd-ldiskfs to /sys/fs/lustre/osd-ldiskfs. In the case of brw_stats they are under /sys/kernel/debug/lustre/osd-ldiskfs. These stats actually moved from obdfilter to osd-ldiskfs back in Lustre 2.4 when the ZFS backend was added, and a symlink has been kept until now for compatibility. That means your monitoring tool should still work with any modern Lustre version if you change the path. The move of brw_stats to /sys/kernel/debug/lustre was mandated by the upstream kernel and only happened in 2.15.0. Cheers, Andreas -- Andreas Dilger Lustre Principal Architect Whamcloud ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
[lustre-discuss] Missing Files in /proc/fs/lustre after Upgrading to Lustre 2.15.X
Hi, We previously had a monitoring tool in Lustre 2.12.X that relied on files located under /proc/fs/lustre for gathering metrics. However, after upgrading our system to version 2.15.2, we noticed that at least five files previously found under /proc/fs/lustre are no longer present. Here is a list of these files as an example: /proc/fs/lustre/osd-ldiskfs/fsname-OST0078/brw_stats /proc/fs/lustre/osd-ldiskfs/fsname-OST0078/kbytestotal /proc/fs/lustre/osd-ldiskfs/fsname-OST0078/kbytesfree /proc/fs/lustre/osd-ldiskfs/fsname-OST0078/filestotal /proc/fs/lustre/osd-ldiskfs/fsname-OST0078/filesfree We have been unable to locate these files in the new version. We can still obtain size information using the following commands: lctl get_param obdfilter.*.kbytestotal lctl get_param obdfilter.*.kbytesfree lctl get_param obdfilter.*.filestotal lctl get_param obdfilter.*.filesfree However, we are unsure how to access the information previously available in the brw_stats file. Any guidance or suggestions would be greatly appreciated. Thanks, Jane ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
Re: [lustre-discuss] question mark when listing file after the upgrade
Hi Andreas, Thank you so much for the information provided. Without it, we might have struggled to find a solution for an extended period. I attempted to rebuild the OI using LFSCK, but experienced hang-ups and Lustre crashes while executing "lctl lfsck_start." Later, we were able to address the OI issue by using the "-o resetoi" option during MDT mounting: mount -o resetoi /mdt00 After resolving OI issue, I was then able to run LFSCK to conduct a comprehensive scan of the MDT. Thanks again, Jane On 2023-05-03 14:52, Andreas Dilger wrote: This looks like https://jira.whamcloud.com/browse/LU-16655 [2] causing problems after the upgrade from 2.12.x to 2.15.[012] breaking the Object Index files. A patch for this has already been landed to b2_15 and will be included in 2.15.3. If you've hit this issue, then you need to backup/delete the OI files (off of Lustre) and run OI Scrub to rebuild them. I believe the OI Scrub/rebuild is described in the Lustre Manual. Cheers, Andreas On May 3, 2023, at 09:30, Colin Faber via lustre-discuss wrote: Hi, What does your client log indicate? (dmesg / syslog) On Wed, May 3, 2023, 7:32 AM Jane Liu via lustre-discuss wrote: Hello, I'm writing to ask for your help on one issue we observed after a major upgrade of a large Lustre system from RHEL7 + 2.12.9 to RHEL8 + 2.15.2. Basically we preserved MDT disk (VDisk on a VM) and also all OST disk (JBOD) in RHEL7 and then reinstalled RHEL8 OS and then attached those preserved disks to RHEL8 OS. However, I met an issue after the OS upgrade and lustre installation. I believe the issue is related to metadata. The old MDS was a virtual machine, and the MDT vdisk was preserved during the upgrade. When a new VM was created with the same hostname and IP, the preserved MDT vdisk was attached to it. Everything seemed fine initially. However, after the client mount was completed, the file listing displayed question marks, as shown below: [root@experimds01 ~]# mount -t lustre 11.22.33.44@tcp:/experi01 /mntlustre/ [root@experimds01 ~]# cd /mntlustre/ [root@experimds01 mntlustre]# ls -l ls: cannot access 'experipro': No such file or directory ls: cannot access 'admin': No such file or directory ls: cannot access 'test4': No such file or directory ls: cannot access 'test3': No such file or directory total 0 d? ? ? ? ?? admin d? ? ? ? ?? experipro -? ? ? ? ?? test3 -? ? ? ? ?? test4 I shut down the MDT and ran "e2fsck -p /dev/mapper/experimds01-experimds01". It reported "primary superblock features different from backup, check forced." [root@experimds01 ~]# e2fsck -p /dev/mapper/experimds01-experimds01 experi01-MDT primary superblock features different from backup, check forced. experi01-MDT: 9493348/429444224 files (0.5% non-contiguous), 109369520/268428864 blocks Running e2fsck again showed that the filesystem was clean. [root@experimds01 /]# e2fsck -p /dev/mapper/experimds01-experimds01 experi01-MDT: clean, 9493378/429444224 files, 109369610/268428864 blocks However, the issue persisted. The file listing continued to display question marks. Do you have any idea what could be causing this problem and how to fix it? By the way, I have an e2image backup of the MDT from the RHEL7 system just in case we need fix it using the backup. Also, after the upgrade, the command "lfs df" shows that all OSTs and MDT are fine. Thank you in advance for your assistance. Best regards, Jane ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org [1] ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org [1] Links: -- [1] https://urldefense.com/v3/__http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org__;!!P4SdNyxKAPE!EjGkARsk0Yp3SMoPZG3ShFH-7yqJP8eiiu01CbPIWhkz9haU5QrbeCbYwZNFupImfJ2pe5WFfJT-Ci0S-Qw0nsps$ [2] https://urldefense.com/v3/__https://jira.whamcloud.com/browse/LU-16655__;!!P4SdNyxKAPE!EjGkARsk0Yp3SMoPZG3ShFH-7yqJP8eiiu01CbPIWhkz9haU5QrbeCbYwZNFupImfJ2pe5WFfJT-Ci0S-fnmcNE6$ ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
[lustre-discuss] question mark when listing file after the upgrade
Hello, I'm writing to ask for your help on one issue we observed after a major upgrade of a large Lustre system from RHEL7 + 2.12.9 to RHEL8 + 2.15.2. Basically we preserved MDT disk (VDisk on a VM) and also all OST disk (JBOD) in RHEL7 and then reinstalled RHEL8 OS and then attached those preserved disks to RHEL8 OS. However, I met an issue after the OS upgrade and lustre installation. I believe the issue is related to metadata. The old MDS was a virtual machine, and the MDT vdisk was preserved during the upgrade. When a new VM was created with the same hostname and IP, the preserved MDT vdisk was attached to it. Everything seemed fine initially. However, after the client mount was completed, the file listing displayed question marks, as shown below: [root@experimds01 ~]# mount -t lustre 11.22.33.44@tcp:/experi01 /mntlustre/ [root@experimds01 ~]# cd /mntlustre/ [root@experimds01 mntlustre]# ls -l ls: cannot access 'experipro': No such file or directory ls: cannot access 'admin': No such file or directory ls: cannot access 'test4': No such file or directory ls: cannot access 'test3': No such file or directory total 0 d? ? ? ? ?? admin d? ? ? ? ?? experipro -? ? ? ? ?? test3 -? ? ? ? ?? test4 I shut down the MDT and ran "e2fsck -p /dev/mapper/experimds01-experimds01". It reported "primary superblock features different from backup, check forced." [root@experimds01 ~]# e2fsck -p /dev/mapper/experimds01-experimds01 experi01-MDT primary superblock features different from backup, check forced. experi01-MDT: 9493348/429444224 files (0.5% non-contiguous), 109369520/268428864 blocks Running e2fsck again showed that the filesystem was clean. [root@experimds01 /]# e2fsck -p /dev/mapper/experimds01-experimds01 experi01-MDT: clean, 9493378/429444224 files, 109369610/268428864 blocks However, the issue persisted. The file listing continued to display question marks. Do you have any idea what could be causing this problem and how to fix it? By the way, I have an e2image backup of the MDT from the RHEL7 system just in case we need fix it using the backup. Also, after the upgrade, the command "lfs df" shows that all OSTs and MDT are fine. Thank you in advance for your assistance. Best regards, Jane ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org