[lustre-discuss] Signing important git commits and files (RPMs, DEBs) distributed on the whamcloud repository ?
Hello, It would be great if the important commits, especially those corresponding to tags, were signed using a long term keys (ex: GPG, SSH or X.509, you have the choice since git supports many formats) with the corresponding public keys published on Lustre web site and their fingerprints on this mailing list for example. This would allow every user to have a better confidence in the integrity of the associated code and comply more with the end-to-end principle as the private keys would be kept preciously by the developers. It is the same thing with the RPMs and DEBs packages distributed over the whamcloud repository (https://downloads.whamcloud.com/public/lustre/) except that the choice of the key system is limited to GPG in this case. As you know it is the common practice to associate a public key with every remote repository to verify the authenticity of every downloaded package before installation (but it is not yet done on this repository). Performing downloads or "git" access over "https" is better than nothing but the guaranty of integrity is way better if done by signatures closer to the original authors. Signing keys could even be held on hardware devices such as Yubikeys as this would be both very secure and convenient for developers. Please consider this suggestion, I am sure it would satisfy many users. Thanks, Martin Audet ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
[lustre-discuss] Which MOFED version is suggested for Lustre (especially are Lustre 2.15.x and MOFED 23.10-1.1.9.0-LTS compatible) ?
Hello Lustre community, We are in the process of upgrading our (small) cluster. We would like to use RHEL/AlmaLinux 9.3 on our compute nodes (Lustre clients) and RHEL/AlmaLinux 8.9 on our file server node (we have a single server). Our Lustre RPMs (client, server and server patched kernel) are all compiled by ourselves from the Lustre git repository. We are currently testing version 2.15.4-RC1. We were thinking of installing MOFED 5.8-3.0.7.0-LTS but since it refuses to compile on AlmaLinux 9.3 (mlnxofedinstall --add-kernel-support fails), we decided to use the more recently released (last week) MOFED 23.10-1.1.9.0-LTS. The Lustre documentation says that this later MOFED version is compatible with Lustre 2.12.9 and 2.15.2, but did NVIDIA really performed some tests ? Anyway which MOFED version is appropriate for Lustre 2.15.x ? We didn't find any answer to this question in the documentation (and source repository). Any information or experience on that ? Thanks, Martin Audet ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
Re: [lustre-discuss] Error messages (ex: not available for connect from 0@lo) on server boot with Lustre 2.15.3 and 2.15.4-RC1
Thanks Andreas and Aurélien for your answers. They makes us confident that we are on the right track for our cluster update ! Also I have noticed that 2.15.4-RC1 was released two weeks ago, can we expect 2.15.4 to be ready by the end of the year ? Regards, Martin From: Andreas Dilger Sent: December 7, 2023 6:02 AM To: Aurelien Degremont Cc: Audet, Martin; lustre-discuss@lists.lustre.org Subject: Re: [lustre-discuss] Error messages (ex: not available for connect from 0@lo) on server boot with Lustre 2.15.3 and 2.15.4-RC1 ***Attention*** This email originated from outside of the NRC. ***Attention*** Ce courriel provient de l'extérieur du CNRC. Aurelien, there have beeen a number of questions about this message. > Lustre: lustrevm-OST0001: deleting orphan objects from 0x0:227 to 0x0:513 This is not marked LustreError, so it is just an advisory message. This can sometimes be useful for debugging issues related to MDT->OST connections. It is already printed with D_INFO level, so the lowest printk level available. Would rewording the message make it more clear that this is a normal situation when the MDT and OST are establishing connections? Cheers, Andreas On Dec 5, 2023, at 02:13, Aurelien Degremont wrote: > > > Now what is the messages about "deleting orphaned objects" ? Is it normal > > also ? > > Yeah, this is kind of normal, and I'm even thinking we should lower the > message verbosity... > Andreas, do you agree that could become a simple CDEBUG(D_HA, ...) instead of > LCONSOLE(D_INFO, ...)? > > > Aurélien > > Audet, Martin wrote on lundi 4 décembre 2023 20:26: >> Hello Andreas, >> >> Thanks for your response. Happy to learn that the "errors" I was reporting >> aren't really errors. >> >> I now understand that the 3 messages about LDISKFS were only normal messages >> resulting from mounting the file systems (I was fooled by vim showing this >> message in red, like important error messages, but this is simply a false >> positive result of its syntax highlight rules probably triggered by the >> "errors=" string which is only a mount option...). >> >> Now what is the messages about "deleting orphaned objects" ? Is it normal >> also ? We boot the clients VMs always after the server is ready and we >> shutdown clients cleanly well before the vlmf Lustre server is (also >> cleanly) shutdown. It is a sign of corruption ? How come this happen if >> shutdowns are clean ? >> >> Thanks (and sorry for the beginners questions), >> >> Martin >> >> Andreas Dilger wrote on December 4, 2023 5:25 AM: >>> It wasn't clear from your rail which message(s) are you concerned about? >>> These look like normal mount message(s) to me. >>> >>> The "error" is pretty normal, it just means there were multiple services >>> starting at once and one wasn't yet ready for the other. >>> >>> LustreError: 137-5: lustrevm-MDT_UUID: not available for >>> connect >>> from 0@lo (no target). If you are running an HA pair check that >>> the target >>> is mounted on the other server. >>> >>> It probably makes sense to quiet this message right at mount time to avoid >>> this. >>> >>> Cheers, Andreas >>> >>>> On Dec 1, 2023, at 10:24, Audet, Martin via lustre-discuss >>>> wrote: >>>> >>>> >>>> Hello Lustre community, >>>> >>>> Have someone ever seen messages like these on in "/var/log/messages" on a >>>> Lustre server ? >>>> >>>> Dec 1 11:26:30 vlfs kernel: Lustre: Lustre: Build Version: 2.15.4_RC1 >>>> Dec 1 11:26:30 vlfs kernel: LDISKFS-fs (sdd): mounted filesystem with >>>> ordered data mode. Opts: errors=remount-ro,no_mbcache,nodelalloc >>>> Dec 1 11:26:30 vlfs kernel: LDISKFS-fs (sdc): mounted filesystem with >>>> ordered data mode. Opts: errors=remount-ro,no_mbcache,nodelalloc >>>> Dec 1 11:26:30 vlfs kernel: LDISKFS-fs (sdb): mounted filesystem with >>>> ordered data mode. Opts: user_xattr,errors=remount-ro,no_mbcache,nodelalloc >>>> Dec 1 11:26:36 vlfs kernel: LustreError: 137-5: lustrevm-MDT_UUID: >>>> not available for connect from 0@lo (no target). If you are running an HA >>>> pair check that the target is mounted on the other server. >>>> Dec 1 11:26:36 vlfs kernel: Lustre: lustrevm-OST0001: Imperative Recovery >>>> not enabled, recovery window 300
Re: [lustre-discuss] Error messages (ex: not available for connect from 0@lo) on server boot with Lustre 2.15.3 and 2.15.4-RC1
Hello Andrea, Thanks for your response. Happy to learn that the "errors" I was reporting aren't really errors. I now understand that the 3 messages about LDISKFS were only normal messages resulting from mounting the file systems (I was fooled by vim showing this message in red, like important error messages, but this is simply a false positive result of its syntax highlight rules probably triggered by the "errors=" string which is only a mount option...). Now what is the messages about "deleting orphaned objects" ? Is it normal also ? We boot the clients VMs always after the server is ready and we shutdown clients cleanly well before the vlmf Lustre server is (also cleanly) shutdown. It is a sign of corruption ? How come this happen if shutdowns are clean ? Thanks (and sorry for the beginners questions), Martin From: Andreas Dilger Sent: December 4, 2023 5:25 AM To: Audet, Martin Cc: lustre-discuss@lists.lustre.org Subject: Re: [lustre-discuss] Error messages (ex: not available for connect from 0@lo) on server boot with Lustre 2.15.3 and 2.15.4-RC1 ***Attention*** This email originated from outside of the NRC. ***Attention*** Ce courriel provient de l'extérieur du CNRC. It wasn't clear from your rail which message(s) are you concerned about? These look like normal mount message(s) to me. The "error" is pretty normal, it just means there were multiple services starting at once and one wasn't yet ready for the other. LustreError: 137-5: lustrevm-MDT_UUID: not available for connect from 0@lo (no target). If you are running an HA pair check that the target is mounted on the other server. It probably makes sense to quiet this message right at mount time to avoid this. Cheers, Andreas On Dec 1, 2023, at 10:24, Audet, Martin via lustre-discuss wrote: Hello Lustre community, Have someone ever seen messages like these on in "/var/log/messages" on a Lustre server ? Dec 1 11:26:30 vlfs kernel: Lustre: Lustre: Build Version: 2.15.4_RC1 Dec 1 11:26:30 vlfs kernel: LDISKFS-fs (sdd): mounted filesystem with ordered data mode. Opts: errors=remount-ro,no_mbcache,nodelalloc Dec 1 11:26:30 vlfs kernel: LDISKFS-fs (sdc): mounted filesystem with ordered data mode. Opts: errors=remount-ro,no_mbcache,nodelalloc Dec 1 11:26:30 vlfs kernel: LDISKFS-fs (sdb): mounted filesystem with ordered data mode. Opts: user_xattr,errors=remount-ro,no_mbcache,nodelalloc Dec 1 11:26:36 vlfs kernel: LustreError: 137-5: lustrevm-MDT_UUID: not available for connect from 0@lo (no target). If you are running an HA pair check that the target is mounted on the other server. Dec 1 11:26:36 vlfs kernel: Lustre: lustrevm-OST0001: Imperative Recovery not enabled, recovery window 300-900 Dec 1 11:26:36 vlfs kernel: Lustre: lustrevm-OST0001: deleting orphan objects from 0x0:227 to 0x0:513 This happens on every boot on a Lustre server named vlfs (a AlmaLinux 8.9 VM hosted on a VMware) playing the role of both MGS and OSS (it hosts an MDT two OST using "virtual" disks). We chose LDISKFS and not ZFS. Note that this happens at every boot, well before the clients (AlmaLinux 9.3 or 8.9 VMs) connect and even when the clients are powered off. The network connecting the clients and the server is a "virtual" 10GbE network (of course there is no virtual IB). Also we had the same messages previously with Lustre 2.15.3 using an AlmaLinux 8.8 server and AlmaLinux 8.8 / 9.2 clients (also using VMs). Note also that we compile ourselves the Lustre RPMs from the sources from the git repository. We also chose to use a patched kernel. Our build procedure for RPMs seems to work well because our real cluster run fine on CentOS 7.9 with Lustre 2.12.9 and IB (MOFED) networking. So has anyone seen these messages ? Are they problematic ? If yes, how do we avoid them ? We would like to make sure our small test system using VMs works well before we upgrade our real cluster. Thanks in advance ! Martin Audet ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
[lustre-discuss] Error messages (ex: not available for connect from 0@lo) on server boot with Lustre 2.15.3 and 2.15.4-RC1
Hello Lustre community, Have someone ever seen messages like these on in "/var/log/messages" on a Lustre server ? Dec 1 11:26:30 vlfs kernel: Lustre: Lustre: Build Version: 2.15.4_RC1 Dec 1 11:26:30 vlfs kernel: LDISKFS-fs (sdd): mounted filesystem with ordered data mode. Opts: errors=remount-ro,no_mbcache,nodelalloc Dec 1 11:26:30 vlfs kernel: LDISKFS-fs (sdc): mounted filesystem with ordered data mode. Opts: errors=remount-ro,no_mbcache,nodelalloc Dec 1 11:26:30 vlfs kernel: LDISKFS-fs (sdb): mounted filesystem with ordered data mode. Opts: user_xattr,errors=remount-ro,no_mbcache,nodelalloc Dec 1 11:26:36 vlfs kernel: LustreError: 137-5: lustrevm-MDT_UUID: not available for connect from 0@lo (no target). If you are running an HA pair check that the target is mounted on the other server. Dec 1 11:26:36 vlfs kernel: Lustre: lustrevm-OST0001: Imperative Recovery not enabled, recovery window 300-900 Dec 1 11:26:36 vlfs kernel: Lustre: lustrevm-OST0001: deleting orphan objects from 0x0:227 to 0x0:513 This happens on every boot on a Lustre server named vlfs (a AlmaLinux 8.9 VM hosted on a VMware) playing the role of both MGS and OSS (it hosts an MDT two OST using "virtual" disks). We chose LDISKFS and not ZFS. Note that this happens at every boot, well before the clients (AlmaLinux 9.3 or 8.9 VMs) connect and even when the clients are powered off. The network connecting the clients and the server is a "virtual" 10GbE network (of course there is no virtual IB). Also we had the same messages previously with Lustre 2.15.3 using an AlmaLinux 8.8 server and AlmaLinux 8.8 / 9.2 clients (also using VMs). Note also that we compile ourselves the Lustre RPMs from the sources from the git repository. We also chose to use a patched kernel. Our build procedure for RPMs seems to work well because our real cluster run fine on CentOS 7.9 with Lustre 2.12.9 and IB (MOFED) networking. So has anyone seen these messages ? Are they problematic ? If yes, how do we avoid them ? We would like to make sure our small test system using VMs works well before we upgrade our real cluster. Thanks in advance ! Martin Audet ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
Re: [lustre-discuss] Cannot mount MDT after upgrading from Lustre 2.12.6 to 2.15.3
Hello all, I would appreciate if the community would give more attention to this issue because upgrading from 2.12.x to 2.15.x, two LTS versions, is something that we can expect many cluster admin will try to do in the next few months... We ourselves plan to upgrade a small Lustre (production) system from 2.12.9 to 2.15.3 in the next couple of weeks... After seeing problems reports like this we start feeling a bit nervous... The documentation for doing this major update appears to me as not very specific... In this document for example, https://doc.lustre.org/lustre_manual.xhtml#upgradinglustre , the update process appears not so difficult and there is no mention of using "tunefs.lustre --writeconf" for this kind of update. Or am I missing something ? Thanks in advance for providing more tips for this kind of update. Martin Audet From: lustre-discuss on behalf of Tung-Han Hsieh via lustre-discuss Sent: September 23, 2023 2:20 PM To: lustre-discuss@lists.lustre.org Subject: [lustre-discuss] Cannot mount MDT after upgrading from Lustre 2.12.6 to 2.15.3 ***Attention*** This email originated from outside of the NRC. ***Attention*** Ce courriel provient de l'extérieur du CNRC. Dear All, Today we tried to upgrade Lustre file system from version 2.12.6 to 2.15.3. But after the work, we cannot mount MDT successfully. Our MDT is ldiskfs backend. The procedure of upgrade is 1. Install the new version of e2fsprogs-1.47.0 2. Install Lustre-2.15.3 3. After reboot, run: tunefs.lustre --writeconf /dev/md0 Then when mounting MDT, we got the error message in dmesg: === [11662.434724] LDISKFS-fs (md0): mounted filesystem with ordered data mode. Opts: user_xattr,errors=remount-ro,no_mbcache,nodelalloc [11662.584593] Lustre: 3440:0:(scrub.c:189:scrub_file_load()) chome-MDT: reset scrub OI count for format change (LU-16655) [11666.036253] Lustre: MGS: Logs for fs chome were removed by user request. All servers must be restarted in order to regenerate the logs: rc = 0 [11666.523144] Lustre: chome-MDT: Imperative Recovery not enabled, recovery window 300-900 [11666.594098] LustreError: 3440:0:(mdd_device.c:1355:mdd_prepare()) chome-MDD: get default LMV of root failed: rc = -2 [11666.594291] LustreError: 3440:0:(obd_mount_server.c:2027:server_fill_super()) Unable to start targets: -2 [11666.594951] Lustre: Failing over chome-MDT [11672.868438] Lustre: 3440:0:(client.c:2295:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1695492248/real 1695492248] req@5dfd9b53 x1777852464760768/t0(0) o251->MGC192.168.32.240@o2ib@0@lo:26/25 lens 224/224 e 0 to 1 dl 1695492254 ref 2 fl Rpc:XNQr/0/ rc 0/-1 job:'' [11672.925905] Lustre: server umount chome-MDT complete [11672.926036] LustreError: 3440:0:(super25.c:183:lustre_fill_super()) llite: Unable to mount : rc = -2 [11872.893970] LDISKFS-fs (md0): mounted filesystem with ordered data mode. Opts: (null) Could anyone help to solve this problem ? Sorry that it is really urgent. Thank you very much. T.H.Hsieh ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org