[lustre-discuss] Signing important git commits and files (RPMs, DEBs) distributed on the whamcloud repository ?

2024-01-26 Thread Audet, Martin via lustre-discuss
Hello,


It would be great if the important commits, especially those corresponding to 
tags, were signed using a long term keys (ex: GPG, SSH or X.509, you have the 
choice since git supports many formats) with the corresponding public keys 
published on Lustre web site and their fingerprints on this mailing list for 
example. This would allow every user to have a better confidence in the 
integrity of the associated code and comply more with the end-to-end principle 
as the private keys would be kept preciously by the developers.


It is the same thing with the RPMs and DEBs  packages distributed over the 
whamcloud repository (https://downloads.whamcloud.com/public/lustre/) except 
that the choice of the key system is limited to GPG in this case. As you know 
it is the common practice to associate a public key with every remote 
repository to verify the authenticity of every downloaded package before 
installation (but it is not yet done on this repository).


Performing downloads or "git" access over "https" is better than nothing but 
the guaranty of integrity is way better if done by signatures closer to the 
original authors.

Signing keys could even be held on hardware devices such as Yubikeys as this 
would be both very secure and convenient for developers.


Please consider this suggestion, I am sure it would satisfy many users.


Thanks,


Martin Audet
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


[lustre-discuss] Which MOFED version is suggested for Lustre (especially are Lustre 2.15.x and MOFED 23.10-1.1.9.0-LTS compatible) ?

2023-12-19 Thread Audet, Martin via lustre-discuss
Hello Lustre community,


We are in the process of upgrading our (small) cluster. We would like to use 
RHEL/AlmaLinux 9.3 on our compute nodes (Lustre clients) and RHEL/AlmaLinux 8.9 
on our file server node (we have a single server).


Our Lustre RPMs (client, server and server patched kernel) are all compiled by 
ourselves from the Lustre git repository. We are currently testing version 
2.15.4-RC1.


We were thinking of installing MOFED 5.8-3.0.7.0-LTS but since it refuses to 
compile on AlmaLinux 9.3 (mlnxofedinstall --add-kernel-support fails), we 
decided to use the more recently released (last week) MOFED 23.10-1.1.9.0-LTS. 
The Lustre documentation says that this later MOFED version is compatible with 
Lustre 2.12.9 and 2.15.2, but did NVIDIA really performed some tests ?


Anyway which MOFED version is appropriate for Lustre 2.15.x ? We didn't find 
any answer to this question in the documentation (and source repository). Any 
information or experience on that ?

Thanks,

Martin Audet
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Error messages (ex: not available for connect from 0@lo) on server boot with Lustre 2.15.3 and 2.15.4-RC1

2023-12-07 Thread Audet, Martin via lustre-discuss
Thanks Andreas and Aurélien for your answers. They makes us confident that we 
are on the right track for our cluster update !


Also I have noticed that 2.15.4-RC1 was released two weeks ago, can we expect 
2.15.4 to be ready by the end of the year ?


Regards,


Martin


From: Andreas Dilger 
Sent: December 7, 2023 6:02 AM
To: Aurelien Degremont
Cc: Audet, Martin; lustre-discuss@lists.lustre.org
Subject: Re: [lustre-discuss] Error messages (ex: not available for connect 
from 0@lo) on server boot with Lustre 2.15.3 and 2.15.4-RC1

***Attention*** This email originated from outside of the NRC. ***Attention*** 
Ce courriel provient de l'extérieur du CNRC.

Aurelien,
there have beeen a number of questions about this message.

> Lustre: lustrevm-OST0001: deleting orphan objects from 0x0:227 to 0x0:513

This is not marked LustreError, so it is just an advisory message.

This can sometimes be useful for debugging issues related to MDT->OST 
connections.
It is already printed with D_INFO level, so the lowest printk level available.
Would rewording the message make it more clear that this is a normal situation
when the MDT and OST are establishing connections?

Cheers, Andreas

On Dec 5, 2023, at 02:13, Aurelien Degremont  wrote:
>
> > Now what is the messages about "deleting orphaned objects" ? Is it normal 
> > also ?
>
> Yeah, this is kind of normal, and I'm even thinking we should lower the 
> message verbosity...
> Andreas, do you agree that could become a simple CDEBUG(D_HA, ...) instead of 
> LCONSOLE(D_INFO, ...)?
>
>
> Aurélien
>
> Audet, Martin wrote on lundi 4 décembre 2023 20:26:
>> Hello Andreas,
>>
>> Thanks for your response. Happy to learn that the "errors" I was reporting 
>> aren't really errors.
>>
>> I now understand that the 3 messages about LDISKFS were only normal messages 
>> resulting from mounting the file systems (I was fooled by vim showing this 
>> message in red, like important error messages, but this is simply a false 
>> positive result of its syntax highlight rules probably triggered by the 
>> "errors=" string which is only a mount option...).
>>
>> Now what is the messages about "deleting orphaned objects" ? Is it normal 
>> also ? We boot the clients VMs always after the server is ready and we 
>> shutdown clients cleanly well before the vlmf Lustre server is (also 
>> cleanly) shutdown. It is a sign of corruption ? How come this happen if 
>> shutdowns are clean ?
>>
>> Thanks (and sorry for the beginners questions),
>>
>> Martin
>>
>> Andreas Dilger  wrote on December 4, 2023 5:25 AM:
>>> It wasn't clear from your rail which message(s) are you concerned about?  
>>> These look like normal mount message(s) to me.
>>>
>>> The "error" is pretty normal, it just means there were multiple services 
>>> starting at once and one wasn't yet ready for the other.
>>>
>>>  LustreError: 137-5: lustrevm-MDT_UUID: not available for 
>>> connect
>>>  from 0@lo (no target). If you are running an HA pair check that 
>>> the target
>>> is mounted on the other server.
>>>
>>> It probably makes sense to quiet this message right at mount time to avoid 
>>> this.
>>>
>>> Cheers, Andreas
>>>
>>>> On Dec 1, 2023, at 10:24, Audet, Martin via lustre-discuss 
>>>>  wrote:
>>>>
>>>> 
>>>> Hello Lustre community,
>>>>
>>>> Have someone ever seen messages like these on in "/var/log/messages" on a 
>>>> Lustre server ?
>>>>
>>>> Dec  1 11:26:30 vlfs kernel: Lustre: Lustre: Build Version: 2.15.4_RC1
>>>> Dec  1 11:26:30 vlfs kernel: LDISKFS-fs (sdd): mounted filesystem with 
>>>> ordered data mode. Opts: errors=remount-ro,no_mbcache,nodelalloc
>>>> Dec  1 11:26:30 vlfs kernel: LDISKFS-fs (sdc): mounted filesystem with 
>>>> ordered data mode. Opts: errors=remount-ro,no_mbcache,nodelalloc
>>>> Dec  1 11:26:30 vlfs kernel: LDISKFS-fs (sdb): mounted filesystem with 
>>>> ordered data mode. Opts: user_xattr,errors=remount-ro,no_mbcache,nodelalloc
>>>> Dec  1 11:26:36 vlfs kernel: LustreError: 137-5: lustrevm-MDT_UUID: 
>>>> not available for connect from 0@lo (no target). If you are running an HA 
>>>> pair check that the target is mounted on the other server.
>>>> Dec  1 11:26:36 vlfs kernel: Lustre: lustrevm-OST0001: Imperative Recovery 
>>>> not enabled, recovery window 300

Re: [lustre-discuss] Error messages (ex: not available for connect from 0@lo) on server boot with Lustre 2.15.3 and 2.15.4-RC1

2023-12-04 Thread Audet, Martin via lustre-discuss
Hello Andrea,


Thanks for your response. Happy to learn that the "errors" I was reporting 
aren't really errors.


I now understand that the 3 messages about LDISKFS were only normal messages 
resulting from mounting the file systems (I was fooled by vim showing this 
message in red, like important error messages, but this is simply a false 
positive result of its syntax highlight rules probably triggered by the 
"errors=" string which is only a mount option...).

Now what is the messages about "deleting orphaned objects" ? Is it normal also 
? We boot the clients VMs always after the server is ready and we shutdown 
clients cleanly well before the vlmf Lustre server is (also cleanly) shutdown. 
It is a sign of corruption ? How come this happen if shutdowns are clean ?

Thanks (and sorry for the beginners questions),

Martin


From: Andreas Dilger 
Sent: December 4, 2023 5:25 AM
To: Audet, Martin
Cc: lustre-discuss@lists.lustre.org
Subject: Re: [lustre-discuss] Error messages (ex: not available for connect 
from 0@lo) on server boot with Lustre 2.15.3 and 2.15.4-RC1


***Attention*** This email originated from outside of the NRC. ***Attention*** 
Ce courriel provient de l'extérieur du CNRC.

It wasn't clear from your rail which message(s) are you concerned about?  These 
look like normal mount message(s) to me.

The "error" is pretty normal, it just means there were multiple services 
starting at once and one wasn't yet ready for the other.

 LustreError: 137-5: lustrevm-MDT_UUID: not available for connect
 from 0@lo (no target). If you are running an HA pair check that the 
target
is mounted on the other server.

It probably makes sense to quiet this message right at mount time to avoid this.

Cheers, Andreas

On Dec 1, 2023, at 10:24, Audet, Martin via lustre-discuss 
 wrote:



Hello Lustre community,


Have someone ever seen messages like these on in "/var/log/messages" on a 
Lustre server ?


Dec  1 11:26:30 vlfs kernel: Lustre: Lustre: Build Version: 2.15.4_RC1
Dec  1 11:26:30 vlfs kernel: LDISKFS-fs (sdd): mounted filesystem with ordered 
data mode. Opts: errors=remount-ro,no_mbcache,nodelalloc
Dec  1 11:26:30 vlfs kernel: LDISKFS-fs (sdc): mounted filesystem with ordered 
data mode. Opts: errors=remount-ro,no_mbcache,nodelalloc
Dec  1 11:26:30 vlfs kernel: LDISKFS-fs (sdb): mounted filesystem with ordered 
data mode. Opts: user_xattr,errors=remount-ro,no_mbcache,nodelalloc
Dec  1 11:26:36 vlfs kernel: LustreError: 137-5: lustrevm-MDT_UUID: not 
available for connect from 0@lo (no target). If you are running an HA pair 
check that the target is mounted on the other server.
Dec  1 11:26:36 vlfs kernel: Lustre: lustrevm-OST0001: Imperative Recovery not 
enabled, recovery window 300-900
Dec  1 11:26:36 vlfs kernel: Lustre: lustrevm-OST0001: deleting orphan objects 
from 0x0:227 to 0x0:513


This happens on every boot on a Lustre server named vlfs (a AlmaLinux 8.9 VM 
hosted on a VMware) playing the role of both MGS and OSS (it hosts an MDT two 
OST using "virtual" disks). We chose LDISKFS and not ZFS. Note that this 
happens at every boot, well before the clients (AlmaLinux 9.3 or 8.9 VMs) 
connect and even when the clients are powered off. The network connecting the 
clients and the server is a "virtual" 10GbE network (of course there is no 
virtual IB). Also we had the same messages previously with Lustre 2.15.3 using 
an AlmaLinux 8.8 server and AlmaLinux 8.8 / 9.2 clients (also using VMs). Note 
also that we compile ourselves the Lustre RPMs from the sources from the git 
repository. We also chose to use a patched kernel. Our build procedure for RPMs 
seems to work well because our real cluster run fine on CentOS 7.9 with Lustre 
2.12.9 and IB (MOFED) networking.

So has anyone seen these messages ?

Are they problematic ? If yes, how do we avoid them ?

We would like to make sure our small test system using VMs works well before we 
upgrade our real cluster.

Thanks in advance !

Martin Audet

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


[lustre-discuss] Error messages (ex: not available for connect from 0@lo) on server boot with Lustre 2.15.3 and 2.15.4-RC1

2023-12-01 Thread Audet, Martin via lustre-discuss
Hello Lustre community,


Have someone ever seen messages like these on in "/var/log/messages" on a 
Lustre server ?


Dec  1 11:26:30 vlfs kernel: Lustre: Lustre: Build Version: 2.15.4_RC1
Dec  1 11:26:30 vlfs kernel: LDISKFS-fs (sdd): mounted filesystem with ordered 
data mode. Opts: errors=remount-ro,no_mbcache,nodelalloc
Dec  1 11:26:30 vlfs kernel: LDISKFS-fs (sdc): mounted filesystem with ordered 
data mode. Opts: errors=remount-ro,no_mbcache,nodelalloc
Dec  1 11:26:30 vlfs kernel: LDISKFS-fs (sdb): mounted filesystem with ordered 
data mode. Opts: user_xattr,errors=remount-ro,no_mbcache,nodelalloc
Dec  1 11:26:36 vlfs kernel: LustreError: 137-5: lustrevm-MDT_UUID: not 
available for connect from 0@lo (no target). If you are running an HA pair 
check that the target is mounted on the other server.
Dec  1 11:26:36 vlfs kernel: Lustre: lustrevm-OST0001: Imperative Recovery not 
enabled, recovery window 300-900
Dec  1 11:26:36 vlfs kernel: Lustre: lustrevm-OST0001: deleting orphan objects 
from 0x0:227 to 0x0:513


This happens on every boot on a Lustre server named vlfs (a AlmaLinux 8.9 VM 
hosted on a VMware) playing the role of both MGS and OSS (it hosts an MDT two 
OST using "virtual" disks). We chose LDISKFS and not ZFS. Note that this 
happens at every boot, well before the clients (AlmaLinux 9.3 or 8.9 VMs) 
connect and even when the clients are powered off. The network connecting the 
clients and the server is a "virtual" 10GbE network (of course there is no 
virtual IB). Also we had the same messages previously with Lustre 2.15.3 using 
an AlmaLinux 8.8 server and AlmaLinux 8.8 / 9.2 clients (also using VMs). Note 
also that we compile ourselves the Lustre RPMs from the sources from the git 
repository. We also chose to use a patched kernel. Our build procedure for RPMs 
seems to work well because our real cluster run fine on CentOS 7.9 with Lustre 
2.12.9 and IB (MOFED) networking.

So has anyone seen these messages ?

Are they problematic ? If yes, how do we avoid them ?

We would like to make sure our small test system using VMs works well before we 
upgrade our real cluster.

Thanks in advance !

Martin Audet

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Cannot mount MDT after upgrading from Lustre 2.12.6 to 2.15.3

2023-09-26 Thread Audet, Martin via lustre-discuss
Hello all,


I would appreciate if the community would give more attention to this issue 
because upgrading from 2.12.x to 2.15.x, two LTS versions, is something that we 
can expect many cluster admin will try to do in the next few months...


We ourselves plan to upgrade a small Lustre (production) system from 2.12.9 to 
2.15.3 in the next couple of weeks...

After seeing problems reports like this we start feeling a bit nervous...


The documentation for doing this major update appears to me as not very 
specific...


In this document for example, 
https://doc.lustre.org/lustre_manual.xhtml#upgradinglustre , the update process 
appears not so difficult and there is no mention of using "tunefs.lustre 
--writeconf" for this kind of update.


Or am I missing something ?


Thanks in advance for providing more tips for this kind of update.


Martin Audet


From: lustre-discuss  on behalf of 
Tung-Han Hsieh via lustre-discuss 
Sent: September 23, 2023 2:20 PM
To: lustre-discuss@lists.lustre.org
Subject: [lustre-discuss] Cannot mount MDT after upgrading from Lustre 2.12.6 
to 2.15.3


***Attention*** This email originated from outside of the NRC. ***Attention*** 
Ce courriel provient de l'extérieur du CNRC.

Dear All,

Today we tried to upgrade Lustre file system from version 2.12.6 to 2.15.3. But 
after the work, we cannot mount MDT successfully. Our MDT is ldiskfs backend. 
The procedure of upgrade is

1. Install the new version of e2fsprogs-1.47.0
2. Install Lustre-2.15.3
3. After reboot, run: tunefs.lustre --writeconf /dev/md0

Then when mounting MDT, we got the error message in dmesg:

===
[11662.434724] LDISKFS-fs (md0): mounted filesystem with ordered data mode. 
Opts: user_xattr,errors=remount-ro,no_mbcache,nodelalloc
[11662.584593] Lustre: 3440:0:(scrub.c:189:scrub_file_load()) chome-MDT: 
reset scrub OI count for format change (LU-16655)
[11666.036253] Lustre: MGS: Logs for fs chome were removed by user request.  
All servers must be restarted in order to regenerate the logs: rc = 0
[11666.523144] Lustre: chome-MDT: Imperative Recovery not enabled, recovery 
window 300-900
[11666.594098] LustreError: 3440:0:(mdd_device.c:1355:mdd_prepare()) 
chome-MDD: get default LMV of root failed: rc = -2
[11666.594291] LustreError: 
3440:0:(obd_mount_server.c:2027:server_fill_super()) Unable to start targets: -2
[11666.594951] Lustre: Failing over chome-MDT
[11672.868438] Lustre: 3440:0:(client.c:2295:ptlrpc_expire_one_request()) @@@ 
Request sent has timed out for slow reply: [sent 1695492248/real 1695492248]  
req@5dfd9b53 x1777852464760768/t0(0) 
o251->MGC192.168.32.240@o2ib@0@lo:26/25 lens 224/224 e 0 to 1 dl 1695492254 ref 
2 fl Rpc:XNQr/0/ rc 0/-1 job:''
[11672.925905] Lustre: server umount chome-MDT complete
[11672.926036] LustreError: 3440:0:(super25.c:183:lustre_fill_super()) llite: 
Unable to mount : rc = -2
[11872.893970] LDISKFS-fs (md0): mounted filesystem with ordered data mode. 
Opts: (null)


Could anyone help to solve this problem ? Sorry that it is really urgent.

Thank you very much.

T.H.Hsieh
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org