Re: [lustre-discuss] lustre-discuss Digest, Vol 213, Issue 10

2023-12-07 Thread John Bauer
/log/messages:Dec  6 13:50:17 mds2 kernel: LNetError: 
11579:0:(lib-move.c:4005:lnet_handle_recovery_reply()) Skipped 1 previous 
similar message

/var/log/messages:Dec  6 14:05:17 mds2 kernel: LNetError: 
11579:0:(lib-move.c:4005:lnet_handle_recovery_reply()) peer NI 
(10.67.178.25@tcp) recovery failed with -110

/var/log/messages:Dec  6 14:05:17 mds2 kernel: LNetError: 
11579:0:(lib-move.c:4005:lnet_handle_recovery_reply()) Skipped 1 previous 
similar message

/var/log/messages:Dec  6 14:20:16 mds2 kernel: LNetError: 
11579:0:(lib-move.c:4005:lnet_handle_recovery_reply()) peer NI 
(10.67.178.25@tcp) recovery failed with -110

/var/log/messages:Dec  6 14:20:16 mds2 kernel: LNetError: 
11579:0:(lib-move.c:4005:lnet_handle_recovery_reply()) Skipped 1 previous 
similar message

/var/log/messages:Dec  6 14:30:17 mds2 kernel: LNetError: 
3806712:0:(lib-move.c:4005:lnet_handle_recovery_reply()) peer NI 
(10.67.176.25@tcp) recovery failed with -111

/var/log/messages:Dec  6 14:30:17 mds2 kernel: LNetError: 
3806712:0:(lib-move.c:4005:lnet_handle_recovery_reply()) Skipped 3 previous 
similar messages

/var/log/messages:Dec  6 14:47:14 mds2 kernel: LNetError: 
3812070:0:(lib-move.c:4005:lnet_handle_recovery_reply()) peer NI 
(10.67.176.25@tcp) recovery failed with -111

/var/log/messages:Dec  6 14:47:14 mds2 kernel: LNetError: 
3812070:0:(lib-move.c:4005:lnet_handle_recovery_reply()) Skipped 8 previous 
similar messages

/var/log/messages:Dec  6 15:02:14 mds2 kernel: LNetError: 
3817248:0:(lib-move.c:4005:lnet_handle_recovery_reply()) peer NI 
(10.67.176.25@tcp) recovery failed with -111


Regards,
Qiulan
-- next part --
An HTML attachment was scrubbed...
URL:<http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20231206/89b7c124/attachment.htm>

--

Subject: Digest Footer

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


--

End of lustre-discuss Digest, Vol 213, Issue 7
**

-- next part --
An HTML attachment was scrubbed...
URL: 
<http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20231207/ce8919f0/attachment.htm>
-- next part --
A non-text attachment was scrubbed...
Name: pfe27_allOSC_cached.png
Type: image/png
Size: 12394 bytes
Desc: not available
URL: 
<http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20231207/ce8919f0/attachment.png>

--

Subject: Digest Footer

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


--

End of lustre-discuss Digest, Vol 213, Issue 10
***

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


[lustre-discuss] Lustre caching and NUMA nodes

2023-12-07 Thread John Bauer

Peter,

A delayed reply to one more of your questions, "What makes you think 
"lustre" is doing that?" , as I had to make another run and gather OSC 
stats on all the Lustre file systems mounted on the host that I run dd on.


This host has 12 Lustre file systems, comprised of 507 OSTs. While dd 
was running I instrumented the amount of cached data associated with all 
507 OSCs.  That is reflected in the bottom frame of the image below.  
Note that in the top frame there was always about 5GB of free memory, 
and 50GB of cached data.  I believe it has to be a Lustre issue as the 
Linux buffer cache has no knowledge that a page is a Lustre page.  How 
is it that every OSC, on all 12 file systems on the host, has its memory 
dropped to 0, yet all the other 50GB of cached data on the host remains? 
It's as though dropcache is being run on only the lustre file systems.  
My googling around finds no such feature in dropcache that would allow 
file system specific dropping.  Is there some tuneable that gives Lustre 
pages higher potential for eviction than other cached data?


Another subtle point of interest.  Note that dd writing resumes, as 
reflected in the growth of the cached data for its 8 OSTs, before all 
the other OSCs have finished dumping.  This is most visible around 2.1 
seconds into the run.  Also different is that this dumping phenomenon 
happened 3 times in the course of a 10 second run, instead of just 1 as 
in the previous run I was referencing, costing this dd run 1.2 seconds.


John


On 12/6/23 14:24, lustre-discuss-requ...@lists.lustre.org wrote:

Send lustre-discuss mailing list submissions to
lustre-discuss@lists.lustre.org

To subscribe or unsubscribe via the World Wide Web, visit
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
or, via email, send a message with subject or body 'help' to
lustre-discuss-requ...@lists.lustre.org

You can reach the person managing the list at
lustre-discuss-ow...@lists.lustre.org

When replying, please edit your Subject line so it is more specific
than "Re: Contents of lustre-discuss digest..."


Today's Topics:

1. Coordinating cluster start and shutdown? (Jan Andersen)
2. Re: Lustre caching and NUMA nodes (Peter Grandi)
3. Re: Coordinating cluster start and shutdown?
   (Bertschinger, Thomas Andrew Hjorth)
4. Lustre server still try to recover the lnet reply to the
   depreciated clients (Huang, Qiulan)


--

Message: 1
Date: Wed, 6 Dec 2023 10:27:11 +
From: Jan Andersen
To: lustre
Subject: [lustre-discuss] Coordinating cluster start and shutdown?
Message-ID:<696fac02-df18-4fe1-967c-02c3bca42...@comind.io>
Content-Type: text/plain; charset=UTF-8; format=flowed

Are there any tools for coordinating the start and shutdown of lustre 
filesystem, so that the OSS systems don't attempt to mount disks before the MGT 
and MDT are online?


--

Message: 2
Date: Wed, 6 Dec 2023 12:40:54 +
From:p...@lustre.list.sabi.co.uk  (Peter Grandi)
To: list Lustre discussion
Subject: Re: [lustre-discuss] Lustre caching and NUMA nodes
Message-ID:<25968.27606.536270.208...@petal.ty.sabi.co.uk>
Content-Type: text/plain; charset=iso-8859-1


I have a an OSC caching question.? I am running a dd process
which writes an 8GB file.? The file is on lustre, striped
8x1M.

How the Lustre instance servers store the data may not have a
huge influence on what happens in the client's system buffer
cache.


This is run on a system that has 2 NUMA nodes (? cpu sockets).
[...] Why does lustre go to the trouble of dumping node1 and
then not use node1's memory, when there was always plenty of
free memory on node0?

What makes you think "lustre" is doing that?

Are you aware of the values of the flusher settings such as
'dirty_bytes', 'dirty_ratio', 'dirty_expire_centisecs'?

Have you considered looking at NUMA policies e.g. as described
in 'man numactl'?

Also while you surely know better I usually try to avoid
buffering large amounts of to-be-written data in RAM (whether on
the OSC or the OSS), and to my taste 8GiB "in-flight" is large.


--

Message: 3
Date: Wed, 6 Dec 2023 16:00:38 +
From: "Bertschinger, Thomas Andrew Hjorth"
To: Jan Andersen, lustre

Subject: Re: [lustre-discuss] Coordinating cluster start and shutdown?
Message-ID:



Content-Type: text/plain; charset="iso-8859-1"

Hello Jan,

You can use the Pacemaker / Corosync high-availability software stack for this: 
specifically, ordering constraints [1] can be used.

Unfortunately, Pacemaker is probably over-the-top if you don't need HA -- its 
configuration is complex and difficult to get right, and it significantly 
complicates system administration. One downside of Pacemaker is that it is not 
easy to decouple the Pacemaker service from the Lustre services, meaning if you 
stop the Pacemaker service, it 

Re: [lustre-discuss] Error messages (ex: not available for connect from 0@lo) on server boot with Lustre 2.15.3 and 2.15.4-RC1

2023-12-07 Thread Audet, Martin via lustre-discuss
Thanks Andreas and Aurélien for your answers. They makes us confident that we 
are on the right track for our cluster update !


Also I have noticed that 2.15.4-RC1 was released two weeks ago, can we expect 
2.15.4 to be ready by the end of the year ?


Regards,


Martin


From: Andreas Dilger 
Sent: December 7, 2023 6:02 AM
To: Aurelien Degremont
Cc: Audet, Martin; lustre-discuss@lists.lustre.org
Subject: Re: [lustre-discuss] Error messages (ex: not available for connect 
from 0@lo) on server boot with Lustre 2.15.3 and 2.15.4-RC1

***Attention*** This email originated from outside of the NRC. ***Attention*** 
Ce courriel provient de l'extérieur du CNRC.

Aurelien,
there have beeen a number of questions about this message.

> Lustre: lustrevm-OST0001: deleting orphan objects from 0x0:227 to 0x0:513

This is not marked LustreError, so it is just an advisory message.

This can sometimes be useful for debugging issues related to MDT->OST 
connections.
It is already printed with D_INFO level, so the lowest printk level available.
Would rewording the message make it more clear that this is a normal situation
when the MDT and OST are establishing connections?

Cheers, Andreas

On Dec 5, 2023, at 02:13, Aurelien Degremont  wrote:
>
> > Now what is the messages about "deleting orphaned objects" ? Is it normal 
> > also ?
>
> Yeah, this is kind of normal, and I'm even thinking we should lower the 
> message verbosity...
> Andreas, do you agree that could become a simple CDEBUG(D_HA, ...) instead of 
> LCONSOLE(D_INFO, ...)?
>
>
> Aurélien
>
> Audet, Martin wrote on lundi 4 décembre 2023 20:26:
>> Hello Andreas,
>>
>> Thanks for your response. Happy to learn that the "errors" I was reporting 
>> aren't really errors.
>>
>> I now understand that the 3 messages about LDISKFS were only normal messages 
>> resulting from mounting the file systems (I was fooled by vim showing this 
>> message in red, like important error messages, but this is simply a false 
>> positive result of its syntax highlight rules probably triggered by the 
>> "errors=" string which is only a mount option...).
>>
>> Now what is the messages about "deleting orphaned objects" ? Is it normal 
>> also ? We boot the clients VMs always after the server is ready and we 
>> shutdown clients cleanly well before the vlmf Lustre server is (also 
>> cleanly) shutdown. It is a sign of corruption ? How come this happen if 
>> shutdowns are clean ?
>>
>> Thanks (and sorry for the beginners questions),
>>
>> Martin
>>
>> Andreas Dilger  wrote on December 4, 2023 5:25 AM:
>>> It wasn't clear from your rail which message(s) are you concerned about?  
>>> These look like normal mount message(s) to me.
>>>
>>> The "error" is pretty normal, it just means there were multiple services 
>>> starting at once and one wasn't yet ready for the other.
>>>
>>>  LustreError: 137-5: lustrevm-MDT_UUID: not available for 
>>> connect
>>>  from 0@lo (no target). If you are running an HA pair check that 
>>> the target
>>> is mounted on the other server.
>>>
>>> It probably makes sense to quiet this message right at mount time to avoid 
>>> this.
>>>
>>> Cheers, Andreas
>>>
 On Dec 1, 2023, at 10:24, Audet, Martin via lustre-discuss 
  wrote:

 
 Hello Lustre community,

 Have someone ever seen messages like these on in "/var/log/messages" on a 
 Lustre server ?

 Dec  1 11:26:30 vlfs kernel: Lustre: Lustre: Build Version: 2.15.4_RC1
 Dec  1 11:26:30 vlfs kernel: LDISKFS-fs (sdd): mounted filesystem with 
 ordered data mode. Opts: errors=remount-ro,no_mbcache,nodelalloc
 Dec  1 11:26:30 vlfs kernel: LDISKFS-fs (sdc): mounted filesystem with 
 ordered data mode. Opts: errors=remount-ro,no_mbcache,nodelalloc
 Dec  1 11:26:30 vlfs kernel: LDISKFS-fs (sdb): mounted filesystem with 
 ordered data mode. Opts: user_xattr,errors=remount-ro,no_mbcache,nodelalloc
 Dec  1 11:26:36 vlfs kernel: LustreError: 137-5: lustrevm-MDT_UUID: 
 not available for connect from 0@lo (no target). If you are running an HA 
 pair check that the target is mounted on the other server.
 Dec  1 11:26:36 vlfs kernel: Lustre: lustrevm-OST0001: Imperative Recovery 
 not enabled, recovery window 300-900
 Dec  1 11:26:36 vlfs kernel: Lustre: lustrevm-OST0001: deleting orphan 
 objects from 0x0:227 to 0x0:513

 This happens on every boot on a Lustre server named vlfs (a AlmaLinux 8.9 
 VM hosted on a VMware) playing the role of both MGS and OSS (it hosts an 
 MDT two OST using "virtual" disks). We chose LDISKFS and not ZFS. Note 
 that this happens at every boot, well before the clients (AlmaLinux 9.3 or 
 8.9 VMs) connect and even when the clients are powered off. The network 
 connecting the clients and the server is a "virtual" 10GbE network (of 
 course there is no virtual IB). Also we had the same messages previously 
 

Re: [lustre-discuss] Error messages (ex: not available for connect from 0@lo) on server boot with Lustre 2.15.3 and 2.15.4-RC1

2023-12-07 Thread Andreas Dilger via lustre-discuss
Aurelien,
there have beeen a number of questions about this message.

> Lustre: lustrevm-OST0001: deleting orphan objects from 0x0:227 to 0x0:513

This is not marked LustreError, so it is just an advisory message.

This can sometimes be useful for debugging issues related to MDT->OST 
connections.
It is already printed with D_INFO level, so the lowest printk level available.
Would rewording the message make it more clear that this is a normal situation
when the MDT and OST are establishing connections?

Cheers, Andreas

On Dec 5, 2023, at 02:13, Aurelien Degremont  wrote:
> 
> > Now what is the messages about "deleting orphaned objects" ? Is it normal 
> > also ?
> 
> Yeah, this is kind of normal, and I'm even thinking we should lower the 
> message verbosity...
> Andreas, do you agree that could become a simple CDEBUG(D_HA, ...) instead of 
> LCONSOLE(D_INFO, ...)?
> 
> 
> Aurélien
> 
> Audet, Martin wrote on lundi 4 décembre 2023 20:26:
>> Hello Andreas,
>> 
>> Thanks for your response. Happy to learn that the "errors" I was reporting 
>> aren't really errors.
>> 
>> I now understand that the 3 messages about LDISKFS were only normal messages 
>> resulting from mounting the file systems (I was fooled by vim showing this 
>> message in red, like important error messages, but this is simply a false 
>> positive result of its syntax highlight rules probably triggered by the 
>> "errors=" string which is only a mount option...).
>> 
>> Now what is the messages about "deleting orphaned objects" ? Is it normal 
>> also ? We boot the clients VMs always after the server is ready and we 
>> shutdown clients cleanly well before the vlmf Lustre server is (also 
>> cleanly) shutdown. It is a sign of corruption ? How come this happen if 
>> shutdowns are clean ?
>> 
>> Thanks (and sorry for the beginners questions),
>> 
>> Martin
>> 
>> Andreas Dilger  wrote on December 4, 2023 5:25 AM:
>>> It wasn't clear from your rail which message(s) are you concerned about?  
>>> These look like normal mount message(s) to me. 
>>> 
>>> The "error" is pretty normal, it just means there were multiple services 
>>> starting at once and one wasn't yet ready for the other. 
>>> 
>>>  LustreError: 137-5: lustrevm-MDT_UUID: not available for 
>>> connect
>>>  from 0@lo (no target). If you are running an HA pair check that 
>>> the target
>>> is mounted on the other server.
>>> 
>>> It probably makes sense to quiet this message right at mount time to avoid 
>>> this. 
>>> 
>>> Cheers, Andreas
>>> 
 On Dec 1, 2023, at 10:24, Audet, Martin via lustre-discuss 
  wrote:
 
 
 Hello Lustre community,
 
 Have someone ever seen messages like these on in "/var/log/messages" on a 
 Lustre server ?
 
 Dec  1 11:26:30 vlfs kernel: Lustre: Lustre: Build Version: 2.15.4_RC1
 Dec  1 11:26:30 vlfs kernel: LDISKFS-fs (sdd): mounted filesystem with 
 ordered data mode. Opts: errors=remount-ro,no_mbcache,nodelalloc
 Dec  1 11:26:30 vlfs kernel: LDISKFS-fs (sdc): mounted filesystem with 
 ordered data mode. Opts: errors=remount-ro,no_mbcache,nodelalloc
 Dec  1 11:26:30 vlfs kernel: LDISKFS-fs (sdb): mounted filesystem with 
 ordered data mode. Opts: user_xattr,errors=remount-ro,no_mbcache,nodelalloc
 Dec  1 11:26:36 vlfs kernel: LustreError: 137-5: lustrevm-MDT_UUID: 
 not available for connect from 0@lo (no target). If you are running an HA 
 pair check that the target is mounted on the other server.
 Dec  1 11:26:36 vlfs kernel: Lustre: lustrevm-OST0001: Imperative Recovery 
 not enabled, recovery window 300-900
 Dec  1 11:26:36 vlfs kernel: Lustre: lustrevm-OST0001: deleting orphan 
 objects from 0x0:227 to 0x0:513
 
 This happens on every boot on a Lustre server named vlfs (a AlmaLinux 8.9 
 VM hosted on a VMware) playing the role of both MGS and OSS (it hosts an 
 MDT two OST using "virtual" disks). We chose LDISKFS and not ZFS. Note 
 that this happens at every boot, well before the clients (AlmaLinux 9.3 or 
 8.9 VMs) connect and even when the clients are powered off. The network 
 connecting the clients and the server is a "virtual" 10GbE network (of 
 course there is no virtual IB). Also we had the same messages previously 
 with Lustre 2.15.3 using an AlmaLinux 8.8 server and AlmaLinux 8.8 / 9.2 
 clients (also using VMs). Note also that we compile ourselves the Lustre 
 RPMs from the sources from the git repository. We also chose to use a 
 patched kernel. Our build procedure for RPMs seems to work well because 
 our real cluster run fine on CentOS 7.9 with Lustre 2.12.9 and IB (MOFED) 
 networking.
 
 So has anyone seen these messages ?
 
 Are they problematic ? If yes, how do we avoid them ?
 
 We would like to make sure our small test system using VMs works well 
 before we upgrade our real cluster.
 
 Thanks in advance !