date:20231205

Re: [lustre-discuss] Lustre caching and NUMA nodes

2023-12-05 Thread John Bauer


Andreas,

Thanks for the reply.

Client version is 2.14.0_ddn98. Here is a plot of the 
*write_RPCs_in_flight* plot.  Snapshot every 50ms.  The max for any of 
the samples for any of the OSCs was 1.  No RPCs in flight while the OSCs 
were dumping memory.  The number following the OSC name in the legends 
is the sum of the *write_RPCs_in flight* for all the intervals.  To be 
honest, I have never really looked at the RPCs in flight numbers.  I'm 
running as a lowly user, so I don't have access to any of the server 
data, so I have nothing on osd-ldiskfs.*.brw_stats.


I should also point out that the backing storage on the servers is SSD, 
so I would think the commiting to storage on the server side should be 
pretty quick.


I'm trying to get a handle on how Linux buffer cache works. Everything I 
find on the web is pretty old.  Here's one from 2012. 
https://lwn.net/Articles/495543/


Can someone point me to something more current, and perhaps Lustre related?

As for images, I think the list server strips the images.  In previous 
postings, when I would include images , what I got back when the list 
server broadcast it out had the iamges stripped. I'll include the images 
and also a link to the image on DropBox.


Thanks again,

John

https://www.dropbox.com/scl/fi/fgmz4wazr6it9q2aeo0mb/write_RPCs_in_flight.png?rlkey=d3ri2w2n7isggvn05se4j3a6b&dl=0


On 12/5/23 22:33, Andreas Dilger wrote:


On Dec 4, 2023, at 15:06, John Bauer  wrote:


I have a an OSC caching question.  I am running a dd process which 
writes an 8GB file.  The file is on lustre, striped 8x1M. This is run 
on a system that has 2 NUMA nodes (cpu sockets). All the data is 
apparently stored on one NUMA node (node1 in the plot below) until 
node1 runs out of free memory.  Then it appears that dd comes to a 
stop (no more writes complete) until lustre dumps the data from the 
node1.  Then dd continues writing, but now the data is stored on the 
second NUMA node, node0.  Why does lustre go to the trouble of 
dumping node1 and then not use node1's memory, when there was always 
plenty of free memory on node0?


I'll forego the explanation of the plot.  Hopefully it is clear 
enough.  If someone has questions about what the plot is depicting, 
please ask.


https://www.dropbox.com/scl/fi/pijgnnlb8iilkptbeekaz/dd.png?rlkey=3abonv5tx8w5w5m08bn24qb7x&dl=0 



Hi John,
thanks for your detailed analysis.  It would be good to include the 
client kernel and Lustre version in this case, as the page cache 
behaviour can vary dramatically between different versions.


The allocation of the page cache pages may actually be out of the 
control of Lustre, since they are typically being allocated by the 
kernel VM affine to the core where the process that is doing the IO is 
running.  It may be that the "dd" is rescheduled to run on node0 
during the IO, since the ptlrpcd threads will be busy processing all 
of the RPCs during this time, and then dd will start allocating pages 
from node0.


That said, it isn't clear why the client doesn't start flushing the 
dirty data from cache earlier?  Is it actually sending the data to the 
OSTs, but then waiting for the OSTs to reply that the data has been 
committed to the storage before dropping the cache?


It would be interesting to plot the 
osc.*.rpc_stats::write_rpcs_in_flight and ::pending_write_pages to see 
if the data is already in flight.  The osd-ldiskfs.*.brw_stats on the 
server would also useful to graph over the same period, if possible.


It *does* look like the "node1 dirty" is kept at a low value for the 
entire run, so it at least appears that RPCs are being sent, but there 
is no page reclaim triggered until memory is getting low.  Doing page 
reclaim is really the kernel's job, but it seems possible that the 
Lustre client may not be suitably notifying the kernel about the dirty 
pages and kicking it in the butt earlier to clean up the pages.


PS: my preference would be to just attach the image to the email 
instead of hosting it externally, since it is only 55 KB.  Is this 
blocked by the list server?


Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Whamcloud






___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Re: [lustre-discuss] Lustre caching and NUMA nodes

2023-12-05 Thread Andreas Dilger via lustre-discuss

On Dec 4, 2023, at 15:06, John Bauer
mailto:bau...@iodoctors.com>> wrote:

I have a an OSC caching question. I am running a dd process which writes an
8GB file. The file is on lustre, striped 8x1M. This is run on a system that
has 2 NUMA nodes (cpu sockets). All the data is apparently stored on one NUMA
node (node1 in the plot below) until node1 runs out of free memory. Then it
appears that dd comes to a stop (no more writes complete) until lustre dumps
the data from the node1. Then dd continues writing, but now the data is stored
on the second NUMA node, node0. Why does lustre go to the trouble of dumping
node1 and then not use node1's memory, when there was always plenty of free
memory on node0?

I'll forego the explanation of the plot. Hopefully it is clear enough. If
someone has questions about what the plot is depicting, please ask.

https://www.dropbox.com/scl/fi/pijgnnlb8iilkptbeekaz/dd.png?rlkey=3abonv5tx8w5w5m08bn24qb7x&dl=0

Hi John,
thanks for your detailed analysis. It would be good to include the client
kernel and Lustre version in this case, as the page cache behaviour can vary
dramatically between different versions.

The allocation of the page cache pages may actually be out of the control of
Lustre, since they are typically being allocated by the kernel VM affine to the
core where the process that is doing the IO is running. It may be that the
"dd" is rescheduled to run on node0 during the IO, since the ptlrpcd threads
will be busy processing all of the RPCs during this time, and then dd will
start allocating pages from node0.

That said, it isn't clear why the client doesn't start flushing the dirty data
from cache earlier? Is it actually sending the data to the OSTs, but then
waiting for the OSTs to reply that the data has been committed to the storage
before dropping the cache?

It would be interesting to plot the osc.*.rpc_stats::write_rpcs_in_flight and
::pending_write_pages to see if the data is already in flight. The
osd-ldiskfs.*.brw_stats on the server would also useful to graph over the same
period, if possible.

It *does* look like the "node1 dirty" is kept at a low value for the entire
run, so it at least appears that RPCs are being sent, but there is no page
reclaim triggered until memory is getting low. Doing page reclaim is really
the kernel's job, but it seems possible that the Lustre client may not be
suitably notifying the kernel about the dirty pages and kicking it in the butt
earlier to clean up the pages.

PS: my preference would be to just attach the image to the email instead of
hosting it externally, since it is only 55 KB. Is this blocked by the list
server?

Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Whamcloud

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

[lustre-discuss] What is the meaning of these messages?

2023-12-05 Thread Backer via lustre-discuss

Hi All,

Time to time, I see the following messages on multiple OSS about a
particular client IP. What does it mean? All the OSS and OSTs are online
and has been online in the past.

Dec  4 18:05:27 oss010 kernel: LustreError: 137-5: fs-OST00b0_UUID: not
available for connect from @tcp1 (no target). If you are running
an HA pair check that the target is mounted on the other server.
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Re: [lustre-discuss] Error messages (ex: not available for connect from 0@lo) on server boot with Lustre 2.15.3 and 2.15.4-RC1

2023-12-05 Thread Aurelien Degremont via lustre-discuss

> Now what is the messages about "deleting orphaned objects" ? Is it normal 
> also ?

Yeah, this is kind of normal, and I'm even thinking we should lower the message 
verbosity...
Andreas, do you agree that could become a simple CDEBUG(D_HA, ...) instead of 
LCONSOLE(D_INFO, ...)?

Aurélien

De : lustre-discuss  de la part de 
Audet, Martin via lustre-discuss 
Envoyé : lundi 4 décembre 2023 20:26
À : Andreas Dilger 
Cc : lustre-discuss@lists.lustre.org 
Objet : Re: [lustre-discuss] Error messages (ex: not available for connect from 
0@lo) on server boot with Lustre 2.15.3 and 2.15.4-RC1

External email: Use caution opening links or attachments

Hello Andrea,

Thanks for your response. Happy to learn that the "errors" I was reporting 
aren't really errors.

I now understand that the 3 messages about LDISKFS were only normal messages 
resulting from mounting the file systems (I was fooled by vim showing this 
message in red, like important error messages, but this is simply a false 
positive result of its syntax highlight rules probably triggered by the 
"errors=" string which is only a mount option...).

Now what is the messages about "deleting orphaned objects" ? Is it normal also 
? We boot the clients VMs always after the server is ready and we shutdown 
clients cleanly well before the vlmf Lustre server is (also cleanly) shutdown. 
It is a sign of corruption ? How come this happen if shutdowns are clean ?

Thanks (and sorry for the beginners questions),

Martin

From: Andreas Dilger 
Sent: December 4, 2023 5:25 AM
To: Audet, Martin
Cc: lustre-discuss@lists.lustre.org
Subject: Re: [lustre-discuss] Error messages (ex: not available for connect 
from 0@lo) on server boot with Lustre 2.15.3 and 2.15.4-RC1

***Attention*** This email originated from outside of the NRC. ***Attention*** 
Ce courriel provient de l'extérieur du CNRC.

It wasn't clear from your rail which message(s) are you concerned about?  These 
look like normal mount message(s) to me.

The "error" is pretty normal, it just means there were multiple services 
starting at once and one wasn't yet ready for the other.

 LustreError: 137-5: lustrevm-MDT_UUID: not available for connect
 from 0@lo (no target). If you are running an HA pair check that the 
target
is mounted on the other server.

It probably makes sense to quiet this message right at mount time to avoid this.

Cheers, Andreas

On Dec 1, 2023, at 10:24, Audet, Martin via lustre-discuss 
 wrote:

Hello Lustre community,

Have someone ever seen messages like these on in "/var/log/messages" on a 
Lustre server ?

Dec  1 11:26:30 vlfs kernel: Lustre: Lustre: Build Version: 2.15.4_RC1
Dec  1 11:26:30 vlfs kernel: LDISKFS-fs (sdd): mounted filesystem with ordered 
data mode. Opts: errors=remount-ro,no_mbcache,nodelalloc
Dec  1 11:26:30 vlfs kernel: LDISKFS-fs (sdc): mounted filesystem with ordered 
data mode. Opts: errors=remount-ro,no_mbcache,nodelalloc
Dec  1 11:26:30 vlfs kernel: LDISKFS-fs (sdb): mounted filesystem with ordered 
data mode. Opts: user_xattr,errors=remount-ro,no_mbcache,nodelalloc
Dec  1 11:26:36 vlfs kernel: LustreError: 137-5: lustrevm-MDT_UUID: not 
available for connect from 0@lo (no target). If you are running an HA pair 
check that the target is mounted on the other server.
Dec  1 11:26:36 vlfs kernel: Lustre: lustrevm-OST0001: Imperative Recovery not 
enabled, recovery window 300-900
Dec  1 11:26:36 vlfs kernel: Lustre: lustrevm-OST0001: deleting orphan objects 
from 0x0:227 to 0x0:513

This happens on every boot on a Lustre server named vlfs (a AlmaLinux 8.9 VM 
hosted on a VMware) playing the role of both MGS and OSS (it hosts an MDT two 
OST using "virtual" disks). We chose LDISKFS and not ZFS. Note that this 
happens at every boot, well before the clients (AlmaLinux 9.3 or 8.9 VMs) 
connect and even when the clients are powered off. The network connecting the 
clients and the server is a "virtual" 10GbE network (of course there is no 
virtual IB). Also we had the same messages previously with Lustre 2.15.3 using 
an AlmaLinux 8.8 server and AlmaLinux 8.8 / 9.2 clients (also using VMs). Note 
also that we compile ourselves the Lustre RPMs from the sources from the git 
repository. We also chose to use a patched kernel. Our build procedure for RPMs 
seems to work well because our real cluster run fine on CentOS 7.9 with Lustre 
2.12.9 and IB (MOFED) networking.

So has anyone seen these messages ?

Are they problematic ? If yes, how do we avoid them ?

We would like to make sure our small test system using VMs works well before we 
upgrade our real cluster.

Thanks in advance !

Martin Audet

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
___
lustre-discuss mailing list
lustre-discuss@lists.l

Re: [lustre-discuss] Lustre caching and NUMA nodes

Re: [lustre-discuss] Lustre caching and NUMA nodes

[lustre-discuss] What is the meaning of these messages?

Re: [lustre-discuss] Error messages (ex: not available for connect from 0@lo) on server boot with Lustre 2.15.3 and 2.15.4-RC1

4 matches

Site Navigation

Mail list logo

Footer information