Re: [lustre-discuss] lustre-discuss Digest, Vol 213, Issue 10
/log/messages:Dec 6 13:50:17 mds2 kernel: LNetError: 11579:0:(lib-move.c:4005:lnet_handle_recovery_reply()) Skipped 1 previous similar message /var/log/messages:Dec 6 14:05:17 mds2 kernel: LNetError: 11579:0:(lib-move.c:4005:lnet_handle_recovery_reply()) peer NI (10.67.178.25@tcp) recovery failed with -110 /var/log/messages:Dec 6 14:05:17 mds2 kernel: LNetError: 11579:0:(lib-move.c:4005:lnet_handle_recovery_reply()) Skipped 1 previous similar message /var/log/messages:Dec 6 14:20:16 mds2 kernel: LNetError: 11579:0:(lib-move.c:4005:lnet_handle_recovery_reply()) peer NI (10.67.178.25@tcp) recovery failed with -110 /var/log/messages:Dec 6 14:20:16 mds2 kernel: LNetError: 11579:0:(lib-move.c:4005:lnet_handle_recovery_reply()) Skipped 1 previous similar message /var/log/messages:Dec 6 14:30:17 mds2 kernel: LNetError: 3806712:0:(lib-move.c:4005:lnet_handle_recovery_reply()) peer NI (10.67.176.25@tcp) recovery failed with -111 /var/log/messages:Dec 6 14:30:17 mds2 kernel: LNetError: 3806712:0:(lib-move.c:4005:lnet_handle_recovery_reply()) Skipped 3 previous similar messages /var/log/messages:Dec 6 14:47:14 mds2 kernel: LNetError: 3812070:0:(lib-move.c:4005:lnet_handle_recovery_reply()) peer NI (10.67.176.25@tcp) recovery failed with -111 /var/log/messages:Dec 6 14:47:14 mds2 kernel: LNetError: 3812070:0:(lib-move.c:4005:lnet_handle_recovery_reply()) Skipped 8 previous similar messages /var/log/messages:Dec 6 15:02:14 mds2 kernel: LNetError: 3817248:0:(lib-move.c:4005:lnet_handle_recovery_reply()) peer NI (10.67.176.25@tcp) recovery failed with -111 Regards, Qiulan -- next part -- An HTML attachment was scrubbed... URL:<http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20231206/89b7c124/attachment.htm> -- Subject: Digest Footer ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org -- End of lustre-discuss Digest, Vol 213, Issue 7 ** -- next part -- An HTML attachment was scrubbed... URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20231207/ce8919f0/attachment.htm> -- next part -- A non-text attachment was scrubbed... Name: pfe27_allOSC_cached.png Type: image/png Size: 12394 bytes Desc: not available URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20231207/ce8919f0/attachment.png> -- Subject: Digest Footer ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org -- End of lustre-discuss Digest, Vol 213, Issue 10 *** ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
[lustre-discuss] Lustre caching and NUMA nodes
Peter, A delayed reply to one more of your questions, "What makes you think "lustre" is doing that?" , as I had to make another run and gather OSC stats on all the Lustre file systems mounted on the host that I run dd on. This host has 12 Lustre file systems, comprised of 507 OSTs. While dd was running I instrumented the amount of cached data associated with all 507 OSCs. That is reflected in the bottom frame of the image below. Note that in the top frame there was always about 5GB of free memory, and 50GB of cached data. I believe it has to be a Lustre issue as the Linux buffer cache has no knowledge that a page is a Lustre page. How is it that every OSC, on all 12 file systems on the host, has its memory dropped to 0, yet all the other 50GB of cached data on the host remains? It's as though dropcache is being run on only the lustre file systems. My googling around finds no such feature in dropcache that would allow file system specific dropping. Is there some tuneable that gives Lustre pages higher potential for eviction than other cached data? Another subtle point of interest. Note that dd writing resumes, as reflected in the growth of the cached data for its 8 OSTs, before all the other OSCs have finished dumping. This is most visible around 2.1 seconds into the run. Also different is that this dumping phenomenon happened 3 times in the course of a 10 second run, instead of just 1 as in the previous run I was referencing, costing this dd run 1.2 seconds. John On 12/6/23 14:24, lustre-discuss-requ...@lists.lustre.org wrote: Send lustre-discuss mailing list submissions to lustre-discuss@lists.lustre.org To subscribe or unsubscribe via the World Wide Web, visit http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org or, via email, send a message with subject or body 'help' to lustre-discuss-requ...@lists.lustre.org You can reach the person managing the list at lustre-discuss-ow...@lists.lustre.org When replying, please edit your Subject line so it is more specific than "Re: Contents of lustre-discuss digest..." Today's Topics: 1. Coordinating cluster start and shutdown? (Jan Andersen) 2. Re: Lustre caching and NUMA nodes (Peter Grandi) 3. Re: Coordinating cluster start and shutdown? (Bertschinger, Thomas Andrew Hjorth) 4. Lustre server still try to recover the lnet reply to the depreciated clients (Huang, Qiulan) -- Message: 1 Date: Wed, 6 Dec 2023 10:27:11 + From: Jan Andersen To: lustre Subject: [lustre-discuss] Coordinating cluster start and shutdown? Message-ID:<696fac02-df18-4fe1-967c-02c3bca42...@comind.io> Content-Type: text/plain; charset=UTF-8; format=flowed Are there any tools for coordinating the start and shutdown of lustre filesystem, so that the OSS systems don't attempt to mount disks before the MGT and MDT are online? -- Message: 2 Date: Wed, 6 Dec 2023 12:40:54 + From:p...@lustre.list.sabi.co.uk (Peter Grandi) To: list Lustre discussion Subject: Re: [lustre-discuss] Lustre caching and NUMA nodes Message-ID:<25968.27606.536270.208...@petal.ty.sabi.co.uk> Content-Type: text/plain; charset=iso-8859-1 I have a an OSC caching question.? I am running a dd process which writes an 8GB file.? The file is on lustre, striped 8x1M. How the Lustre instance servers store the data may not have a huge influence on what happens in the client's system buffer cache. This is run on a system that has 2 NUMA nodes (? cpu sockets). [...] Why does lustre go to the trouble of dumping node1 and then not use node1's memory, when there was always plenty of free memory on node0? What makes you think "lustre" is doing that? Are you aware of the values of the flusher settings such as 'dirty_bytes', 'dirty_ratio', 'dirty_expire_centisecs'? Have you considered looking at NUMA policies e.g. as described in 'man numactl'? Also while you surely know better I usually try to avoid buffering large amounts of to-be-written data in RAM (whether on the OSC or the OSS), and to my taste 8GiB "in-flight" is large. -- Message: 3 Date: Wed, 6 Dec 2023 16:00:38 + From: "Bertschinger, Thomas Andrew Hjorth" To: Jan Andersen, lustre Subject: Re: [lustre-discuss] Coordinating cluster start and shutdown? Message-ID: Content-Type: text/plain; charset="iso-8859-1" Hello Jan, You can use the Pacemaker / Corosync high-availability software stack for this: specifically, ordering constraints [1] can be used. Unfortunately, Pacemaker is probably over-the-top if you don't need HA -- its configuration is complex and difficult to get right, and it significantly complicates system administration. One downside of Pacemaker is that it is not easy to decouple the Pacemaker service from the Lustre services, meaning if you stop the Pacemaker service, it
Re: [lustre-discuss] Error messages (ex: not available for connect from 0@lo) on server boot with Lustre 2.15.3 and 2.15.4-RC1
Thanks Andreas and Aurélien for your answers. They makes us confident that we are on the right track for our cluster update ! Also I have noticed that 2.15.4-RC1 was released two weeks ago, can we expect 2.15.4 to be ready by the end of the year ? Regards, Martin From: Andreas Dilger Sent: December 7, 2023 6:02 AM To: Aurelien Degremont Cc: Audet, Martin; lustre-discuss@lists.lustre.org Subject: Re: [lustre-discuss] Error messages (ex: not available for connect from 0@lo) on server boot with Lustre 2.15.3 and 2.15.4-RC1 ***Attention*** This email originated from outside of the NRC. ***Attention*** Ce courriel provient de l'extérieur du CNRC. Aurelien, there have beeen a number of questions about this message. > Lustre: lustrevm-OST0001: deleting orphan objects from 0x0:227 to 0x0:513 This is not marked LustreError, so it is just an advisory message. This can sometimes be useful for debugging issues related to MDT->OST connections. It is already printed with D_INFO level, so the lowest printk level available. Would rewording the message make it more clear that this is a normal situation when the MDT and OST are establishing connections? Cheers, Andreas On Dec 5, 2023, at 02:13, Aurelien Degremont wrote: > > > Now what is the messages about "deleting orphaned objects" ? Is it normal > > also ? > > Yeah, this is kind of normal, and I'm even thinking we should lower the > message verbosity... > Andreas, do you agree that could become a simple CDEBUG(D_HA, ...) instead of > LCONSOLE(D_INFO, ...)? > > > Aurélien > > Audet, Martin wrote on lundi 4 décembre 2023 20:26: >> Hello Andreas, >> >> Thanks for your response. Happy to learn that the "errors" I was reporting >> aren't really errors. >> >> I now understand that the 3 messages about LDISKFS were only normal messages >> resulting from mounting the file systems (I was fooled by vim showing this >> message in red, like important error messages, but this is simply a false >> positive result of its syntax highlight rules probably triggered by the >> "errors=" string which is only a mount option...). >> >> Now what is the messages about "deleting orphaned objects" ? Is it normal >> also ? We boot the clients VMs always after the server is ready and we >> shutdown clients cleanly well before the vlmf Lustre server is (also >> cleanly) shutdown. It is a sign of corruption ? How come this happen if >> shutdowns are clean ? >> >> Thanks (and sorry for the beginners questions), >> >> Martin >> >> Andreas Dilger wrote on December 4, 2023 5:25 AM: >>> It wasn't clear from your rail which message(s) are you concerned about? >>> These look like normal mount message(s) to me. >>> >>> The "error" is pretty normal, it just means there were multiple services >>> starting at once and one wasn't yet ready for the other. >>> >>> LustreError: 137-5: lustrevm-MDT_UUID: not available for >>> connect >>> from 0@lo (no target). If you are running an HA pair check that >>> the target >>> is mounted on the other server. >>> >>> It probably makes sense to quiet this message right at mount time to avoid >>> this. >>> >>> Cheers, Andreas >>> On Dec 1, 2023, at 10:24, Audet, Martin via lustre-discuss wrote: Hello Lustre community, Have someone ever seen messages like these on in "/var/log/messages" on a Lustre server ? Dec 1 11:26:30 vlfs kernel: Lustre: Lustre: Build Version: 2.15.4_RC1 Dec 1 11:26:30 vlfs kernel: LDISKFS-fs (sdd): mounted filesystem with ordered data mode. Opts: errors=remount-ro,no_mbcache,nodelalloc Dec 1 11:26:30 vlfs kernel: LDISKFS-fs (sdc): mounted filesystem with ordered data mode. Opts: errors=remount-ro,no_mbcache,nodelalloc Dec 1 11:26:30 vlfs kernel: LDISKFS-fs (sdb): mounted filesystem with ordered data mode. Opts: user_xattr,errors=remount-ro,no_mbcache,nodelalloc Dec 1 11:26:36 vlfs kernel: LustreError: 137-5: lustrevm-MDT_UUID: not available for connect from 0@lo (no target). If you are running an HA pair check that the target is mounted on the other server. Dec 1 11:26:36 vlfs kernel: Lustre: lustrevm-OST0001: Imperative Recovery not enabled, recovery window 300-900 Dec 1 11:26:36 vlfs kernel: Lustre: lustrevm-OST0001: deleting orphan objects from 0x0:227 to 0x0:513 This happens on every boot on a Lustre server named vlfs (a AlmaLinux 8.9 VM hosted on a VMware) playing the role of both MGS and OSS (it hosts an MDT two OST using "virtual" disks). We chose LDISKFS and not ZFS. Note that this happens at every boot, well before the clients (AlmaLinux 9.3 or 8.9 VMs) connect and even when the clients are powered off. The network connecting the clients and the server is a "virtual" 10GbE network (of course there is no virtual IB). Also we had the same messages previously
Re: [lustre-discuss] Error messages (ex: not available for connect from 0@lo) on server boot with Lustre 2.15.3 and 2.15.4-RC1
Aurelien, there have beeen a number of questions about this message. > Lustre: lustrevm-OST0001: deleting orphan objects from 0x0:227 to 0x0:513 This is not marked LustreError, so it is just an advisory message. This can sometimes be useful for debugging issues related to MDT->OST connections. It is already printed with D_INFO level, so the lowest printk level available. Would rewording the message make it more clear that this is a normal situation when the MDT and OST are establishing connections? Cheers, Andreas On Dec 5, 2023, at 02:13, Aurelien Degremont wrote: > > > Now what is the messages about "deleting orphaned objects" ? Is it normal > > also ? > > Yeah, this is kind of normal, and I'm even thinking we should lower the > message verbosity... > Andreas, do you agree that could become a simple CDEBUG(D_HA, ...) instead of > LCONSOLE(D_INFO, ...)? > > > Aurélien > > Audet, Martin wrote on lundi 4 décembre 2023 20:26: >> Hello Andreas, >> >> Thanks for your response. Happy to learn that the "errors" I was reporting >> aren't really errors. >> >> I now understand that the 3 messages about LDISKFS were only normal messages >> resulting from mounting the file systems (I was fooled by vim showing this >> message in red, like important error messages, but this is simply a false >> positive result of its syntax highlight rules probably triggered by the >> "errors=" string which is only a mount option...). >> >> Now what is the messages about "deleting orphaned objects" ? Is it normal >> also ? We boot the clients VMs always after the server is ready and we >> shutdown clients cleanly well before the vlmf Lustre server is (also >> cleanly) shutdown. It is a sign of corruption ? How come this happen if >> shutdowns are clean ? >> >> Thanks (and sorry for the beginners questions), >> >> Martin >> >> Andreas Dilger wrote on December 4, 2023 5:25 AM: >>> It wasn't clear from your rail which message(s) are you concerned about? >>> These look like normal mount message(s) to me. >>> >>> The "error" is pretty normal, it just means there were multiple services >>> starting at once and one wasn't yet ready for the other. >>> >>> LustreError: 137-5: lustrevm-MDT_UUID: not available for >>> connect >>> from 0@lo (no target). If you are running an HA pair check that >>> the target >>> is mounted on the other server. >>> >>> It probably makes sense to quiet this message right at mount time to avoid >>> this. >>> >>> Cheers, Andreas >>> On Dec 1, 2023, at 10:24, Audet, Martin via lustre-discuss wrote: Hello Lustre community, Have someone ever seen messages like these on in "/var/log/messages" on a Lustre server ? Dec 1 11:26:30 vlfs kernel: Lustre: Lustre: Build Version: 2.15.4_RC1 Dec 1 11:26:30 vlfs kernel: LDISKFS-fs (sdd): mounted filesystem with ordered data mode. Opts: errors=remount-ro,no_mbcache,nodelalloc Dec 1 11:26:30 vlfs kernel: LDISKFS-fs (sdc): mounted filesystem with ordered data mode. Opts: errors=remount-ro,no_mbcache,nodelalloc Dec 1 11:26:30 vlfs kernel: LDISKFS-fs (sdb): mounted filesystem with ordered data mode. Opts: user_xattr,errors=remount-ro,no_mbcache,nodelalloc Dec 1 11:26:36 vlfs kernel: LustreError: 137-5: lustrevm-MDT_UUID: not available for connect from 0@lo (no target). If you are running an HA pair check that the target is mounted on the other server. Dec 1 11:26:36 vlfs kernel: Lustre: lustrevm-OST0001: Imperative Recovery not enabled, recovery window 300-900 Dec 1 11:26:36 vlfs kernel: Lustre: lustrevm-OST0001: deleting orphan objects from 0x0:227 to 0x0:513 This happens on every boot on a Lustre server named vlfs (a AlmaLinux 8.9 VM hosted on a VMware) playing the role of both MGS and OSS (it hosts an MDT two OST using "virtual" disks). We chose LDISKFS and not ZFS. Note that this happens at every boot, well before the clients (AlmaLinux 9.3 or 8.9 VMs) connect and even when the clients are powered off. The network connecting the clients and the server is a "virtual" 10GbE network (of course there is no virtual IB). Also we had the same messages previously with Lustre 2.15.3 using an AlmaLinux 8.8 server and AlmaLinux 8.8 / 9.2 clients (also using VMs). Note also that we compile ourselves the Lustre RPMs from the sources from the git repository. We also chose to use a patched kernel. Our build procedure for RPMs seems to work well because our real cluster run fine on CentOS 7.9 with Lustre 2.12.9 and IB (MOFED) networking. So has anyone seen these messages ? Are they problematic ? If yes, how do we avoid them ? We would like to make sure our small test system using VMs works well before we upgrade our real cluster. Thanks in advance !