Re: [lustre-discuss] 9.4 client release date
2.15.5 will have that support and this is "coming soon", see slide 13 https://www.depts.ttu.edu/hpcc/events/LUG24/slides/Day1/LUG_2024_Talk_01-Community_Release_Update.pdf Aurélien De : lustre-discuss de la part de Michael DiDomenico via lustre-discuss Envoyé : mercredi 8 mai 2024 21:25 À : lustre-discuss Objet : [lustre-discuss] 9.4 client release date External email: Use caution opening links or attachments does anyone have an idea of when the 2.15 client for redhat 9.4 will get released? just curious, trying to plan some maintenance ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
Re: [lustre-discuss] kernel threads for rpcs in flight
> This is a module parameter, since it cannot be changed at runtime. This is > visible at /sys/module/libcfs/parameters/cpu_npartitions and the default > value depends on the number of CPU cores and NUMA configuration. It can be > specified with "options libcfs cpu_npartitions=" in > /etc/modprobe.d/lustre.conf. > The "cpu_npartitions" module parameter controls how many groups the cores are > split into. The "cpu_pattern" parameter can control the specific cores in > each of the CPTs, which would affect the default per-CPT ptlrpcd threads > location. It is possible to further use the "ptlrpcd_cpts" and > "ptlrpcd_per_cpt_max" parameters to control specifically which cores are used > for the threads. Just a comment on the tuning parameters could be tricky. "cpu_npartitions" is ignored at the profit of "cpu_pattern", except if cpu_pattern is the empty string. cpu_pattern can achieve the same results as cpu_npartitions, but at the cost of a more complex declaration. If you just want to split your cores into multiple subgroups, you can use cpu_npartition. options libcfs cpu_pattern="" cpu_npartition=8 # You need to set the empty string for cpu_pattern to avoid cpu_npartition to be ignored or options libcfs cpu_pattern="N" # the default, split on the NUMA groups or options libcfs cpu_pattern="0[0-3] 1[4-7] 2[8-11] 3[12-15] 4[16-19] 5[21-23] 6[24-27] 7[28-31]" # same as cpu_npartition=8 or even more complex distribution, see Lustre Manual for details. Also check "lctl get_param cpu_partition_table" to see your current partition table. Aurélien De : lustre-discuss de la part de Andreas Dilger via lustre-discuss Envoyé : vendredi 3 mai 2024 07:25 À : Anna Fuchs Cc : lustre Objet : Re: [lustre-discuss] kernel threads for rpcs in flight External email: Use caution opening links or attachments On May 2, 2024, at 18:10, Anna Fuchs mailto:anna.fu...@uni-hamburg.de>> wrote: The number of ptlrpc threads per CPT is set by the "ptlrpcd_partner_group_size" module parameter, and defaults to 2 threads per CPT, IIRC. I don't think that clients dynamically start/stop ptlrpcd threads at runtime. When there are RPCs in the queue for any ptlrpcd it will be woken up and scheduled by the kernel, so it will compete with the application threads. IIRC, if a ptlrpcd thread is woken up and there are no RPCs in the local CPT queue it will try to steal RPCs from another CPT on the assumption that the local CPU is not generating any RPCs so it would be beneficial to offload threads on another CPU that *is* generating RPCs. If the application thread is extremely CPU hungry, then the kernel will not schedule the ptlrpcd threads on those codes very often, and the "idle" core ptlrpcd threads will be be able to run more frequently. Sorry, maybe I am confusing things. I am still not sure how many threads I get. For example I have a 32 cores AMD Epyc machine as a client and I am running a serial stream io application with a single stripesize, 1 OST. I am struggeling to find out how many CPU partitions I have - is it something on the hardware side or something configurable? There is no file /proc/sys/lnet/cpu_partitions on my client. This is a module parameter, since it cannot be changed at runtime. This is visible at /sys/module/libcfs/parameters/cpu_npartitions and the default value depends on the number of CPU cores and NUMA configuration. It can be specified with "options libcfs cpu_npartitions=" in /etc/modprobe.d/lustre.conf. Assuming I had 2 CPU partitions, that would result in 4 ptlrpc threads at system start, right? Correct. Now I set rpcs_in_flight to 1 or to 8, what effect does that have on the number and the activity of the threads? Setting rpcs_in_flight has no effect on the number of ptlrpcd threads. The ptlrpcd threads process RPCs asynchronously (unlike server threads) so they can keep many RPCs in progress. Serial stream, 1 rpcs_in_flight is waking up only one ptlrpc thread, 3 remain inactive/sleep/do nothing? This depends. There are two ptlrpcd threads for the CPT that can process the RPCs from the one user thread. If they can send the RPCs quickly enough then the other ptlrpcd threads may not steal the RPCs from that CPT. That said, even a single threaded userspace writer may have up to 8 RPCs in flight *per OST* (depending on the file striping and if IO submission allows it - buffered or AIO+DIO) so if there are a lot of outstanding RPCs and RPC generation takes a long time (e.g. compression) then it may be that all ptlrpcd threads will be busy. Does not seem to be the case, as I've applied the rpctracing (thanks a lot for the hint!!), and rpcs_in_flight being 1 still show at least 3 different threads from at least 2 different partitions for writing a 1MB file with ten blocks. I don't get the relationship between these values. What are the opcodes from the different RPCs? The ptlrpcd threads are only handling
Re: [lustre-discuss] question on behavior of supplementary group permissions
If I remember correctly, the Lustre client is sending at most 2 GIDs into its RPC, in addition to its effective GID, those are the file GIDs related to the operation you're trying to do. (2 GIDs if you're doing an operation on 2 files (ie: rename)). In your case, this is just open, so I think the client will send only its effective FS GID and the file GID. However this is an open, and it is likely that the intent RPC does not fetch the file metadata prior to do the open, so the client is not telling the MDT that this user is already a member of that file group. The MDT has no way to know it (identity_upcall = NONE). This is a bug/limitation of identity_upcall = NONE (which is not the standard deployment). The "fix" would be for the client to either send all the supplementary groups the user is member of (used to be done prior to 2.6, but note sure this is a good thing), or the client should retrieve file metadata before sending the open RPC, which will impact open perf a lot. If you want a bit more context:https://review.whamcloud.com/c/fs/lustre-release/+/49539 Aurélien De : lustre-discuss de la part de Bertschinger, Thomas Andrew Hjorth via lustre-discuss Envoyé : mercredi 24 janvier 2024 09:23 À : Kira Duwe via lustre-discuss Objet : [lustre-discuss] question on behavior of supplementary group permissions External email: Use caution opening links or attachments Hello, We have a curious issue with supplemental group permissions. There is a set of files where a user has group read permission to the files via a supplemental group. If the user tries to open() one of these files, they get EACCES. Then, if the user stat()s the file (or seemingly does any operation that caches the inode on the client), the next open() attempt succeeds. Interactively, this looks like: $ cat /lustre/problem_file cat: /lustre/problem_file: Permission denied $ stat /lustre/problem_file (succeeds) $ cat /lustre/problem_file (succeeds) We've only observed this on particular client/server pair: client kernel: 3.10.0-1160.95.1 client lustre: 2.15.3 server kernel: 4.18.0-477.21.1 server lustre: 2.15.3 We have mdt.*.identity_upcall=NONE set on every server. Also, we cannot reproduce the issue with newly created files; it only appears to affect a set of existing files. I have 2 questions about this. The big one is, section 41.1.2.1 of the Lustre manual claims: > If there is no upcall or if there is an upcall and it fails, one > supplementary group at most will be added as supplied by the client. To my reading, this suggests that the "bug" we observe above is actually the correct behavior. Well, the manual is not precise about which single supplementary group will be supplied by the client, but the relevant group in this case is not the first supplemental group in the user's group list, it's in the middle of the list. So my question is, is the Lustre manual accurate (and then my followup question is, if so, why do supplemental group permissions appear to work for us in most cases...)? Or is the manual wrong/out of date here? My second question is, assuming the behavior described above is a bug, are there any known issues here that we could be running into? Let me know if I can provide any more information. Thanks, Thomas Bertschinger ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
Re: [lustre-discuss] Error messages (ex: not available for connect from 0@lo) on server boot with Lustre 2.15.3 and 2.15.4-RC1
> Now what is the messages about "deleting orphaned objects" ? Is it normal > also ? Yeah, this is kind of normal, and I'm even thinking we should lower the message verbosity... Andreas, do you agree that could become a simple CDEBUG(D_HA, ...) instead of LCONSOLE(D_INFO, ...)? Aurélien De : lustre-discuss de la part de Audet, Martin via lustre-discuss Envoyé : lundi 4 décembre 2023 20:26 À : Andreas Dilger Cc : lustre-discuss@lists.lustre.org Objet : Re: [lustre-discuss] Error messages (ex: not available for connect from 0@lo) on server boot with Lustre 2.15.3 and 2.15.4-RC1 External email: Use caution opening links or attachments Hello Andrea, Thanks for your response. Happy to learn that the "errors" I was reporting aren't really errors. I now understand that the 3 messages about LDISKFS were only normal messages resulting from mounting the file systems (I was fooled by vim showing this message in red, like important error messages, but this is simply a false positive result of its syntax highlight rules probably triggered by the "errors=" string which is only a mount option...). Now what is the messages about "deleting orphaned objects" ? Is it normal also ? We boot the clients VMs always after the server is ready and we shutdown clients cleanly well before the vlmf Lustre server is (also cleanly) shutdown. It is a sign of corruption ? How come this happen if shutdowns are clean ? Thanks (and sorry for the beginners questions), Martin From: Andreas Dilger Sent: December 4, 2023 5:25 AM To: Audet, Martin Cc: lustre-discuss@lists.lustre.org Subject: Re: [lustre-discuss] Error messages (ex: not available for connect from 0@lo) on server boot with Lustre 2.15.3 and 2.15.4-RC1 ***Attention*** This email originated from outside of the NRC. ***Attention*** Ce courriel provient de l'extérieur du CNRC. It wasn't clear from your rail which message(s) are you concerned about? These look like normal mount message(s) to me. The "error" is pretty normal, it just means there were multiple services starting at once and one wasn't yet ready for the other. LustreError: 137-5: lustrevm-MDT_UUID: not available for connect from 0@lo (no target). If you are running an HA pair check that the target is mounted on the other server. It probably makes sense to quiet this message right at mount time to avoid this. Cheers, Andreas On Dec 1, 2023, at 10:24, Audet, Martin via lustre-discuss wrote: Hello Lustre community, Have someone ever seen messages like these on in "/var/log/messages" on a Lustre server ? Dec 1 11:26:30 vlfs kernel: Lustre: Lustre: Build Version: 2.15.4_RC1 Dec 1 11:26:30 vlfs kernel: LDISKFS-fs (sdd): mounted filesystem with ordered data mode. Opts: errors=remount-ro,no_mbcache,nodelalloc Dec 1 11:26:30 vlfs kernel: LDISKFS-fs (sdc): mounted filesystem with ordered data mode. Opts: errors=remount-ro,no_mbcache,nodelalloc Dec 1 11:26:30 vlfs kernel: LDISKFS-fs (sdb): mounted filesystem with ordered data mode. Opts: user_xattr,errors=remount-ro,no_mbcache,nodelalloc Dec 1 11:26:36 vlfs kernel: LustreError: 137-5: lustrevm-MDT_UUID: not available for connect from 0@lo (no target). If you are running an HA pair check that the target is mounted on the other server. Dec 1 11:26:36 vlfs kernel: Lustre: lustrevm-OST0001: Imperative Recovery not enabled, recovery window 300-900 Dec 1 11:26:36 vlfs kernel: Lustre: lustrevm-OST0001: deleting orphan objects from 0x0:227 to 0x0:513 This happens on every boot on a Lustre server named vlfs (a AlmaLinux 8.9 VM hosted on a VMware) playing the role of both MGS and OSS (it hosts an MDT two OST using "virtual" disks). We chose LDISKFS and not ZFS. Note that this happens at every boot, well before the clients (AlmaLinux 9.3 or 8.9 VMs) connect and even when the clients are powered off. The network connecting the clients and the server is a "virtual" 10GbE network (of course there is no virtual IB). Also we had the same messages previously with Lustre 2.15.3 using an AlmaLinux 8.8 server and AlmaLinux 8.8 / 9.2 clients (also using VMs). Note also that we compile ourselves the Lustre RPMs from the sources from the git repository. We also chose to use a patched kernel. Our build procedure for RPMs seems to work well because our real cluster run fine on CentOS 7.9 with Lustre 2.12.9 and IB (MOFED) networking. So has anyone seen these messages ? Are they problematic ? If yes, how do we avoid them ? We would like to make sure our small test system using VMs works well before we upgrade our real cluster. Thanks in advance ! Martin Audet ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org ___ lustre-discuss mailing list