Re: [lustre-discuss] kernel threads for rpcs in flight
> This is a module parameter, since it cannot be changed at runtime. This is > visible at /sys/module/libcfs/parameters/cpu_npartitions and the default > value depends on the number of CPU cores and NUMA configuration. It can be > specified with "options libcfs cpu_npartitions=" in > /etc/modprobe.d/lustre.conf. > The "cpu_npartitions" module parameter controls how many groups the cores are > split into. The "cpu_pattern" parameter can control the specific cores in > each of the CPTs, which would affect the default per-CPT ptlrpcd threads > location. It is possible to further use the "ptlrpcd_cpts" and > "ptlrpcd_per_cpt_max" parameters to control specifically which cores are used > for the threads. Just a comment on the tuning parameters could be tricky. "cpu_npartitions" is ignored at the profit of "cpu_pattern", except if cpu_pattern is the empty string. cpu_pattern can achieve the same results as cpu_npartitions, but at the cost of a more complex declaration. If you just want to split your cores into multiple subgroups, you can use cpu_npartition. options libcfs cpu_pattern="" cpu_npartition=8 # You need to set the empty string for cpu_pattern to avoid cpu_npartition to be ignored or options libcfs cpu_pattern="N" # the default, split on the NUMA groups or options libcfs cpu_pattern="0[0-3] 1[4-7] 2[8-11] 3[12-15] 4[16-19] 5[21-23] 6[24-27] 7[28-31]" # same as cpu_npartition=8 or even more complex distribution, see Lustre Manual for details. Also check "lctl get_param cpu_partition_table" to see your current partition table. Aurélien De : lustre-discuss de la part de Andreas Dilger via lustre-discuss Envoyé : vendredi 3 mai 2024 07:25 À : Anna Fuchs Cc : lustre Objet : Re: [lustre-discuss] kernel threads for rpcs in flight External email: Use caution opening links or attachments On May 2, 2024, at 18:10, Anna Fuchs mailto:anna.fu...@uni-hamburg.de>> wrote: The number of ptlrpc threads per CPT is set by the "ptlrpcd_partner_group_size" module parameter, and defaults to 2 threads per CPT, IIRC. I don't think that clients dynamically start/stop ptlrpcd threads at runtime. When there are RPCs in the queue for any ptlrpcd it will be woken up and scheduled by the kernel, so it will compete with the application threads. IIRC, if a ptlrpcd thread is woken up and there are no RPCs in the local CPT queue it will try to steal RPCs from another CPT on the assumption that the local CPU is not generating any RPCs so it would be beneficial to offload threads on another CPU that *is* generating RPCs. If the application thread is extremely CPU hungry, then the kernel will not schedule the ptlrpcd threads on those codes very often, and the "idle" core ptlrpcd threads will be be able to run more frequently. Sorry, maybe I am confusing things. I am still not sure how many threads I get. For example I have a 32 cores AMD Epyc machine as a client and I am running a serial stream io application with a single stripesize, 1 OST. I am struggeling to find out how many CPU partitions I have - is it something on the hardware side or something configurable? There is no file /proc/sys/lnet/cpu_partitions on my client. This is a module parameter, since it cannot be changed at runtime. This is visible at /sys/module/libcfs/parameters/cpu_npartitions and the default value depends on the number of CPU cores and NUMA configuration. It can be specified with "options libcfs cpu_npartitions=" in /etc/modprobe.d/lustre.conf. Assuming I had 2 CPU partitions, that would result in 4 ptlrpc threads at system start, right? Correct. Now I set rpcs_in_flight to 1 or to 8, what effect does that have on the number and the activity of the threads? Setting rpcs_in_flight has no effect on the number of ptlrpcd threads. The ptlrpcd threads process RPCs asynchronously (unlike server threads) so they can keep many RPCs in progress. Serial stream, 1 rpcs_in_flight is waking up only one ptlrpc thread, 3 remain inactive/sleep/do nothing? This depends. There are two ptlrpcd threads for the CPT that can process the RPCs from the one user thread. If they can send the RPCs quickly enough then the other ptlrpcd threads may not steal the RPCs from that CPT. That said, even a single threaded userspace writer may have up to 8 RPCs in flight *per OST* (depending on the file striping and if IO submission allows it - buffered or AIO+DIO) so if there are a lot of outstanding RPCs and RPC generation takes a long time (e.g. compression) then it may be that all ptlrpcd threads will be busy. Does not seem to be the case, as I've applied the rpctracing (thanks a lot for the hint!!), and rpcs_in_flight being 1 still show at least 3 different threads from at least 2 different partition
Re: [lustre-discuss] kernel threads for rpcs in flight
On May 2, 2024, at 18:10, Anna Fuchs mailto:anna.fu...@uni-hamburg.de>> wrote: The number of ptlrpc threads per CPT is set by the "ptlrpcd_partner_group_size" module parameter, and defaults to 2 threads per CPT, IIRC. I don't think that clients dynamically start/stop ptlrpcd threads at runtime. When there are RPCs in the queue for any ptlrpcd it will be woken up and scheduled by the kernel, so it will compete with the application threads. IIRC, if a ptlrpcd thread is woken up and there are no RPCs in the local CPT queue it will try to steal RPCs from another CPT on the assumption that the local CPU is not generating any RPCs so it would be beneficial to offload threads on another CPU that *is* generating RPCs. If the application thread is extremely CPU hungry, then the kernel will not schedule the ptlrpcd threads on those codes very often, and the "idle" core ptlrpcd threads will be be able to run more frequently. Sorry, maybe I am confusing things. I am still not sure how many threads I get. For example I have a 32 cores AMD Epyc machine as a client and I am running a serial stream io application with a single stripesize, 1 OST. I am struggeling to find out how many CPU partitions I have - is it something on the hardware side or something configurable? There is no file /proc/sys/lnet/cpu_partitions on my client. This is a module parameter, since it cannot be changed at runtime. This is visible at /sys/module/libcfs/parameters/cpu_npartitions and the default value depends on the number of CPU cores and NUMA configuration. It can be specified with "options libcfs cpu_npartitions=" in /etc/modprobe.d/lustre.conf. Assuming I had 2 CPU partitions, that would result in 4 ptlrpc threads at system start, right? Correct. Now I set rpcs_in_flight to 1 or to 8, what effect does that have on the number and the activity of the threads? Setting rpcs_in_flight has no effect on the number of ptlrpcd threads. The ptlrpcd threads process RPCs asynchronously (unlike server threads) so they can keep many RPCs in progress. Serial stream, 1 rpcs_in_flight is waking up only one ptlrpc thread, 3 remain inactive/sleep/do nothing? This depends. There are two ptlrpcd threads for the CPT that can process the RPCs from the one user thread. If they can send the RPCs quickly enough then the other ptlrpcd threads may not steal the RPCs from that CPT. That said, even a single threaded userspace writer may have up to 8 RPCs in flight *per OST* (depending on the file striping and if IO submission allows it - buffered or AIO+DIO) so if there are a lot of outstanding RPCs and RPC generation takes a long time (e.g. compression) then it may be that all ptlrpcd threads will be busy. Does not seem to be the case, as I've applied the rpctracing (thanks a lot for the hint!!), and rpcs_in_flight being 1 still show at least 3 different threads from at least 2 different partitions for writing a 1MB file with ten blocks. I don't get the relationship between these values. What are the opcodes from the different RPCs? The ptlrpcd threads are only handling asynchronous RPCs like buffered writes, statfs, and a few others. Many RPCs are processed in the context of the application thread, not by ptlrpcd. And, if I had compression or any other heavy load, which settings could clearly control how many resources I want to give Lustre for this load? I can see a clear scaling with higher rpcs in flight, but I am struggeling to understand the numbers and attribute them to a specific settings. Uncompressed case already benefits a bit by higher RPCs number due to multiple "substreaming", but there must be much more happening in parallel behind the scenes for compressed case even with rpcs_in_flight=1. The "cpu_npartitions" module parameter controls how many groups the cores are split into. The "cpu_pattern" parameter can control the specific cores in each of the CPTs, which would affect the default per-CPT ptlrpcd threads location. It is possible to further use the "ptlrpcd_cpts" and "ptlrpcd_per_cpt_max" parameters to control specifically which cores are used for the threads. It is entirely possible that the number of ptlrpcd threads and CPT configuration is becoming sub-optimal as the number of multi-chip package CPUs with many cores grows dramatically. It is a balance between having enough threads to maximize performance without having so many that it goes down hill again. Ideally this should all happen without the need to hand-tune the CPT and thread count for every CPU on the market. Cheers, Andreas Thank you! Anna Whether this behavior is optimal or not is subject to debate, and investigation/improvements are of course welcome. Definitely, data checksums have some overhead (a few percent), and client-side data compression (which is done by ptlrpcd threads) would have a significant usage of CPU cycles, but given the large number of CPU cores on client nodes these days this
Re: [lustre-discuss] kernel threads for rpcs in flight
The number of ptlrpc threads per CPT is set by the "ptlrpcd_partner_group_size" module parameter, and defaults to 2 threads per CPT, IIRC. I don't think that clients dynamically start/stop ptlrpcd threads at runtime. When there are RPCs in the queue for any ptlrpcd it will be woken up and scheduled by the kernel, so it will compete with the application threads. IIRC, if a ptlrpcd thread is woken up and there are no RPCs in the local CPT queue it will try to steal RPCs from another CPT on the assumption that the local CPU is not generating any RPCs so it would be beneficial to offload threads on another CPU that *is* generating RPCs. If the application thread is extremely CPU hungry, then the kernel will not schedule the ptlrpcd threads on those codes very often, and the "idle" core ptlrpcd threads will be be able to run more frequently. Sorry, maybe I am confusing things. I am still not sure how many threads I get. For example I have a 32 cores AMD Epyc machine as a client and I am running a serial stream io application with a single stripesize, 1 OST. I am struggeling to find out how many CPU partitions I have - is it something on the hardware side or something configurable? There is no file /proc/sys/lnet/cpu_partitions on my client. Assuming I had 2 CPU partitions, that would result in 4 ptlrpc threads at system start, right? Now I set rpcs_in_flight to 1 or to 8, what effect does that have on the number and the activity of the threads? Serial stream, 1 rpcs_in_flight is waking up only one ptlrpc thread, 3 remain inactive/sleep/do nothing? Does not seem to be the case, as I've applied the rpctracing (thanks a lot for the hint!!), and rpcs_in_flight being 1 still show at least 3 different threads from at least 2 different partitions for writing a 1MB file with ten blocks. I don't get the relationship between these values. And, if I had compression or any other heavy load, which settings could clearly control how many resources I want to give Lustre for this load? I can see a clear scaling with higher rpcs in flight, but I am struggeling to understand the numbers and attribute them to a specific settings. Uncompressed case already benefits a bit by higher RPCs number due to multiple "substreaming", but there must be much more happening in parallel behind the scenes for compressed case even with rpcs_in_flight=1. Thank you! Anna Whether this behavior is optimal or not is subject to debate, and investigation/improvements are of course welcome. Definitely, data checksums have some overhead (a few percent), and client-side data compression (which is done by ptlrpcd threads) would have a significant usage of CPU cycles, but given the large number of CPU cores on client nodes these days this may still provide a net performance benefit if the IO bottleneck is on the server. With |max_||rpcs_in_flight = 1|, multiple cores are loaded, presumably alternately, but the statistics are too inaccurate to capture this. The distribution of threads to cores is regulated by the Linux kernel, right? Does anyone have experience with what happens when all CPUs are under full load with the application or something else? Note that {osc,mdc}.*.max_rpcs_in_flight is a *per target* parameter, so a single client can still have tens or hundreds of RPCs in flight to different servers. The client will send many RPC types directly from the process context, since they are waiting on the result anyway. For asynchronous bulk RPCs, the ptlrpcd thread will try to process the bulk IO on the same CPT (= Lustre CPU Partition Table, roughly aligned to NUMA nodes) as the userspace application was running when the request was created. This minimizes the cross-NUMA traffic when accessing pages for bulk RPCs, so long as those cores are not busy with userspace tasks. Otherwise, the ptlrpcd thread on another CPT will steal RPCs from the queues. Do the Lustre threads suffer? Is there a prioritization of the Lustre threads over other tasks? Are you asking about the client or the server? Many of the client RPCs are generated by the client threads, but for the running ptlrpcd threads do not have a higher priority than client application threads. If the application threads are running on some cores, but other cores are idle, then the ptlrpcd threads on other cores will try to process the RPCs to allow the application threads to continue running there. Otherwise, if all cores are busy (as is typical for HPC applications) then they will be scheduled by the kernel as needed. Are there readily available statistics or tools for this scenario? What statistics are you looking for? There are "{osc,mdc}.*.stats" and "{osc,mdc}.*rpc_stats" that have aggregate information about RPC counts and latency. Oh, right, these tell a lot. Isn't there also something to log the utilization and location of these threads? Otherwise, I'll continue trying with perf, which
Re: [lustre-discuss] kernel threads for rpcs in flight
On Apr 29, 2024, at 02:36, Anna Fuchs mailto:anna.fu...@uni-hamburg.de>> wrote: Hi Andreas. Thank you very much, that helps a lot. Sorry for the confusion, I primarily meant the client. The servers rarely have to compete with anything else for CPU resources I guess. The mechanism to start new threads is relatively simple. Before a server thread is processing a new request, if it is the last thread available, and not the maximum number of threads are running, then it will try to launch a new thread; repeat as needed. So the thread count will depend on the client RPC load and the RPC processing rate and lock contention on whatever resources those RPCs are accessing. And what conditions are on the client? Are the threads then driven by the workload of the application somehow? The number of ptlrpc threads per CPT is set by the "ptlrpcd_partner_group_size" module parameter, and defaults to 2 threads per CPT, IIRC. I don't think that clients dynamically start/stop ptlrpcd threads at runtime. Imagine an edge case where all but one core are pinned and at 100% constant load and one is dumping RAM to Lustre. Presumably, the available core will be taken. But will Lustre or the kernel then spawn additional threads and try to somehow interleave them with those of the application, or will it simply handle it with 1-2 threads on the available core (assume single stream to single OST)? In any case, I suppose the I/O transfer would suffer under the resource shortage, but my question would be to what extent it would (try to) hinder the application. For latency-critical applications, such small delays can already lead to idle waves. And surely, the Lustre threads are usually not CPU-hungry, but they will when it comes to encryption and compression. When there are RPCs in the queue for any ptlrpcd it will be woken up and scheduled by the kernel, so it will compete with the application threads. IIRC, if a ptlrpcd thread is woken up and there are no RPCs in the local CPT queue it will try to steal RPCs from another CPT on the assumption that the local CPU is not generating any RPCs so it would be beneficial to offload threads on another CPU that *is* generating RPCs. If the application thread is extremely CPU hungry, then the kernel will not schedule the ptlrpcd threads on those codes very often, and the "idle" core ptlrpcd threads will be be able to run more frequently. Whether this behavior is optimal or not is subject to debate, and investigation/improvements are of course welcome. Definitely, data checksums have some overhead (a few percent), and client-side data compression (which is done by ptlrpcd threads) would have a significant usage of CPU cycles, but given the large number of CPU cores on client nodes these days this may still provide a net performance benefit if the IO bottleneck is on the server. With max_rpcs_in_flight = 1, multiple cores are loaded, presumably alternately, but the statistics are too inaccurate to capture this. The distribution of threads to cores is regulated by the Linux kernel, right? Does anyone have experience with what happens when all CPUs are under full load with the application or something else? Note that {osc,mdc}.*.max_rpcs_in_flight is a *per target* parameter, so a single client can still have tens or hundreds of RPCs in flight to different servers. The client will send many RPC types directly from the process context, since they are waiting on the result anyway. For asynchronous bulk RPCs, the ptlrpcd thread will try to process the bulk IO on the same CPT (= Lustre CPU Partition Table, roughly aligned to NUMA nodes) as the userspace application was running when the request was created. This minimizes the cross-NUMA traffic when accessing pages for bulk RPCs, so long as those cores are not busy with userspace tasks. Otherwise, the ptlrpcd thread on another CPT will steal RPCs from the queues. Do the Lustre threads suffer? Is there a prioritization of the Lustre threads over other tasks? Are you asking about the client or the server? Many of the client RPCs are generated by the client threads, but for the running ptlrpcd threads do not have a higher priority than client application threads. If the application threads are running on some cores, but other cores are idle, then the ptlrpcd threads on other cores will try to process the RPCs to allow the application threads to continue running there. Otherwise, if all cores are busy (as is typical for HPC applications) then they will be scheduled by the kernel as needed. Are there readily available statistics or tools for this scenario? What statistics are you looking for? There are "{osc,mdc}.*.stats" and "{osc,mdc}.*rpc_stats" that have aggregate information about RPC counts and latency. Oh, right, these tell a lot. Isn't there also something to log the utilization and location of these threads? Otherwise, I'll continue trying with perf, which seems
Re: [lustre-discuss] kernel threads for rpcs in flight
Hi Andreas. Thank you very much, that helps a lot. Sorry for the confusion, I primarily meant the client. The servers rarely have to compete with anything else for CPU resources I guess. The mechanism to start new threads is relatively simple. Before a server thread is processing a new request, if it is the last thread available, and not the maximum number of threads are running, then it will try to launch a new thread; repeat as needed. So the thread count will depend on the client RPC load and the RPC processing rate and lock contention on whatever resources those RPCs are accessing. And what conditions are on the client? Are the threads then driven by the workload of the application somehow? Imagine an edge case where all but one core are pinned and at 100% constant load and one is dumping RAM to Lustre. Presumably, the available core will be taken. But will Lustre or the kernel then spawn additional threads and try to somehow interleave them with those of the application, or will it simply handle it with 1-2 threads on the available core (assume single stream to single OST)? In any case, I suppose the I/O transfer would suffer under the resource shortage, but my question would be to what extent it would (try to) hinder the application. For latency-critical applications, such small delays can already lead to idle waves. And surely, the Lustre threads are usually not CPU-hungr, but they will when it comes to encryption and compression. With |max_||rpcs_in_flight = 1|, multiple cores are loaded, presumably alternately, but the statistics are too inaccurate to capture this. The distribution of threads to cores is regulated by the Linux kernel, right? Does anyone have experience with what happens when all CPUs are under full load with the application or something else? Note that {osc,mdc}.*.max_rpcs_in_flight is a *per target* parameter, so a single client can still have tens or hundreds of RPCs in flight to different servers. The client will send many RPC types directly from the process context, since they are waiting on the result anyway. For asynchronous bulk RPCs, the ptlrpcd thread will try to process the bulk IO on the same CPT (= Lustre CPU Partition Table, roughly aligned to NUMA nodes) as the userspace application was running when the request was created. This minimizes the cross-NUMA traffic when accessing pages for bulk RPCs, so long as those cores are not busy with userspace tasks. Otherwise, the ptlrpcd thread on another CPT will steal RPCs from the queues. Do the Lustre threads suffer? Is there a prioritization of the Lustre threads over other tasks? Are you asking about the client or the server? Many of the client RPCs are generated by the client threads, but for the running ptlrpcd threads do not have a higher priority than client application threads. If the application threads are running on some cores, but other cores are idle, then the ptlrpcd threads on other cores will try to process the RPCs to allow the application threads to continue running there. Otherwise, if all cores are busy (as is typical for HPC applications) then they will be scheduled by the kernel as needed. Are there readily available statistics or tools for this scenario? What statistics are you looking for? There are "{osc,mdc}.*.stats" and "{osc,mdc}.*rpc_stats" that have aggregate information about RPC counts and latency. Oh, right, these tell a lot. Isn't there also something to log the utilization and location of these threads? Otherwise, I'll continue trying with perf, which seems to be more complex with kernel threads. Thanks for the explanations! Anna Cheers, Andreas -- Andreas Dilger Lustre Principal Architect Whamcloud ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
Re: [lustre-discuss] kernel threads for rpcs in flight
On Apr 28, 2024, at 16:54, Anna Fuchs via lustre-discuss mailto:lustre-discuss@lists.lustre.org>> wrote: The setting max_rpcs_in_flight affects, among other things, how many threads can be spawned simultaneously for processing the RPCs, right? The {osc,mdc}.*.max_rpcs_in_flight are actually controlling the maximum number of RPCs a *client* will have in flight to any MDT or OST, while the number of MDS and OSS threads is controlled on the server with mds.MDS.mdt*.threads_{min,max} and ost.OSS.ost*.threads_{min,max} for each of the various service portals (which are selected by the client based on the RPC type). The max_rpcs_in_flight allows concurrent operations on the client for multiple threads to hide network latency and to improve server utilization, without allowing a single client to overwhelm the server. In tests where the network is clearly a bottleneck, this setting has almost no effect - the network cannot keep up with processing the data, there is not so much to do in parallel. With a faster network, the stats show higher CPU utilization on different cores (at least on the client). What is the exact mechanism by which it is decided that a kernel thread is spawned for processing a bulk? Is there an RPC queue with timings or something similar? Is it in any way predictable or calculable how many threads a specific workload will require (spawn if possible) given the data rates from the network and storage devices? The mechanism to start new threads is relatively simple. Before a server thread is processing a new request, if it is the last thread available, and not the maximum number of threads are running, then it will try to launch a new thread; repeat as needed. So the thread count will depend on the client RPC load and the RPC processing rate and lock contention on whatever resources those RPCs are accessing. With max_rpcs_in_flight = 1, multiple cores are loaded, presumably alternately, but the statistics are too inaccurate to capture this. The distribution of threads to cores is regulated by the Linux kernel, right? Does anyone have experience with what happens when all CPUs are under full load with the application or something else? Note that {osc,mdc}.*.max_rpcs_in_flight is a *per target* parameter, so a single client can still have tens or hundreds of RPCs in flight to different servers. The client will send many RPC types directly from the process context, since they are waiting on the result anyway. For asynchronous bulk RPCs, the ptlrpcd thread will try to process the bulk IO on the same CPT (= Lustre CPU Partition Table, roughly aligned to NUMA nodes) as the userspace application was running when the request was created. This minimizes the cross-NUMA traffic when accessing pages for bulk RPCs, so long as those cores are not busy with userspace tasks. Otherwise, the ptlrpcd thread on another CPT will steal RPCs from the queues. Do the Lustre threads suffer? Is there a prioritization of the Lustre threads over other tasks? Are you asking about the client or the server? Many of the client RPCs are generated by the client threads, but for the running ptlrpcd threads do not have a higher priority than client application threads. If the application threads are running on some cores, but other cores are idle, then the ptlrpcd threads on other cores will try to process the RPCs to allow the application threads to continue running there. Otherwise, if all cores are busy (as is typical for HPC applications) then they will be scheduled by the kernel as needed. Are there readily available statistics or tools for this scenario? What statistics are you looking for? There are "{osc,mdc}.*.stats" and "{osc,mdc}.*rpc_stats" that have aggregate information about RPC counts and latency. Cheers, Andreas -- Andreas Dilger Lustre Principal Architect Whamcloud ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org