Re: [lustre-discuss] kernel threads for rpcs in flight

2024-05-03 Thread Aurelien Degremont via lustre-discuss
> This is a module parameter, since it cannot be changed at runtime.  This is 
> visible at /sys/module/libcfs/parameters/cpu_npartitions and the default 
> value depends on the number of CPU cores and NUMA configuration.  It can be 
> specified with "options libcfs cpu_npartitions=" in 
> /etc/modprobe.d/lustre.conf.
> The "cpu_npartitions" module parameter controls how many groups the cores are 
> split into.  The "cpu_pattern" parameter can control the specific cores in 
> each of the CPTs, which would affect the default per-CPT ptlrpcd threads 
> location. It is possible to further use the "ptlrpcd_cpts" and 
> "ptlrpcd_per_cpt_max" parameters to control specifically which cores are used 
> for the threads.

Just a comment on the tuning parameters could be tricky.
"cpu_npartitions" is ignored at the profit of "cpu_pattern", except if 
cpu_pattern is the empty string. cpu_pattern can achieve the same results as 
cpu_npartitions, but at the cost of a more complex declaration. If you just 
want to split your cores into multiple subgroups, you can use cpu_npartition.

options libcfs cpu_pattern="" cpu_npartition=8  # You need to set the empty 
string for cpu_pattern to avoid cpu_npartition to be ignored
or
options libcfs cpu_pattern="N" # the default, split on the NUMA groups
or
options libcfs cpu_pattern="0[0-3] 1[4-7] 2[8-11] 3[12-15] 4[16-19] 5[21-23] 
6[24-27] 7[28-31]"  # same as cpu_npartition=8

or even more complex distribution, see Lustre Manual for details.

Also check  "lctl get_param cpu_partition_table" to see your current partition 
table.

Aurélien

De : lustre-discuss  de la part de 
Andreas Dilger via lustre-discuss 
Envoyé : vendredi 3 mai 2024 07:25
À : Anna Fuchs 
Cc : lustre 
Objet : Re: [lustre-discuss] kernel threads for rpcs in flight

External email: Use caution opening links or attachments

On May 2, 2024, at 18:10, Anna Fuchs 
mailto:anna.fu...@uni-hamburg.de>> wrote:
The number of ptlrpc threads per CPT is set by the "ptlrpcd_partner_group_size" 
module parameter, and defaults to 2 threads per CPT, IIRC.  I don't think that 
clients dynamically start/stop ptlrpcd threads at runtime.
When there are RPCs in the queue for any ptlrpcd it will be woken up and 
scheduled by the kernel, so it will compete with the application threads.  
IIRC, if a ptlrpcd thread is woken up and there are no RPCs in the local CPT 
queue it will try to steal RPCs from another CPT on the assumption that the 
local CPU is not generating any RPCs so it would be beneficial to offload 
threads on another CPU that *is* generating RPCs.  If the application thread is 
extremely CPU hungry, then the kernel will not schedule the ptlrpcd threads on 
those codes very often, and the "idle" core ptlrpcd threads will be be able to 
run more frequently.

Sorry, maybe I am confusing things. I am still not sure how many threads I get.
For example I have a 32 cores AMD Epyc machine as a client and I am running a 
serial stream io application with a single stripesize, 1 OST.
I am struggeling to find out how many CPU partitions I have - is it something 
on the hardware side or something configurable?
There is no file /proc/sys/lnet/cpu_partitions on my client.

This is a module parameter, since it cannot be changed at runtime.  This is 
visible at /sys/module/libcfs/parameters/cpu_npartitions and the default value 
depends on the number of CPU cores and NUMA configuration.  It can be specified 
with "options libcfs cpu_npartitions=" in /etc/modprobe.d/lustre.conf.

Assuming I had 2 CPU partitions, that would result in 4 ptlrpc threads at 
system start, right?

Correct.

Now I set  rpcs_in_flight to 1 or to 8, what effect does that have on the 
number and the activity of the threads?

Setting rpcs_in_flight has no effect on the number of ptlrpcd threads.  The 
ptlrpcd threads process RPCs asynchronously (unlike server threads) so they can 
keep many RPCs in progress.

Serial stream, 1 rpcs_in_flight is waking up only one ptlrpc thread, 3 remain 
inactive/sleep/do nothing?

This depends.  There are two ptlrpcd threads for the CPT that can process the 
RPCs from the one user thread.  If they can send the RPCs quickly enough then 
the other ptlrpcd threads may not steal the RPCs from that CPT.

That said, even a single threaded userspace writer may have up to 8 RPCs in 
flight *per OST* (depending on the file striping and if IO submission allows it 
- buffered or AIO+DIO) so if there are a lot of outstanding RPCs and RPC 
generation takes a long time (e.g. compression) then it may be that all ptlrpcd 
threads will be busy.

Does not seem to be the case, as I've applied the rpctracing (thanks a lot for 
the hint!!), and rpcs_in_flight being 1 still show at least 3 different threads 
from at least 2 different partition

Re: [lustre-discuss] kernel threads for rpcs in flight

2024-05-02 Thread Andreas Dilger via lustre-discuss
On May 2, 2024, at 18:10, Anna Fuchs 
mailto:anna.fu...@uni-hamburg.de>> wrote:
The number of ptlrpc threads per CPT is set by the "ptlrpcd_partner_group_size" 
module parameter, and defaults to 2 threads per CPT, IIRC.  I don't think that 
clients dynamically start/stop ptlrpcd threads at runtime.
When there are RPCs in the queue for any ptlrpcd it will be woken up and 
scheduled by the kernel, so it will compete with the application threads.  
IIRC, if a ptlrpcd thread is woken up and there are no RPCs in the local CPT 
queue it will try to steal RPCs from another CPT on the assumption that the 
local CPU is not generating any RPCs so it would be beneficial to offload 
threads on another CPU that *is* generating RPCs.  If the application thread is 
extremely CPU hungry, then the kernel will not schedule the ptlrpcd threads on 
those codes very often, and the "idle" core ptlrpcd threads will be be able to 
run more frequently.

Sorry, maybe I am confusing things. I am still not sure how many threads I get.
For example I have a 32 cores AMD Epyc machine as a client and I am running a 
serial stream io application with a single stripesize, 1 OST.
I am struggeling to find out how many CPU partitions I have - is it something 
on the hardware side or something configurable?
There is no file /proc/sys/lnet/cpu_partitions on my client.

This is a module parameter, since it cannot be changed at runtime.  This is 
visible at /sys/module/libcfs/parameters/cpu_npartitions and the default value 
depends on the number of CPU cores and NUMA configuration.  It can be specified 
with "options libcfs cpu_npartitions=" in /etc/modprobe.d/lustre.conf.

Assuming I had 2 CPU partitions, that would result in 4 ptlrpc threads at 
system start, right?

Correct.

Now I set  rpcs_in_flight to 1 or to 8, what effect does that have on the 
number and the activity of the threads?

Setting rpcs_in_flight has no effect on the number of ptlrpcd threads.  The 
ptlrpcd threads process RPCs asynchronously (unlike server threads) so they can 
keep many RPCs in progress.

Serial stream, 1 rpcs_in_flight is waking up only one ptlrpc thread, 3 remain 
inactive/sleep/do nothing?

This depends.  There are two ptlrpcd threads for the CPT that can process the 
RPCs from the one user thread.  If they can send the RPCs quickly enough then 
the other ptlrpcd threads may not steal the RPCs from that CPT.

That said, even a single threaded userspace writer may have up to 8 RPCs in 
flight *per OST* (depending on the file striping and if IO submission allows it 
- buffered or AIO+DIO) so if there are a lot of outstanding RPCs and RPC 
generation takes a long time (e.g. compression) then it may be that all ptlrpcd 
threads will be busy.

Does not seem to be the case, as I've applied the rpctracing (thanks a lot for 
the hint!!), and rpcs_in_flight being 1 still show at least 3 different threads 
from at least 2 different partitions for writing a 1MB file with ten blocks.
I don't get the relationship between these values.

What are the opcodes from the different RPCs?  The ptlrpcd threads are only 
handling asynchronous RPCs like buffered writes, statfs, and a few others.  
Many RPCs are processed in the context of the application thread, not by 
ptlrpcd.

And, if I had compression or any other heavy load, which settings could clearly 
control how many resources I want to give Lustre for this load? I can see a 
clear scaling with higher rpcs in flight, but I am struggeling to understand 
the numbers and attribute them to a specific settings. Uncompressed case 
already benefits a bit by higher RPCs number due to multiple "substreaming", 
but there must be much more happening in parallel behind the scenes for 
compressed case even with rpcs_in_flight=1.

The "cpu_npartitions" module parameter controls how many groups the cores are 
split into.  The "cpu_pattern" parameter can control the specific cores in each 
of the CPTs, which would affect the default per-CPT ptlrpcd threads location. 
It is possible to further use the "ptlrpcd_cpts" and "ptlrpcd_per_cpt_max" 
parameters to control specifically which cores are used for the threads.

It is entirely possible that the number of ptlrpcd threads and CPT 
configuration is becoming sub-optimal as the number of multi-chip package CPUs 
with many cores grows dramatically.  It is a balance between having enough 
threads to maximize performance without having so many that it goes down hill 
again.  Ideally this should all happen without the need to hand-tune the CPT 
and thread count for every CPU on the market.

Cheers, Andreas

Thank you!

Anna


Whether this behavior is optimal or not is subject to debate, and 
investigation/improvements are of course welcome.  Definitely, data checksums 
have some overhead (a few percent), and client-side data compression (which is 
done by ptlrpcd threads) would have a significant usage of CPU cycles, but 
given the large number of CPU cores on client nodes these days this 

Re: [lustre-discuss] kernel threads for rpcs in flight

2024-05-02 Thread Anna Fuchs via lustre-discuss


The number of ptlrpc threads per CPT is set by the 
"ptlrpcd_partner_group_size" module parameter, and defaults to 2 
threads per CPT, IIRC.  I don't think that clients dynamically 
start/stop ptlrpcd threads at runtime.
When there are RPCs in the queue for any ptlrpcd it will be woken up 
and scheduled by the kernel, so it will compete with the application 
threads.  IIRC, if a ptlrpcd thread is woken up and there are no RPCs 
in the local CPT queue it will try to steal RPCs from another CPT on 
the assumption that the local CPU is not generating any RPCs so it 
would be beneficial to offload threads on another CPU that *is* 
generating RPCs.  If the application thread is extremely CPU hungry, 
then the kernel will not schedule the ptlrpcd threads on those codes 
very often, and the "idle" core ptlrpcd threads will be be able to run 
more frequently.


Sorry, maybe I am confusing things. I am still not sure how many threads 
I get.
For example I have a 32 cores AMD Epyc machine as a client and I am 
running a serial stream io application with a single stripesize, 1 OST.
I am struggeling to find out how many CPU partitions I have - is it 
something on the hardware side or something configurable?

There is no file /proc/sys/lnet/cpu_partitions on my client.

Assuming I had 2 CPU partitions, that would result in 4 ptlrpc threads 
at system start, right? Now I set  rpcs_in_flight to 1 or to 8, what 
effect does that have on the number and the activity of the threads?
Serial stream, 1 rpcs_in_flight is waking up only one ptlrpc thread, 3 
remain inactive/sleep/do nothing?


Does not seem to be the case, as I've applied the rpctracing (thanks a 
lot for the hint!!), and rpcs_in_flight being 1 still show at least 3 
different threads from at least 2 different partitions for writing a 1MB 
file with ten blocks.

I don't get the relationship between these values.

And, if I had compression or any other heavy load, which settings could 
clearly control how many resources I want to give Lustre for this load? 
I can see a clear scaling with higher rpcs in flight, but I am 
struggeling to understand the numbers and attribute them to a specific 
settings. Uncompressed case already benefits a bit by higher RPCs number 
due to multiple "substreaming", but there must be much more happening in 
parallel behind the scenes for compressed case even with rpcs_in_flight=1.


Thank you!

Anna



Whether this behavior is optimal or not is subject to debate, and 
investigation/improvements are of course welcome.  Definitely, data 
checksums have some overhead (a few percent), and client-side data 
compression (which is done by ptlrpcd threads) would have a 
significant usage of CPU cycles, but given the large number of CPU 
cores on client nodes these days this may still provide a net 
performance benefit if the IO bottleneck is on the server.


With |max_||rpcs_in_flight = 1|, multiple cores are loaded, 
presumably alternately, but the statistics are too inaccurate to 
capture this.  The distribution of threads to cores is regulated by 
the Linux kernel, right? Does anyone have experience with what 
happens when all CPUs are under full load with the application or 
something else?


Note that {osc,mdc}.*.max_rpcs_in_flight is a *per target* 
parameter, so a single client can still have tens or hundreds of 
RPCs in flight to different servers.  The client will send many RPC 
types directly from the process context, since they are waiting on 
the result anyway.  For asynchronous bulk RPCs, the ptlrpcd thread 
will try to process the bulk IO on the same CPT (= Lustre CPU 
Partition Table, roughly aligned to NUMA nodes) as the userspace 
application was running when the request was created.  This 
minimizes the cross-NUMA traffic when accessing pages for bulk RPCs, 
so long as those cores are not busy with userspace tasks. 
 Otherwise, the ptlrpcd thread on another CPT will steal RPCs from 
the queues.


Do the Lustre threads suffer? Is there a prioritization of the 
Lustre threads over other tasks?


Are you asking about the client or the server?  Many of the client 
RPCs are generated by the client threads, but for the running 
ptlrpcd threads do not have a higher priority than client 
application threads.  If the application threads are running on some 
cores, but other cores are idle, then the ptlrpcd threads on other 
cores will try to process the RPCs to allow the application threads 
to continue running there.  Otherwise, if all cores are busy (as is 
typical for HPC applications) then they will be scheduled by the 
kernel as needed.



Are there readily available statistics or tools for this scenario?


What statistics are you looking for?  There are "{osc,mdc}.*.stats" 
and "{osc,mdc}.*rpc_stats" that have aggregate information about RPC 
counts and latency.


Oh, right, these tell a lot. Isn't there also something to log the 
utilization and location of these threads? Otherwise, I'll continue 
trying with perf, which 

Re: [lustre-discuss] kernel threads for rpcs in flight

2024-04-30 Thread Andreas Dilger via lustre-discuss
On Apr 29, 2024, at 02:36, Anna Fuchs 
mailto:anna.fu...@uni-hamburg.de>> wrote:

Hi Andreas.

Thank you very much, that helps a lot.
Sorry for the confusion, I primarily meant the client. The servers rarely have 
to compete with anything else for CPU resources I guess.

The mechanism to start new threads is relatively simple.  Before a server 
thread is processing a new request, if it is the last thread available, and not 
the maximum number of threads are running, then it will try to launch a new 
thread; repeat as needed.  So the thread  count will depend on the client RPC 
load and the RPC processing rate and lock contention on whatever resources 
those RPCs are accessing.

And what conditions are on the client? Are the threads then driven by the 
workload of the application somehow?

The number of ptlrpc threads per CPT is set by the "ptlrpcd_partner_group_size" 
module parameter, and defaults to 2 threads per CPT, IIRC.  I don't think that 
clients dynamically start/stop ptlrpcd threads at runtime.

Imagine an edge case where all but one core are pinned and at 100% constant 
load and one is dumping RAM to Lustre. Presumably, the available core will be 
taken. But will Lustre or the kernel then spawn additional threads and try to 
somehow interleave them with those of the application, or will it simply handle 
it with 1-2 threads on the available core (assume single stream to single OST)? 
In any case, I suppose the I/O transfer would suffer under the resource 
shortage, but my question would be to what extent it would (try to) hinder the 
application. For latency-critical applications, such small delays can already 
lead to idle waves. And surely, the Lustre threads are usually not CPU-hungry, 
but they will when it comes to encryption and compression.

When there are RPCs in the queue for any ptlrpcd it will be woken up and 
scheduled by the kernel, so it will compete with the application threads.  
IIRC, if a ptlrpcd thread is woken up and there are no RPCs in the local CPT 
queue it will try to steal RPCs from another CPT on the assumption that the 
local CPU is not generating any RPCs so it would be beneficial to offload 
threads on another CPU that *is* generating RPCs.  If the application thread is 
extremely CPU hungry, then the kernel will not schedule the ptlrpcd threads on 
those codes very often, and the "idle" core ptlrpcd threads will be be able to 
run more frequently.

Whether this behavior is optimal or not is subject to debate, and 
investigation/improvements are of course welcome.  Definitely, data checksums 
have some overhead (a few percent), and client-side data compression (which is 
done by ptlrpcd threads) would have a significant usage of CPU cycles, but 
given the large number of CPU cores on client nodes these days this may still 
provide a net performance benefit if the IO bottleneck is on the server.

With max_rpcs_in_flight = 1, multiple cores are loaded, presumably alternately, 
but the statistics are too inaccurate to capture this.  The distribution of 
threads to cores is regulated by the Linux kernel, right? Does anyone have 
experience with what happens when all CPUs are under full load with the 
application or something else?

Note that {osc,mdc}.*.max_rpcs_in_flight is a *per target* parameter, so a 
single client can still have tens or hundreds of RPCs in flight to different 
servers.  The client will send many RPC types directly from the process 
context, since they are waiting on the result anyway.  For asynchronous bulk 
RPCs, the ptlrpcd thread will try to process the bulk IO on the same CPT (= 
Lustre CPU Partition Table, roughly aligned to NUMA nodes) as the userspace 
application was running when the request was created.  This minimizes the 
cross-NUMA traffic when accessing pages for bulk RPCs, so long as those cores 
are not busy with userspace tasks.  Otherwise, the ptlrpcd thread on another 
CPT will steal RPCs from the queues.

Do the Lustre threads suffer? Is there a prioritization of the Lustre threads 
over other tasks?

Are you asking about the client or the server?  Many of the client RPCs are 
generated by the client threads, but for the running ptlrpcd threads do not 
have a higher priority than client application threads.  If the application 
threads are running on some cores, but other cores are idle, then the ptlrpcd 
threads on other cores will try to process the RPCs to allow the application 
threads to continue running there.  Otherwise, if all cores are busy (as is 
typical for HPC applications) then they will be scheduled by the kernel as 
needed.

Are there readily available statistics or tools for this scenario?

What statistics are you looking for?  There are "{osc,mdc}.*.stats" and 
"{osc,mdc}.*rpc_stats" that have aggregate information about RPC counts and 
latency.

Oh, right, these tell a lot. Isn't there also something to log the utilization 
and location of these threads? Otherwise, I'll continue trying with perf, which 
seems 

Re: [lustre-discuss] kernel threads for rpcs in flight

2024-04-29 Thread Anna Fuchs via lustre-discuss

Hi Andreas.

Thank you very much, that helps a lot.
Sorry for the confusion, I primarily meant the client. The servers 
rarely have to compete with anything else for CPU resources I guess.


The mechanism to start new threads is relatively simple.  Before a 
server thread is processing a new request, if it is the last thread 
available, and not the maximum number of threads are running, then it 
will try to launch a new thread; repeat as needed.  So the thread 
 count will depend on the client RPC load and the RPC processing rate 
and lock contention on whatever resources those RPCs are accessing.
And what conditions are on the client? Are the threads then driven by 
the workload of the application somehow?


Imagine an edge case where all but one core are pinned and at 100% 
constant load and one is dumping RAM to Lustre. Presumably, the 
available core will be taken. But will Lustre or the kernel then spawn 
additional threads and try to somehow interleave them with those of the 
application, or will it simply handle it with 1-2 threads on the 
available core (assume single stream to single OST)? In any case, I 
suppose the I/O transfer would suffer under the resource shortage, but 
my question would be to what extent it would (try to) hinder the 
application. For latency-critical applications, such small delays can 
already lead to idle waves. And surely, the Lustre threads are usually 
not CPU-hungr, but they will when it comes to encryption and compression.


With |max_||rpcs_in_flight = 1|, multiple cores are loaded, 
presumably alternately, but the statistics are too inaccurate to 
capture this.  The distribution of threads to cores is regulated by 
the Linux kernel, right? Does anyone have experience with what 
happens when all CPUs are under full load with the application or 
something else?


Note that {osc,mdc}.*.max_rpcs_in_flight is a *per target* parameter, 
so a single client can still have tens or hundreds of RPCs in flight 
to different servers.  The client will send many RPC types directly 
from the process context, since they are waiting on the result anyway. 
 For asynchronous bulk RPCs, the ptlrpcd thread will try to process 
the bulk IO on the same CPT (= Lustre CPU Partition Table, roughly 
aligned to NUMA nodes) as the userspace application was running when 
the request was created.  This minimizes the cross-NUMA traffic when 
accessing pages for bulk RPCs, so long as those cores are not busy 
with userspace tasks.  Otherwise, the ptlrpcd thread on another CPT 
will steal RPCs from the queues.


Do the Lustre threads suffer? Is there a prioritization of the Lustre 
threads over other tasks?


Are you asking about the client or the server?  Many of the client 
RPCs are generated by the client threads, but for the running ptlrpcd 
threads do not have a higher priority than client application threads. 
 If the application threads are running on some cores, but other cores 
are idle, then the ptlrpcd threads on other cores will try to process 
the RPCs to allow the application threads to continue running there. 
 Otherwise, if all cores are busy (as is typical for HPC applications) 
then they will be scheduled by the kernel as needed.



Are there readily available statistics or tools for this scenario?


What statistics are you looking for?  There are "{osc,mdc}.*.stats" 
and "{osc,mdc}.*rpc_stats" that have aggregate information about RPC 
counts and latency.
Oh, right, these tell a lot. Isn't there also something to log the 
utilization and location of these threads? Otherwise, I'll continue 
trying with perf, which seems to be more complex with kernel threads.


Thanks for the explanations!

Anna


Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Whamcloud







___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] kernel threads for rpcs in flight

2024-04-28 Thread Andreas Dilger via lustre-discuss
On Apr 28, 2024, at 16:54, Anna Fuchs via lustre-discuss 
mailto:lustre-discuss@lists.lustre.org>> wrote:

The setting max_rpcs_in_flight affects, among other things, how many threads 
can be spawned simultaneously for processing the RPCs, right?

The {osc,mdc}.*.max_rpcs_in_flight are actually controlling the maximum number 
of RPCs a *client* will have in flight to any MDT or OST, while the number of 
MDS and OSS threads is controlled on the server with 
mds.MDS.mdt*.threads_{min,max} and ost.OSS.ost*.threads_{min,max} for each of 
the various service portals (which are selected by the client based on the RPC 
type).  The max_rpcs_in_flight allows concurrent operations on the client for 
multiple threads to hide network latency and to improve server utilization, 
without allowing a single client to overwhelm the server.

In tests where the network is clearly a bottleneck, this setting has almost no 
effect - the network cannot keep up with processing the data, there is not so 
much to do in parallel.
With a faster network, the stats show higher CPU utilization on different cores 
(at least on the client).

What is the exact mechanism by which it is decided that a kernel thread is 
spawned for processing a bulk? Is there an RPC queue with timings or something 
similar?
Is it in any way predictable or calculable how many threads a specific workload 
will require (spawn if possible) given the data rates from the network and 
storage devices?

The mechanism to start new threads is relatively simple.  Before a server 
thread is processing a new request, if it is the last thread available, and not 
the maximum number of threads are running, then it will try to launch a new 
thread; repeat as needed.  So the thread  count will depend on the client RPC 
load and the RPC processing rate and lock contention on whatever resources 
those RPCs are accessing.

With max_rpcs_in_flight = 1, multiple cores are loaded, presumably alternately, 
but the statistics are too inaccurate to capture this.  The distribution of 
threads to cores is regulated by the Linux kernel, right? Does anyone have 
experience with what happens when all CPUs are under full load with the 
application or something else?


Note that {osc,mdc}.*.max_rpcs_in_flight is a *per target* parameter, so a 
single client can still have tens or hundreds of RPCs in flight to different 
servers.  The client will send many RPC types directly from the process 
context, since they are waiting on the result anyway.  For asynchronous bulk 
RPCs, the ptlrpcd thread will try to process the bulk IO on the same CPT (= 
Lustre CPU Partition Table, roughly aligned to NUMA nodes) as the userspace 
application was running when the request was created.  This minimizes the 
cross-NUMA traffic when accessing pages for bulk RPCs, so long as those cores 
are not busy with userspace tasks.  Otherwise, the ptlrpcd thread on another 
CPT will steal RPCs from the queues.

Do the Lustre threads suffer? Is there a prioritization of the Lustre threads 
over other tasks?

Are you asking about the client or the server?  Many of the client RPCs are 
generated by the client threads, but for the running ptlrpcd threads do not 
have a higher priority than client application threads.  If the application 
threads are running on some cores, but other cores are idle, then the ptlrpcd 
threads on other cores will try to process the RPCs to allow the application 
threads to continue running there.  Otherwise, if all cores are busy (as is 
typical for HPC applications) then they will be scheduled by the kernel as 
needed.

Are there readily available statistics or tools for this scenario?

What statistics are you looking for?  There are "{osc,mdc}.*.stats" and 
"{osc,mdc}.*rpc_stats" that have aggregate information about RPC counts and 
latency.

Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Whamcloud







___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org