Re: [lustre-discuss] kernel threads for rpcs in flight

Anna Fuchs via lustre-discuss Thu, 02 May 2024 17:14:28 -0700

The number of ptlrpc threads per CPT is set by the"ptlrpcd_partner_group_size" module parameter, and defaults to 2threads per CPT, IIRC. I don't think that clients dynamicallystart/stop ptlrpcd threads at runtime.When there are RPCs in the queue for any ptlrpcd it will be woken upand scheduled by the kernel, so it will compete with the applicationthreads. IIRC, if a ptlrpcd thread is woken up and there are no RPCsin the local CPT queue it will try to steal RPCs from another CPT onthe assumption that the local CPU is not generating any RPCs so itwould be beneficial to offload threads on another CPU that *is*generating RPCs. If the application thread is extremely CPU hungry,then the kernel will not schedule the ptlrpcd threads on those codesvery often, and the "idle" core ptlrpcd threads will be be able to runmore frequently.

Sorry, maybe I am confusing things. I am still not sure how many threadsI get.For example I have a 32 cores AMD Epyc machine as a client and I amrunning a serial stream io application with a single stripesize, 1 OST.I am struggeling to find out how many CPU partitions I have - is itsomething on the hardware side or something configurable?

There is no file /proc/sys/lnet/cpu_partitions on my client.

Assuming I had 2 CPU partitions, that would result in 4 ptlrpc threadsat system start, right? Now I set rpcs_in_flight to 1 or to 8, whateffect does that have on the number and the activity of the threads?Serial stream, 1 rpcs_in_flight is waking up only one ptlrpc thread, 3remain inactive/sleep/do nothing?

Does not seem to be the case, as I've applied the rpctracing (thanks alot for the hint!!), and rpcs_in_flight being 1 still show at least 3different threads from at least 2 different partitions for writing a 1MBfile with ten blocks.

I don't get the relationship between these values.

And, if I had compression or any other heavy load, which settings couldclearly control how many resources I want to give Lustre for this load?I can see a clear scaling with higher rpcs in flight, but I amstruggeling to understand the numbers and attribute them to a specificsettings. Uncompressed case already benefits a bit by higher RPCs numberdue to multiple "substreaming", but there must be much more happening inparallel behind the scenes for compressed case even with rpcs_in_flight=1.


Thank you!

Anna

Whether this behavior is optimal or not is subject to debate, andinvestigation/improvements are of course welcome. Definitely, datachecksums have some overhead (a few percent), and client-side datacompression (which is done by ptlrpcd threads) would have asignificant usage of CPU cycles, but given the large number of CPUcores on client nodes these days this may still provide a netperformance benefit if the IO bottleneck is on the server.
With |max_||rpcs_in_flight = 1|, multiple cores are loaded,presumably alternately, but the statistics are too inaccurate tocapture this. The distribution of threads to cores is regulated bythe Linux kernel, right? Does anyone have experience with whathappens when all CPUs are under full load with the application orsomething else?
Note that {osc,mdc}.*.max_rpcs_in_flight is a *per target*parameter, so a single client can still have tens or hundreds ofRPCs in flight to different servers. The client will send many RPCtypes directly from the process context, since they are waiting onthe result anyway. For asynchronous bulk RPCs, the ptlrpcd threadwill try to process the bulk IO on the same CPT (= Lustre CPUPartition Table, roughly aligned to NUMA nodes) as the userspaceapplication was running when the request was created. Thisminimizes the cross-NUMA traffic when accessing pages for bulk RPCs,so long as those cores are not busy with userspace tasks. Otherwise, the ptlrpcd thread on another CPT will steal RPCs fromthe queues.
Do the Lustre threads suffer? Is there a prioritization of theLustre threads over other tasks?
Are you asking about the client or the server? Many of the clientRPCs are generated by the client threads, but for the runningptlrpcd threads do not have a higher priority than clientapplication threads. If the application threads are running on somecores, but other cores are idle, then the ptlrpcd threads on othercores will try to process the RPCs to allow the application threadsto continue running there. Otherwise, if all cores are busy (as istypical for HPC applications) then they will be scheduled by thekernel as needed.
Are there readily available statistics or tools for this scenario?
What statistics are you looking for? There are "{osc,mdc}.*.stats"and "{osc,mdc}.*rpc_stats" that have aggregate information about RPCcounts and latency.
Oh, right, these tell a lot. Isn't there also something to log theutilization and location of these threads? Otherwise, I'll continuetrying with perf, which seems to be more complex with kernel threads.
There are kernel debug logs available when "lctl set_paramdebug=+rpctrace" is enabled, that will show which ptlrpcd orapplication thread is handling each RPC, and on which core it was runon. These can be found on the client by searching for "SendingRPC|Completed RPC" in the debug logs, for example:
# lctl set_param debug=+rpctrace
# lctl set_param jobid_var=procname_uid
# cp -a /etc /mnt/testfs
# lctl dk /tmp/debug
# grep -E "Sending RPC|Completed RPC" /tmp/debug
    :
    :
00000100:00100000:2.0:1714502851.435000:0:23892:0:(client.c:1758:ptlrpc_send_new_req())
     Sending RPC req@ffff90c9b2948640 pname:cluuid:pid:xid:nid:opc:job
ptlrpcd_01_00:e81f3122-b1bc-4ac4-afcb-f6629a81e5bd:23892:1797634353438336:0@lo:2:cp.0
00000100:00100000:2.0:1714502851.436117:0:23892:0:(client.c:2239:ptlrpc_check_set())
     Completed RPC req@ffff90c9b2948640 pname:cluuid:pid:xid:nid:opc:job
ptlrpcd_01_00:e81f3122-b1bc-4ac4-afcb-f6629a81e5bd:23892:1797634353438336:0@lo:2:cp.0
Shows that thread "ptlrpcd_01_00" (CPT 01, thread 00, pid 23892) wassent on core 2.0 (no hyperthread) and sent an OST_SETATTR (opc = 2)RPC on behalf of "cp" for root (uid=0), and competed in 1117msec.
Similarly, with a "dd" sync write workload it shows write RPCs by theptlrpcd threads, and sync RPCs in the "dd" process context:
# dd if=/dev/zero of=/mnt/testfs/file bs=4k count=10000 oflag=dsync
# lctl dk /tmp/debug
# grep -E "Sending RPC|Completed RPC" /tmp/debug
    :
    :
00000100:00100000:2.0:1714503761.136971:0:23892:0:(client.c:1758:ptlrpc_send_new_req())
     Sending RPC req@ffff90c9a6ad6640 pname:cluuid:pid:xid:nid:opc:job
ptlrpcd_01_00:e81f3122-b1bc-4ac4-afcb-f6629a81e5bd:23892:1797634358961024:0@lo:4:dd.0
00000100:00100000:2.0:1714503761.140288:0:23892:0:(client.c:2239:ptlrpc_check_set())
     Completed RPC req@ffff90c9a6ad6640 pname:cluuid:pid:xid:nid:opc:job
ptlrpcd_01_00:e81f3122-b1bc-4ac4-afcb-f6629a81e5bd:23892:1797634358961024:0@lo:4:dd.0
00000100:00100000:2.0:1714503761.140518:0:17993:0:(client.c:1758:ptlrpc_send_new_req())
     Sending RPC req@ffff90c9a6ad3040 pname:cluuid:pid:xid:nid:opc:job
dd:e81f3122-b1bc-4ac4-afcb-f6629a81e5bd:17993:1797634358961088:0@lo:44:dd.0
00000100:00100000:2.0:1714503761.141556:0:17993:0:(client.c:2239:ptlrpc_check_set())
     Completed RPC req@ffff90c9a6ad3040 pname:cluuid:pid:xid:nid:opc:job
dd:e81f3122-b1bc-4ac4-afcb-f6629a81e5bd:17993:1797634358961088:0@lo:44:dd.0
00000100:00100000:2.0:1714503761.141885:0:23893:0:(client.c:1758:ptlrpc_send_new_req())
     Sending RPC req@ffff90c9a6ad3040 pname:cluuid:pid:xid:nid:opc:job
ptlrpcd_01_01:e81f3122-b1bc-4ac4-afcb-f6629a81e5bd:23893:1797634358961152:0@lo:16:dd.0
00000100:00100000:2.0:1714503761.144172:0:23893:0:(client.c:2239:ptlrpc_check_set())
     Completed RPC req@ffff90c9a6ad3040 pname:cluuid:pid:xid:nid:opc:job
ptlrpcd_01_01:e81f3122-b1bc-4ac4-afcb-f6629a81e5bd:23893:1797634358961152:0@lo:16:dd.0
There are no stats files that aggregate information about ptlrpcdthread utilization.
Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Whamcloud

_______________________________________________
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Re: [lustre-discuss] kernel threads for rpcs in flight

Reply via email to