On 6/2/26 5:24 PM, Gaetan Rivet wrote: > On Fri May 29, 2026 at 6:26 PM CEST, Kevin Traynor wrote: >> On 5/28/26 10:29 AM, Eelco Chaudron wrote: >>> >>> >>> On 27 May 2026, at 16:37, Gaetan Rivet wrote: >>> >>>> On Thu Apr 2, 2026 at 12:41 PM CEST, Kevin Traynor via dev wrote: >>>>> On 4/1/26 1:03 PM, Eelco Chaudron via dev wrote: >>>>>> >>>>>> >>>>>> On 1 Apr 2026, at 13:57, Eelco Chaudron via dev wrote: >>>>>> >>>>>>> This patch adds support for specific PMD thread initialization, >>>>>>> deinitialization, and a callback execution to perform work as >>>>>>> part of the PMD thread loop. This allows hardware offload >>>>>>> providers to handle any specific asynchronous or batching work. >>>>>>> >>>>>>> This patch also adds cycle statistics for the provider-specific >>>>>>> callbacks to the 'ovs-appctl dpif-netdev/pmd-perf-show' command. >>>>>> >>>>>> Bringing back the discussion on the earlier patch between Ilya and >>>>>> Gaetan to this revision :) >>>>>> >>>>>> Ilya: >>>>>> Hi, Eelco. As we talked before, this infrastructure resembles the >>>>>> async >>>>>> work infra that was proposed in the past for the use case of async >>>>>> vhost >>>>>> processing. And I don't see any real use case proposed for it here nor >>>>>> in the RFC, where the question was asked, but not replied. >>>>>> >>>>>> Gaetan: >>>>>> >>>>> >>>>> Hi Gaetan, >>>>> >>>>> A few questions below. I'm not so clear on the DOCA threading >>>>> requirements, so questions may be broad. >>>>> >>>>>> Hi Ilya, Eelco, >>>>>> >>>>>> Thanks for the patch and for the review. >>>>>> >>>>>> The use-case on our side is distributed data-structures in DOCA that >>>>>> requires each participating threads to do maintenance work >>>>>> periodically. >>>>>> >>>>>> Specifically, offload threads will insert offload objects. >>>>>> Those will reserve entries in a map that can be resized. The DOCA >>>>>> implementation requires any thread that owns an entry to perform the >>>>>> work of moving it to the new bucket / space after resize is initiated. >>>>>> >>>>>> This is a pervasive design choice in DOCA, they write most of their >>>>>> APIs >>>>>> assuming participating threads are periodically calling into these >>>>>> maintenance functions. >>>>>> >>>>> >>>>> What is a "particpating thread" ? IIUC, the pmd thread passes down the >>>>> flow pattern/action and the offload thread inserts the offload into the >>>>> NIC. >>>>> >>>>> In that case, is it the offload thread that owns the entry ? >>>>> >>>> >>>> Participating threads are any threads that registered to DOCA-flow as >>>> offloading threads. In our case, it means: >>>> >>>> * The main thread >>>> --> When probing a port, starting it requires installing >>>> DOCA offloads to execute RSS in particular, and a few other >>>> 'admin' offloads (optional rate-limiting on VF to avoid >>>> noisy-neighbors, etc). >>>> >>>> * The offload thread(s) (in the OVS sense) >>>> A thread in OVS managing dp-flow offloads asynchronously. >>>> >>>> * The polling thread(s) >>>> CT-offload is much simpler and faster than dp-flow offload. >>>> Executing offload insertion synchronously from the fastpath >>>> is beneficial. >>>> >>>> In our case, 'participating threads' are any thread owning an offload >>>> queue in DOCA-flow. >>>> >>>> We have a few exceptions for the main thread, mainly that we force all >>>> offload operations to be fully synchronous there: we do not want to >>>> publish a new netdev if its 'admin' offloads have not yet been received >>>> and successfully acknowledged by the hardware, so we force waiting >>>> operations for it: it does not need to do regular upkeep etc. >>>> >>>>>> Some of such work is also time-sensitive, for example the current >>>>>> implementation requires a CT offload thread to receive completions >>>>>> after >>>>>> some hardware initialization. Until this completion is done, the CT >>>>>> offload entry is not fully usable (cannot be queried for activity / >>>>>> counters). We cannot leave batches of CT offload entry waiting for >>>>>> completion, assuming that at some later point, we will eventually >>>>>> re-execute something in our offload provider: it leaves a few stranded >>>>>> connection objects incomplete. >>>>>> >>>>>> This has the result of having hardware execution of a flow with CT >>>>>> actions, but no activity counters: the software datapath then deletes >>>>>> the connection and/or flow due to inactivity. >>>>>> >>>>> >>>>> Can this periodic work be done by the offload thread ? If it is fast >>>>> enough for inserting the offload, then maybe it is fast enough for this. >>>>> >>>> >>>> The PMD thread owns the offload queue. If another thread has to execute >>>> its upkeep work, it means sharing the queue between threads. >>>> >>>>> Some DPDK PMDs use alarms for periodic maintenance work, could they be >>>>> used inside DOCA for this? >>>>> >>>> >>>> Those upkeep functions are exposed by DOCA and part of the DOCA-flow >>>> API. DOCA does not expose an event framework to schedule this kind of >>>> work, it requires DOCA applications to explicitly call those functions. >>>> >>>>> If it needs to be on the PMD thread, is the work significant (i.e. more >>>>> than a few % cpu) and how variable is it ? Could it be added inside the >>>>> call to rte_eth_rx_burst polling ? >>>>> >>>> >>>> It can be significant. >>>> The work is anything requiring the use of the offload queue owned by >>>> this thread. The principle is that the owning thread must execute it. >>>> >>>> Currently, with CT offloads we have: >>>> >>>> * offload queue polling for HW completion (requests have been >>>> executed: add / mod / del were executed) >>>> >>>> * CT-del: A conn was offloaded by PMD 1. The connection either expired >>>> or another PMD 2 closed it: ct-clean or PMD-2 send a CT-del >>>> request to PMD-1: PMD-1 must poll for CT-del requests and >>>> execute them locally. >>>> >>>> * Offload flush: when a port is deleted, all owning threads must >>>> process a blocking flush request from the main thread. The main >>>> thread only proceeds once all participating threads have completed >>>> their flush. >>>> >>>> Completion is a very lightweight work, but we must execute it. >>>> Generally we do only completion polling as needed: we only clear enough >>>> room in the offload queue for the current batch of requests we want to >>>> enqueue, but we have an issue on idle: some stray completion can >>>> be left in the queue and won't be processed if we rely only on activity. >>>> Currently DOCA-flow does not support leaving the completions until the >>>> port is deleted: they need to be processed. >>>> >>>> CT-del can be significant in some cases. We have a 'rolling-window' case >>>> of constant open + close of short connections, and in this worst case, >>>> CT-del takes ~30% (both local and distant). Some portion of it comes from >>>> CT-del messages, in particular in case of multiple PMDs. >>>> >>>> Offload flush is generally quick, but we must answer the flush message >>>> quickly to block the main thread as little as possible. >>>> >>>> Some of the messages must be handled even if there is no RX-burst: a PMD >>>> that is waiting for reload will need to execute a flush message that it >>>> has received. >>> >>> Hi Gaetan, >>> >>> I guess Kevin is suggesting to hide this work in netdev_doca_rxq_recv(), >>> as it will always be called as long as DOCA ports are present on the >>> PMD. Or are there cases where this is not the case? >>> >>> dp_netdev_process_rxq_port() >>> netdev_rxq_recv() >>> netdev_doca_rxq_recv() >>> >>> Kevin, please confirm. >> >> Yes, that's what I was suggesting. The work is rxq specific and we >> already have an rxq specific call that is called in a loop so why not do >> it there and include the cycles needed for the maintenance work in the >> measured cycles needed for that rxq. >> >>> >>>> I think completions and flushes would be the main issues with the >>>> rx-burst approach. > > Hi,
Hi Gaetan, Thanks for explaining further. > > We had an issue with this kind of approach with flush commands. > A PMD can be registered as a DOCA offload thread, in which case it > will receive a blocking flush request on port deletion. > This happens even if that port is not scheduled on that PMD. > > The issue arises when the PMD has no netdev-doca rxq scheduled: it > is registered as a DOCA offload thread, but will never process its flush > requests. A typical example might be on multi-NUMA, where by default 1 > PMD is created per NUMA, and ports are configured with 1 rxq. With a > single NIC, its rxq is configured on the closest PMD, leaving the other > one idle. The idle PMD is still registered as a DOCA offload thread, as > nothing forbids the user from adding a port on its NUMA at a future > time. > iiuc, the same issue will be present with the approach in this patch as the PMD thread will block if there are no rxqs to poll. Another issue is that even if there are rxq's being polled with sleep settings then there could be a delay in the flush which means blocking. > In this case, the idle PMD would never enter the right rxq-burst command > to process its offload messages. > > All other cases would seem fine however, I think it almost works. > I just don't have a solid approach for this flush issue. > Waiting for PMD threads to flush that aren't doing anything meaningful related to the offload or rxqs from the device is not ideal and creates a few headaches. Maybe you could dynamically register/unregister them as needed or find a way to not require a flush from ones which aren't actively involved but I'm just thinking out loud. thanks, Kevin. _______________________________________________ dev mailing list [email protected] https://mail.openvswitch.org/mailman/listinfo/ovs-dev
