Re: [ovs-dev] PMD Scheduling: Grouping RX Queues from Related Ports

Eli Britstein via dev Wed, 29 Oct 2025 09:44:32 -0700


>-----Original Message-----
>From: Kevin Traynor <[email protected]>
>Sent: Wednesday, 29 October 2025 18:28
>To: Eli Britstein <[email protected]>; Ilya Maximets <[email protected]>;
>Eelco Chaudron <[email protected]>; [email protected]
>Cc: Simon Horman <[email protected]>; Maor Dickman
><[email protected]>; Gaetan Rivet <[email protected]>
>Subject: Re: [ovs-dev] PMD Scheduling: Grouping RX Queues from Related Ports
>
>External email: Use caution opening links or attachments
>
>
>On 26/10/2025 08:24, Eli Britstein wrote:
>>
>>
>>> -----Original Message-----
>>> From: Kevin Traynor <[email protected]>
>>> Sent: Friday, 24 October 2025 18:40
>>> To: Eli Britstein <[email protected]>; Ilya Maximets
>>> <[email protected]>; Eelco Chaudron <[email protected]>;
>>> [email protected]
>>> Cc: Simon Horman <[email protected]>; Maor Dickman
>>> <[email protected]>
>>> Subject: Re: [ovs-dev] PMD Scheduling: Grouping RX Queues from
>>> Related Ports
>>>
>>> External email: Use caution opening links or attachments
>>>
>>>
>>> On 23/10/2025 15:08, Eli Britstein wrote:
>>>>
>>>>
>>>>> -----Original Message-----
>>>>> From: Ilya Maximets <[email protected]>
>>>>> Sent: Thursday, 23 October 2025 15:30
>>>>> To: Eelco Chaudron <[email protected]>; [email protected];
>>>>> Kevin Traynor <[email protected]>
>>>>> Cc: Eli Britstein <[email protected]>; Simon Horman
>>>>> <[email protected]>; Maor Dickman <[email protected]>;
>>>>> [email protected]
>>>>> Subject: Re: [ovs-dev] PMD Scheduling: Grouping RX Queues from
>>>>> Related Ports
>>>>>
>>>>> External email: Use caution opening links or attachments
>>>>>
>>>>>
>>>>> On 10/23/25 1:48 PM, Eelco Chaudron via dev wrote:
>>>>>> Hi all,
>>>>>>
>>>>>> We’d like to bring a design discussion to the community regarding
>>>>>> a requirement for RX queues from different ports to be grouped on
>>>>>> the same
>>>>> PMD.
>>>>>> We’ve had some initial talks with the NVIDIA team (who are CC’d),
>>>>>> and I think this discussion will benefit from upstream feedback
>>>>>> and
>>> involvement.
>>>>>>
>>>>>> Here is the background and context:
>>>>>>
>>>>>> The goal is to automatically (i.e., without user configuration)
>>>>>> group together the same queue IDs from different, but related,
>>>>>> ports. A key use case is an E-Switch manager (e.g., p0) and its VF
>>>>>> representatives (e.g.,
>>>>> pf0vf0, pf0vf1).
>>>>>
>>>>> Could you explain why this is a requirement to poll the same queue
>>>>> ID of different, though related, ports by the same thread?  It's not
>obvious.
>>>>> I suspect, in a typical setup with hardware offload most of the
>>>>> ports will be related this way.
>>>> [Eli Britstein]
>>>>
>>>> With DOCA ports, we call rx-burst only for the ESW manager port (of
>>>> a
>>> specific queue). In the same burst we get packets from this port
>>> (e.g. p0) as well as of all its representors (pf0vf0, pf0vf1 etc).
>>>> The HW is configured to set the mark field as a metadata with the
>>>> port-id of
>>> that packet.
>>>> Then, we go over this burst and classify the packets, to a per-port
>>>> (of that
>>> queue #) data structure.
>>>> OVS model is calling "input" per port. We then return the burst of
>>>> that data
>>> structure.
>>>> Since this data structure is not thread safe, it works for us if we
>>>> force the
>>> processing of a specific queue for all those ports to be processed in
>>> the same PMD thread.
>>>
>>> Hi All. IIUC, the lack of thread safety in netdev_rxq_recv calls to
>>> these related rxqs is the root of the issue.
>>>
>>> You mentioned the packets are already on a per-port/per-queue data
>>> structure which seems good, but it's not thread safe. Can it be made thread
>safe ?
>>>
>>>> That PMD thread will loop over all of them (by its poll_list). For
>>>> each it calls
>>> netdev_rxq_recv().
>>>> Under the hood, we do the above (reading a burst from HW only for
>>>> the ESW
>>> manager, classifying and returning the classified burst).
>>>>
>>>> For the scheduling (just for a reference, in our downstream code) we
>>>> did the
>>> scheduling in 2 phases (change in sched_numa_list_schedule):
>>>> The first iteration skips representor ports. Only ESW manager ports
>>>> are
>>> scheduled (summing up cycles if needed for itself and its
>>> representors). The scheduled RXQs are kept in a list.
>>>> The 2nd iteration schedules the representor ports. They are not
>>>> scheduled
>>> according to any algorithm but only get the scheduled PMD from the
>>> one of their ESW manager (with the help of the list from the first 
>>> iteration).
>>>>
>>>> This is tailored for DOCA mode. As part of the effort, we want to
>>>> upstream
>>> DOCA support we wanted a more generic support.
>>>>
>>>
>>> rxq_scheduling() seems like the wrong layer to be trying to consider
>>> netdev specific thread safety requirements. In terms of the actual
>>> grouping scheme, I don't think it's something really generic that
>>> other netdevs would likely use. So baking a specific scheme into
>rxq_scheduling feels a bit dubious.
>>>
>>> Another issue worth to mention, is that the 'cycles' and 'roundrobin'
>>> algorithms spread the number of rxqs evenly (as possible) to the
>>> available cores, so that would conflict with this type of grouping from a 
>>> user
>perspective.
>> [Eli Britstein]
>> Making the data structure thread safe still has gaps.
>> Yes, it will allow scheduling each rxq independently (at the cost of 
>> complexity,
>memory consumption and perhaps performance).
>> However, bad scheduling can occur.
>> For example, in a scenario in which the ESW port itself doesn't get packets,
>but its representors do. A ESW rxq can be scheduled on a PMD with another
>very busy port.
>> Since the representors getting packets to process depend on RX of the ESW
>port, the result is their starvation. Then, in turn they will consume less 
>cycles
>and could be scheduled worse.
>>
>
>Unless an rxq is pinned and the core is isolated then this can happen for an 
>rxq
>on any PMD core at any time. Even if the representor queues are on the same
>PMD core, there is nothing to stop rxqs from same or different ports delaying
>processing of the ESW rxq. That is just the nature of multiple rxqs sharing a
>PMD core.
[Eli Britstein] 
If they are scheduled as a group, it means the same thread is doing their RX 
and processing. All packets read from the HW are handled in the same thread 
loop.
There won't be any new packets read from HW until all already read packets 
(from all ports related) are processed.
The point is that the thread doing the RX from HW and classifies the batch in 
the data structure is the bottleneck for the other ports. If that RXQ gets less 
cycles, the related RXQs for representors are affected, even if they are 
scheduled on a less busy thread.
I can't see how such starvation can happen this way.
Do I miss something?


>
>If that is the case the best way is to try and avoid an rxq that is too heavily
>loaded is to add more rxqs to spread it's load (assuming that rss will be
>effective).
>
>> In case they are grouped together, and cycles (for example) used is the sum 
>> of
>the group, this is avoided.
>>
>
>grouping rxqs, summing their cycles and assigning them to PMD cores as a
>group means there is less granularity in the unit that is being assigned. That
>makes balancing the load across available PMDs less effective because where
>OVS could assign a single rxq to a PMD core based on it's load, now it can only
>assign in groups.
[Eli Britstein] 
Indeed, but that's how this type of netdev works.
>
>> There is another note about it that n_rxqs is meaningful to configure only on
>the ESW port. Representors cannot be independently configured but will get
>their ESW n_rxqs.
>> However, this policy can be enforced in the netdev layer, so no DPIF support 
>> is
>required here.
>> >>
>>> thanks,
>>> Kevin.
>>>
>>>>>
>>>>>>
>>>>>> This new grouping logic must also respect existing scheduling
>>>>>> algorithms like ‘cycles’. For example, if ‘cycles’ is used, the
>>>>>> scheduler would need to base its decision on the sum of cycles for
>>>>>> all RX
>>>>> queues within that group.
>>>>>>
>>>>>> For this, we think we need some kind of netdev API that tells the
>>>>>> rxq_scheduling() function which port-queues belong to a group.
>>>>>> Once this group is known, the algorithm can perform the proper
>>>>>> calculation on the
>>>>> aggregated group.
>>>>>>
>>>>>> Does this approach sound reasonable? We are very open to other
>>>>>> ideas on how to discover these related queues.
>>>>>>
>>>>>> Kevin, I’ve copied you in, as you did most of the existing
>>>>>> implementation, so any feedback is appreciated.
>>>>>>
>>>>>> Cheers,
>>>>>> Eelco
>>

_______________________________________________
dev mailing list
[email protected]
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Re: [ovs-dev] PMD Scheduling: Grouping RX Queues from Related Ports

Reply via email to