Hi Anurag,

On 16/03/2022 12:29, Anurag Agarwal wrote:
Hello Kevin,
        Thanks for your inputs.

In this scenario we have one VM each on NUMA0 and NUMA1 (VM1 is on NUMA0, VM2 
is on NUMA1), dpdk port is on NUMA1.

Without cross-numa-polling, VM/VHU queue traffic is evenly distributed based on 
load on their respective NUMA sockets.

However, DPDK traffic is only load balanced on NUMA1 PMDs, thereby exhibiting 
aggregate load imbalance in the system (i.e.NUMA1 PMDs having more load v/s 
NUMA0 PMDs)

Please refer example below (cross-numa-polling is not enabled)

pmd thread numa_id 0 core_id 2:
   isolated : false
   port: vhu-vm1p1         queue-id:  2 (enabled)   pmd usage: 11 %
   port: vhu-vm1p1         queue-id:  4 (enabled)   pmd usage:  0 %
   overhead:  0 %
pmd thread numa_id 1 core_id 3:
   isolated : false
   port: dpdk0             queue-id:  0 (enabled)   pmd usage: 13 %
   port: dpdk0             queue-id:  2 (enabled)   pmd usage: 15 %
   port: vhu-vm2p1         queue-id:  3 (enabled)   pmd usage:  9 %
   port: vhu-vm2p1         queue-id:  4 (enabled)   pmd usage:  0 %
   overhead:  0 %

With cross-numa-polling enabled,  the rxqs from DPDK port are distributed to 
both NUMAs, and then the 'group' scheduling algorithm assigns the rxqs to PMDs 
based on load.

Please refer example below, after cross-numa-polling is enabled on dpdk0 port.

pmd thread numa_id 0 core_id 2:
   isolated : false
   port: dpdk0             queue-id:  5 (enabled)   pmd usage: 11 %
   port: vhu-vm1p1         queue-id:  3 (enabled)   pmd usage:  4 %
   port: vhu-vm1p1         queue-id:  5 (enabled)   pmd usage:  4 %
   overhead:  2 %
pmd thread numa_id 1 core_id 3:
   isolated : false
   port: dpdk0             queue-id:  2 (enabled)   pmd usage: 10 %
   port: vhu-vm2p1         queue-id:  0 (enabled)   pmd usage:  4 %
   port: vhu-vm2p1         queue-id:  6 (enabled)   pmd usage:  4 %
   overhead:  3 %


Yes, this illustrates the operation well. We can imply that if the traffic rate was increased that the cross-numa disabled case would hit a bottleneck first by virtue of having 2 dpdk rxq + 2 vm rxq on one core.

Kevin.

Regards,
Anurag

-----Original Message-----
From: Kevin Traynor <ktray...@redhat.com>
Sent: Thursday, March 10, 2022 11:02 PM
To: Anurag Agarwal <anurag.agar...@ericsson.com>; Jan Scheurich 
<jan.scheur...@ericsson.com>; Wan Junjie <wanjun...@bytedance.com>
Cc: d...@openvswitch.org
Subject: Re: [External] Re: [PATCH] dpif-netdev: add an option to assign pmd 
rxq to all numas

On 04/03/2022 17:57, Anurag Agarwal wrote:
Hello Kevin,
             I have prepared a patch for "per port cross-numa-polling" and 
attached herewith.

The results are captured in 'cross-numa-results.txt'. We see PMD to RxQ 
assignment evenly balanced across all PMDs with this patch.

Please take a look and let us know your inputs.

Hi Anurag,

I think what this is showing is more related to txqs used for sending to the 
VM. As you are allowing the rxqs from the phy port to be handled by more pmds, 
and all those rxqs have traffic then in turn more txqs are used for sending to 
the VM. The result of using more txqs when sending to the VM in this case is 
that the traffic is returned on the more rxqs.

Allowing cross-numa does not guarantee that the different pmd cores will poll 
rxqs from an interface. At least with group algorithm, the pmds will be 
selected purely on load. The right way to ensure that all VM
txqs(/rxqs) are used is to enable the Tx-steering feature [0].

So you might be seeing some benefit in this case, but to me it's not the core 
use case of cross-numa polling. That is more about allowing the pmds on every 
numa to be used when the traffic load is primarily coming from one numa.

Kevin.

[0] 
https://protect2.fireeye.com/v1/url?k=31323334-501d5122-313273af-454445555731-623c75dfcb975446&q=1&e=fda391fe-6bfc-4657-ba86-b13008b338fd&u=https%3A%2F%2Fdocs.openvswitch.org%2Fen%2Flatest%2Ftopics%2Fuserspace-tx-steering%2F

Please find some of my inputs inline, in response to your comments.

Regards,
Anurag

-----Original Message-----
From: Kevin Traynor <ktray...@redhat.com>
Sent: Thursday, February 24, 2022 7:54 PM
To: Jan Scheurich <jan.scheur...@ericsson.com>; Wan Junjie
<wanjun...@bytedance.com>
Cc: d...@openvswitch.org; Anurag Agarwal <anurag.agar...@ericsson.com>
Subject: Re: [External] Re: [PATCH] dpif-netdev: add an option to
assign pmd rxq to all numas

Hi Jan,

On 17/02/2022 14:21, Jan Scheurich wrote:
Hi Kevin,

We have done extensive benchmarking and found that we get better
overall
PMD load balance and resulting OVS performance when we do not
statically pin any rx queues and instead let the auto-load-balancing
find the optimal distribution of phy rx queues over both NUMA nodes
to balance an asymmetric load of vhu rx queues (polled only on the local NUMA 
node).

Cross-NUMA polling of vhu rx queues comes with a very high latency
cost due
to cross-NUMA access to volatile virtio ring pointers in every
iteration (not only when actually copying packets). Cross-NUMA
polling of phy rx queues doesn't have a similar issue.


I agree that for vhost rxq polling, it always causes a performance
penalty when there is cross-numa polling.

For polling phy rxq, when phy and vhost are in different numas, I
don't see any additional penalty for cross-numa polling the phy rxq.

For the case where phy and vhost are both in the same numa, if I
change to poll the phy rxq cross-numa, then I see about a >20% tput
drop for traffic from phy -
vhost. Are you seeing that too?

Yes, but the performance drop is mostly due to the extra cost of copying the 
packets across the UPI bus to the virtio buffers on the other NUMA, not because 
of polling the phy rxq on the other NUMA.


Just to be clear, phy and vhost are on the same numa in my test. I see the drop 
when polling the phy rxq with a pmd from a different numa.


Also, the fact that a different numa can poll the phy rxq after
every rebalance means that the ability of the auto-load-balancer to
estimate and trigger a rebalance is impacted.

Agree, there is some inaccuracy in the estimation of the load a phy rx queue 
creates when it is moved to another NUMA node. So far we have not seen that as 
a practical problem.


It seems like simple pinning some phy rxqs cross-numa would avoid
all the issues above and give most of the benefit of cross-numa polling for phy 
rxqs.

That is what we have done in the past (far a lack of alternatives). But any 
static pinning reduces the ability of the auto-load balancer to do its job. 
Consider the following scenarios:

1. The phy ingress traffic is not evenly distributed by RSS due to lack of 
entropy (Examples for this are IP-IP encapsulated traffic, e.g. Calico, or 
MPLSoGRE encapsulated traffic).

2. VM traffic is very asymmetric, e.g. due to a large dual-NUMA VM whose vhu 
ports are all on NUMA 0.

In all such scenarios, static pinning of phy rxqs may lead to unnecessarily 
uneven PMD load and loss of overall capacity.


I agree that static pinning may cause a bottleneck if you have more than one rx 
pinned on a core. On the flip side, pinning removes uncertainty about the 
ability of OVS to make good assignments and ALB.
[Anurag] Echoing what Jan said, static pinning wouldn't allow rebalancing in 
case the traffic across DPDK and VHU queues is asymmetric. With the 
introduction of per port cross-numa-polling, the user has one more option in 
his tool box, to allow full auto load balancing without worrying at all about 
the rxq to PMD assignments. This also makes the deployment of OVS much simpler. 
The user only now needs to provide the list of CPUs, enable AUTO LB and 
cross-numa-polling (in case necessary). All of the rest is handled in software.


With the pmd-rxq-assign=group and pmd-rxq-isolate=false options, OVS
could still assign other rxqs to those cores which have with pinned
phy rxqs and properly adjust the assignments based on the load from the pinned 
rxqs.

Yes, sometimes the vhu rxq load is distributed such that it can be use to 
balance the PMD, but not always. Sometimes the balance is just better when phy 
rxqs are not pinned.


New assignments or auto-load-balance would not change the numa
polling those rxqs, so it it would have no impact to ALB or ability
to assign based on load.

In our practical experience the new "group" algorithm for load-based rxq 
distribution is able to balance the PMD load best when none of the rxqs are pinned and 
cross-NUMA polling of phy rxqs is enabled. So the effect of the prediction error when 
doing auto-lb dry-runs cannot be significant.


It could definitely be significant in some cases but it depends on a lot of 
factors to know that.

In our experience we consistently get the best PMD balance and OVS throughput 
when we give the auto-lb free hands (no cross-NUMA polling of vhu rxqs, 
through).

BR, Jan

Thanks for sharing your experience with it. My fear with the proposal is that 
someone turns this on and then tells us performance is worse and/or OVS 
assignments/ALB are broken, because it has an impact on their case.
[Anurag] We have run tests with the per port cross-numa patch, please find the 
results to attached. We have more detailed results available for a 2 Core and 4 
Core OVS/PMD resource allocation (i.e.4 PMDs and 8 PMDs available for OVS 
respectively). The ALB algorithm was able to load balance and distribute rxq to 
PMD evenly for both UDP over VLAN and UDP over VxLAN traffic, and also when 
combined with other features such as security group.

In terms of limiting possible negative effects,
- it can be opt-in and recommended only for phy ports [Anurag] I
believe this might be a reasonable approach. Patch for this is attached for 
your reference.

- could print a warning when it is enabled [Anurag] Might be a
reasonable thing to do. There seems to be already some logging to indicate 
warning, when a rxq is polled by non-local NUMA PMD.
- ALB is currently disabled with cross-numa polling (except a limited
case) but it's clear you want to remove that restriction too [Anurag]
Yes. We exercise cross-numa-polling with 'group' scheduling and PMD auto-lb 
enabled today, in our solution, and it would be nice to support this with OVS 
master as well.
- for ALB, a user could increase the improvement threshold to account
for any reassignments triggered by inaccuracies


There is also some improvements that can be made to the proposed
method when used with group assignment,
- we can prefer local numa where there is no difference between pmd
cores. (e.g. two unused cores available, pick the local numa one)
- we can flatten the list of pmds, so best pmd can be selected. This will 
remove issues with RR numa when there are different num of pmd cores or loads 
per numa.
- I wrote an RFC that does these two items, I can post when(/if!)
consensus is reached on the broader topic

In summary, it's a trade-off,

With no cross-numa polling (current):
- won't have any impact to OVS assignment or ALB accuracy
- there could be a bottleneck on one numa pmds while other numa pmd
cores are idle and unused

With cross-numa rx pinning (current):
- will have access to pmd cores on all numas
- may require more cycles for some traffic paths
- won't have any impact to OVS assignment or ALB accuracy
- >1 pinned rxqs per core may cause a bottleneck depending on traffic

With cross-numa interface setting (proposed):
- will have access to all pmd cores on all numas (i.e. no unused pmd
cores during highest load)
- will require more cycles for some traffic paths
- will impact on OVS assignment and ALB accuracy

Anything missing above, or is it a reasonable summary?

[Anurag] Seems like a good summary to me. Thanks Kevin.

thanks,
Kevin.

_______________________________________________
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Reply via email to