Please find some of my inputs inline, in response to your comments.
Regards,
Anurag
-----Original Message-----
From: Kevin Traynor <ktray...@redhat.com>
Sent: Thursday, February 24, 2022 7:54 PM
To: Jan Scheurich <jan.scheur...@ericsson.com>; Wan Junjie
<wanjun...@bytedance.com>
Cc: d...@openvswitch.org; Anurag Agarwal <anurag.agar...@ericsson.com>
Subject: Re: [External] Re: [PATCH] dpif-netdev: add an option to
assign pmd rxq to all numas
Hi Jan,
On 17/02/2022 14:21, Jan Scheurich wrote:
Hi Kevin,
We have done extensive benchmarking and found that we get better
overall
PMD load balance and resulting OVS performance when we do not
statically pin any rx queues and instead let the auto-load-balancing
find the optimal distribution of phy rx queues over both NUMA nodes
to balance an asymmetric load of vhu rx queues (polled only on the local NUMA
node).
Cross-NUMA polling of vhu rx queues comes with a very high latency
cost due
to cross-NUMA access to volatile virtio ring pointers in every
iteration (not only when actually copying packets). Cross-NUMA
polling of phy rx queues doesn't have a similar issue.
I agree that for vhost rxq polling, it always causes a performance
penalty when there is cross-numa polling.
For polling phy rxq, when phy and vhost are in different numas, I
don't see any additional penalty for cross-numa polling the phy rxq.
For the case where phy and vhost are both in the same numa, if I
change to poll the phy rxq cross-numa, then I see about a >20% tput
drop for traffic from phy -
vhost. Are you seeing that too?
Yes, but the performance drop is mostly due to the extra cost of copying the
packets across the UPI bus to the virtio buffers on the other NUMA, not because
of polling the phy rxq on the other NUMA.
Just to be clear, phy and vhost are on the same numa in my test. I see the drop
when polling the phy rxq with a pmd from a different numa.
Also, the fact that a different numa can poll the phy rxq after
every rebalance means that the ability of the auto-load-balancer to
estimate and trigger a rebalance is impacted.
Agree, there is some inaccuracy in the estimation of the load a phy rx queue
creates when it is moved to another NUMA node. So far we have not seen that as
a practical problem.
It seems like simple pinning some phy rxqs cross-numa would avoid
all the issues above and give most of the benefit of cross-numa polling for phy
rxqs.
That is what we have done in the past (far a lack of alternatives). But any
static pinning reduces the ability of the auto-load balancer to do its job.
Consider the following scenarios:
1. The phy ingress traffic is not evenly distributed by RSS due to lack of
entropy (Examples for this are IP-IP encapsulated traffic, e.g. Calico, or
MPLSoGRE encapsulated traffic).
2. VM traffic is very asymmetric, e.g. due to a large dual-NUMA VM whose vhu
ports are all on NUMA 0.
In all such scenarios, static pinning of phy rxqs may lead to unnecessarily
uneven PMD load and loss of overall capacity.
I agree that static pinning may cause a bottleneck if you have more than one rx
pinned on a core. On the flip side, pinning removes uncertainty about the
ability of OVS to make good assignments and ALB.
[Anurag] Echoing what Jan said, static pinning wouldn't allow rebalancing in
case the traffic across DPDK and VHU queues is asymmetric. With the
introduction of per port cross-numa-polling, the user has one more option in
his tool box, to allow full auto load balancing without worrying at all about
the rxq to PMD assignments. This also makes the deployment of OVS much simpler.
The user only now needs to provide the list of CPUs, enable AUTO LB and
cross-numa-polling (in case necessary). All of the rest is handled in software.
With the pmd-rxq-assign=group and pmd-rxq-isolate=false options, OVS
could still assign other rxqs to those cores which have with pinned
phy rxqs and properly adjust the assignments based on the load from the pinned
rxqs.
Yes, sometimes the vhu rxq load is distributed such that it can be use to
balance the PMD, but not always. Sometimes the balance is just better when phy
rxqs are not pinned.
New assignments or auto-load-balance would not change the numa
polling those rxqs, so it it would have no impact to ALB or ability
to assign based on load.
In our practical experience the new "group" algorithm for load-based rxq
distribution is able to balance the PMD load best when none of the rxqs are pinned and
cross-NUMA polling of phy rxqs is enabled. So the effect of the prediction error when
doing auto-lb dry-runs cannot be significant.
It could definitely be significant in some cases but it depends on a lot of
factors to know that.
In our experience we consistently get the best PMD balance and OVS throughput
when we give the auto-lb free hands (no cross-NUMA polling of vhu rxqs,
through).
BR, Jan
Thanks for sharing your experience with it. My fear with the proposal is that
someone turns this on and then tells us performance is worse and/or OVS
assignments/ALB are broken, because it has an impact on their case.
[Anurag] We have run tests with the per port cross-numa patch, please find the
results to attached. We have more detailed results available for a 2 Core and 4
Core OVS/PMD resource allocation (i.e.4 PMDs and 8 PMDs available for OVS
respectively). The ALB algorithm was able to load balance and distribute rxq to
PMD evenly for both UDP over VLAN and UDP over VxLAN traffic, and also when
combined with other features such as security group.
In terms of limiting possible negative effects,
- it can be opt-in and recommended only for phy ports [Anurag] I
believe this might be a reasonable approach. Patch for this is attached for
your reference.
- could print a warning when it is enabled [Anurag] Might be a
reasonable thing to do. There seems to be already some logging to indicate
warning, when a rxq is polled by non-local NUMA PMD.
- ALB is currently disabled with cross-numa polling (except a limited
case) but it's clear you want to remove that restriction too [Anurag]
Yes. We exercise cross-numa-polling with 'group' scheduling and PMD auto-lb
enabled today, in our solution, and it would be nice to support this with OVS
master as well.
- for ALB, a user could increase the improvement threshold to account
for any reassignments triggered by inaccuracies
There is also some improvements that can be made to the proposed
method when used with group assignment,
- we can prefer local numa where there is no difference between pmd
cores. (e.g. two unused cores available, pick the local numa one)
- we can flatten the list of pmds, so best pmd can be selected. This will
remove issues with RR numa when there are different num of pmd cores or loads
per numa.
- I wrote an RFC that does these two items, I can post when(/if!)
consensus is reached on the broader topic
In summary, it's a trade-off,
With no cross-numa polling (current):
- won't have any impact to OVS assignment or ALB accuracy
- there could be a bottleneck on one numa pmds while other numa pmd
cores are idle and unused
With cross-numa rx pinning (current):
- will have access to pmd cores on all numas
- may require more cycles for some traffic paths
- won't have any impact to OVS assignment or ALB accuracy
- >1 pinned rxqs per core may cause a bottleneck depending on traffic
With cross-numa interface setting (proposed):
- will have access to all pmd cores on all numas (i.e. no unused pmd
cores during highest load)
- will require more cycles for some traffic paths
- will impact on OVS assignment and ALB accuracy
Anything missing above, or is it a reasonable summary?
[Anurag] Seems like a good summary to me. Thanks Kevin.
thanks,
Kevin.