On 27/06/2022 05:21, Anurag Agarwal wrote:
Hello Kevin,
-----Original Message-----
From: Kevin Traynor <ktray...@redhat.com>
Sent: Thursday, June 23, 2022 9:07 PM
To: Anurag Agarwal <anura...@gmail.com>; ovs-dev@openvswitch.org
Cc: lic...@chinatelecom.cn
Subject: Re: [ovs-dev] [PATCH v5] dpif-netdev: Allow cross-NUMA polling on
selected ports
Hi Anurag,
On 23/06/2022 11:18, Anurag Agarwal wrote:
From: Jan Scheurich<jan.scheur...@ericsson.com>
Today dpif-netdev considers PMD threads on a non-local NUMA node for
automatic assignment of the rxqs of a port only if there are no local,non-
isolated PMDs.
On typical servers with both physical ports on one NUMA node, this
often leaves the PMDs on the other NUMA node under-utilized, wasting CPU
resources.
The alternative, to manually pin the rxqs to PMDs on remote NUMA
nodes, also has drawbacks as it limits OVS' ability to auto load-balance the
rxqs.
This patch introduces a new interface configuration option to allow
ports to be automatically polled by PMDs on any NUMA node:
ovs-vsctl set interface <Name> other_config:cross-numa-polling=true
The group assignment algorithm now has the ability to select lowest
loaded PMD on any NUMA, and not just the local NUMA on which the rxq
of the port resides
If this option is not present or set to false, legacy behaviour applies.
Co-authored-by: Anurag Agarwal<anurag.agar...@ericsson.com>
Signed-off-by: Jan Scheurich<jan.scheur...@ericsson.com>
Signed-off-by: Anurag Agarwal<anurag.agar...@ericsson.com>
---
Changes in v5:
- Addressed comments from<lic...@chinatelecom.cn>
- First schedule rxqs that are not enabled for cross-numa scheduling
- Follow this with rxqs that are enabled for cross-numa scheduling
I don't think this is a correct fix for the issue reported. The root problem
reported is not really that rxqs with cross-numa=true are assigned first, but
that
the pool or pmd resources is changing/overlapping during the assignments i.e. in
the reported case from a full pool to a fixed per-numa pool.
With the change you have now, you could have something like:
3 rxqs, (1 cross-numa) and 2 pmds.
cross-numa=true rxq load: 80%
per-numa rxq loads: 45%, 40%
rxq assignment rounds
1.
pmd0 = 45
pmd1 = 0
2.
pmd0 = 45
pmd1 = 40
3.
pmd0 = 45 = 40%
pmd1 = 40 + 80 = 120%
when clearly the best way is:
pmd0 = 80
pmd1 = 45 + 40
Could you help elaborate on this a bit more. Is PMD0 on NUMA0 and PMD1 on
NUMA1? The two per-numa rxqs(45%, 40%) they belong to which NUMA?
Need some more details to understand this scenario.
pmd0 and pmd1 belong to NUMA 0. rxq 45% and rxq 40% belong to NUMA0.
To simplify above, I've shown 2 rxq and 2 pmd on NUMA0 but that could be
replicated on other NUMAs as too.
rxq 80% cross-numa could belong to any NUMA.
So it's not about which ones gets assigned first, as is shown that can cause an
issue whichever one is assigned first. The problem is that the algorithm for
assignment is not designed for changing and overlapping ranges of pmds to
assign rxqs to.
Here is the comment from lic...@chinatelecom.cn:
It may be better to schedule non-cross-numa rxqs first, and then
cross-numa rxqs.
Otherwise, one numa may hold much more load because of later
scheduled non- cross-numa rxqs.
And here is the example shared:
Considering the following situation:
We have 2 numa nodes, each numa node has 1 pmd.
And we have 2 port(p0, p1), each port has 2 queues.
p0 is configured as cross_numa, p1 is not.
each queue's workload,
rxq p0q0 p0q1 p1q0 p1q1
load 30 30 20 20
Based on your current implement, the assignment will be:
p0q0 -> numa0 (30)
p0q1 -> numa1 (30)
p1q0 -> numa0 (30+20=50)
p1q1 -> numa0 (50+20=70)
As the result, numa0 holds 70% workload but numa1 holds only 30%.
Because later assigned queues are numa affinity.
Yes, I understand the comment and it highlights a potential issue. My
concern is that your solution only worksaround that issue, but causes
other potential issues (like in the example i gave) because it does not
fix the root cause.
To fix this, you would probably need to do all the assignments first as per v4
and
then do another round of checking and possibly moving some cross-numa=true
rxqs. But that is further relying on estimates which you are making potentially
inaccurate. If you are writing something to "move" individual rxqs after initial
assignment, maybe it's better to rethink doing it in the ALB with the real loads
and not estimates.
It is worth noting again that while less flexible, if the rxq load was
distributed on
an interface with RSS etc, some pinning of phy rxqs can allow cross-numa and
remove any inaccurate estimates from changing numa.
Do you mean pinning of phy rxqs can also be used alternatively to achieve
cross-numa equivalent functionality although this is less flexible?
Yes, the user can pin rxqs to any NUMA, but OVS will not then reassign
them. It will however, consider the load from them when placing other
rxqs it can assign. I think the pros and cons were discussed in previous
threads related to these patches.
If scheduling cross-numa queues along with per numa queues leads to inaccurate
assignment, should we revisit and think about enabling/supporting cross-numa
polling at a global level to begin with?
It would cover the problems that have been highlighted most recently,
yes, but it would also mean every rxq is open to moving NUMA and that
would mean more inaccuracies in estimates, leading to inaccurrate
assignments from that. It also might not be what is wanted by a user.
Even in your own case, I think you said you only want this to apply to
some interfaces.
thanks,
Kevin.
Changes in v4:
- Addressed comments from Kevin Traynor<ktray...@redhat.com>
Please refer this thread for an earlier discussion on this topic
https://protect2.fireeye.com/v1/url?k=31323334-501d5122-313273af-45444
5555731-6f777962dd61a512&q=1&e=d01bde47-c6d2-4ad3-912a-
9f78067d9727&u=
https%3A%2F%2Fmail.openvswitch.org%2Fpipermail%2Fovs-dev%2F2022-
March%
2F392310.html
Documentation/topics/dpdk/pmd.rst | 23 +++++
lib/dpif-netdev.c | 156 +++++++++++++++++++++++-------
tests/pmd.at | 38 ++++++++
vswitchd/vswitch.xml | 20 ++++
4 files changed, 201 insertions(+), 36 deletions(-)
_______________________________________________
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev