Re: [ovs-dev] [External] Re: [PATCH] dpif-netdev: add an option to assign pmd rxq to all numas

2022-03-24 Thread Kevin Traynor

Hi Anurag,

On 16/03/2022 12:29, Anurag Agarwal wrote:

Hello Kevin,
Thanks for your inputs.

In this scenario we have one VM each on NUMA0 and NUMA1 (VM1 is on NUMA0, VM2 
is on NUMA1), dpdk port is on NUMA1.

Without cross-numa-polling, VM/VHU queue traffic is evenly distributed based on 
load on their respective NUMA sockets.

However, DPDK traffic is only load balanced on NUMA1 PMDs, thereby exhibiting 
aggregate load imbalance in the system (i.e.NUMA1 PMDs having more load v/s 
NUMA0 PMDs)

Please refer example below (cross-numa-polling is not enabled)

pmd thread numa_id 0 core_id 2:
   isolated : false
   port: vhu-vm1p1 queue-id:  2 (enabled)   pmd usage: 11 %
   port: vhu-vm1p1 queue-id:  4 (enabled)   pmd usage:  0 %
   overhead:  0 %
pmd thread numa_id 1 core_id 3:
   isolated : false
   port: dpdk0 queue-id:  0 (enabled)   pmd usage: 13 %
   port: dpdk0 queue-id:  2 (enabled)   pmd usage: 15 %
   port: vhu-vm2p1 queue-id:  3 (enabled)   pmd usage:  9 %
   port: vhu-vm2p1 queue-id:  4 (enabled)   pmd usage:  0 %
   overhead:  0 %

With cross-numa-polling enabled,  the rxqs from DPDK port are distributed to 
both NUMAs, and then the 'group' scheduling algorithm assigns the rxqs to PMDs 
based on load.

Please refer example below, after cross-numa-polling is enabled on dpdk0 port.

pmd thread numa_id 0 core_id 2:
   isolated : false
   port: dpdk0 queue-id:  5 (enabled)   pmd usage: 11 %
   port: vhu-vm1p1 queue-id:  3 (enabled)   pmd usage:  4 %
   port: vhu-vm1p1 queue-id:  5 (enabled)   pmd usage:  4 %
   overhead:  2 %
pmd thread numa_id 1 core_id 3:
   isolated : false
   port: dpdk0 queue-id:  2 (enabled)   pmd usage: 10 %
   port: vhu-vm2p1 queue-id:  0 (enabled)   pmd usage:  4 %
   port: vhu-vm2p1 queue-id:  6 (enabled)   pmd usage:  4 %
   overhead:  3 %



Yes, this illustrates the operation well. We can imply that if the 
traffic rate was increased that the cross-numa disabled case would hit a 
bottleneck first by virtue of having 2 dpdk rxq + 2 vm rxq on one core.


Kevin.


Regards,
Anurag

-Original Message-
From: Kevin Traynor 
Sent: Thursday, March 10, 2022 11:02 PM
To: Anurag Agarwal ; Jan Scheurich 
; Wan Junjie 
Cc: d...@openvswitch.org
Subject: Re: [External] Re: [PATCH] dpif-netdev: add an option to assign pmd 
rxq to all numas

On 04/03/2022 17:57, Anurag Agarwal wrote:

Hello Kevin,
 I have prepared a patch for "per port cross-numa-polling" and 
attached herewith.

The results are captured in 'cross-numa-results.txt'. We see PMD to RxQ 
assignment evenly balanced across all PMDs with this patch.

Please take a look and let us know your inputs.


Hi Anurag,

I think what this is showing is more related to txqs used for sending to the 
VM. As you are allowing the rxqs from the phy port to be handled by more pmds, 
and all those rxqs have traffic then in turn more txqs are used for sending to 
the VM. The result of using more txqs when sending to the VM in this case is 
that the traffic is returned on the more rxqs.

Allowing cross-numa does not guarantee that the different pmd cores will poll 
rxqs from an interface. At least with group algorithm, the pmds will be 
selected purely on load. The right way to ensure that all VM
txqs(/rxqs) are used is to enable the Tx-steering feature [0].

So you might be seeing some benefit in this case, but to me it's not the core 
use case of cross-numa polling. That is more about allowing the pmds on every 
numa to be used when the traffic load is primarily coming from one numa.

Kevin.

[0] 
https://protect2.fireeye.com/v1/url?k=31323334-501d5122-313273af-45444731-623c75dfcb975446&q=1&e=fda391fe-6bfc-4657-ba86-b13008b338fd&u=https%3A%2F%2Fdocs.openvswitch.org%2Fen%2Flatest%2Ftopics%2Fuserspace-tx-steering%2F


Please find some of my inputs inline, in response to your comments.

Regards,
Anurag

-Original Message-
From: Kevin Traynor 
Sent: Thursday, February 24, 2022 7:54 PM
To: Jan Scheurich ; Wan Junjie

Cc: d...@openvswitch.org; Anurag Agarwal 
Subject: Re: [External] Re: [PATCH] dpif-netdev: add an option to
assign pmd rxq to all numas

Hi Jan,

On 17/02/2022 14:21, Jan Scheurich wrote:

Hi Kevin,


We have done extensive benchmarking and found that we get better
overall

PMD load balance and resulting OVS performance when we do not
statically pin any rx queues and instead let the auto-load-balancing
find the optimal distribution of phy rx queues over both NUMA nodes
to balance an asymmetric load of vhu rx queues (polled only on the local NUMA 
node).


Cross-NUMA polling of vhu rx queues comes with a very high latency
cost due

to cross-NUMA access to volatile virtio ring pointers in every
iteration (not only when actually copying packets). Cross-NUMA
polling of phy rx queues doesn't have a similar issue.




I agree that for vhost rxq polling, it always causes a performance
penalty w

Re: [ovs-dev] [External] Re: [PATCH] dpif-netdev: add an option to assign pmd rxq to all numas

2022-03-16 Thread Anurag Agarwal via dev
Hello Kevin,
   Thanks for your inputs. 

In this scenario we have one VM each on NUMA0 and NUMA1 (VM1 is on NUMA0, VM2 
is on NUMA1), dpdk port is on NUMA1. 

Without cross-numa-polling, VM/VHU queue traffic is evenly distributed based on 
load on their respective NUMA sockets. 

However, DPDK traffic is only load balanced on NUMA1 PMDs, thereby exhibiting 
aggregate load imbalance in the system (i.e.NUMA1 PMDs having more load v/s 
NUMA0 PMDs)

Please refer example below (cross-numa-polling is not enabled)

pmd thread numa_id 0 core_id 2:
  isolated : false
  port: vhu-vm1p1 queue-id:  2 (enabled)   pmd usage: 11 %
  port: vhu-vm1p1 queue-id:  4 (enabled)   pmd usage:  0 %
  overhead:  0 %
pmd thread numa_id 1 core_id 3:
  isolated : false
  port: dpdk0 queue-id:  0 (enabled)   pmd usage: 13 %
  port: dpdk0 queue-id:  2 (enabled)   pmd usage: 15 %
  port: vhu-vm2p1 queue-id:  3 (enabled)   pmd usage:  9 %
  port: vhu-vm2p1 queue-id:  4 (enabled)   pmd usage:  0 %
  overhead:  0 %

With cross-numa-polling enabled,  the rxqs from DPDK port are distributed to 
both NUMAs, and then the 'group' scheduling algorithm assigns the rxqs to PMDs 
based on load. 

Please refer example below, after cross-numa-polling is enabled on dpdk0 port. 

pmd thread numa_id 0 core_id 2:
  isolated : false
  port: dpdk0 queue-id:  5 (enabled)   pmd usage: 11 %
  port: vhu-vm1p1 queue-id:  3 (enabled)   pmd usage:  4 %
  port: vhu-vm1p1 queue-id:  5 (enabled)   pmd usage:  4 %
  overhead:  2 %
pmd thread numa_id 1 core_id 3:
  isolated : false
  port: dpdk0 queue-id:  2 (enabled)   pmd usage: 10 %
  port: vhu-vm2p1 queue-id:  0 (enabled)   pmd usage:  4 %
  port: vhu-vm2p1 queue-id:  6 (enabled)   pmd usage:  4 %
  overhead:  3 %

Regards,
Anurag

-Original Message-
From: Kevin Traynor  
Sent: Thursday, March 10, 2022 11:02 PM
To: Anurag Agarwal ; Jan Scheurich 
; Wan Junjie 
Cc: d...@openvswitch.org
Subject: Re: [External] Re: [PATCH] dpif-netdev: add an option to assign pmd 
rxq to all numas

On 04/03/2022 17:57, Anurag Agarwal wrote:
> Hello Kevin,
> I have prepared a patch for "per port cross-numa-polling" and 
> attached herewith.
> 
> The results are captured in 'cross-numa-results.txt'. We see PMD to RxQ 
> assignment evenly balanced across all PMDs with this patch.
> 
> Please take a look and let us know your inputs.
> 
Hi Anurag,

I think what this is showing is more related to txqs used for sending to the 
VM. As you are allowing the rxqs from the phy port to be handled by more pmds, 
and all those rxqs have traffic then in turn more txqs are used for sending to 
the VM. The result of using more txqs when sending to the VM in this case is 
that the traffic is returned on the more rxqs.

Allowing cross-numa does not guarantee that the different pmd cores will poll 
rxqs from an interface. At least with group algorithm, the pmds will be 
selected purely on load. The right way to ensure that all VM
txqs(/rxqs) are used is to enable the Tx-steering feature [0].

So you might be seeing some benefit in this case, but to me it's not the core 
use case of cross-numa polling. That is more about allowing the pmds on every 
numa to be used when the traffic load is primarily coming from one numa.

Kevin.

[0] 
https://protect2.fireeye.com/v1/url?k=31323334-501d5122-313273af-45444731-623c75dfcb975446&q=1&e=fda391fe-6bfc-4657-ba86-b13008b338fd&u=https%3A%2F%2Fdocs.openvswitch.org%2Fen%2Flatest%2Ftopics%2Fuserspace-tx-steering%2F

> Please find some of my inputs inline, in response to your comments.
> 
> Regards,
> Anurag
> 
> -Original Message-
> From: Kevin Traynor 
> Sent: Thursday, February 24, 2022 7:54 PM
> To: Jan Scheurich ; Wan Junjie 
> 
> Cc: d...@openvswitch.org; Anurag Agarwal 
> Subject: Re: [External] Re: [PATCH] dpif-netdev: add an option to 
> assign pmd rxq to all numas
> 
> Hi Jan,
> 
> On 17/02/2022 14:21, Jan Scheurich wrote:
>> Hi Kevin,
>>
 We have done extensive benchmarking and found that we get better 
 overall
>>> PMD load balance and resulting OVS performance when we do not 
>>> statically pin any rx queues and instead let the auto-load-balancing 
>>> find the optimal distribution of phy rx queues over both NUMA nodes 
>>> to balance an asymmetric load of vhu rx queues (polled only on the local 
>>> NUMA node).

 Cross-NUMA polling of vhu rx queues comes with a very high latency 
 cost due
>>> to cross-NUMA access to volatile virtio ring pointers in every 
>>> iteration (not only when actually copying packets). Cross-NUMA 
>>> polling of phy rx queues doesn't have a similar issue.

>>>
>>> I agree that for vhost rxq polling, it always causes a performance 
>>> penalty when there is cross-numa polling.
>>>
>>> For polling phy rxq, when phy and vhost are in different numas, I 
>>> don't see any additional penalty for cross-numa polling the 

Re: [ovs-dev] [External] Re: [PATCH] dpif-netdev: add an option to assign pmd rxq to all numas

2022-03-10 Thread Kevin Traynor

On 04/03/2022 17:57, Anurag Agarwal wrote:

Hello Kevin,
I have prepared a patch for "per port cross-numa-polling" and 
attached herewith.

The results are captured in 'cross-numa-results.txt'. We see PMD to RxQ 
assignment evenly balanced across all PMDs with this patch.

Please take a look and let us know your inputs.


Hi Anurag,

I think what this is showing is more related to txqs used for sending to 
the VM. As you are allowing the rxqs from the phy port to be handled by 
more pmds, and all those rxqs have traffic then in turn more txqs are 
used for sending to the VM. The result of using more txqs when sending 
to the VM in this case is that the traffic is returned on the more rxqs.


Allowing cross-numa does not guarantee that the different pmd cores will 
poll rxqs from an interface. At least with group algorithm, the pmds 
will be selected purely on load. The right way to ensure that all VM 
txqs(/rxqs) are used is to enable the Tx-steering feature [0].


So you might be seeing some benefit in this case, but to me it's not the 
core use case of cross-numa polling. That is more about allowing the 
pmds on every numa to be used when the traffic load is primarily coming 
from one numa.


Kevin.

[0] https://docs.openvswitch.org/en/latest/topics/userspace-tx-steering/


Please find some of my inputs inline, in response to your comments.

Regards,
Anurag

-Original Message-
From: Kevin Traynor 
Sent: Thursday, February 24, 2022 7:54 PM
To: Jan Scheurich ; Wan Junjie 

Cc: d...@openvswitch.org; Anurag Agarwal 
Subject: Re: [External] Re: [PATCH] dpif-netdev: add an option to assign pmd 
rxq to all numas

Hi Jan,

On 17/02/2022 14:21, Jan Scheurich wrote:

Hi Kevin,


We have done extensive benchmarking and found that we get better
overall

PMD load balance and resulting OVS performance when we do not
statically pin any rx queues and instead let the auto-load-balancing
find the optimal distribution of phy rx queues over both NUMA nodes
to balance an asymmetric load of vhu rx queues (polled only on the local NUMA 
node).


Cross-NUMA polling of vhu rx queues comes with a very high latency
cost due

to cross-NUMA access to volatile virtio ring pointers in every
iteration (not only when actually copying packets). Cross-NUMA
polling of phy rx queues doesn't have a similar issue.




I agree that for vhost rxq polling, it always causes a performance
penalty when there is cross-numa polling.

For polling phy rxq, when phy and vhost are in different numas, I
don't see any additional penalty for cross-numa polling the phy rxq.

For the case where phy and vhost are both in the same numa, if I
change to poll the phy rxq cross-numa, then I see about a >20% tput
drop for traffic from phy -

vhost. Are you seeing that too?


Yes, but the performance drop is mostly due to the extra cost of copying the 
packets across the UPI bus to the virtio buffers on the other NUMA, not because 
of polling the phy rxq on the other NUMA.



Just to be clear, phy and vhost are on the same numa in my test. I see the drop 
when polling the phy rxq with a pmd from a different numa.



Also, the fact that a different numa can poll the phy rxq after every
rebalance means that the ability of the auto-load-balancer to
estimate and trigger a rebalance is impacted.


Agree, there is some inaccuracy in the estimation of the load a phy rx queue 
creates when it is moved to another NUMA node. So far we have not seen that as 
a practical problem.



It seems like simple pinning some phy rxqs cross-numa would avoid all
the issues above and give most of the benefit of cross-numa polling for phy 
rxqs.


That is what we have done in the past (far a lack of alternatives). But any 
static pinning reduces the ability of the auto-load balancer to do its job. 
Consider the following scenarios:

1. The phy ingress traffic is not evenly distributed by RSS due to lack of 
entropy (Examples for this are IP-IP encapsulated traffic, e.g. Calico, or 
MPLSoGRE encapsulated traffic).

2. VM traffic is very asymmetric, e.g. due to a large dual-NUMA VM whose vhu 
ports are all on NUMA 0.

In all such scenarios, static pinning of phy rxqs may lead to unnecessarily 
uneven PMD load and loss of overall capacity.



I agree that static pinning may cause a bottleneck if you have more than one rx 
pinned on a core. On the flip side, pinning removes uncertainty about the 
ability of OVS to make good assignments and ALB.
[Anurag] Echoing what Jan said, static pinning wouldn't allow rebalancing in 
case the traffic across DPDK and VHU queues is asymmetric. With the 
introduction of per port cross-numa-polling, the user has one more option in 
his tool box, to allow full auto load balancing without worrying at all about 
the rxq to PMD assignments. This also makes the deployment of OVS much simpler. 
The user only now needs to provide the list of CPUs, enable AUTO LB and 
cross-numa-polling (in case necessary). All of the rest is handled in software.



Re: [ovs-dev] [External] Re: [PATCH] dpif-netdev: add an option to assign pmd rxq to all numas

2022-03-10 Thread Kevin Traynor

Hi Jan,

On 09/03/2022 15:48, Jan Scheurich wrote:

Thanks for sharing your experience with it. My fear with the proposal is that
someone turns this on and then tells us performance is worse and/or OVS
assignments/ALB are broken, because it has an impact on their case.

In terms of limiting possible negative effects,
- it can be opt-in and recommended only for phy ports
- could print a warning when it is enabled
- ALB is currently disabled with cross-numa polling (except a limited
case) but it's clear you want to remove that restriction too
- for ALB, a user could increase the improvement threshold to account for any
reassignments triggered by inaccuracies


[Jan] Yes, we want to enable cross-NUMA polling of selected (typically phy) ports in ALB 
"group" mode as an opt-in config option (default off). Based on our 
observations we are not too much concerned with the loss of ALB prediction accuracy but 
increasing the threshold may be a way of taking that into account, if wanted.



There is also some improvements that can be made to the proposed method
when used with group assignment,
- we can prefer local numa where there is no difference between pmd cores.
(e.g. two unused cores available, pick the local numa one)
- we can flatten the list of pmds, so best pmd can be selected. This will remove
issues with RR numa when there are different num of pmd cores or loads per
numa.
- I wrote an RFC that does these two items, I can post when(/if!) consensus is
reached on the broader topic


[Jan] In our alternative version of the current upstream "group" ALB [1] we 
already maintained a flat list of PMDs. So we would support that feature. Using 
NUMA-locality as a tie-breaker makes sense.

[1] https://mail.openvswitch.org/pipermail/ovs-dev/2021-June/384546.html



In summary, it's a trade-off,

With no cross-numa polling (current):
- won't have any impact to OVS assignment or ALB accuracy
- there could be a bottleneck on one numa pmds while other numa pmd cores
are idle and unused

With cross-numa rx pinning (current):
- will have access to pmd cores on all numas
- may require more cycles for some traffic paths
- won't have any impact to OVS assignment or ALB accuracy
- >1 pinned rxqs per core may cause a bottleneck depending on traffic

With cross-numa interface setting (proposed):
- will have access to all pmd cores on all numas (i.e. no unused pmd cores
during highest load)
- will require more cycles for some traffic paths
- will impact on OVS assignment and ALB accuracy

Anything missing above, or is it a reasonable summary?


I think that is a reasonable summary, albeit I would have characterized the 
third option a bit more positively:
- Gives ALB maximum freedom to balance load of PMDs on all NUMA nodes (in the 
likely scenario of uneven VM load on the NUMAs)
- Accepts an increase of cycles on cross-NUMA paths for a better utilization of 
a free PMD cycles
- Mostly suitable for phy ports due to limited cycle increase for cross-NUMA 
polling of phy rx queues
- Could negatively impact the ALB prediction accuracy in certain scenarios



It's the estimation accuracy during the assignments. That might be be 
seen as part of the ALB dry-run, but it could also be during an actual 
reconfigure/reassignment itself. e.g. we think pmdx is the lowest loaded 
pmd and assign another rxq, only to find out a previously assigned rxq 
now requires 30% more cycles after changing numa and it was a bad 
selection to add more rxqs. Now there's overload on that pmdx etc.


OTOH, in some cases you are now also getting to utilize more pmds from 
other numas, so there is more net compute power. That means less risk of 
overload on any pmd, but I'm sure people will still try and push it to 
the limits.


I think we're both on the same page regarding functionality, pros and 
cons etc. It's just the inaccuracy and likelihood of problems occurring 
where we are viewing differently. You are saying you think it's low risk 
as that is your experience, while I am a bit more cautious about it.



We will post a new version of our patch [2] for cross-numa polling on selected 
ports adapted to the current OVS master shortly.

[2] https://mail.openvswitch.org/pipermail/ovs-dev/2021-June/384547.html



I think we should give a bit more time for more eyes and any discussion 
at the high level before progressing too much on the code, but as you 
are talking about it...I mentioned an RFC so I put it up here [0].


I didn't add the user enabling part, that's straightforward, I was just 
working on how adapt the current rxq scheduling for not always selecting 
a numa first (as it was based like this), flattening numa and a local 
numa tiebreaker.


[0] https://github.com/kevintraynor/ovs/commits/crossnuma

thanks,
Kevin.


Thanks, Jan




___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [External] Re: [PATCH] dpif-netdev: add an option to assign pmd rxq to all numas

2022-03-09 Thread Jan Scheurich via dev
> Thanks for sharing your experience with it. My fear with the proposal is that
> someone turns this on and then tells us performance is worse and/or OVS
> assignments/ALB are broken, because it has an impact on their case.
> 
> In terms of limiting possible negative effects,
> - it can be opt-in and recommended only for phy ports
> - could print a warning when it is enabled
> - ALB is currently disabled with cross-numa polling (except a limited
> case) but it's clear you want to remove that restriction too
> - for ALB, a user could increase the improvement threshold to account for any
> reassignments triggered by inaccuracies

[Jan] Yes, we want to enable cross-NUMA polling of selected (typically phy) 
ports in ALB "group" mode as an opt-in config option (default off). Based on 
our observations we are not too much concerned with the loss of ALB prediction 
accuracy but increasing the threshold may be a way of taking that into account, 
if wanted.

> 
> There is also some improvements that can be made to the proposed method
> when used with group assignment,
> - we can prefer local numa where there is no difference between pmd cores.
> (e.g. two unused cores available, pick the local numa one)
> - we can flatten the list of pmds, so best pmd can be selected. This will 
> remove
> issues with RR numa when there are different num of pmd cores or loads per
> numa.
> - I wrote an RFC that does these two items, I can post when(/if!) consensus is
> reached on the broader topic

[Jan] In our alternative version of the current upstream "group" ALB [1] we 
already maintained a flat list of PMDs. So we would support that feature. Using 
NUMA-locality as a tie-breaker makes sense.

[1] https://mail.openvswitch.org/pipermail/ovs-dev/2021-June/384546.html

> 
> In summary, it's a trade-off,
> 
> With no cross-numa polling (current):
> - won't have any impact to OVS assignment or ALB accuracy
> - there could be a bottleneck on one numa pmds while other numa pmd cores
> are idle and unused
> 
> With cross-numa rx pinning (current):
> - will have access to pmd cores on all numas
> - may require more cycles for some traffic paths
> - won't have any impact to OVS assignment or ALB accuracy
> - >1 pinned rxqs per core may cause a bottleneck depending on traffic
> 
> With cross-numa interface setting (proposed):
> - will have access to all pmd cores on all numas (i.e. no unused pmd cores
> during highest load)
> - will require more cycles for some traffic paths
> - will impact on OVS assignment and ALB accuracy
> 
> Anything missing above, or is it a reasonable summary?

I think that is a reasonable summary, albeit I would have characterized the 
third option a bit more positively:
- Gives ALB maximum freedom to balance load of PMDs on all NUMA nodes (in the 
likely scenario of uneven VM load on the NUMAs)
- Accepts an increase of cycles on cross-NUMA paths for a better utilization of 
a free PMD cycles
- Mostly suitable for phy ports due to limited cycle increase for cross-NUMA 
polling of phy rx queues
- Could negatively impact the ALB prediction accuracy in certain scenarios

We will post a new version of our patch [2] for cross-numa polling on selected 
ports adapted to the current OVS master shortly.

[2] https://mail.openvswitch.org/pipermail/ovs-dev/2021-June/384547.html

Thanks, Jan


___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [External] Re: [PATCH] dpif-netdev: add an option to assign pmd rxq to all numas

2022-03-04 Thread Anurag Agarwal via dev
Hello Kevin,
   I have prepared a patch for "per port cross-numa-polling" and 
attached herewith. 

The results are captured in 'cross-numa-results.txt'. We see PMD to RxQ 
assignment evenly balanced across all PMDs with this patch. 

Please take a look and let us know your inputs. 

Please find some of my inputs inline, in response to your comments. 

Regards,
Anurag

-Original Message-
From: Kevin Traynor  
Sent: Thursday, February 24, 2022 7:54 PM
To: Jan Scheurich ; Wan Junjie 

Cc: d...@openvswitch.org; Anurag Agarwal 
Subject: Re: [External] Re: [PATCH] dpif-netdev: add an option to assign pmd 
rxq to all numas

Hi Jan,

On 17/02/2022 14:21, Jan Scheurich wrote:
> Hi Kevin,
> 
>>> We have done extensive benchmarking and found that we get better 
>>> overall
>> PMD load balance and resulting OVS performance when we do not 
>> statically pin any rx queues and instead let the auto-load-balancing 
>> find the optimal distribution of phy rx queues over both NUMA nodes 
>> to balance an asymmetric load of vhu rx queues (polled only on the local 
>> NUMA node).
>>>
>>> Cross-NUMA polling of vhu rx queues comes with a very high latency 
>>> cost due
>> to cross-NUMA access to volatile virtio ring pointers in every 
>> iteration (not only when actually copying packets). Cross-NUMA 
>> polling of phy rx queues doesn't have a similar issue.
>>>
>>
>> I agree that for vhost rxq polling, it always causes a performance 
>> penalty when there is cross-numa polling.
>>
>> For polling phy rxq, when phy and vhost are in different numas, I 
>> don't see any additional penalty for cross-numa polling the phy rxq.
>>
>> For the case where phy and vhost are both in the same numa, if I 
>> change to poll the phy rxq cross-numa, then I see about a >20% tput 
>> drop for traffic from phy -
>>> vhost. Are you seeing that too?
> 
> Yes, but the performance drop is mostly due to the extra cost of copying the 
> packets across the UPI bus to the virtio buffers on the other NUMA, not 
> because of polling the phy rxq on the other NUMA.
> 

Just to be clear, phy and vhost are on the same numa in my test. I see the drop 
when polling the phy rxq with a pmd from a different numa.

>>
>> Also, the fact that a different numa can poll the phy rxq after every 
>> rebalance means that the ability of the auto-load-balancer to 
>> estimate and trigger a rebalance is impacted.
> 
> Agree, there is some inaccuracy in the estimation of the load a phy rx queue 
> creates when it is moved to another NUMA node. So far we have not seen that 
> as a practical problem.
> 
>>
>> It seems like simple pinning some phy rxqs cross-numa would avoid all 
>> the issues above and give most of the benefit of cross-numa polling for phy 
>> rxqs.
> 
> That is what we have done in the past (far a lack of alternatives). But any 
> static pinning reduces the ability of the auto-load balancer to do its job. 
> Consider the following scenarios:
> 
> 1. The phy ingress traffic is not evenly distributed by RSS due to lack of 
> entropy (Examples for this are IP-IP encapsulated traffic, e.g. Calico, or 
> MPLSoGRE encapsulated traffic).
> 
> 2. VM traffic is very asymmetric, e.g. due to a large dual-NUMA VM whose vhu 
> ports are all on NUMA 0.
> 
> In all such scenarios, static pinning of phy rxqs may lead to unnecessarily 
> uneven PMD load and loss of overall capacity.
> 

I agree that static pinning may cause a bottleneck if you have more than one rx 
pinned on a core. On the flip side, pinning removes uncertainty about the 
ability of OVS to make good assignments and ALB.
[Anurag] Echoing what Jan said, static pinning wouldn't allow rebalancing in 
case the traffic across DPDK and VHU queues is asymmetric. With the 
introduction of per port cross-numa-polling, the user has one more option in 
his tool box, to allow full auto load balancing without worrying at all about 
the rxq to PMD assignments. This also makes the deployment of OVS much simpler. 
The user only now needs to provide the list of CPUs, enable AUTO LB and 
cross-numa-polling (in case necessary). All of the rest is handled in software. 

>>
>> With the pmd-rxq-assign=group and pmd-rxq-isolate=false options, OVS 
>> could still assign other rxqs to those cores which have with pinned 
>> phy rxqs and properly adjust the assignments based on the load from the 
>> pinned rxqs.
> 
> Yes, sometimes the vhu rxq load is distributed such that it can be use to 
> balance the PMD, but not always. Sometimes the balance is just better when 
> phy rxqs are not pinned.
> 
>>
>> New assignments or auto-load-balance would not change the numa 
>> polling those rxqs, so it it would have no impact to ALB or ability 
>> to assign based on load.
> 
> In our practical experience the new "group" algorithm for load-based rxq 
> distribution is able to balance the PMD load best when none of the rxqs are 
> pinned and cross-NUMA polling of phy rxqs is enabled. So the effect of the 
> prediction error when doing

Re: [ovs-dev] [External] Re: [PATCH] dpif-netdev: add an option to assign pmd rxq to all numas

2022-02-24 Thread Kevin Traynor

Hi Jan,

On 17/02/2022 14:21, Jan Scheurich wrote:

Hi Kevin,


We have done extensive benchmarking and found that we get better overall

PMD load balance and resulting OVS performance when we do not statically
pin any rx queues and instead let the auto-load-balancing find the optimal
distribution of phy rx queues over both NUMA nodes to balance an asymmetric
load of vhu rx queues (polled only on the local NUMA node).


Cross-NUMA polling of vhu rx queues comes with a very high latency cost due

to cross-NUMA access to volatile virtio ring pointers in every iteration (not 
only
when actually copying packets). Cross-NUMA polling of phy rx queues doesn't
have a similar issue.




I agree that for vhost rxq polling, it always causes a performance penalty when
there is cross-numa polling.

For polling phy rxq, when phy and vhost are in different numas, I don't see any
additional penalty for cross-numa polling the phy rxq.

For the case where phy and vhost are both in the same numa, if I change to poll
the phy rxq cross-numa, then I see about a >20% tput drop for traffic from phy -

vhost. Are you seeing that too?


Yes, but the performance drop is mostly due to the extra cost of copying the 
packets across the UPI bus to the virtio buffers on the other NUMA, not because 
of polling the phy rxq on the other NUMA.



Just to be clear, phy and vhost are on the same numa in my test. I see 
the drop when polling the phy rxq with a pmd from a different numa.




Also, the fact that a different numa can poll the phy rxq after every rebalance
means that the ability of the auto-load-balancer to estimate and trigger a
rebalance is impacted.


Agree, there is some inaccuracy in the estimation of the load a phy rx queue 
creates when it is moved to another NUMA node. So far we have not seen that as 
a practical problem.



It seems like simple pinning some phy rxqs cross-numa would avoid all the
issues above and give most of the benefit of cross-numa polling for phy rxqs.


That is what we have done in the past (far a lack of alternatives). But any 
static pinning reduces the ability of the auto-load balancer to do its job. 
Consider the following scenarios:

1. The phy ingress traffic is not evenly distributed by RSS due to lack of 
entropy (Examples for this are IP-IP encapsulated traffic, e.g. Calico, or 
MPLSoGRE encapsulated traffic).

2. VM traffic is very asymmetric, e.g. due to a large dual-NUMA VM whose vhu 
ports are all on NUMA 0.

In all such scenarios, static pinning of phy rxqs may lead to unnecessarily 
uneven PMD load and loss of overall capacity.



I agree that static pinning may cause a bottleneck if you have more than 
one rx pinned on a core. On the flip side, pinning removes uncertainty 
about the ability of OVS to make good assignments and ALB.




With the pmd-rxq-assign=group and pmd-rxq-isolate=false options, OVS could
still assign other rxqs to those cores which have with pinned phy rxqs and
properly adjust the assignments based on the load from the pinned rxqs.


Yes, sometimes the vhu rxq load is distributed such that it can be use to 
balance the PMD, but not always. Sometimes the balance is just better when phy 
rxqs are not pinned.



New assignments or auto-load-balance would not change the numa polling
those rxqs, so it it would have no impact to ALB or ability to assign based on
load.


In our practical experience the new "group" algorithm for load-based rxq 
distribution is able to balance the PMD load best when none of the rxqs are pinned and 
cross-NUMA polling of phy rxqs is enabled. So the effect of the prediction error when 
doing auto-lb dry-runs cannot be significant.



It could definitely be significant in some cases but it depends on a lot 
of factors to know that.



In our experience we consistently get the best PMD balance and OVS throughput 
when we give the auto-lb free hands (no cross-NUMA polling of vhu rxqs, 
through).

BR, Jan


Thanks for sharing your experience with it. My fear with the proposal is 
that someone turns this on and then tells us performance is worse and/or 
OVS assignments/ALB are broken, because it has an impact on their case.


In terms of limiting possible negative effects,
- it can be opt-in and recommended only for phy ports
- could print a warning when it is enabled
- ALB is currently disabled with cross-numa polling (except a limited 
case) but it's clear you want to remove that restriction too
- for ALB, a user could increase the improvement threshold to account 
for any reassignments triggered by inaccuracies



There is also some improvements that can be made to the proposed method 
when used with group assignment,
- we can prefer local numa where there is no difference between pmd 
cores. (e.g. two unused cores available, pick the local numa one)
- we can flatten the list of pmds, so best pmd can be selected. This 
will remove issues with RR numa when there are different num of pmd 
cores or loads per numa.
- I wrote an RFC that does these

Re: [ovs-dev] [External] Re: [PATCH] dpif-netdev: add an option to assign pmd rxq to all numas

2022-02-17 Thread Jan Scheurich via dev
Hi Kevin,

> > We have done extensive benchmarking and found that we get better overall
> PMD load balance and resulting OVS performance when we do not statically
> pin any rx queues and instead let the auto-load-balancing find the optimal
> distribution of phy rx queues over both NUMA nodes to balance an asymmetric
> load of vhu rx queues (polled only on the local NUMA node).
> >
> > Cross-NUMA polling of vhu rx queues comes with a very high latency cost due
> to cross-NUMA access to volatile virtio ring pointers in every iteration (not 
> only
> when actually copying packets). Cross-NUMA polling of phy rx queues doesn't
> have a similar issue.
> >
> 
> I agree that for vhost rxq polling, it always causes a performance penalty 
> when
> there is cross-numa polling.
> 
> For polling phy rxq, when phy and vhost are in different numas, I don't see 
> any
> additional penalty for cross-numa polling the phy rxq.
> 
> For the case where phy and vhost are both in the same numa, if I change to 
> poll
> the phy rxq cross-numa, then I see about a >20% tput drop for traffic from 
> phy -
> > vhost. Are you seeing that too?

Yes, but the performance drop is mostly due to the extra cost of copying the 
packets across the UPI bus to the virtio buffers on the other NUMA, not because 
of polling the phy rxq on the other NUMA.

> 
> Also, the fact that a different numa can poll the phy rxq after every 
> rebalance
> means that the ability of the auto-load-balancer to estimate and trigger a
> rebalance is impacted.

Agree, there is some inaccuracy in the estimation of the load a phy rx queue 
creates when it is moved to another NUMA node. So far we have not seen that as 
a practical problem.

> 
> It seems like simple pinning some phy rxqs cross-numa would avoid all the
> issues above and give most of the benefit of cross-numa polling for phy rxqs.

That is what we have done in the past (far a lack of alternatives). But any 
static pinning reduces the ability of the auto-load balancer to do its job. 
Consider the following scenarios:

1. The phy ingress traffic is not evenly distributed by RSS due to lack of 
entropy (Examples for this are IP-IP encapsulated traffic, e.g. Calico, or 
MPLSoGRE encapsulated traffic).

2. VM traffic is very asymmetric, e.g. due to a large dual-NUMA VM whose vhu 
ports are all on NUMA 0.

In all such scenarios, static pinning of phy rxqs may lead to unnecessarily 
uneven PMD load and loss of overall capacity.

> 
> With the pmd-rxq-assign=group and pmd-rxq-isolate=false options, OVS could
> still assign other rxqs to those cores which have with pinned phy rxqs and
> properly adjust the assignments based on the load from the pinned rxqs.

Yes, sometimes the vhu rxq load is distributed such that it can be use to 
balance the PMD, but not always. Sometimes the balance is just better when phy 
rxqs are not pinned.

> 
> New assignments or auto-load-balance would not change the numa polling
> those rxqs, so it it would have no impact to ALB or ability to assign based on
> load.

In our practical experience the new "group" algorithm for load-based rxq 
distribution is able to balance the PMD load best when none of the rxqs are 
pinned and cross-NUMA polling of phy rxqs is enabled. So the effect of the 
prediction error when doing auto-lb dry-runs cannot be significant.

In our experience we consistently get the best PMD balance and OVS throughput 
when we give the auto-lb free hands (no cross-NUMA polling of vhu rxqs, 
through).

BR, Jan
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [External] Re: [PATCH] dpif-netdev: add an option to assign pmd rxq to all numas

2022-02-17 Thread Kevin Traynor

Hi Jan,

On 14/02/2022 10:54, Jan Scheurich wrote:

We do acknowledge the benefit of non-pinned polling of phy rx queues by

PMD threads on all NUMA nodes. It gives the auto-load balancer much better
options to utilize spare capacity on PMDs on all NUMA nodes.


Our patch proposed in
https://protect2.fireeye.com/v1/url?k=31323334-501d5122-313273af-45444
731-c996011189a3eea8&q=1&e=0dc6a0b0-959c-493e-a3de-

fea8f3151705&u=

https%3A%2F%2Fmail.openvswitch.org%2Fpipermail%2Fovs-dev%2F2021-

June%2

F384547.html indeed covers the difference between phy and vhu ports.
One has to explicitly enable cross-NUMA-polling for individual interfaces

with:


ovs-vsctl set interface  other_config:cross-numa-polling=true

This would typically only be done by static configuration for the fixed set of

physical ports. There is no code in the OpenStack's os-vif handler to apply such
configuration for dynamically created vhu ports.


I would strongly suggest that cross-num-polling be introduced as a per-

interface option as in our patch rather than as a per-datapath option as in your
patch. Why not adapt our original patch to the latest OVS code base? We can
help you with that.


BR, Jan



Hi, Jan Scheurich

We can achieve the static setting of pinning a phy port by combining pmd-rxq-
isolate and pmd-rxq-affinity.  This setting can get the same result. And we have
seen the benefits.
The new issue is the polling of vhu on one numa. Under heavy traffic, polling
vhu + phy will make the pmds reach 100% usage. While other pmds on the
other numa with only phy port reaches 70% usage. Enabling cross-numa polling
for a vhu port would give us more benefits in this case. Overloads of different
pmds on both numa would be balanced.
As you have mentioned, there is no code to apply this config for vhu while
creating them. A global setting would save us from dynamically detecting the
vhu name or any new creation.


Hi Wan Junjie,

We have done extensive benchmarking and found that we get better overall PMD 
load balance and resulting OVS performance when we do not statically pin any rx 
queues and instead let the auto-load-balancing find the optimal distribution of 
phy rx queues over both NUMA nodes to balance an asymmetric load of vhu rx 
queues (polled only on the local NUMA node).

Cross-NUMA polling of vhu rx queues comes with a very high latency cost due to 
cross-NUMA access to volatile virtio ring pointers in every iteration (not only 
when actually copying packets). Cross-NUMA polling of phy rx queues doesn't 
have a similar issue.



I agree that for vhost rxq polling, it always causes a performance 
penalty when there is cross-numa polling.


For polling phy rxq, when phy and vhost are in different numas, I don't 
see any additional penalty for cross-numa polling the phy rxq.


For the case where phy and vhost are both in the same numa, if I change 
to poll the phy rxq cross-numa, then I see about a >20% tput drop for 
traffic from phy -> vhost. Are you seeing that too?


Also, the fact that a different numa can poll the phy rxq after every 
rebalance means that the ability of the auto-load-balancer to estimate 
and trigger a rebalance is impacted.


It seems like simple pinning some phy rxqs cross-numa would avoid all 
the issues above and give most of the benefit of cross-numa polling for 
phy rxqs.


With the pmd-rxq-assign=group and pmd-rxq-isolate=false options, OVS 
could still assign other rxqs to those cores which have with pinned phy 
rxqs and properly adjust the assignments based on the load from the 
pinned rxqs.


New assignments or auto-load-balance would not change the numa polling 
those rxqs, so it it would have no impact to ALB or ability to assign 
based on load.


thanks,
Kevin.


BR, Jan



___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [External] Re: [PATCH] dpif-netdev: add an option to assign pmd rxq to all numas

2022-02-14 Thread Wan Junjie
On Mon, Feb 14, 2022 at 6:54 PM Jan Scheurich
 wrote:
>
> > > We do acknowledge the benefit of non-pinned polling of phy rx queues by
> > PMD threads on all NUMA nodes. It gives the auto-load balancer much better
> > options to utilize spare capacity on PMDs on all NUMA nodes.
> > >
> > > Our patch proposed in
> > > https://protect2.fireeye.com/v1/url?k=31323334-501d5122-313273af-45444
> > > 731-c996011189a3eea8&q=1&e=0dc6a0b0-959c-493e-a3de-
> > fea8f3151705&u=
> > > https%3A%2F%2Fmail.openvswitch.org%2Fpipermail%2Fovs-dev%2F2021-
> > June%2
> > > F384547.html indeed covers the difference between phy and vhu ports.
> > > One has to explicitly enable cross-NUMA-polling for individual interfaces
> > with:
> > >
> > >ovs-vsctl set interface  other_config:cross-numa-polling=true
> > >
> > > This would typically only be done by static configuration for the fixed 
> > > set of
> > physical ports. There is no code in the OpenStack's os-vif handler to apply 
> > such
> > configuration for dynamically created vhu ports.
> > >
> > > I would strongly suggest that cross-num-polling be introduced as a per-
> > interface option as in our patch rather than as a per-datapath option as in 
> > your
> > patch. Why not adapt our original patch to the latest OVS code base? We can
> > help you with that.
> > >
> > > BR, Jan
> > >
> >
> > Hi, Jan Scheurich
> >
> > We can achieve the static setting of pinning a phy port by combining 
> > pmd-rxq-
> > isolate and pmd-rxq-affinity.  This setting can get the same result. And we 
> > have
> > seen the benefits.
> > The new issue is the polling of vhu on one numa. Under heavy traffic, 
> > polling
> > vhu + phy will make the pmds reach 100% usage. While other pmds on the
> > other numa with only phy port reaches 70% usage. Enabling cross-numa polling
> > for a vhu port would give us more benefits in this case. Overloads of 
> > different
> > pmds on both numa would be balanced.
> > As you have mentioned, there is no code to apply this config for vhu while
> > creating them. A global setting would save us from dynamically detecting the
> > vhu name or any new creation.
>
> Hi Wan Junjie,
>
> We have done extensive benchmarking and found that we get better overall PMD 
> load balance and resulting OVS performance when we do not statically pin any 
> rx queues and instead let the auto-load-balancing find the optimal 
> distribution of phy rx queues over both NUMA nodes to balance an asymmetric 
> load of vhu rx queues (polled only on the local NUMA node).
>
> Cross-NUMA polling of vhu rx queues comes with a very high latency cost due 
> to cross-NUMA access to volatile virtio ring pointers in every iteration (not 
> only when actually copying packets). Cross-NUMA polling of phy rx queues 
> doesn't have a similar issue.
>
> BR, Jan
>

Hi Jan Scheurich,

Thanks for the info. Yes I am using the static pinning. It should have
saved the cost for calculating the 'load'. I'm with you on that.
Kevin's RFC seems to be like using dynamic calculating.
As for the latency cost of cross-numa on vhu, it seems to be a
concern. I would say It is a trade-off with a cost. I have no bias for
using the cross-numa by default.  Maybe for different traffic
patterns, different settings could be a proper way.

Regards,
Wan Junjie
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [External] Re: [PATCH] dpif-netdev: add an option to assign pmd rxq to all numas

2022-02-14 Thread Jan Scheurich via dev
> > We do acknowledge the benefit of non-pinned polling of phy rx queues by
> PMD threads on all NUMA nodes. It gives the auto-load balancer much better
> options to utilize spare capacity on PMDs on all NUMA nodes.
> >
> > Our patch proposed in
> > https://protect2.fireeye.com/v1/url?k=31323334-501d5122-313273af-45444
> > 731-c996011189a3eea8&q=1&e=0dc6a0b0-959c-493e-a3de-
> fea8f3151705&u=
> > https%3A%2F%2Fmail.openvswitch.org%2Fpipermail%2Fovs-dev%2F2021-
> June%2
> > F384547.html indeed covers the difference between phy and vhu ports.
> > One has to explicitly enable cross-NUMA-polling for individual interfaces
> with:
> >
> >ovs-vsctl set interface  other_config:cross-numa-polling=true
> >
> > This would typically only be done by static configuration for the fixed set 
> > of
> physical ports. There is no code in the OpenStack's os-vif handler to apply 
> such
> configuration for dynamically created vhu ports.
> >
> > I would strongly suggest that cross-num-polling be introduced as a per-
> interface option as in our patch rather than as a per-datapath option as in 
> your
> patch. Why not adapt our original patch to the latest OVS code base? We can
> help you with that.
> >
> > BR, Jan
> >
> 
> Hi, Jan Scheurich
> 
> We can achieve the static setting of pinning a phy port by combining pmd-rxq-
> isolate and pmd-rxq-affinity.  This setting can get the same result. And we 
> have
> seen the benefits.
> The new issue is the polling of vhu on one numa. Under heavy traffic, polling
> vhu + phy will make the pmds reach 100% usage. While other pmds on the
> other numa with only phy port reaches 70% usage. Enabling cross-numa polling
> for a vhu port would give us more benefits in this case. Overloads of 
> different
> pmds on both numa would be balanced.
> As you have mentioned, there is no code to apply this config for vhu while
> creating them. A global setting would save us from dynamically detecting the
> vhu name or any new creation.

Hi Wan Junjie,

We have done extensive benchmarking and found that we get better overall PMD 
load balance and resulting OVS performance when we do not statically pin any rx 
queues and instead let the auto-load-balancing find the optimal distribution of 
phy rx queues over both NUMA nodes to balance an asymmetric load of vhu rx 
queues (polled only on the local NUMA node).

Cross-NUMA polling of vhu rx queues comes with a very high latency cost due to 
cross-NUMA access to volatile virtio ring pointers in every iteration (not only 
when actually copying packets). Cross-NUMA polling of phy rx queues doesn't 
have a similar issue.

BR, Jan

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [External] Re: [PATCH] dpif-netdev: add an option to assign pmd rxq to all numas

2022-02-14 Thread Wan Junjie
On Mon, Feb 14, 2022 at 6:12 PM Jan Scheurich
 wrote:
>
> > >
> > > Btw, this patch is similar in functionality to the one posted by
> > > Anurag [0] and there was also some discussion about this approach here 
> > > [1].
> > >
> >
> > Thanks for pointing this out.
> > IMO, setting interface cross-numa would be good for phy port but not good 
> > for
> > vhu.  Since vhu can be destroyed and created relatively frequently.
> > But yes the main idea is the same.
> >
>
> We do acknowledge the benefit of non-pinned polling of phy rx queues by PMD 
> threads on all NUMA nodes. It gives the auto-load balancer much better 
> options to utilize spare capacity on PMDs on all NUMA nodes.
>
> Our patch proposed in 
> https://mail.openvswitch.org/pipermail/ovs-dev/2021-June/384547.html
> indeed covers the difference between phy and vhu ports. One has to explicitly 
> enable cross-NUMA-polling for individual interfaces with:
>
>ovs-vsctl set interface  other_config:cross-numa-polling=true
>
> This would typically only be done by static configuration for the fixed set 
> of physical ports. There is no code in the OpenStack's os-vif handler to 
> apply such configuration for dynamically created vhu ports.
>
> I would strongly suggest that cross-num-polling be introduced as a 
> per-interface option as in our patch rather than as a per-datapath option as 
> in your patch. Why not adapt our original patch to the latest OVS code base? 
> We can help you with that.
>
> BR, Jan
>

Hi, Jan Scheurich

We can achieve the static setting of pinning a phy port by combining
pmd-rxq-isolate and pmd-rxq-affinity.  This setting can get the same
result. And we have seen the benefits.
The new issue is the polling of vhu on one numa. Under heavy traffic,
polling vhu + phy will make the pmds reach 100% usage. While other
pmds on the other numa with only phy port reaches 70% usage. Enabling
cross-numa polling for a vhu port would give us more benefits in this
case. Overloads of different pmds on both numa would be balanced.
As you have mentioned, there is no code to apply this config for vhu
while creating them. A global setting would save us from dynamically
detecting the vhu name or any new creation.


Regards,
Wan Junjie
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [External] Re: [PATCH] dpif-netdev: add an option to assign pmd rxq to all numas

2022-02-14 Thread Jan Scheurich via dev
> >
> > Btw, this patch is similar in functionality to the one posted by
> > Anurag [0] and there was also some discussion about this approach here [1].
> >
> 
> Thanks for pointing this out.
> IMO, setting interface cross-numa would be good for phy port but not good for
> vhu.  Since vhu can be destroyed and created relatively frequently.
> But yes the main idea is the same.
> 

We do acknowledge the benefit of non-pinned polling of phy rx queues by PMD 
threads on all NUMA nodes. It gives the auto-load balancer much better options 
to utilize spare capacity on PMDs on all NUMA nodes.

Our patch proposed in 
https://mail.openvswitch.org/pipermail/ovs-dev/2021-June/384547.html
indeed covers the difference between phy and vhu ports. One has to explicitly 
enable cross-NUMA-polling for individual interfaces with:

   ovs-vsctl set interface  other_config:cross-numa-polling=true

This would typically only be done by static configuration for the fixed set of 
physical ports. There is no code in the OpenStack's os-vif handler to apply 
such configuration for dynamically created vhu ports.

I would strongly suggest that cross-num-polling be introduced as a 
per-interface option as in our patch rather than as a per-datapath option as in 
your patch. Why not adapt our original patch to the latest OVS code base? We 
can help you with that.

BR, Jan

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [External] Re: [PATCH] dpif-netdev: add an option to assign pmd rxq to all numas

2022-02-12 Thread Wan Junjie
Hi Kevin,

Thanks for your reply.

Sorry about forgetting the cc, repost it.

On Sat, Feb 12, 2022 at 1:09 AM Kevin Traynor  wrote:
>
> Hi Wan Junjie,
>
> On 27/01/2022 11:43, Wan Junjie wrote:
> > When assign a rxq to a pmd, the rxq will only be assigned to
> > the numa it belongs. An exception is that the numa has no non-iso
> > pmd there.
> >
> > For example, we have one vm port (vhu) with all its rxqs in numa1,
> > while one phy port (p0) in numa0 with all its rxqs in numa 0.
> > we have four pmds in two numas.
> > See tables below, paras are listed as (port, queue, numa, core)
> >   p0  0 0 20
> >   p0  1 0 21
> >   p0  2 0 20
> >   p0  3 0 21
> >   vhu 0 1 40
> >   vhu 1 1 41
> >   vhu 2 1 40
> >   vhu 2 1 41
> > In this scenario, ovs-dpdk underperforms another setting that
> > all p0's queues pinned to four pmds using pmd-rxq-affinity. With
> > pmd-rxq-isolate=false, we can make vhu get polled on numa 1.
> >   p0  0 0 20
> >   p0  1 1 40
> >   p0  2 0 21
> >   p0  3 1 41
> >   vhu 0 1 40
> >   vhu 1 1 41
> >   vhu 2 1 40
> >   vhu 2 1 41
> > Then we found that with really high throughput, say pmd 40, 41
> > will reach 100% cycles busy, and pmd 20, 21 reach 70% cycles
> > busy. The unbalanced rxqs on these pmds became the bottle neck.
> >
>
> OTOH, you could also look at it that you are contributing to that
> overload by pinning p0 to those cores. So the partial pinning you are
> doing here is not a good solution for your config. If you had of
> continued to pin all the queues, then you would have had a better solution.
>

For a host, usually it has only one phy port taken by ovs and the numa
it uses can be determined while initializing. On the other hand, we
can have multiple vhu ports from several vms. The numa of the vhu
ports could be random as the vms could use huge pages from different
numas.
Partial pinning p0 to all pmd cores could be of benefit to ovs
performance as p0 is being polled by more threads. If not, the
receiver side ovs could only receive half of the throughput.

The main idea of pinning p0 to all cores is to get it polled more.
Even if vhu did not get pinned cross numa. This setting is good when
the traffic is not that high. Only when the traffic goes really high,
the bottle neck comes out. And we did see it happen.
This patch provides an option for high traffic on all threads.

> > pmd-rxq-numa=false would ignore the rxq's numa attribute when
> > assign rxq to pmds. This could make full use of the cores from
> > different numas while we have limited core resources for pmds.
> >
>
> Yes, with this config and this traffic, it is beneficial. The challenge
> is trying to also make it not cause performance regressions for other
> configs. Or at least having some control about that.
>
> With the code below (and the user enabling) there is no preference for
> local-numa polling, so it can be always local-numa or cross-numa. Have
> you tested the performance drop that will occur if cross-numa polling is
> selected instead of local-numa for different cases? It can be quite
> large, like 40%.
>
Yes, cross-numa could harm the throughput.
I did some tests like putting phy port, vhu port and pmd on two numas.
See combination and result below.  ( ':' is the delimiter for numas)
p0 : PMD + vhu |  PMD + p0 : vhu | p0 + vhu : PMD | PMD + p0 + vhu
  11G |   9.7G | 8.2G.
|14.2G

The DUT is the receiver side. This data comes from iperf with one flow.
For multiple flows, in all cases, throughput can reach 22.8G the line
rate limits (25G nic).
For the PPS test, the results did not have too many gaps. All around 5Mpps.

In another test, with 100G *2 bonding. cross-numa multiple flows test
has the same
result with local-numa.
In a real scenario, maybe the multiple flows' result and PPS's data
are much more important.

Another thing needs to be mentioned.
>From the test we can see that when vm and p0 are on the same numa, it
will have better performance than when they are on different numas. So
two vms from different numas could have different performance data.
This is eveitible.

> Another issue is that if you allow polling to switch numa commonly you
> also make estimates for assignments less reliable. I don't see any good
> way around this, but there might still be net benefit to lose some
> accuracy in order to be able to utilize all the cores. Especially if you
> consider worst cases where one numa pmds are overloaded and another numa
> pmds are idle.
>
> Btw, this patch is similar in functionality to the one posted by Anurag
> [0] and there was also some discussion about this approach here [1].
>

Thanks for pointing this out.
IMO, setting interface cross-numa would be good for phy port but not
good for vhu.  Since vhu can be destroyed and created relatively
frequently.
But yes the main idea is the same.


> Another option is to only use cross-numa cores when local-numa ones are
>   overloaded. That way we can get the benefit from local-numa polling