Hi,

I'm using OVS with the OpenStack Yoga version (deployed by kolla-ansible)
without using OVN. We have dedicated Neutron L3 Network Nodes that deploy
openstack-l3-agent to connect VXLAN tenant networks.

When the vxlan packet rate is sent to Network Node below  100.000
packets/sec we don't have any problem, but when the packet rate climb up to
300.000 packets/sec we has periodically packet drop on physical interfaces

The strange thing is, sometime packet rate climbed up  to 300.000
packets/sec - but no packet drop happened, but sometime packet rate only
climbed up to 200.000 packets/sec then packet drop rate go to 2.000
packets/sec

The frequency packet drop peaking to above 2.000 packets/sec  is 9 times in
3 hours


cat /proc/net/dev | column -t Inter-| Receive | Transmit face |bytes
packets errs drop fifo frame compressed multicast|bytes packets errs drop
fifo colls carrier compressed bond0: 1229518716379867 1402282814614 0
1612862140 0 0 0 789948303 808181727486045 1147395595472 0 5 0 0 0 0
bond0.1101: 587766817883843 341996183858 0 0 0 0 0 21777 417218398136997
589085919748 0 0 0 0 0 0 ens1f0np0: 312755571667466 378057363607 0 28204910
0 0 0 198566646 203881369193882 291861651408 0 0 0 0 0 0 br-tun: 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 br-ex: 0 0 0 88026356178 0 0 0 0 0 0 0 0 0 0 0 0
ens10f1: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 vxlan_sys_4789: 310591659314200
174122032400 0 0 0 0 0 0 142088649261411 184955648498 0 0 0 0 0 0
ovs-system: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 tapa52863dc-ab: 0 0 0 31384 0 0
0 0 0 0 0 0 0 0 0 0 ens2f0np0: 323323937488958 342547015164 0 59721243 0 0
0 198141664 215335993133985 296977179863 0 0 0 0 0 0 ens1f1np1:
162645543120864 297124381886 0 547 0 0 0 15125913548 220501163493120
228094313440 0 0 0 0 0 0 ens10f2: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 docker0:
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 tapc18c46e6-d7: 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 tap46037b70-51: 0 0 0 12825 0 0 0 0 0 0 0 0 0 0 0 0 br-int: 0 0 0
117588157210 0 0 0 0 0 0 0 0 0 0 0 0 bond0.1202: 341091911 7201662 0 4 0 0
0 10852 13483450128 1685930 0 0 0 0 0 0 bond1: 463256663322004 768463957603
0 10993 0 0 0 45088631734 667594181742640 690319573311 0 2 0 0 0 0 ens10f3:
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 bond1.1201: 6485684512877 10099878625 0 0 0
0 0 85801552 9606602848414 5262009051 0 0 0 0 0 0 lo: 13108706568840
4596904610 0 3 0 0 0 0 13108706568840 4596904610 0 0 0 0 0 0 ens2f1np1:
151015096573480 237504719792 0 5223 0 0 0 15049082867 226123173562828
233815782870 0 0 0 0 0 0 ens10f0: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

Maximum packet rate go to single queue on an NIC is up to 80.000 packet/sec

./ethq -t ens2f0np0 nic txp rxp txb rxb txmbps rxmbps ens2f0np0 142757
169715 165531909 200542000 1324.255 1604.336 0 79618 4257 - - - - 1 1522
2498 - - - - 2 20556 4058 - - - - 3 2135 7817 - - - - 4 9135 4892 - - - - 5
1415 79387 - - - - 6 7194 2813 - - - - 7 1739 23426 - - - -
Our NIC is dual port Broadcom P210p NetXtreme-E Dual-port 10Gb Ethernet
PCIe Adapter

The OS version is Ubuntu Server 20.04 kernel 5.4.0
The OVS Version is 2.17.0

After investigating, we found that the drop number in physical interfaces
is exactly equal to discarded packet number returned by ethtool -S command:

ethtool -S ens2f0np0 | grep -e 'rx_discards' -e 'rx_total_discard_pkts'

NIC statistics: [0]: rx_discards: 0 [1]: rx_discards: 377424 [2]:
rx_discards: 29662 [3]: rx_discards: 903618 [4]: rx_discards: 919911 [5]:
rx_discards: 20368639 [6]: rx_discards: 0 [7]: rx_discards: 37121989
rx_total_discard_pkts: 59721243

ethtool -S ens2f0np0 | grep -e 'drop'

NIC statistics: [0]: rx_drops: 0 [0]: tx_drops: 0 [1]: rx_drops: 0 [1]:
tx_drops: 0 [2]: rx_drops: 0 [2]: tx_drops: 0 [3]: rx_drops: 0 [3]:
tx_drops: 0 [4]: rx_drops: 0 [4]: tx_drops: 0 [5]: rx_drops: 0 [5]:
tx_drops: 0 [6]: rx_drops: 0 [6]: tx_drops: 0 [7]: rx_drops: 0 [7]:
tx_drops: 0

We tried to tuning rx buffer queues:

sudo ethtool -G ens2f0np0 rx 4096
sudo ethtool -G ens1f0np0 rx 4096

and netdev budget parameters:

net.core.netdev_budget = 1200 net.core.netdev_budget_usecs = 16000
net.core.dev_weight = 128

But nothing works, packet drop/discarded still happened

Then we see that ovs-vswitchd revalidator cpu usage is quite high

top - 17:17:05 up 1023 days, 2:42, 24 users, load average: 7.48, 9.01, 9.37
Threads: 98 total, 2 running, 96 sleeping, 0 stopped, 0 zombie %Cpu(s): 4.6
us, 1.7 sy, 0.0 ni, 87.6 id, 0.0 wa, 0.0 hi, 6.1 si, 0.0 st MiB Mem :
386392.5 total, 225536.8 free, 125843.6 used, 35012.1 buff/cache MiB Swap:
40960.0 total, 40960.0 free, 0.0 used. 251973.2 avail Mem PID USER PR NI
VIRT RES SHR S %CPU %MEM TIME+ COMMAND 21598 root 20 0 7290168 1.2g 11756 S
13.3 0.3 16391:38 ovs-vswitchd 30484 root 20 0 7290168 1.2g 11756 S 8.6 0.3
12804:43 revalidator73 30515 root 20 0 7290168 1.2g 11756 S 8.6 0.3
10094:21 revalidator96 30488 root 20 0 7290168 1.2g 11756 S 8.3 0.3
10192:06 revalidator77 30490 root 20 0 7290168 1.2g 11756 S 8.3 0.3
10176:31 revalidator79 30507 root 20 0 7290168 1.2g 11756 S 8.3 0.3
10205:52 revalidator91 30508 root 20 0 7290168 1.2g 11756 S 8.3 0.3
10098:58 revalidator92 30516 root 20 0 7290168 1.2g 11756 S 8.3 0.3
10101:48 revalidator97 30501 root 20 0 7290168 1.2g 11756 S 8.0 0.3
10092:29 revalidator85 30502 root 20 0 7290168 1.2g 11756 S 8.0 0.3
10095:15 revalidator86 30485 root 20 0 7290168 1.2g 11756 S 7.6 0.3
10236:32 revalidator74

Then we tuning some ovs revalidator parameters:

ovs-vsctl set Open_vSwitch . other_config:min-revalidate-pps=150 ovs-vsctl
set Open_vSwitch . other_config:max-idle=30000 ovs-vsctl set Open_vSwitch .
other_config:max-revalidator=30000

After this tuning:

- The CPU usage % of ovs revalidator threads decreased to ~ 4.5%
- The packets drop times decreased from 9 times in 3 hours to 3 times in 3
hours, and the max packets rate dropped from 2.000 packet/sec to 1.200
packet/sec

But we still:

- Can't decrease packet rate to zero
- Can't explain the relations between physical packet discard and
ovs-revalidator high cpu usage - most of
- Can't explain why sometime receive packet rate climb up to 300.000
packet/sec, but packet drop/discard don't happened

Does anyone know about this issue and can help us understand and fix this
packet drop problem? Thank you
_______________________________________________
discuss mailing list
[email protected]
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss

Reply via email to