Hi ovs experts,
We are experiencing some issues with ovs-dpdk packet loss. So we hope
that we can get some community help. Thank you in advance.
Phenomenon:
In the ovs-dpdk environment, when the number of TCP connections
between two virtual machines is greater than 480, packet loss occurs,
but the bandwidth and packet volume are not large. The packet per
second is about 182071, while the bandwidth is about 2Gbps.
Versions:
ovs: 2.17.2
dpdk: 20.11.7
How to reproduce:
1. start 2 (8vcpu, 16G mem, 2 virtio queues) VMs in two hosts with
dpdkvhostuserclient interfaces
2. each VM has 2 virtio queues
example config
```
<interface type='vhostuser'>
<mac address='fa:16:3e:7f:af:17'/>
<source type='unix' path='/var/run/openvswitch/vhu1b62110d-e0'
mode='server'/>
<target dev='vhu1b62110d-e0'/>
<model type='virtio'/>
<driver queues='2' rx_queue_size='1024' tx_queue_size='1024'/>
<alias name='net1'/>
<address type='pci' domain='0x0000' bus='0x00' slot='0x04'
function='0x0'/>
</interface>
```
2. start iperf3 servers on one VM1
nohup taskset -c 0 iperf3 -s -p 12340 1>&2 > /tmp/0 &
nohup taskset -c 1 iperf3 -s -p 12341 1>&2 > /tmp/1 &
nohup taskset -c 2 iperf3 -s -p 12342 1>&2 > /tmp/2 &
nohup taskset -c 3 iperf3 -s -p 12343 1>&2 > /tmp/3 &
nohup taskset -c 4 iperf3 -s -p 12344 1>&2 > /tmp/4 &
nohup taskset -c 5 iperf3 -s -p 12345 1>&2 > /tmp/5 &
3. start ipef3 clients on another VM2
nohup iperf3 -c 10.10.10.185 -p 12340 -i 1 -l 1000 -b 4M -t 3000 -P 80
1>&2 > /tmp/0 &
nohup iperf3 -c 10.10.10.185 -p 12341 -i 1 -l 1000 -b 4M -t 3000 -P 80
1>&2 > /tmp/1 &
nohup iperf3 -c 10.10.10.185 -p 12342 -i 1 -l 1000 -b 4M -t 3000 -P 80
1>&2 > /tmp/2 &
nohup iperf3 -c 10.10.10.185 -p 12343 -i 1 -l 1000 -b 4M -t 3000 -P 80
1>&2 > /tmp/3 &
nohup iperf3 -c 10.10.10.185 -p 12344 -i 1 -l 1000 -b 4M -t 3000 -P 80
1>&2 > /tmp/4 &
nohup iperf3 -c 10.10.10.185 -p 12345 -i 1 -l 1000 -b 4M -t 3000 -P 80
1>&2 > /tmp/5 &
4. bind the server VM1's virtio-0-input and virtio-1-intput interrupt
to CPU 6 and 7
Then we can see tx drop on VM1's vhu interfaces increasing:
output1:
{ovs_rx_qos_drops=0, ovs_tx_failure_drops=51484738,
ovs_tx_invalid_hwol_drops=0, ovs_tx_mtu_exceeded_drops=0,
ovs_tx_qos_drops=0, ovs_tx_retries=176374,
rx_1024_to_1522_packets=1835, rx_128_to_255_packets=1637,
rx_1523_to_max_packets=0, rx_1_to_64_packets=2189,
rx_256_to_511_packets=230, rx_512_to_1023_packets=2724,
rx_65_to_127_packets=371660013, rx_bytes=25702390483, rx_dropped=0,
rx_errors=0, rx_packets=371668628, tx_bytes=2918495543672,
tx_dropped=51484760, tx_packets=1981250066}
output2:
{ovs_rx_qos_drops=0, ovs_tx_failure_drops=51525049,
ovs_tx_invalid_hwol_drops=0, ovs_tx_mtu_exceeded_drops=0,
ovs_tx_qos_drops=0, ovs_tx_retries=176491,
rx_1024_to_1522_packets=1835, rx_128_to_255_packets=1637,
rx_1523_to_max_packets=0, rx_1_to_64_packets=2189,
rx_256_to_511_packets=230, rx_512_to_1023_packets=2724,
rx_65_to_127_packets=372138096, rx_bytes=25734945373, rx_dropped=0,
rx_errors=0, rx_packets=372146711, tx_bytes=2919776711668,
tx_dropped=51525071, tx_packets=1982185204}
It's about 100K/5sec for ovs_tx_failure_drops.
We read some materials and codes:
https://docs.redhat.com/en/documentation/red_hat_openstack_platform/10/html/ovs-dpdk_end_to_end_troubleshooting_guide/tx_drops_on_instance_vhu_interfaces_with_open_vswitch_dpdk
https://github.com/openvswitch/ovs/commit/2f862c712e52fe524e441ab58bb042dcb20214ee
These say that "TX queue full drop".
But how could we prove that the virtio queue is full?
Increasing the number of queues can improve the problem to a certain
extent, but this requires users to restart the virtual machine. In a
production environment, restart action sometimes is not acceptable for
customers.
In the guest VM, we let users to change some kernel options:
net.core.rmem_max = 16777216
net.core.rmem_default = 16777216
net.core.wmem_max = 16777216
net.core.wmem_default = 16777216
net.core.optmem_max = 524288
net.core.netdev_max_backlog = 60000
net.core.netdev_budget = 600
net.ipv4.tcp_window_scaling = 1
net.ipv4.tcp_rmem = 16384 1048576 16777216
net.ipv4.tcp_wmem = 16384 1048576 16777216
But unfortunately, there is no improvement.
Only when the number of connections is reduced to a small number,
there will be no packet loss. For example, there are 40 links, each
with a 90Mbps bandwidth, totally 3.6 Gbps.
So, are there any other means to optimize the above phenomenon?
_______________________________________________
discuss mailing list
[email protected]
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss