On 18/2/2022 6:31 pm, Jake Yip wrote:
On 16/2/2022 6:38 pm, Daniel Alvarez wrote:


On 16 Feb 2022, at 06:02, Jake Yip via discuss <ovs-discuss@openvswitch.org> wrote:

Hi all,

We are running VMs on OpenStack with OVN. We have an issue with performance, tested using iperf3 with TCP. We get like ~300Kbits in the problematic scenario, when normal traffic is around 1Gbps. Some observations are:

* Only TCP is affected, not UDP iperf3 tests
* It only happens between some nodes, not others
* In only happens in one direction in some cases

We've looked into this and found that the poor performance may be due to retransmission / congestion. Looking deeper, there seems to be some interesting behaviour with fragmentation? / reassembly

Our architecture is like this:

VM23 - [TAP23 - BOND23] -- (internet) -- [BOND21 - TAP21] - VM21

VMs are on hypervisors, on the hypervisor the tap devices egress out bonds. The

We have done tcpdumps from VM23 , VM21, BOND21 and TAP21.

What we have found is that a PSH,ACK packet from VM23 is re-written into a ACK packet when it gets to the bond.

When it gets to the other side, these packets doesn't seem to be reassembled properly to be passed onto the tap into VM21.

We would like to know if this behaviour (rewriting a PSH,ACK into separate ACK packet) is a normal behaviour of OVS/OVN? Is there any other reason why there are so many retransmissions?

I'm not sure if this is an OVN or OVS issue, apologies if this is not the right list. I'm also not sure if I'm debugging this issue correctly. Any help will be welcome!

Which NICs are you using?
Is this Geneve traffic? over VLAN?
If so, may you re run your tests by disabling the VLAN tx offload in both hypervisors (w/ ethtool)?

Cheers,
Daniel

Thanks!

We tried disabling VLAN tx offload on the hypervisors, no difference. Also tried turning off a few offloading, no difference too.

Interestingly, same VM hypervisor only has problems to some VMs, not others. If it was a offload issue, will that affect traffic to all VMs equally?

I'm not sure if it is related to HW or SW, it is all very peculiar.

Regards,
Jake

Just to answer my own question, and close off this discussion - we have found that it was caused by Generic Receive Offload (GRO).

We have GRE tunnels across the internet to connect different sites together. An (iperf3) packet travelling across the tunnel looks like

 (GRE (GENEVE (TCP)))

We found out, on some tunnel nodes, when un-encapsulating the GRE packet, joins two UDP GENEVE packet (and the inner TCP data) together, but does not seem to generate a valid UDP checksum for the resultant packet. This packet appears to be dropped further down the line causing massive packet loss in iperf3.

Turning off GRO resolves this issue. Thanks to everyone who pointed me in the direction of offloads!

Regards,
Jake
_______________________________________________
discuss mailing list
disc...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss

Reply via email to