Hi,
from my perspective the patch works for all cases. My test environment runs 
with several k8s clusters and I haven't noticed any etcd failures so far.

Best regards

Michael

Von: Lazuardi Nasution <mrxlazuar...@gmail.com>
Gesendet: Dienstag, 4. April 2023 09:41
An: Plato, Michael <michael.pl...@tu-berlin.de>
Cc: ovs-discuss@openvswitch.org
Betreff: Re: [ovs-discuss] ovs-vswitchd crashes serveral times a day

Hi Michael,

Is your patch working on the same subnet unreachable traffic too. In my case, 
crashes happen when too many unreachable replies even from the same subnet. For 
example, when one of the etcd instances is down, there will be huge 
reconnection attempts and then unreachable replies from the destination VM 
where the down etcd instance exists.

Best regards.

On Tue, Apr 4, 2023 at 1:06 PM Plato, Michael 
<michael.pl...@tu-berlin.de<mailto:michael.pl...@tu-berlin.de>> wrote:
Hi,
I have some news on this topic. Unfortunately I could not find the root cause. 
But I managed to implement a workaround (see patch in attachment). The basic 
idea is to mark the nat flows as invalid if there is no longer an associated 
connection. From my point of view it is a race condition. It can be triggered 
by many short-lived connections. With the patch we no longer have any crashes. 
I can't say if it has any negative effects though, as I'm not an expert. So far 
I haven't found any problems at least. Without this patch we had hundreds of 
crashes a day :/

Best regards

Michael

Von: Lazuardi Nasution <mrxlazuar...@gmail.com<mailto:mrxlazuar...@gmail.com>>
Gesendet: Montag, 3. April 2023 13:50
An: ovs-discuss@openvswitch.org<mailto:ovs-discuss@openvswitch.org>
Cc: Plato, Michael 
<michael.pl...@tu-berlin.de<mailto:michael.pl...@tu-berlin.de>>
Betreff: Re: [ovs-discuss] ovs-vswitchd crashes serveral times a day

Hi,

Is this related to following glibc bug? I'm not so sure about this because when 
I check the glibc source of installed version (2.35), the proposed patch has 
been applied.

https://sourceware.org/bugzilla/show_bug.cgi?id=12889

I can confirm that this problem only happen if I use statefull ACL which is 
related to conntrack. The racing situation happen when massive unreachable 
replies are received. For example, if I run etcd on VMs but one etcd node has 
been disabled which causes massive connection attempts and unreachable replies.

Best regards.

On Mon, Mar 20, 2023, 10:58 PM Lazuardi Nasution 
<mrxlazuar...@gmail.com<mailto:mrxlazuar...@gmail.com>> wrote:
Hi Michael,

Have you found the solution for this case? I find the same weird problem 
without any information about which conntrack entries are causing this issue.

I'm using OVS 3.0.1 with DPDK  21.11.2 on Ubuntu 22.04. By the way, this 
problem is disappear after I remove some Kubernutes cluster VMs and some DB 
cluster VMs.

Best regards.

Date: Thu, 29 Sep 2022 07:56:32 +0000
From: "Plato, Michael" 
<michael.pl...@tu-berlin.de<mailto:michael.pl...@tu-berlin.de>>
To: "ovs-discuss@openvswitch.org<mailto:ovs-discuss@openvswitch.org>" 
<ovs-discuss@openvswitch.org<mailto:ovs-discuss@openvswitch.org>>
Subject: [ovs-discuss] ovs-vswitchd crashes serveral times a day
Message-ID: 
<8e53d3d0674049e69b2b7f3c4b0b8...@tu-berlin.de<mailto:8e53d3d0674049e69b2b7f3c4b0b8...@tu-berlin.de>>
Content-Type: text/plain; charset="us-ascii"

Hi,

we are about to roll out our new openstack infrastructure based on yoga and 
during our testing we observered that the openvswitch-switch systemd unit 
restarts several times a day, causing network interruptions for all VMs on the 
compute node in question.
After some research we found that the ovs-vswitchd crashes with the following 
assertion failure:

"2022-09-29T06:51:05.195Z|00003|util(pmd-c01/id:8)|EMER|../lib/conntrack.c:1095:
 assertion conn->conn_type == CT_CONN_TYPE_DEFAULT failed in 
conn_update_state()"

To get more information about the connection that leads to this assertion 
failure, I added some debug code to conntrack.c .
We have seen that we can trigger this issue when trying to connect from a VM to 
a destination which is unreachable. For example curl https://www.google.de:444

Shortly after that we get an assertion and the debug code says:

conn_type=1 (may be CT_CONN_TYPE_UN_NAT) ?
src ip 172.217.16.67 dst ip 141.23.xx.xx rev src ip 141.23.xx.xx rev dst ip 
172.217.16.67 src/dst ports 444/46212 rev src/dst ports 46212/444 zone/rev zone 
2/2 nw_proto/rev nw_proto 6/6

ovs-appctl dpctl/dump-conntrack | grep "444"
tcp,orig=(src=141.23.xx.xx,dst=172.217.16.67,sport=46212,dport=444),reply=(src=172.217.16.67,dst=141.23.xx.xx,sport=444,dport=46212),zone=2,protoinfo=(state=SYN_SENT)

Versions:
ovs-vsctl --version
ovs-vsctl (Open vSwitch) 2.17.2
DB Schema 8.3.0

ovn-controller --version
ovn-controller 22.03.0
Open vSwitch Library 2.17.0
OpenFlow versions 0x6:0x6
SB DB Schema 20.21.0

DPDK 21.11.2

We are now unsure if this is a misconfiguration or if we hit a bug.

Thanks for any feedback

Michael
_______________________________________________
discuss mailing list
disc...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss

Reply via email to