Re: [ovs-discuss] ovs-vswitchd crashes serveral times a day

Lazuardi Nasution via discuss Tue, 04 Apr 2023 00:56:50 -0700

Hi Michael,

I assume that your k8s cluster is on the same subnet, right? Would you mind
testing it by shutting down one of etcd instances and see if this bug still
exists?


Best regards.

On Tue, Apr 4, 2023 at 2:50 PM Plato, Michael <michael.pl...@tu-berlin.de>
wrote:

> Hi,
>
> from my perspective the patch works for all cases. My test environment
> runs with several k8s clusters and I haven't noticed any etcd failures so
> far.
>
>
>
> Best regards
>
>
>
> Michael
>
>
>
> *Von:* Lazuardi Nasution <mrxlazuar...@gmail.com>
> *Gesendet:* Dienstag, 4. April 2023 09:41
> *An:* Plato, Michael <michael.pl...@tu-berlin.de>
> *Cc:* ovs-discuss@openvswitch.org
> *Betreff:* Re: [ovs-discuss] ovs-vswitchd crashes serveral times a day
>
>
>
> Hi Michael,
>
>
>
> Is your patch working on the same subnet unreachable traffic too. In my
> case, crashes happen when too many unreachable replies even from the same
> subnet. For example, when one of the etcd instances is down, there will be
> huge reconnection attempts and then unreachable replies from the
> destination VM where the down etcd instance exists.
>
>
>
> Best regards.
>
>
>
> On Tue, Apr 4, 2023 at 1:06 PM Plato, Michael <michael.pl...@tu-berlin.de>
> wrote:
>
> Hi,
>
> I have some news on this topic. Unfortunately I could not find the root
> cause. But I managed to implement a workaround (see patch in attachment).
> The basic idea is to mark the nat flows as invalid if there is no longer an
> associated connection. From my point of view it is a race condition. It can
> be triggered by many short-lived connections. With the patch we no longer
> have any crashes. I can't say if it has any negative effects though, as I'm
> not an expert. So far I haven't found any problems at least. Without this
> patch we had hundreds of crashes a day :/
>
>
>
> Best regards
>
>
> Michael
>
>
>
> *Von:* Lazuardi Nasution <mrxlazuar...@gmail.com>
> *Gesendet:* Montag, 3. April 2023 13:50
> *An:* ovs-discuss@openvswitch.org
> *Cc:* Plato, Michael <michael.pl...@tu-berlin.de>
> *Betreff:* Re: [ovs-discuss] ovs-vswitchd crashes serveral times a day
>
>
>
> Hi,
>
>
>
> Is this related to following glibc bug? I'm not so sure about this because
> when I check the glibc source of installed version (2.35), the proposed
> patch has been applied.
>
>
>
> https://sourceware.org/bugzilla/show_bug.cgi?id=12889
>
>
>
> I can confirm that this problem only happen if I use statefull ACL which
> is related to conntrack. The racing situation happen when massive
> unreachable replies are received. For example, if I run etcd on VMs but one
> etcd node has been disabled which causes massive connection attempts and
> unreachable replies.
>
>
>
> Best regards.
>
>
>
> On Mon, Mar 20, 2023, 10:58 PM Lazuardi Nasution <mrxlazuar...@gmail.com>
> wrote:
>
> Hi Michael,
>
>
>
> Have you found the solution for this case? I find the same weird problem
> without any information about which conntrack entries are causing
> this issue.
>
>
>
> I'm using OVS 3.0.1 with DPDK  21.11.2 on Ubuntu 22.04. By the way, this
> problem is disappear after I remove some Kubernutes cluster VMs and some DB
> cluster VMs.
>
>
>
> Best regards.
>
>
>
> Date: Thu, 29 Sep 2022 07:56:32 +0000
> From: "Plato, Michael" <michael.pl...@tu-berlin.de>
> To: "ovs-discuss@openvswitch.org" <ovs-discuss@openvswitch.org>
> Subject: [ovs-discuss] ovs-vswitchd crashes serveral times a day
> Message-ID: <8e53d3d0674049e69b2b7f3c4b0b8...@tu-berlin.de>
> Content-Type: text/plain; charset="us-ascii"
>
> Hi,
>
> we are about to roll out our new openstack infrastructure based on yoga
> and during our testing we observered that the openvswitch-switch systemd
> unit restarts several times a day, causing network interruptions for all
> VMs on the compute node in question.
> After some research we found that the ovs-vswitchd crashes with the
> following assertion failure:
>
> "2022-09-29T06:51:05.195Z|00003|util(pmd-c01/id:8)|EMER|../lib/conntrack.c:1095:
> assertion conn->conn_type == CT_CONN_TYPE_DEFAULT failed in
> conn_update_state()"
>
> To get more information about the connection that leads to this assertion
> failure, I added some debug code to conntrack.c .
> We have seen that we can trigger this issue when trying to connect from a
> VM to a destination which is unreachable. For example curl
> https://www.google.de:444
>
> Shortly after that we get an assertion and the debug code says:
>
> conn_type=1 (may be CT_CONN_TYPE_UN_NAT) ?
> src ip 172.217.16.67 dst ip 141.23.xx.xx rev src ip 141.23.xx.xx rev dst
> ip 172.217.16.67 src/dst ports 444/46212 rev src/dst ports 46212/444
> zone/rev zone 2/2 nw_proto/rev nw_proto 6/6
>
> ovs-appctl dpctl/dump-conntrack | grep "444"
>
> tcp,orig=(src=141.23.xx.xx,dst=172.217.16.67,sport=46212,dport=444),reply=(src=172.217.16.67,dst=141.23.xx.xx,sport=444,dport=46212),zone=2,protoinfo=(state=SYN_SENT)
>
> Versions:
> ovs-vsctl --version
> ovs-vsctl (Open vSwitch) 2.17.2
> DB Schema 8.3.0
>
> ovn-controller --version
> ovn-controller 22.03.0
> Open vSwitch Library 2.17.0
> OpenFlow versions 0x6:0x6
> SB DB Schema 20.21.0
>
> DPDK 21.11.2
>
> We are now unsure if this is a misconfiguration or if we hit a bug.
>
> Thanks for any feedback
>
> Michael
>
>

_______________________________________________
discuss mailing list
disc...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss

Re: [ovs-discuss] ovs-vswitchd crashes serveral times a day

Reply via email to