Hi Ilya, thank you very much for your detailed answers. We did a lot of research within our environment and tried to troubleshoot the re-occurring issues. After we've upgraded the linux kernel and the NIC drivers to the latest available versions, the bond NICs are not flapping anymore. But starting last week we got some curious errors in the kernel log which says something like : "NETDEV WATCHDOG: enp3s0f1 (ixgbe): transmit queue 23 timed out .... Fake Tx hang detected with timeout of 5 seconds" I will open a request for these messages at Intel. While researching, we wrote a little job saving the flows using "ovs-appctl upcall/show | grep flows" in a simple text file and it turned out that the limit was adjusted dynamically. I guess this is nothing new for you. As you can see, here's a little extract of our logs while an "overcloud deployment" is running: the limit decreases sometimes to only 1000 which is lower than the current flows at that time. "2021-04-10T18:59:37.603Z",os-ctrl-01,"ovs.flows","Sat Apr 10 20:41:31 CEST 2021\n flows : (current 1048) (avg 1075) (max 7680) (limit 1000)" "2021-04-10T18:59:37.603Z",os-ctrl-01,"ovs.flows","Sat Apr 10 20:42:01 CEST 2021\n flows : (current 677) (avg 1525) (max 7680) (limit 28125)" "2021-04-10T18:59:37.603Z",os-ctrl-01,"ovs.flows","Sat Apr 10 20:42:31 CEST 2021\n flows : (current 957) (avg 1809) (max 7680) (limit 1000)" "2021-04-10T18:59:37.603Z",os-ctrl-01,"ovs.flows","Sat Apr 10 20:43:02 CEST 2021\n flows : (current 1044) (avg 1162) (max 7680) (limit 1000)" "2021-04-10T18:59:37.603Z",os-ctrl-01,"ovs.flows","Sat Apr 10 20:43:32 CEST 2021\n flows : (current 1025) (avg 1091) (max 7680) (limit 1000)" "2021-04-10T18:59:37.603Z",os-ctrl-01,"ovs.flows","Sat Apr 10 20:44:02 CEST 2021\n flows : (current 931) (avg 914) (max 7680) (limit 14000)" "2021-04-10T18:59:37.603Z",os-ctrl-01,"ovs.flows","Sat Apr 10 20:44:32 CEST 2021\n flows : (current 3772) (avg 2157) (max 7680) (limit 57000)" "2021-04-10T18:59:37.603Z",os-ctrl-01,"ovs.flows","Sat Apr 10 20:45:01 CEST 2021\n flows : (current 901) (avg 2383) (max 7680) (limit 1000)" "2021-04-10T18:59:37.603Z",os-ctrl-01,"ovs.flows","Sat Apr 10 20:45:31 CEST 2021\n flows : (current 1129) (avg 1205) (max 7680) (limit 1000)" "2021-04-10T18:59:37.603Z",os-ctrl-01,"ovs.flows","Sat Apr 10 20:46:01 CEST 2021\n flows : (current 982) (avg 1262) (max 7680) (limit 1000)" "2021-04-10T18:59:37.603Z",os-ctrl-01,"ovs.flows","Sat Apr 10 20:46:31 CEST 2021\n flows : (current 1057) (avg 2823) (max 7680) (limit 1000)" "2021-04-10T18:59:37.603Z",os-ctrl-01,"ovs.flows","Sat Apr 10 20:47:02 CEST 2021\n flows : (current 935) (avg 1271) (max 7680) (limit 1000)" "2021-04-10T18:59:37.603Z",os-ctrl-01,"ovs.flows","Sat Apr 10 20:47:32 CEST 2021\n flows : (current 954) (avg 1310) (max 7680) (limit 1000)" "2021-04-10T18:59:37.603Z",os-ctrl-01,"ovs.flows","Sat Apr 10 20:48:02 CEST 2021\n flows : (current 1066) (avg 1067) (max 7680) (limit 1000)" "2021-04-10T18:59:37.603Z",os-ctrl-01,"ovs.flows","Sat Apr 10 20:48:32 CEST 2021\n flows : (current 194) (avg 472) (max 1827) (limit 29000)" "2021-04-10T18:59:37.603Z",os-ctrl-01,"ovs.flows","Sat Apr 10 20:49:01 CEST 2021\n flows : (current 610) (avg 618) (max 1827) (limit 79000)" "2021-04-10T18:59:37.603Z",os-ctrl-01,"ovs.flows","Sat Apr 10 20:49:31 CEST 2021\n flows : (current 883) (avg 924) (max 1827) (limit 139000)" "2021-04-10T18:59:37.603Z",os-ctrl-01,"ovs.flows","Sat Apr 10 20:50:01 CEST 2021\n flows : (current 476) (avg 483) (max 1827) (limit 199000)" "2021-04-10T18:59:37.600Z",os-ctrl-01,"ovs.flows","Sat Apr 10 13:22:31 CEST 2021\n flows : (current 557) (avg 552) (max 2773) (limit 180000)" "2021-04-10T18:59:37.600Z",os-ctrl-01,"ovs.flows","Sat Apr 10 13:23:01 CEST 2021\n flows : (current 473) (avg 485) (max 2773) (limit 197000)" "2021-04-10T18:59:37.600Z",os-ctrl-01,"ovs.flows","Sat Apr 10 13:23:31 CEST 2021\n flows : (current 527) (avg 531) (max 2773) (limit 197000)" "2021-04-10T18:59:37.600Z",os-ctrl-01,"ovs.flows","Sat Apr 10 13:24:01 CEST 2021\n flows : (current 577) (avg 573) (max 2773) (limit 197000)" "2021-04-10T18:59:37.599Z",os-ctrl-01,"ovs.flows","Sat Apr 10 12:57:31 CEST 2021\n flows : (current 425) (avg 414) (max 3769) (limit 122000)" "2021-04-10T18:59:37.599Z",os-ctrl-01,"ovs.flows","Sat Apr 10 13:22:01 CEST 2021\n flows : (current 547) (avg 546) (max 2773) (limit 166000)" "2021-04-10T18:54:37.147Z",os-ctrl-02,"ovs.flows","Sat Apr 10 20:41:32 CEST 2021\n flows : (current 4272) (avg 4291) (max 7483) (limit 36095)" "2021-04-10T18:54:37.147Z",os-ctrl-02,"ovs.flows","Sat Apr 10 20:42:01 CEST 2021\n flows : (current 2876) (avg 2881) (max 7483) (limit 150095)" "2021-04-10T18:54:37.147Z",os-ctrl-02,"ovs.flows","Sat Apr 10 20:42:31 CEST 2021\n flows : (current 4184) (avg 4387) (max 7483) (limit 23870)" "2021-04-10T18:54:37.147Z",os-ctrl-02,"ovs.flows","Sat Apr 10 20:43:01 CEST 2021\n flows : (current 1243) (avg 1235) (max 7483) (limit 21000)" "2021-04-10T18:54:37.147Z",os-ctrl-02,"ovs.flows","Sat Apr 10 20:43:31 CEST 2021\n flows : (current 5812) (avg 5738) (max 7483) (limit 113000)" "2021-04-10T18:54:37.147Z",os-ctrl-02,"ovs.flows","Sat Apr 10 20:45:01 CEST 2021\n flows : (current 2107) (avg 917) (max 8250) (limit 3000)" "2021-04-10T18:54:37.147Z",os-ctrl-02,"ovs.flows","Sat Apr 10 20:45:31 CEST 2021\n flows : (current 2129) (avg 1846) (max 8250) (limit 2250)" "2021-04-10T18:54:37.147Z",os-ctrl-02,"ovs.flows","Sat Apr 10 20:46:02 CEST 2021\n flows : (current 1149) (avg 1351) (max 8250) (limit 1000)" "2021-04-10T18:54:37.147Z",os-ctrl-02,"ovs.flows","Sat Apr 10 20:46:32 CEST 2021\n flows : (current 3862) (avg 2457) (max 8250) (limit 8000)" "2021-04-10T18:54:37.147Z",os-ctrl-02,"ovs.flows","Sat Apr 10 20:47:01 CEST 2021\n flows : (current 3567) (avg 2024) (max 8250) (limit 2750)" "2021-04-10T18:54:37.147Z",os-ctrl-02,"ovs.flows","Sat Apr 10 20:47:31 CEST 2021\n flows : (current 413) (avg 872) (max 8250) (limit 3662)" "2021-04-10T18:54:37.147Z",os-ctrl-02,"ovs.flows","Sat Apr 10 20:48:01 CEST 2021\n flows : (current 2098) (avg 1088) (max 8250) (limit 2000)" "2021-04-10T18:54:37.147Z",os-ctrl-02,"ovs.flows","Sat Apr 10 20:48:31 CEST 2021\n flows : (current 1217) (avg 1363) (max 8250) (limit 1000)" "2021-04-10T18:54:37.147Z",os-ctrl-02,"ovs.flows","Sat Apr 10 20:49:01 CEST 2021\n flows : (current 3332) (avg 3364) (max 8250) (limit 144000)" After reading the documentation we set those two options, although the first one is not recommended: $> ovs-vsctl --no-wait set Open_vSwitch . other_config:max-revalidator=10000 $> ovs-appctl upcall/enable-megaflows This seems to be a short-termed workaround, but it was necessary to change the value for "max-revalidator" which is stated to better leave untouched according to the documentation. We figured that the frequent revalidation of the network seems to set the flow limit to 1000 but we are not quite sure how the revalidation is actually deciding that it needs to lower the limit. So, several questions: 1. The dynamic limit adjustment is set by a kind of rating-mechanism in OVS. Is there any possibility to get deeper insight, what OVS identifies as a problem and causes to limit this down to 1000? 2. Are the set values really a good idea or more of a no go? Especially the huge increase of the revalidator value? 3. Could the enablement of megaflows already lead to our deployment being finished without those low limits? From what we understood, it speeds up traffic coming from the same source host but with different headers a lot due to arbitrary wildcard matching to populate to cache. Can you give us a brief hint if we are one the right track here?
Kind regards Daniel -----Original Message----- From: Ilya Maximets <i.maxim...@ovn.org> Sent: Donnerstag, 4. März 2021 20:48 To: ovs-discuss@openvswitch.org Cc: i.maxim...@ovn.org; Daniel Pfanz <daniel.pf...@werum.com> Subject: [EXTERNAL] Re: [ovs-discuss] bond members flapping in state up and down > Hi OVS-team, > > we neither know if this is really a bug, nor if this has really something to > do with OVS, but we hope so, because we have no other ideas so far to solve > our issues. > > Environment: > We are running an OpenStack cluster based on CentOS 8 with nearly 50 host > systems in total. > The controller-cluster consists of three nodes. Every of these control nodes > is connected to a 4-member-stack switch by two cables via two dual-port > "Intel Corporation Ethernet Controller 10-Gigabit X540-AT2 (rev 01)" cards. > These ports have LACP activated and building a "port-channel" as DELL calls > it. So in other words, we have to two cables, coming from different network > cards (but same model) building a LACP based bond. > > Issue: > Since two month we have the re-occurring issue, that both interfaces are > going down, a few seconds one after another. Sometimes one of the interfaces > comes back online again and going down a second time, before the bond finally > breaks. > In most cases the links are coming back in state "link state up" within 2-10 > seconds and then the link aggregation will be negotiated again. > Last time, this happened 15 times at 2021-02-24 within one hour. > > This issue happened multiple times out of a sudden in the middle of the day. > In most cases only on one of the mentioned host machines. Sometimes on two. > But the last two times it happened at almost the same time on all three > machines while we were running a so called "overcloud deployment", which > actually means a rollout or execution of ansible- and puppet-based > configuration changes on every host system of the whole cluster. > > Do not confuse cause and effect: > We don't know what exactly happens at the time, while the links starting to > flap up and down, but we saw a growing load average on these servers. > But it is totally unclear if the load is the cause or just an effect, because > we know when the nodes don't see each other anymore, they are starting all > the (not really) lost resources of the other unseen cluster nodes. This will > rise up the load in any case and can be re-produced. Another problem with > this is, that when the link state is "up" again, the nodes can see each other > again and the cluster is healing itself, the resources will be torn down and > then after a couple of minutes or just seconds the issue appears again and > everything starts from the beginning and the load continues to rise up. > > So we hope, you could say if such a higher load average can cause this > problems or not. > Unfortunately our sysstat reporting was set to a 10minutes period at this > day, so we only got these values from the database: > > CPU %user %nice %system %iowait %steal > %idle > 02:10:00 PM all 18.67 0.00 8.61 0.01 0.00 > 72.72 > 02:20:00 PM all 41.89 0.00 35.00 0.00 0.00 > 23.11 > 02:30:00 PM all 30.02 0.00 37.79 0.00 0.00 > 32.19 > 02:40:00 PM all 53.12 0.00 12.50 0.01 0.00 > 34.37 > 02:50:00 PM all 53.57 0.00 12.28 0.01 0.00 > 34.15 > 03:00:00 PM all 47.39 0.00 11.99 0.01 0.00 > 40.61 > ... > Average: all 13.40 0.00 6.19 0.01 0.00 > 80.40 > > 03:20:00 PM DEV tps rkB/s wkB/s areq-sz aqu-sz > await svctm %util > 02:10:00 PM dev8-0 142.92 5.41 6676.74 46.75 0.07 > 0.76 0.20 2.80 > 02:20:00 PM dev8-0 239.82 12.81 8050.04 33.62 0.09 > 0.72 0.14 3.27 > 02:30:00 PM dev8-0 178.67 2.63 5310.33 29.74 0.04 > 0.61 0.08 1.45 > 02:40:00 PM dev8-0 318.51 2.07 10603.47 33.30 0.08 > 0.56 0.17 5.33 > 02:50:00 PM dev8-0 389.29 0.11 11048.52 28.38 0.11 > 0.58 0.16 6.23 > 03:00:00 PM dev8-0 421.79 0.03 11976.57 28.39 0.12 > 0.65 0.13 5.27 > ... > Average: dev8-0 142.71 2.36 7260.18 50.89 0.09 > 0.90 0.18 2.52 > > Information / version: > All hosts are servers with ... > > - 2x Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz with 16 threats each, > or 32 threats in total. > > - 256GB RAM > > - 2x 450GB SSDs in RAID1 > > $> ovs-vswitchd --version > ovs-vswitchd (Open vSwitch) 2.12.0 > # No patches applied by us > > # Package version > network-scripts-openvswitch.x86_64 2.12.0-1.1.el8 > openvswitch.x86_64 2.12.0-1.1.el8 > > # Package info > $> sudo yum info openvswitch.x86_64 > Last metadata expiration check: 0:41:55 ago on Tue 02 Mar 2021 08:47:40 PM > CET. > Installed Packages > Name : openvswitch > Version : 2.12.0 > Release : 1.1.el8 > Architecture : x86_64 > Size : 5.7 M > Source : openvswitch-2.12.0-1.1.el8.src.rpm > Repository : @System >>From repo : delorean-ussuri-testing > Summary : Open vSwitch daemon/database/utilities > URL : > https://urldefense.com/v3/__http://www.openvswitch.org/__;!!Edv2BrqY9prnXw!9GJxWlCm9tIfoVtVWfsG6dm8iGp4toU0whvFD-pTloGzvtROYmm0Do4Gm2Fi8F6HmflokIE8mQ$ > > License : ASL 2.0 and LGPLv2+ and SISSL > Description : Open vSwitch provides standard network bridging functions and > : support for the OpenFlow protocol for remote per-flow control > of > : traffic. > > $> cat /etc/centos-release > CentOS Linux release 8.2.2004 (Core) > > $> cat /proc/version > Linux version 4.18.0-193.28.1.el8_2.x86_64 (mockbuild at > kbuilder.bsys.centos.org) (gcc version 8.3.1 20191121 (Red Hat > 8.3.1-5) (GCC)) #1 SMP Thu Oct 22 00:20:22 UTC 2020 > > Output of "ovs-dpctl show" is attached to this e-mail. > > /etc/openvswitch/conf.db is attached with shrinked content. > > Logs: > We set "ovs-appctl vlog/set file:bond:dbg syslog:bond:dbg file:lacp:dbg > syslog:lacp:dbg" before the last occurrence. > Following lines marking the first "link state down" event in the logs: > 2021-02-24T13:13:59.710Z|98452|bond|INFO|interface enp3s0f1: link state down > 2021-02-24T13:13:59.710Z|98453|bond|INFO|interface enp3s0f1: > disabled > > Also interesting is, around this first event there are also starting messages > like "Unreasonably long XXXXms poll interval". Hi. This is likely the main issue here or a symptom of the main issue. Looking at logs I'd say that what happening is: 1. OVS is overloaded. I don't know why, but you have too many datapath flows installed to kernel module. See: ofproto_dpif_upcall(handler335)|WARN|upcall: datapath flow limit reached Datapath flow limit by default is 200K. So, you likely have 200K active datapath flows installed to kernel module. I don't know if this is expected or not, but that is not good. 2. revalidator threads are trying to find which datapath flows in the kernel are not needed anymore (was not used for a long time (10 seconds by default) or doesn't fit currently configured OF rules). For this purpose revalidator threads dumps all flows from kernel and checks them. Since there are 200K flows in kernel, this process takes a lot of time. 3. Since datapath flow table is full, lots of packets that doesn't match on these flows goes to userspace for processing. And they keep going to userspace because handler threads are not able to install datapath flows for them to kernel, because the flow table in kernel is full. So, handler threads are very busy and packet processing in userspace takes a lot of their time: ovs_rcu|WARN|blocked 2000 ms waiting for handler332 to quiesce handlers are even loosing packets since they can not keep up with number of upcalls: dpif_netlink(handler344)|WARN|system@ovs-system: lost packet on port channel 5 of handler 22 4. Since the whole OVS is very busy it may not send LACP packets in time or some LACP packets from the external switch might be dropped. In the first case external switch will decide that interface is dead and will shut it down, so OVS will see that port went down. In the other case OVS itself might shut one of the interfaces down since it will think that it's dead on the side of the external switch. Either way, interface flapping is not actually a problem. The main problem is the number of datapath flows that slows down the whole system and makes OVS to delay or drop LACP packets. So, I'd suggest to check the datapath flows and OF rules. If they all are legitimate, you may try to increase the size of the datapath flow table to, at least, remove some load from handler threads. It's somewhat configurable with other_config:flow-limit. Another option is to try to decrease the time idle flows allowed to stay in kernel with other_config:max-idle (in ms). Best regards, Ilya Maximets. > > Tasks until now: > We have worked hard to find out where the root cause may lie. We have updated > the firmware of the switches, updated the BIOS of the server systems and the > firmware of the network cards. We also changed cables and ports on the switch. > Nothing helped so far. > > Questions: > > - What means the "link state down" and "disabled" (as seen above) > exactly? Is this a behavior like pulling out the cable or the other side > brought the link down? Or could this mean the kernel switched the eth port > off? > > - In any case, is this possible to debug or to increase the verbosity > regarding the interfaces logs? > > - Do you see a correlation in between the rising up load of the > system/cpu and the brought down link states? > > Kind regards > Daniel Pfanz _______________________________________________ discuss mailing list disc...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-discuss