Hi OVN and OVS folks, during OVN upgrade from 20.06 to 21.12 version I’ve faced with a quite long datapath downtime (48-50 seconds). I suppose this downtime was triggered by ovn-controller installing new openflow rules to ovs. There are relatively not so many flows (13-14k). openvswitch version is 2.13.4. ovn-openflow-probe-interval is configured to 30.
The upgrade procedure was: 1. Set to ovs external_ids an option to force matching ovn-northd internal version (ovn-match-northd-version=true) 2. Upgrade ovn-controller (to first get ovn-controller, which supports version match between controller and northd and not to break datapath after upgrading ovn-northd) 3. Upgrade ovn-northd cluster Network outage timeframe matches to ovs-vswitchd WARN logs. Please see next: 2022-02-01T19:44:32.577Z|00035|ovs_rcu(urcu7)|WARN|blocked 1000 ms waiting for main to quiesce 2022-02-01T19:44:33.577Z|00036|ovs_rcu(urcu7)|WARN|blocked 2000 ms waiting for main to quiesce 2022-02-01T19:44:35.577Z|00037|ovs_rcu(urcu7)|WARN|blocked 4000 ms waiting for main to quiesce 2022-02-01T19:44:39.577Z|00038|ovs_rcu(urcu7)|WARN|blocked 8000 ms waiting for main to quiesce 2022-02-01T19:44:44.396Z|04063|connmgr|INFO|br-int<->unix#272674: 37 flow_mods 13 s ago (36 adds, 1 deletes) 2022-02-01T19:44:44.441Z|04064|timeval|WARN|Unreasonably long 12864ms poll interval (0ms user, 12833ms system) 2022-02-01T19:44:44.441Z|04065|timeval|WARN|faults: 287 minor, 0 major 2022-02-01T19:44:44.441Z|04066|timeval|WARN|context switches: 1 voluntary, 18 involuntary 2022-02-01T19:44:44.441Z|04067|coverage|INFO|Event coverage, avg rate over last: 5 seconds, last minute, last hour, hash=9105faf6: 2022-02-01T19:44:44.441Z|04068|coverage|INFO|ccmap_expand 0.0/sec 0.000/sec 0.0000/sec total: 1129 2022-02-01T19:44:44.441Z|04069|coverage|INFO|ccmap_shrink 0.0/sec 11.933/sec 0.1989/sec total: 2363 2022-02-01T19:44:44.441Z|04070|coverage|INFO|cmap_expand 0.0/sec 0.833/sec 0.8811/sec total: 11218023 2022-02-01T19:44:44.441Z|04071|coverage|INFO|cmap_shrink 8.2/sec 11.950/sec 1.0736/sec total: 11217475 2022-02-01T19:44:44.441Z|04072|coverage|INFO|dpif_port_add 0.0/sec 0.000/sec 0.0000/sec total: 229 2022-02-01T19:44:44.441Z|04073|coverage|INFO|dpif_port_del 0.0/sec 0.000/sec 0.0000/sec total: 144 2022-02-01T19:44:44.441Z|04074|coverage|INFO|dpif_flow_flush 0.0/sec 0.000/sec 0.0000/sec total: 2 2022-02-01T19:44:44.441Z|04075|coverage|INFO|dpif_flow_get 0.0/sec 0.000/sec 0.0000/sec total: 23 2022-02-01T19:44:44.441Z|04076|coverage|INFO|dpif_flow_put 83.0/sec 97.900/sec 80.8306/sec total: 849051748 2022-02-01T19:44:44.441Z|04077|coverage|INFO|dpif_flow_del 147.8/sec 91.733/sec 78.2797/sec total: 810797137 2022-02-01T19:44:44.441Z|04078|coverage|INFO|dpif_execute 66.6/sec 83.767/sec 87.8397/sec total: 958697586 2022-02-01T19:44:44.441Z|04079|coverage|INFO|dpif_execute_with_help 0.0/sec 4.250/sec 4.9892/sec total: 70713908 2022-02-01T19:44:44.441Z|04080|coverage|INFO|flow_extract 169.0/sec 247.017/sec 333.1194/sec total: 7990486004 2022-02-01T19:44:44.441Z|04081|coverage|INFO|miniflow_malloc 0.0/sec 0.183/sec 0.7983/sec total: 5871853 2022-02-01T19:44:44.441Z|04082|coverage|INFO|hindex_pathological 0.0/sec 0.000/sec 0.0000/sec total: 78 2022-02-01T19:44:44.441Z|04083|coverage|INFO|hindex_expand 0.0/sec 0.000/sec 0.0000/sec total: 11 2022-02-01T19:44:44.441Z|04084|coverage|INFO|hmap_pathological 0.0/sec 0.633/sec 0.9331/sec total: 16638117 2022-02-01T19:44:44.441Z|04085|coverage|INFO|hmap_expand 0.0/sec 114.867/sec 139.8006/sec total: 2184263485 2022-02-01T19:44:44.441Z|04086|coverage|INFO|mac_learning_learned 0.0/sec 0.000/sec 0.0003/sec total: 299 2022-02-01T19:44:44.441Z|04087|coverage|INFO|mac_learning_expired 0.0/sec 0.000/sec 0.0006/sec total: 117 2022-02-01T19:44:44.441Z|04088|coverage|INFO|mac_learning_moved 0.0/sec 0.000/sec 0.0000/sec total: 87 2022-02-01T19:44:44.441Z|04089|coverage|INFO|netdev_get_stats 0.0/sec 36.850/sec 43.3306/sec total: 690673778 2022-02-01T19:44:44.441Z|04090|coverage|INFO|txn_unchanged 0.0/sec 0.100/sec 0.1614/sec total: 2705864 2022-02-01T19:44:44.441Z|04091|coverage|INFO|txn_incomplete 0.0/sec 0.183/sec 0.2239/sec total: 3688261 2022-02-01T19:44:44.441Z|04092|coverage|INFO|txn_success 0.0/sec 0.167/sec 0.1953/sec total: 3236818 2022-02-01T19:44:44.441Z|04093|coverage|INFO|poll_create_node 3930.6/sec 3941.350/sec 5012.2572/sec total: 115959738162 2022-02-01T19:44:44.441Z|04094|coverage|INFO|poll_zero_timeout 159.4/sec 111.417/sec 103.4078/sec total: 1181551227 2022-02-01T19:44:44.441Z|04095|coverage|INFO|rconn_queued 0.0/sec 0.050/sec 0.0817/sec total: 1279558 2022-02-01T19:44:44.441Z|04096|coverage|INFO|rconn_sent 0.0/sec 0.050/sec 0.0817/sec total: 1279558 2022-02-01T19:44:44.441Z|04097|coverage|INFO|seq_change 2205.2/sec 2502.783/sec 2874.6258/sec total: 57335390998 2022-02-01T19:44:44.441Z|04098|coverage|INFO|pstream_open 0.0/sec 0.000/sec 0.0000/sec total: 5 2022-02-01T19:44:44.441Z|04099|coverage|INFO|stream_open 0.0/sec 0.000/sec 0.0000/sec total: 1 2022-02-01T19:44:44.441Z|04100|coverage|INFO|unixctl_received 0.0/sec 0.000/sec 0.0000/sec total: 189 2022-02-01T19:44:44.441Z|04101|coverage|INFO|unixctl_replied 0.0/sec 0.000/sec 0.0000/sec total: 189 2022-02-01T19:44:44.441Z|04102|coverage|INFO|util_xalloc 7935.8/sec 15234.350/sec 18372.4347/sec total: 345308271208 2022-02-01T19:44:44.441Z|04103|coverage|INFO|vconn_received 0.0/sec 0.083/sec 0.3556/sec total: 2018930 2022-02-01T19:44:44.441Z|04104|coverage|INFO|vconn_sent 0.0/sec 0.067/sec 0.0983/sec total: 1552067 2022-02-01T19:44:44.441Z|04105|coverage|INFO|netdev_set_policing 0.0/sec 0.000/sec 0.0000/sec total: 623 2022-02-01T19:44:44.441Z|04106|coverage|INFO|netdev_get_ifindex 0.0/sec 0.100/sec 0.1636/sec total: 2713928 2022-02-01T19:44:44.441Z|04107|coverage|INFO|netdev_set_hwaddr 0.0/sec 0.000/sec 0.0000/sec total: 2 2022-02-01T19:44:44.441Z|04108|coverage|INFO|netdev_get_ethtool 0.0/sec 0.000/sec 0.0000/sec total: 1073 2022-02-01T19:44:44.441Z|04109|coverage|INFO|netdev_set_ethtool 0.0/sec 0.000/sec 0.0000/sec total: 225 2022-02-01T19:44:44.441Z|04110|coverage|INFO|nln_changed 0.0/sec 0.050/sec 0.0492/sec total: 819614 2022-02-01T19:44:44.441Z|04111|coverage|INFO|netlink_received 346.6/sec 472.133/sec 575.4597/sec total: 11288088779 2022-02-01T19:44:44.441Z|04112|coverage|INFO|netlink_recv_jumbo 18.4/sec 113.250/sec 175.1189/sec total: 3935643810 2022-02-01T19:44:44.441Z|04113|coverage|INFO|netlink_sent 308.4/sec 378.133/sec 380.4267/sec total: 4753114969 2022-02-01T19:44:44.441Z|04114|coverage|INFO|ofproto_flush 0.0/sec 0.000/sec 0.0000/sec total: 1 2022-02-01T19:44:44.441Z|04115|coverage|INFO|ofproto_recv_openflow 0.0/sec 0.067/sec 0.3389/sec total: 1746449 2022-02-01T19:44:44.441Z|04116|coverage|INFO|ofproto_update_port 0.0/sec 0.000/sec 0.0000/sec total: 814 2022-02-01T19:44:44.441Z|04117|coverage|INFO|packet_in_overflow 0.0/sec 0.000/sec 0.0000/sec total: 282520 2022-02-01T19:44:44.441Z|04118|coverage|INFO|rev_reconfigure 0.0/sec 0.017/sec 0.0325/sec total: 539904 2022-02-01T19:44:44.441Z|04119|coverage|INFO|rev_port_toggled 0.0/sec 0.000/sec 0.0000/sec total: 27 2022-02-01T19:44:44.441Z|04120|coverage|INFO|rev_flow_table 0.0/sec 0.000/sec 0.0106/sec total: 10876 2022-02-01T19:44:44.441Z|04121|coverage|INFO|rev_mac_learning 0.0/sec 0.000/sec 0.0006/sec total: 312 2022-02-01T19:44:44.441Z|04122|coverage|INFO|dumped_duplicate_flow 0.0/sec 0.000/sec 0.0000/sec total: 39204 2022-02-01T19:44:44.441Z|04123|coverage|INFO|dumped_new_flow 0.0/sec 0.000/sec 0.0000/sec total: 7177 2022-02-01T19:44:44.441Z|04124|coverage|INFO|handler_duplicate_upcall 0.4/sec 1.833/sec 2.2664/sec total: 34762798 2022-02-01T19:44:44.441Z|04125|coverage|INFO|upcall_ukey_contention 0.0/sec 0.000/sec 0.0000/sec total: 9 2022-02-01T19:44:44.441Z|04126|coverage|INFO|upcall_ukey_replace 0.2/sec 0.050/sec 0.0447/sec total: 7927221 2022-02-01T19:44:44.441Z|04127|coverage|INFO|revalidate_missed_dp_flow 0.0/sec 0.000/sec 0.0000/sec total: 1118 2022-02-01T19:44:44.441Z|04128|coverage|INFO|xlate_actions 123.6/sec 273.183/sec 233.8275/sec total: 2110727223 2022-02-01T19:44:44.441Z|04129|coverage|INFO|bridge_reconfigure 0.0/sec 0.017/sec 0.0325/sec total: 539902 2022-02-01T19:44:44.441Z|04130|coverage|INFO|74 events never hit 2022-02-01T19:44:44.441Z|04131|poll_loop|INFO|Dropped 3 log messages in last 1058229 seconds (most recently, 1058228 seconds ago) due to excessive rate 2022-02-01T19:44:44.441Z|04132|poll_loop|INFO|wakeup due to [POLLIN] on fd 230 (/var/run/openvswitch/br-int.mgmt<->) at lib/stream-fd.c:157 (84% CPU usage) 2022-02-01T19:44:45.441Z|00039|ovs_rcu(urcu7)|WARN|blocked 1000 ms waiting for main to quiesce 2022-02-01T19:44:46.441Z|00040|ovs_rcu(urcu7)|WARN|blocked 2000 ms waiting for main to quiesce 2022-02-01T19:44:48.441Z|00041|ovs_rcu(urcu7)|WARN|blocked 4001 ms waiting for main to quiesce 2022-02-01T19:44:52.441Z|00042|ovs_rcu(urcu7)|WARN|blocked 8000 ms waiting for main to quiesce 2022-02-01T19:44:55.109Z|04133|timeval|WARN|Unreasonably long 10668ms poll interval (0ms user, 10643ms system) 2022-02-01T19:44:55.109Z|04134|timeval|WARN|faults: 70 minor, 0 major 2022-02-01T19:44:55.109Z|04135|timeval|WARN|context switches: 0 voluntary, 4 involuntary 2022-02-01T19:44:55.109Z|04136|coverage|INFO|Skipping details of duplicate event coverage for hash=9105faf6 2022-02-01T19:44:55.109Z|04137|poll_loop|INFO|wakeup due to [POLLIN] on fd 15 (<->/var/run/openvswitch/db.sock) at lib/stream-fd.c:157 (100% CPU usage) 2022-02-01T19:44:55.109Z|04138|poll_loop|INFO|wakeup due to [POLLIN] on fd 230 (/var/run/openvswitch/br-int.mgmt<->) at lib/stream-fd.c:157 (100% CPU usage) 2022-02-01T19:44:56.112Z|00043|ovs_rcu(urcu7)|WARN|blocked 1001 ms waiting for main to quiesce 2022-02-01T19:44:57.112Z|00044|ovs_rcu(urcu7)|WARN|blocked 2000 ms waiting for main to quiesce 2022-02-01T19:44:59.112Z|00045|ovs_rcu(urcu7)|WARN|blocked 4000 ms waiting for main to quiesce 2022-02-01T19:45:03.112Z|00046|ovs_rcu(urcu7)|WARN|blocked 8000 ms waiting for main to quiesce 2022-02-01T19:45:05.249Z|04139|timeval|WARN|Unreasonably long 10140ms poll interval (0ms user, 10115ms system) 2022-02-01T19:45:05.249Z|04140|timeval|WARN|faults: 248 minor, 0 major 2022-02-01T19:45:05.249Z|04141|timeval|WARN|context switches: 1 voluntary, 27 involuntary 2022-02-01T19:45:05.249Z|04142|coverage|INFO|Skipping details of duplicate event coverage for hash=9105faf6 2022-02-01T19:45:05.249Z|04143|poll_loop|INFO|wakeup due to [POLLIN] on fd 14 (NETLINK_ROUTE<->NETLINK_ROUTE) at lib/netlink-socket.c:1401 (100% CPU usage) 2022-02-01T19:45:05.249Z|04144|poll_loop|INFO|wakeup due to [POLLOUT] on fd 235 (/var/run/openvswitch/br-int.mgmt<->) at lib/stream-fd.c:153 (100% CPU usage) 2022-02-01T19:45:05.249Z|04145|poll_loop|INFO|wakeup due to [POLLIN] on fd 230 (/var/run/openvswitch/br-int.mgmt<->) at lib/stream-fd.c:157 (100% CPU usage) 2022-02-01T19:45:06.250Z|00047|ovs_rcu(urcu7)|WARN|blocked 1000 ms waiting for main to quiesce 2022-02-01T19:45:07.250Z|00048|ovs_rcu(urcu7)|WARN|blocked 2000 ms waiting for main to quiesce 2022-02-01T19:45:09.250Z|00049|ovs_rcu(urcu7)|WARN|blocked 4000 ms waiting for main to quiesce 2022-02-01T19:45:13.250Z|00050|ovs_rcu(urcu7)|WARN|blocked 8000 ms waiting for main to quiesce 2022-02-01T19:45:15.704Z|04146|rconn|WARN|br-int<->unix#272675: connection dropped (Broken pipe) 2022-02-01T19:45:16.714Z|00051|ovs_rcu(urcu7)|WARN|blocked 1000 ms waiting for main to quiesce 2022-02-01T19:45:17.713Z|00052|ovs_rcu(urcu7)|WARN|blocked 2000 ms waiting for main to quiesce 2022-02-01T19:45:19.714Z|00053|ovs_rcu(urcu7)|WARN|blocked 4000 ms waiting for main to quiesce 2022-02-01T19:45:23.014Z|04147|timeval|WARN|Unreasonably long 17764ms poll interval (0ms user, 17720ms system) 2022-02-01T19:45:23.014Z|04148|timeval|WARN|faults: 63 minor, 0 major 2022-02-01T19:45:23.014Z|04149|timeval|WARN|context switches: 2 voluntary, 24 involuntary 2022-02-01T19:45:23.014Z|04150|poll_loop|INFO|wakeup due to [POLLIN] on fd 15 (<->/var/run/openvswitch/db.sock) at lib/stream-fd.c:157 (100% CPU usage) 2022-02-01T19:45:23.014Z|04151|poll_loop|INFO|wakeup due to [POLLIN] on fd 230 (/var/run/openvswitch/br-int.mgmt<->) at lib/stream-fd.c:157 (100% CPU usage) 2022-02-01T19:45:23.021Z|04152|poll_loop|INFO|wakeup due to [POLLIN] on fd 230 (/var/run/openvswitch/br-int.mgmt<->) at lib/stream-fd.c:157 (100% CPU usage) 2022-02-01T19:45:23.022Z|04153|poll_loop|INFO|wakeup due to [POLLIN] on fd 230 (/var/run/openvswitch/br-int.mgmt<->) at lib/stream-fd.c:157 (100% CPU usage) 2022-02-01T19:45:23.022Z|04154|poll_loop|INFO|wakeup due to [POLLIN] on fd 230 (/var/run/openvswitch/br-int.mgmt<->) at lib/stream-fd.c:157 (100% CPU usage) 2022-02-01T19:45:23.023Z|04155|poll_loop|INFO|wakeup due to [POLLIN] on fd 230 (/var/run/openvswitch/br-int.mgmt<->) at lib/stream-fd.c:157 (100% CPU usage) 2022-02-01T19:45:23.023Z|04156|poll_loop|INFO|wakeup due to [POLLIN] on fd 230 (/var/run/openvswitch/br-int.mgmt<->) at lib/stream-fd.c:157 (100% CPU usage) 2022-02-01T19:45:23.023Z|04157|poll_loop|INFO|wakeup due to [POLLIN] on fd 230 (/var/run/openvswitch/br-int.mgmt<->) at lib/stream-fd.c:157 (100% CPU usage) 2022-02-01T19:45:23.024Z|04158|poll_loop|INFO|wakeup due to [POLLIN] on fd 230 (/var/run/openvswitch/br-int.mgmt<->) at lib/stream-fd.c:157 (100% CPU usage) 2022-02-01T19:45:23.024Z|04159|poll_loop|INFO|wakeup due to [POLLIN] on fd 230 (/var/run/openvswitch/br-int.mgmt<->) at lib/stream-fd.c:157 (100% CPU usage) 2022-02-01T19:45:41.274Z|04160|connmgr|INFO|br-int<->unix#272674: 13589 flow_mods 18 s ago (13509 adds, 80 modifications) 2022-02-01T19:46:31.285Z|04161|rconn|ERR|br-int<->unix#272673: no response to inactivity probe after 60 seconds, disconnecting In ovn-controller log: 2022-02-01T19:44:31.267Z|00077|features|INFO|unix:/run/openvswitch/br-int.mgmt: connecting to switch 2022-02-01T19:44:31.267Z|00078|rconn|INFO|unix:/run/openvswitch/br-int.mgmt: connecting... 2022-02-01T19:44:31.267Z|00079|ofctrl|INFO|unix:/run/openvswitch/br-int.mgmt: connecting to switch 2022-02-01T19:44:31.267Z|00080|rconn|INFO|unix:/run/openvswitch/br-int.mgmt: connecting... 2022-02-01T19:44:31.268Z|00081|rconn|INFO|unix:/run/openvswitch/br-int.mgmt: connected 2022-02-01T19:44:31.269Z|00082|rconn|INFO|unix:/run/openvswitch/br-int.mgmt: connected 2022-02-01T19:44:31.284Z|00083|main|INFO|OVS feature set changed, force recompute. 2022-02-01T19:44:34.074Z|00084|main|INFO|OVNSB commit failed, force recompute next time. 2022-02-01T19:44:34.621Z|00085|reconnect|INFO|ssl:<IP>16.0.61:6642: connection closed by peer 2022-02-01T19:44:35.622Z|00086|reconnect|INFO|ssl:<IP>:6642: connecting... 2022-02-01T19:44:35.672Z|00087|reconnect|INFO|ssl:<IP>:6642: connected 2022-02-01T19:45:01.470Z|00088|lflow_cache|INFO|Detected cache inactivity (last active 30001 ms ago): trimming cache 2022-02-01T19:45:23.351Z|00089|binding|INFO|Setting lport <lport> ovn-installed in OVS 2022-02-01T19:45:23.351Z|00090|binding|INFO|Setting lport <lport> up in Southbound <there are 80 port bindings...> 2022-02-01T19:45:23.352Z|00245|binding|INFO|Setting lport <lportN> ovn-installed in OVS 2022-02-01T19:45:23.352Z|00246|binding|INFO|Setting lport <lportN> up in Southbound 2022-02-01T19:45:53.323Z|00247|lflow_cache|INFO|Detected cache inactivity (last active 30004 ms ago): trimming cache I’ve got some questions: 1. What OVS is doing while `waiting for main to quiesce` ? I noticed this message appeares each time downtime occurs. 2. Is there any hint to minimize such downtime while reinstalling OF? 3. Could there be a situation where ovn-northd after upgrade still generating new logical flows, ovn-controller constantly gets updates from SB and triggers update to OF in a loop while doesn’t finish getting final lflows "view"? So, could it start installing new OF while having in memory lflows which are not consistent at that moment? 4. What does `br-int<->unix#272675: connection dropped (Broken pipe)` error means? What type of connection is broken? Is it normal? 5. Do you see anything else abnormal in the logs? Let me know if I have to give more inputs... Thanks. Regards, Vladislav Odintsov _______________________________________________ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev