Hi OVN and OVS folks,

during OVN upgrade from 20.06 to 21.12 version I’ve faced with a quite long 
datapath downtime (48-50 seconds).
I suppose this downtime was triggered by ovn-controller installing new openflow 
rules to ovs.
There are relatively not so many flows (13-14k).
openvswitch version is 2.13.4.
ovn-openflow-probe-interval is configured to 30.

The upgrade procedure was:
1. Set to ovs external_ids an option to force matching ovn-northd internal 
version (ovn-match-northd-version=true)
2. Upgrade ovn-controller (to first get ovn-controller, which supports version 
match between controller and northd and not to break datapath after upgrading 
ovn-northd)
3. Upgrade ovn-northd cluster

Network outage timeframe matches to ovs-vswitchd WARN logs. Please see next:

2022-02-01T19:44:32.577Z|00035|ovs_rcu(urcu7)|WARN|blocked 1000 ms waiting for 
main to quiesce
2022-02-01T19:44:33.577Z|00036|ovs_rcu(urcu7)|WARN|blocked 2000 ms waiting for 
main to quiesce
2022-02-01T19:44:35.577Z|00037|ovs_rcu(urcu7)|WARN|blocked 4000 ms waiting for 
main to quiesce
2022-02-01T19:44:39.577Z|00038|ovs_rcu(urcu7)|WARN|blocked 8000 ms waiting for 
main to quiesce
2022-02-01T19:44:44.396Z|04063|connmgr|INFO|br-int<->unix#272674: 37 flow_mods 
13 s ago (36 adds, 1 deletes)
2022-02-01T19:44:44.441Z|04064|timeval|WARN|Unreasonably long 12864ms poll 
interval (0ms user, 12833ms system)
2022-02-01T19:44:44.441Z|04065|timeval|WARN|faults: 287 minor, 0 major
2022-02-01T19:44:44.441Z|04066|timeval|WARN|context switches: 1 voluntary, 18 
involuntary
2022-02-01T19:44:44.441Z|04067|coverage|INFO|Event coverage, avg rate over 
last: 5 seconds, last minute, last hour,  hash=9105faf6:
2022-02-01T19:44:44.441Z|04068|coverage|INFO|ccmap_expand               0.0/sec 
    0.000/sec        0.0000/sec   total: 1129
2022-02-01T19:44:44.441Z|04069|coverage|INFO|ccmap_shrink               0.0/sec 
   11.933/sec        0.1989/sec   total: 2363
2022-02-01T19:44:44.441Z|04070|coverage|INFO|cmap_expand                0.0/sec 
    0.833/sec        0.8811/sec   total: 11218023
2022-02-01T19:44:44.441Z|04071|coverage|INFO|cmap_shrink                8.2/sec 
   11.950/sec        1.0736/sec   total: 11217475
2022-02-01T19:44:44.441Z|04072|coverage|INFO|dpif_port_add              0.0/sec 
    0.000/sec        0.0000/sec   total: 229
2022-02-01T19:44:44.441Z|04073|coverage|INFO|dpif_port_del              0.0/sec 
    0.000/sec        0.0000/sec   total: 144
2022-02-01T19:44:44.441Z|04074|coverage|INFO|dpif_flow_flush            0.0/sec 
    0.000/sec        0.0000/sec   total: 2
2022-02-01T19:44:44.441Z|04075|coverage|INFO|dpif_flow_get              0.0/sec 
    0.000/sec        0.0000/sec   total: 23
2022-02-01T19:44:44.441Z|04076|coverage|INFO|dpif_flow_put             83.0/sec 
   97.900/sec       80.8306/sec   total: 849051748
2022-02-01T19:44:44.441Z|04077|coverage|INFO|dpif_flow_del            147.8/sec 
   91.733/sec       78.2797/sec   total: 810797137
2022-02-01T19:44:44.441Z|04078|coverage|INFO|dpif_execute              66.6/sec 
   83.767/sec       87.8397/sec   total: 958697586
2022-02-01T19:44:44.441Z|04079|coverage|INFO|dpif_execute_with_help     0.0/sec 
    4.250/sec        4.9892/sec   total: 70713908
2022-02-01T19:44:44.441Z|04080|coverage|INFO|flow_extract             169.0/sec 
  247.017/sec      333.1194/sec   total: 7990486004
2022-02-01T19:44:44.441Z|04081|coverage|INFO|miniflow_malloc            0.0/sec 
    0.183/sec        0.7983/sec   total: 5871853
2022-02-01T19:44:44.441Z|04082|coverage|INFO|hindex_pathological        0.0/sec 
    0.000/sec        0.0000/sec   total: 78
2022-02-01T19:44:44.441Z|04083|coverage|INFO|hindex_expand              0.0/sec 
    0.000/sec        0.0000/sec   total: 11
2022-02-01T19:44:44.441Z|04084|coverage|INFO|hmap_pathological          0.0/sec 
    0.633/sec        0.9331/sec   total: 16638117
2022-02-01T19:44:44.441Z|04085|coverage|INFO|hmap_expand                0.0/sec 
  114.867/sec      139.8006/sec   total: 2184263485
2022-02-01T19:44:44.441Z|04086|coverage|INFO|mac_learning_learned       0.0/sec 
    0.000/sec        0.0003/sec   total: 299
2022-02-01T19:44:44.441Z|04087|coverage|INFO|mac_learning_expired       0.0/sec 
    0.000/sec        0.0006/sec   total: 117
2022-02-01T19:44:44.441Z|04088|coverage|INFO|mac_learning_moved         0.0/sec 
    0.000/sec        0.0000/sec   total: 87
2022-02-01T19:44:44.441Z|04089|coverage|INFO|netdev_get_stats           0.0/sec 
   36.850/sec       43.3306/sec   total: 690673778
2022-02-01T19:44:44.441Z|04090|coverage|INFO|txn_unchanged              0.0/sec 
    0.100/sec        0.1614/sec   total: 2705864
2022-02-01T19:44:44.441Z|04091|coverage|INFO|txn_incomplete             0.0/sec 
    0.183/sec        0.2239/sec   total: 3688261
2022-02-01T19:44:44.441Z|04092|coverage|INFO|txn_success                0.0/sec 
    0.167/sec        0.1953/sec   total: 3236818
2022-02-01T19:44:44.441Z|04093|coverage|INFO|poll_create_node         
3930.6/sec  3941.350/sec     5012.2572/sec  total: 115959738162
2022-02-01T19:44:44.441Z|04094|coverage|INFO|poll_zero_timeout        159.4/sec 
  111.417/sec      103.4078/sec   total: 1181551227
2022-02-01T19:44:44.441Z|04095|coverage|INFO|rconn_queued               0.0/sec 
    0.050/sec        0.0817/sec   total: 1279558
2022-02-01T19:44:44.441Z|04096|coverage|INFO|rconn_sent                 0.0/sec 
    0.050/sec        0.0817/sec   total: 1279558
2022-02-01T19:44:44.441Z|04097|coverage|INFO|seq_change               
2205.2/sec  2502.783/sec     2874.6258/sec  total: 57335390998
2022-02-01T19:44:44.441Z|04098|coverage|INFO|pstream_open               0.0/sec 
    0.000/sec        0.0000/sec   total: 5
2022-02-01T19:44:44.441Z|04099|coverage|INFO|stream_open                0.0/sec 
    0.000/sec        0.0000/sec   total: 1
2022-02-01T19:44:44.441Z|04100|coverage|INFO|unixctl_received           0.0/sec 
    0.000/sec        0.0000/sec   total: 189
2022-02-01T19:44:44.441Z|04101|coverage|INFO|unixctl_replied            0.0/sec 
    0.000/sec        0.0000/sec   total: 189
2022-02-01T19:44:44.441Z|04102|coverage|INFO|util_xalloc              
7935.8/sec 15234.350/sec    18372.4347/sec  total: 345308271208
2022-02-01T19:44:44.441Z|04103|coverage|INFO|vconn_received             0.0/sec 
    0.083/sec        0.3556/sec   total: 2018930
2022-02-01T19:44:44.441Z|04104|coverage|INFO|vconn_sent                 0.0/sec 
    0.067/sec        0.0983/sec   total: 1552067
2022-02-01T19:44:44.441Z|04105|coverage|INFO|netdev_set_policing        0.0/sec 
    0.000/sec        0.0000/sec   total: 623
2022-02-01T19:44:44.441Z|04106|coverage|INFO|netdev_get_ifindex         0.0/sec 
    0.100/sec        0.1636/sec   total: 2713928
2022-02-01T19:44:44.441Z|04107|coverage|INFO|netdev_set_hwaddr          0.0/sec 
    0.000/sec        0.0000/sec   total: 2
2022-02-01T19:44:44.441Z|04108|coverage|INFO|netdev_get_ethtool         0.0/sec 
    0.000/sec        0.0000/sec   total: 1073
2022-02-01T19:44:44.441Z|04109|coverage|INFO|netdev_set_ethtool         0.0/sec 
    0.000/sec        0.0000/sec   total: 225
2022-02-01T19:44:44.441Z|04110|coverage|INFO|nln_changed                0.0/sec 
    0.050/sec        0.0492/sec   total: 819614
2022-02-01T19:44:44.441Z|04111|coverage|INFO|netlink_received         346.6/sec 
  472.133/sec      575.4597/sec   total: 11288088779
2022-02-01T19:44:44.441Z|04112|coverage|INFO|netlink_recv_jumbo        18.4/sec 
  113.250/sec      175.1189/sec   total: 3935643810
2022-02-01T19:44:44.441Z|04113|coverage|INFO|netlink_sent             308.4/sec 
  378.133/sec      380.4267/sec   total: 4753114969
2022-02-01T19:44:44.441Z|04114|coverage|INFO|ofproto_flush              0.0/sec 
    0.000/sec        0.0000/sec   total: 1
2022-02-01T19:44:44.441Z|04115|coverage|INFO|ofproto_recv_openflow      0.0/sec 
    0.067/sec        0.3389/sec   total: 1746449
2022-02-01T19:44:44.441Z|04116|coverage|INFO|ofproto_update_port        0.0/sec 
    0.000/sec        0.0000/sec   total: 814
2022-02-01T19:44:44.441Z|04117|coverage|INFO|packet_in_overflow         0.0/sec 
    0.000/sec        0.0000/sec   total: 282520
2022-02-01T19:44:44.441Z|04118|coverage|INFO|rev_reconfigure            0.0/sec 
    0.017/sec        0.0325/sec   total: 539904
2022-02-01T19:44:44.441Z|04119|coverage|INFO|rev_port_toggled           0.0/sec 
    0.000/sec        0.0000/sec   total: 27
2022-02-01T19:44:44.441Z|04120|coverage|INFO|rev_flow_table             0.0/sec 
    0.000/sec        0.0106/sec   total: 10876
2022-02-01T19:44:44.441Z|04121|coverage|INFO|rev_mac_learning           0.0/sec 
    0.000/sec        0.0006/sec   total: 312
2022-02-01T19:44:44.441Z|04122|coverage|INFO|dumped_duplicate_flow      0.0/sec 
    0.000/sec        0.0000/sec   total: 39204
2022-02-01T19:44:44.441Z|04123|coverage|INFO|dumped_new_flow            0.0/sec 
    0.000/sec        0.0000/sec   total: 7177
2022-02-01T19:44:44.441Z|04124|coverage|INFO|handler_duplicate_upcall   0.4/sec 
    1.833/sec        2.2664/sec   total: 34762798
2022-02-01T19:44:44.441Z|04125|coverage|INFO|upcall_ukey_contention     0.0/sec 
    0.000/sec        0.0000/sec   total: 9
2022-02-01T19:44:44.441Z|04126|coverage|INFO|upcall_ukey_replace        0.2/sec 
    0.050/sec        0.0447/sec   total: 7927221
2022-02-01T19:44:44.441Z|04127|coverage|INFO|revalidate_missed_dp_flow   
0.0/sec     0.000/sec        0.0000/sec   total: 1118
2022-02-01T19:44:44.441Z|04128|coverage|INFO|xlate_actions            123.6/sec 
  273.183/sec      233.8275/sec   total: 2110727223
2022-02-01T19:44:44.441Z|04129|coverage|INFO|bridge_reconfigure         0.0/sec 
    0.017/sec        0.0325/sec   total: 539902
2022-02-01T19:44:44.441Z|04130|coverage|INFO|74 events never hit
2022-02-01T19:44:44.441Z|04131|poll_loop|INFO|Dropped 3 log messages in last 
1058229 seconds (most recently, 1058228 seconds ago) due to excessive rate
2022-02-01T19:44:44.441Z|04132|poll_loop|INFO|wakeup due to [POLLIN] on fd 230 
(/var/run/openvswitch/br-int.mgmt<->) at lib/stream-fd.c:157 (84% CPU usage)
2022-02-01T19:44:45.441Z|00039|ovs_rcu(urcu7)|WARN|blocked 1000 ms waiting for 
main to quiesce
2022-02-01T19:44:46.441Z|00040|ovs_rcu(urcu7)|WARN|blocked 2000 ms waiting for 
main to quiesce
2022-02-01T19:44:48.441Z|00041|ovs_rcu(urcu7)|WARN|blocked 4001 ms waiting for 
main to quiesce
2022-02-01T19:44:52.441Z|00042|ovs_rcu(urcu7)|WARN|blocked 8000 ms waiting for 
main to quiesce
2022-02-01T19:44:55.109Z|04133|timeval|WARN|Unreasonably long 10668ms poll 
interval (0ms user, 10643ms system)
2022-02-01T19:44:55.109Z|04134|timeval|WARN|faults: 70 minor, 0 major
2022-02-01T19:44:55.109Z|04135|timeval|WARN|context switches: 0 voluntary, 4 
involuntary
2022-02-01T19:44:55.109Z|04136|coverage|INFO|Skipping details of duplicate 
event coverage for hash=9105faf6
2022-02-01T19:44:55.109Z|04137|poll_loop|INFO|wakeup due to [POLLIN] on fd 15 
(<->/var/run/openvswitch/db.sock) at lib/stream-fd.c:157 (100% CPU usage)
2022-02-01T19:44:55.109Z|04138|poll_loop|INFO|wakeup due to [POLLIN] on fd 230 
(/var/run/openvswitch/br-int.mgmt<->) at lib/stream-fd.c:157 (100% CPU usage)
2022-02-01T19:44:56.112Z|00043|ovs_rcu(urcu7)|WARN|blocked 1001 ms waiting for 
main to quiesce
2022-02-01T19:44:57.112Z|00044|ovs_rcu(urcu7)|WARN|blocked 2000 ms waiting for 
main to quiesce
2022-02-01T19:44:59.112Z|00045|ovs_rcu(urcu7)|WARN|blocked 4000 ms waiting for 
main to quiesce
2022-02-01T19:45:03.112Z|00046|ovs_rcu(urcu7)|WARN|blocked 8000 ms waiting for 
main to quiesce
2022-02-01T19:45:05.249Z|04139|timeval|WARN|Unreasonably long 10140ms poll 
interval (0ms user, 10115ms system)
2022-02-01T19:45:05.249Z|04140|timeval|WARN|faults: 248 minor, 0 major
2022-02-01T19:45:05.249Z|04141|timeval|WARN|context switches: 1 voluntary, 27 
involuntary
2022-02-01T19:45:05.249Z|04142|coverage|INFO|Skipping details of duplicate 
event coverage for hash=9105faf6
2022-02-01T19:45:05.249Z|04143|poll_loop|INFO|wakeup due to [POLLIN] on fd 14 
(NETLINK_ROUTE<->NETLINK_ROUTE) at lib/netlink-socket.c:1401 (100% CPU usage)
2022-02-01T19:45:05.249Z|04144|poll_loop|INFO|wakeup due to [POLLOUT] on fd 235 
(/var/run/openvswitch/br-int.mgmt<->) at lib/stream-fd.c:153 (100% CPU usage)
2022-02-01T19:45:05.249Z|04145|poll_loop|INFO|wakeup due to [POLLIN] on fd 230 
(/var/run/openvswitch/br-int.mgmt<->) at lib/stream-fd.c:157 (100% CPU usage)
2022-02-01T19:45:06.250Z|00047|ovs_rcu(urcu7)|WARN|blocked 1000 ms waiting for 
main to quiesce
2022-02-01T19:45:07.250Z|00048|ovs_rcu(urcu7)|WARN|blocked 2000 ms waiting for 
main to quiesce
2022-02-01T19:45:09.250Z|00049|ovs_rcu(urcu7)|WARN|blocked 4000 ms waiting for 
main to quiesce
2022-02-01T19:45:13.250Z|00050|ovs_rcu(urcu7)|WARN|blocked 8000 ms waiting for 
main to quiesce
2022-02-01T19:45:15.704Z|04146|rconn|WARN|br-int<->unix#272675: connection 
dropped (Broken pipe)
2022-02-01T19:45:16.714Z|00051|ovs_rcu(urcu7)|WARN|blocked 1000 ms waiting for 
main to quiesce
2022-02-01T19:45:17.713Z|00052|ovs_rcu(urcu7)|WARN|blocked 2000 ms waiting for 
main to quiesce
2022-02-01T19:45:19.714Z|00053|ovs_rcu(urcu7)|WARN|blocked 4000 ms waiting for 
main to quiesce
2022-02-01T19:45:23.014Z|04147|timeval|WARN|Unreasonably long 17764ms poll 
interval (0ms user, 17720ms system)
2022-02-01T19:45:23.014Z|04148|timeval|WARN|faults: 63 minor, 0 major
2022-02-01T19:45:23.014Z|04149|timeval|WARN|context switches: 2 voluntary, 24 
involuntary
2022-02-01T19:45:23.014Z|04150|poll_loop|INFO|wakeup due to [POLLIN] on fd 15 
(<->/var/run/openvswitch/db.sock) at lib/stream-fd.c:157 (100% CPU usage)
2022-02-01T19:45:23.014Z|04151|poll_loop|INFO|wakeup due to [POLLIN] on fd 230 
(/var/run/openvswitch/br-int.mgmt<->) at lib/stream-fd.c:157 (100% CPU usage)
2022-02-01T19:45:23.021Z|04152|poll_loop|INFO|wakeup due to [POLLIN] on fd 230 
(/var/run/openvswitch/br-int.mgmt<->) at lib/stream-fd.c:157 (100% CPU usage)
2022-02-01T19:45:23.022Z|04153|poll_loop|INFO|wakeup due to [POLLIN] on fd 230 
(/var/run/openvswitch/br-int.mgmt<->) at lib/stream-fd.c:157 (100% CPU usage)
2022-02-01T19:45:23.022Z|04154|poll_loop|INFO|wakeup due to [POLLIN] on fd 230 
(/var/run/openvswitch/br-int.mgmt<->) at lib/stream-fd.c:157 (100% CPU usage)
2022-02-01T19:45:23.023Z|04155|poll_loop|INFO|wakeup due to [POLLIN] on fd 230 
(/var/run/openvswitch/br-int.mgmt<->) at lib/stream-fd.c:157 (100% CPU usage)
2022-02-01T19:45:23.023Z|04156|poll_loop|INFO|wakeup due to [POLLIN] on fd 230 
(/var/run/openvswitch/br-int.mgmt<->) at lib/stream-fd.c:157 (100% CPU usage)
2022-02-01T19:45:23.023Z|04157|poll_loop|INFO|wakeup due to [POLLIN] on fd 230 
(/var/run/openvswitch/br-int.mgmt<->) at lib/stream-fd.c:157 (100% CPU usage)
2022-02-01T19:45:23.024Z|04158|poll_loop|INFO|wakeup due to [POLLIN] on fd 230 
(/var/run/openvswitch/br-int.mgmt<->) at lib/stream-fd.c:157 (100% CPU usage)
2022-02-01T19:45:23.024Z|04159|poll_loop|INFO|wakeup due to [POLLIN] on fd 230 
(/var/run/openvswitch/br-int.mgmt<->) at lib/stream-fd.c:157 (100% CPU usage)
2022-02-01T19:45:41.274Z|04160|connmgr|INFO|br-int<->unix#272674: 13589 
flow_mods 18 s ago (13509 adds, 80 modifications)
2022-02-01T19:46:31.285Z|04161|rconn|ERR|br-int<->unix#272673: no response to 
inactivity probe after 60 seconds, disconnecting


In ovn-controller log:

2022-02-01T19:44:31.267Z|00077|features|INFO|unix:/run/openvswitch/br-int.mgmt: 
connecting to switch
2022-02-01T19:44:31.267Z|00078|rconn|INFO|unix:/run/openvswitch/br-int.mgmt: 
connecting...
2022-02-01T19:44:31.267Z|00079|ofctrl|INFO|unix:/run/openvswitch/br-int.mgmt: 
connecting to switch
2022-02-01T19:44:31.267Z|00080|rconn|INFO|unix:/run/openvswitch/br-int.mgmt: 
connecting...
2022-02-01T19:44:31.268Z|00081|rconn|INFO|unix:/run/openvswitch/br-int.mgmt: 
connected
2022-02-01T19:44:31.269Z|00082|rconn|INFO|unix:/run/openvswitch/br-int.mgmt: 
connected
2022-02-01T19:44:31.284Z|00083|main|INFO|OVS feature set changed, force 
recompute.
2022-02-01T19:44:34.074Z|00084|main|INFO|OVNSB commit failed, force recompute 
next time.
2022-02-01T19:44:34.621Z|00085|reconnect|INFO|ssl:<IP>16.0.61:6642: connection 
closed by peer
2022-02-01T19:44:35.622Z|00086|reconnect|INFO|ssl:<IP>:6642: connecting...
2022-02-01T19:44:35.672Z|00087|reconnect|INFO|ssl:<IP>:6642: connected
2022-02-01T19:45:01.470Z|00088|lflow_cache|INFO|Detected cache inactivity (last 
active 30001 ms ago): trimming cache
2022-02-01T19:45:23.351Z|00089|binding|INFO|Setting lport <lport> ovn-installed 
in OVS
2022-02-01T19:45:23.351Z|00090|binding|INFO|Setting lport <lport> up in 
Southbound
<there are 80 port bindings...>
2022-02-01T19:45:23.352Z|00245|binding|INFO|Setting lport <lportN> 
ovn-installed in OVS
2022-02-01T19:45:23.352Z|00246|binding|INFO|Setting lport <lportN> up in 
Southbound
2022-02-01T19:45:53.323Z|00247|lflow_cache|INFO|Detected cache inactivity (last 
active 30004 ms ago): trimming cache


I’ve got some questions:
1. What OVS is doing while `waiting for main to quiesce` ? I noticed this 
message appeares each time downtime occurs.
2. Is there any hint to minimize such downtime while reinstalling OF?
3. Could there be a situation where ovn-northd after upgrade still generating 
new logical flows, ovn-controller constantly gets updates from SB and triggers 
update to OF in a loop while doesn’t finish getting final lflows "view"? So, 
could it start installing new OF while having in memory lflows which are not 
consistent at that moment?
4. What does `br-int<->unix#272675: connection dropped (Broken pipe)` error 
means? What type of connection is broken? Is it normal?
5. Do you see anything else abnormal in the logs?

Let me know if I have to give more inputs...

Thanks.

Regards,
Vladislav Odintsov

_______________________________________________
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Reply via email to