Hi Ilya, thanks for detailed answer. I’ve tried to just "soft-restart" ovn-controller and it seems that I found the problem, that could block OVS.
What I’ve done: 1. ovs-appctl -t ovs-vswitchd vlog/set file:vconn:dbg 2. ovn-appctl -t ovn-controller exit --restart 3. systemctl restart ovn-controller In OVS log I saw a lot of flow-mod lines, and CT_FLUSH_ZONE: 2022-02-03T10:43:18.956Z|01574|vconn|DBG|unix#264996: sent (Success): OFPT_HELLO (OF1.5) (xid=0x40a6b): version bitmap: 0x01, 0x02, 0x03, 0x04, 0x05, 0x06 2022-02-03T10:43:18.956Z|01575|vconn|DBG|unix#264996: received: OFPT_HELLO (OF1.5) (xid=0x1): version bitmap: 0x06 2022-02-03T10:43:18.956Z|01576|vconn|DBG|unix#264996: negotiated OpenFlow version 0x06 (we support version 0x06 and earlier, peer supports version 0x06) 2022-02-03T10:43:18.956Z|01577|vconn|DBG|unix#264996: received: OFPT_GET_CONFIG_REQUEST (OF1.5) (xid=0x2): 2022-02-03T10:43:18.956Z|01578|vconn|DBG|unix#264996: sent (Success): OFPT_GET_CONFIG_REPLY (OF1.5) (xid=0x2): frags=normal miss_send_len=0 2022-02-03T10:43:18.956Z|01579|vconn|DBG|unix#264996: received: NXT_SET_PACKET_IN_FORMAT (OF1.5) (xid=0x3): format=nxt_packet_in2 2022-02-03T10:43:18.956Z|01580|vconn|DBG|unix#264996: received: OFPT_SET_CONFIG (OF1.5) (xid=0x4): frags=normal miss_send_len=65535 2022-02-03T10:43:19.123Z|01581|vconn|DBG|unix#264997: sent (Success): OFPT_HELLO (OF1.5) (xid=0x40a6c): version bitmap: 0x01, 0x02, 0x03, 0x04, 0x05, 0x06 2022-02-03T10:43:19.123Z|01582|vconn|DBG|unix#264997: received: OFPT_HELLO (OF1.5) (xid=0x5): version bitmap: 0x06 2022-02-03T10:43:19.123Z|01583|vconn|DBG|unix#264997: negotiated OpenFlow version 0x06 (we support version 0x06 and earlier, peer supports version 0x06) 2022-02-03T10:43:19.123Z|01584|vconn|DBG|unix#264998: sent (Success): OFPT_HELLO (OF1.5) (xid=0x40a6d): version bitmap: 0x01, 0x02, 0x03, 0x04, 0x05, 0x06 2022-02-03T10:43:19.123Z|01585|vconn|DBG|unix#264998: received: OFPT_HELLO (OF1.5) (xid=0x6): version bitmap: 0x06 2022-02-03T10:43:19.123Z|01586|vconn|DBG|unix#264998: negotiated OpenFlow version 0x06 (we support version 0x06 and earlier, peer supports version 0x06) 2022-02-03T10:43:19.124Z|01587|vconn|DBG|unix#264998: received: NXT_TLV_TABLE_REQUEST (OF1.5) (xid=0x7): 2022-02-03T10:43:19.124Z|01588|vconn|DBG|unix#264998: sent (Success): NXT_TLV_TABLE_REPLY (OF1.5) (xid=0x7): max option space=256 max fields=64 allocated option space=4 mapping table: class type length match field ------ ---- ------ -------------- 0x102 0x80 4 tun_metadata0 2022-02-03T10:43:19.124Z|01589|vconn|DBG|unix#264998: received: OFPT_FLOW_MOD (OF1.5) (xid=0x8): DEL table:255 priority=0 actions=drop 2022-02-03T10:43:19.126Z|01590|vconn|DBG|unix#264998: received: OFPT_GROUP_MOD (OF1.5) (xid=0x9): DEL group_id=4294967292,type=all 2022-02-03T10:43:19.126Z|01591|vconn|DBG|unix#264998: received: OFPT_METER_MOD (OF1.5) (xid=0xa): DEL meter=all bands= 2022-02-03T10:43:19.126Z|01592|vconn|DBG|unix#264998: received: OFPT_BUNDLE_CONTROL (OF1.5) (xid=0xb): bundle_id=0 type=OPEN_REQUEST flags=atomic ordered 2022-02-03T10:43:19.126Z|01593|vconn|DBG|unix#264998: sent (Success): OFPT_BUNDLE_CONTROL (OF1.5) (xid=0xb): bundle_id=0 type=OPEN_REPLY flags=0 2022-02-03T10:43:19.126Z|01594|vconn|DBG|unix#264998: received: OFPT_BUNDLE_ADD_MESSAGE (OF1.5) (xid=0xc): bundle_id=0 flags=atomic ordered OFPT_FLOW_MOD (OF1.5) (xid=0xc): ADD priority=100,in_port=98 actions=move:NXM_NX_TUN_ID[40..54]->NXM_NX_REG14[0..14],move:NXM_NX_TUN_ID[24..39]->NXM_NX_REG15[0..15],move:NXM_NX_TUN_ID[0..23]->OXM_OF_METADATA[0..23],resubmit(,38) 2022-02-03T10:43:19.126Z|01595|vconn|DBG|unix#264998: received: OFPT_BUNDLE_ADD_MESSAGE (OF1.5) (xid=0xd): bundle_id=0 flags=atomic ordered OFPT_FLOW_MOD (OF1.5) (xid=0xd): ADD priority=100,in_port=96 actions=move:NXM_NX_TUN_ID[40..54]->NXM_NX_REG14[0..14],move:NXM_NX_TUN_ID[24..39]->NXM_NX_REG15[0..15],move:NXM_NX_TUN_ID[0..23]->OXM_OF_METADATA[0..23],resubmit(,38) 2022-02-03T10:43:19.126Z|01596|vconn|DBG|unix#264998: received: OFPT_BUNDLE_ADD_MESSAGE (OF1.5) (xid=0xe): bundle_id=0 flags=atomic ordered OFPT_FLOW_MOD (OF1.5) (xid=0xe): ADD priority=100,in_port=7 actions=move:NXM_NX_TUN_ID[40..54]->NXM_NX_REG14[0..14],move:NXM_NX_TUN_ID[24..39]->NXM_NX_REG15[0..15],move:NXM_NX_TUN_ID[0..23]->OXM_OF_METADATA[0..23],resubmit(,38) 2022-02-03T10:43:19.126Z|01597|vconn|DBG|unix#264998: received: OFPT_BUNDLE_ADD_MESSAGE (OF1.5) (xid=0xf): bundle_id=0 flags=atomic ordered OFPT_FLOW_MOD (OF1.5) (xid=0xf): ADD priority=100,in_port=6 actions=move:NXM_NX_TUN_ID[40..54]->NXM_NX_REG14[0..14],move:NXM_NX_TUN_ID[24..39]->NXM_NX_REG15[0..15],move:NXM_NX_TUN_ID[0..23]->OXM_OF_METADATA[0..23],resubmit(,38) 2022-02-03T10:43:19.126Z|01598|vconn|DBG|unix#264998: received: OFPT_BUNDLE_ADD_MESSAGE (OF1.5) (xid=0x10): bundle_id=0 flags=atomic ordered OFPT_FLOW_MOD (OF1.5) (xid=0x10): ADD priority=100,in_port=55 actions=move:NXM_NX_TUN_ID[40..54]->NXM_NX_REG14[0..14],move:NXM_NX_TUN_ID[24..39]->NXM_NX_REG15[0..15],move:NXM_NX_TUN_ID[0..23]->OXM_OF_METADATA[0..23],resubmit(,38) 2022-02-03T10:43:19.126Z|01599|vconn|DBG|unix#264998: received: OFPT_BUNDLE_ADD_MESSAGE (OF1.5) (xid=0x11): bundle_id=0 flags=atomic ordered …. 2022-02-03T10:43:26.293Z|01758|vconn|DBG|unix#264998: received: NXT_CT_FLUSH_ZONE (OF1.5) (xid=0x6b): zone_id=52 2022-02-03T10:43:26.424Z|01759|vconn|DBG|unix#264998: received: NXT_CT_FLUSH_ZONE (OF1.5) (xid=0x6c): zone_id=55 2022-02-03T10:43:26.554Z|01760|vconn|DBG|unix#264998: received: NXT_CT_FLUSH_ZONE (OF1.5) (xid=0x6d): zone_id=54 2022-02-03T10:43:26.685Z|01761|vconn|DBG|unix#264998: received: NXT_CT_FLUSH_ZONE (OF1.5) (xid=0x6e): zone_id=58 2022-02-03T10:43:26.816Z|01762|vconn|DBG|unix#264998: received: NXT_CT_FLUSH_ZONE (OF1.5) (xid=0x6f): zone_id=57 2022-02-03T10:43:26.946Z|01763|vconn|DBG|unix#264998: received: NXT_CT_FLUSH_ZONE (OF1.5) (xid=0x70): zone_id=56 2022-02-03T10:43:27.031Z|00013|ovs_rcu(urcu2)|WARN|blocked 1000 ms waiting for main to quiesce 2022-02-03T10:43:27.076Z|01764|vconn|DBG|unix#264998: received: NXT_CT_FLUSH_ZONE (OF1.5) (xid=0x71): zone_id=59 2022-02-03T10:43:27.206Z|01765|vconn|DBG|unix#264998: received: NXT_CT_FLUSH_ZONE (OF1.5) (xid=0x72): zone_id=61 2022-02-03T10:43:27.336Z|01766|vconn|DBG|unix#264998: received: NXT_CT_FLUSH_ZONE (OF1.5) (xid=0x73): zone_id=60 2022-02-03T10:43:27.465Z|01767|vconn|DBG|unix#264998: received: NXT_CT_FLUSH_ZONE (OF1.5) (xid=0x74): zone_id=62 2022-02-03T10:43:27.594Z|01768|vconn|DBG|unix#264998: received: NXT_CT_FLUSH_ZONE (OF1.5) (xid=0x75): zone_id=66 2022-02-03T10:43:27.726Z|01769|vconn|DBG|unix#264998: received: NXT_CT_FLUSH_ZONE (OF1.5) (xid=0x76): zone_id=65 2022-02-03T10:43:27.854Z|01770|vconn|DBG|unix#264998: received: NXT_CT_FLUSH_ZONE (OF1.5) (xid=0x77): zone_id=64 2022-02-03T10:43:27.984Z|01771|vconn|DBG|unix#264998: received: NXT_CT_FLUSH_ZONE (OF1.5) (xid=0x78): zone_id=63 2022-02-03T10:43:28.031Z|00014|ovs_rcu(urcu2)|WARN|blocked 2000 ms waiting for main to quiesce 2022-02-03T10:43:28.111Z|01772|vconn|DBG|unix#264998: received: NXT_CT_FLUSH_ZONE (OF1.5) (xid=0x79): zone_id=67 2022-02-03T10:43:28.238Z|01773|vconn|DBG|unix#264998: received: NXT_CT_FLUSH_ZONE (OF1.5) (xid=0x7a): zone_id=68 2022-02-03T10:43:28.371Z|01774|vconn|DBG|unix#264998: received: NXT_CT_FLUSH_ZONE (OF1.5) (xid=0x7b): zone_id=69 2022-02-03T10:43:28.496Z|01775|vconn|DBG|unix#264998: received: NXT_CT_FLUSH_ZONE (OF1.5) (xid=0x7c): zone_id=72 2022-02-03T10:43:28.622Z|01776|vconn|DBG|unix#264998: received: NXT_CT_FLUSH_ZONE (OF1.5) (xid=0x7d): zone_id=71 2022-02-03T10:43:28.746Z|01777|vconn|DBG|unix#264998: received: NXT_CT_FLUSH_ZONE (OF1.5) (xid=0x7e): zone_id=70 2022-02-03T10:43:28.871Z|01778|vconn|DBG|unix#264998: received: NXT_CT_FLUSH_ZONE (OF1.5) (xid=0x7f): zone_id=75 2022-02-03T10:43:29.000Z|01779|vconn|DBG|unix#264998: received: NXT_CT_FLUSH_ZONE (OF1.5) (xid=0x80): zone_id=74 2022-02-03T10:43:29.123Z|01780|vconn|DBG|unix#264998: received: NXT_CT_FLUSH_ZONE (OF1.5) (xid=0x81): zone_id=73 2022-02-03T10:43:29.246Z|01781|vconn|DBG|unix#264998: received: NXT_CT_FLUSH_ZONE (OF1.5) (xid=0x82): zone_id=76 2022-02-03T10:43:29.371Z|01782|vconn|DBG|unix#264998: received: NXT_CT_FLUSH_ZONE (OF1.5) (xid=0x83): zone_id=78 2022-02-03T10:43:29.495Z|01783|vconn|DBG|unix#264998: received: NXT_CT_FLUSH_ZONE (OF1.5) (xid=0x84): zone_id=77 2022-02-03T10:43:29.618Z|01784|vconn|DBG|unix#264998: received: NXT_CT_FLUSH_ZONE (OF1.5) (xid=0x85): zone_id=81 2022-02-03T10:43:29.746Z|01785|vconn|DBG|unix#264998: received: NXT_CT_FLUSH_ZONE (OF1.5) (xid=0x86): zone_id=80 2022-02-03T10:43:29.867Z|01786|vconn|DBG|unix#264998: received: NXT_CT_FLUSH_ZONE (OF1.5) (xid=0x87): zone_id=79 2022-02-03T10:43:29.988Z|01787|vconn|DBG|unix#264998: received: NXT_CT_FLUSH_ZONE (OF1.5) (xid=0x88): zone_id=84 2022-02-03T10:43:30.031Z|00015|ovs_rcu(urcu2)|WARN|blocked 4000 ms waiting for main to quiesce 2022-02-03T10:43:30.108Z|01788|vconn|DBG|unix#264998: received: NXT_CT_FLUSH_ZONE (OF1.5) (xid=0x89): zone_id=83 2022-02-03T10:43:30.229Z|01789|vconn|DBG|unix#264998: received: NXT_CT_FLUSH_ZONE (OF1.5) (xid=0x8a): zone_id=82 2022-02-03T10:43:30.349Z|01790|vconn|DBG|unix#264998: received: NXT_CT_FLUSH_ZONE (OF1.5) (xid=0x8b): zone_id=86 2022-02-03T10:43:30.470Z|01791|vconn|DBG|unix#264998: received: NXT_CT_FLUSH_ZONE (OF1.5) (xid=0x8c): zone_id=85 2022-02-03T10:43:30.591Z|01792|vconn|DBG|unix#264998: received: NXT_CT_FLUSH_ZONE (OF1.5) (xid=0x8d): zone_id=89 2022-02-03T10:43:30.712Z|01793|vconn|DBG|unix#264998: received: NXT_CT_FLUSH_ZONE (OF1.5) (xid=0x8e): zone_id=88 2022-02-03T10:43:30.833Z|01794|vconn|DBG|unix#264998: received: NXT_CT_FLUSH_ZONE (OF1.5) (xid=0x8f): zone_id=87 2022-02-03T10:43:30.953Z|01795|vconn|DBG|unix#264998: received: NXT_CT_FLUSH_ZONE (OF1.5) (xid=0x90): zone_id=90 2022-02-03T10:43:31.073Z|01796|vconn|DBG|unix#264998: received: NXT_CT_FLUSH_ZONE (OF1.5) (xid=0x91): zone_id=91 2022-02-03T10:43:31.193Z|01797|vconn|DBG|unix#264998: received: NXT_CT_FLUSH_ZONE (OF1.5) (xid=0x92): zone_id=94 2022-02-03T10:43:31.313Z|01798|vconn|DBG|unix#264998: received: NXT_CT_FLUSH_ZONE (OF1.5) (xid=0x93): zone_id=93 2022-02-03T10:43:31.433Z|01799|vconn|DBG|unix#264998: received: NXT_CT_FLUSH_ZONE (OF1.5) (xid=0x94): zone_id=92 2022-02-03T10:43:31.554Z|01800|vconn|DBG|unix#264998: received: NXT_CT_FLUSH_ZONE (OF1.5) (xid=0x95): zone_id=95 2022-02-03T10:43:31.674Z|01801|vconn|DBG|unix#264998: received: NXT_CT_FLUSH_ZONE (OF1.5) (xid=0x96): zone_id=96 2022-02-03T10:43:31.795Z|01802|vconn|DBG|unix#264998: received: NXT_CT_FLUSH_ZONE (OF1.5) (xid=0x97): zone_id=97 2022-02-03T10:43:31.915Z|01803|vconn|DBG|unix#264998: received: NXT_CT_FLUSH_ZONE (OF1.5) (xid=0x98): zone_id=99 2022-02-03T10:43:32.035Z|01804|vconn|DBG|unix#264998: received: NXT_CT_FLUSH_ZONE (OF1.5) (xid=0x99): zone_id=98 2022-02-03T10:43:32.156Z|01805|vconn|DBG|unix#264998: received: NXT_CT_FLUSH_ZONE (OF1.5) (xid=0x9a): zone_id=100 2022-02-03T10:43:32.275Z|01806|connmgr|INFO|br-int<->unix#264998: 37 flow_mods 13 s ago (36 adds, 1 deletes) 2022-02-03T10:43:32.276Z|01807|timeval|WARN|Unreasonably long 6245ms poll interval (826ms user, 5403ms system) 2022-02-03T10:43:32.276Z|01808|timeval|WARN|faults: 2 minor, 0 major 2022-02-03T10:43:32.276Z|01809|timeval|WARN|context switches: 0 voluntary, 21 involuntary 2022-02-03T10:43:32.276Z|01810|coverage|INFO|Skipping details of duplicate event coverage for hash=4321e6a4 2022-02-03T10:43:32.276Z|01811|poll_loop|INFO|wakeup due to [POLLIN] on fd 15 (<->/var/run/openvswitch/db.sock) at lib/stream-fd.c:157 (100% CPU usage) 2022-02-03T10:43:32.276Z|01812|poll_loop|INFO|wakeup due to [POLLIN] on fd 228 (/var/run/openvswitch/br-int.mgmt<->) at lib/stream-fd.c:157 (100% CPU usage) Please note the time for each NXT_CT_FLUSH_ZONE line took. It flushes just ~8 zones per second. Maybe, it’s due to quite big conntrack size: I’ve noticed that on the system there were a huge amount of the conntrack records in host’s zone=0. There were connections generated from overlay tunnel packets, that were not configured in iptables to be NOTRACK - there was a configuration management issue. After I’ve applied iptables rules to not to track geneve, conntrack connection count decreased from ~60k to 3k and CT_ZONE flush became much faster (about 100 zones per 1 second). Thanks for pointing to another to openflows problem! But my question is why ovn-controller flushes conntrack zones and modifies OFs even if nothing has changed? I mean, a "soft-restart" condition (when ovn-controller was stopped through unixctl socket with "--restart" option. Should I start ovn-controller somehow special? When I trigger "soft-restart", I expect to have a minimal impact on the datapath. Regards, Vladislav Odintsov > On 2 Feb 2022, at 18:20, Ilya Maximets <i.maxim...@ovn.org> wrote: > > On 2/1/22 23:15, Vladislav Odintsov wrote: >> Hi OVN and OVS folks, >> >> during OVN upgrade from 20.06 to 21.12 version I’ve faced with a quite long >> datapath downtime (48-50 seconds). >> I suppose this downtime was triggered by ovn-controller installing new >> openflow rules to ovs. >> There are relatively not so many flows (13-14k). >> openvswitch version is 2.13.4. >> ovn-openflow-probe-interval is configured to 30. >> >> The upgrade procedure was: >> 1. Set to ovs external_ids an option to force matching ovn-northd internal >> version (ovn-match-northd-version=true) >> 2. Upgrade ovn-controller (to first get ovn-controller, which supports >> version match between controller and northd and not to break datapath after >> upgrading ovn-northd) >> 3. Upgrade ovn-northd cluster >> >> Network outage timeframe matches to ovs-vswitchd WARN logs. Please see next: >> >> 2022-02-01T19:44:32.577Z|00035|ovs_rcu(urcu7)|WARN|blocked 1000 ms waiting >> for main to quiesce >> 2022-02-01T19:44:33.577Z|00036|ovs_rcu(urcu7)|WARN|blocked 2000 ms waiting >> for main to quiesce >> 2022-02-01T19:44:35.577Z|00037|ovs_rcu(urcu7)|WARN|blocked 4000 ms waiting >> for main to quiesce >> 2022-02-01T19:44:39.577Z|00038|ovs_rcu(urcu7)|WARN|blocked 8000 ms waiting >> for main to quiesce >> 2022-02-01T19:44:44.396Z|04063|connmgr|INFO|br-int<->unix#272674: 37 >> flow_mods 13 s ago (36 adds, 1 deletes) >> 2022-02-01T19:44:44.441Z|04064|timeval|WARN|Unreasonably long 12864ms poll >> interval (0ms user, 12833ms system) >> 2022-02-01T19:44:44.441Z|04065|timeval|WARN|faults: 287 minor, 0 major >> 2022-02-01T19:44:44.441Z|04066|timeval|WARN|context switches: 1 voluntary, >> 18 involuntary >> 2022-02-01T19:44:44.441Z|04067|coverage|INFO|Event coverage, avg rate over >> last: 5 seconds, last minute, last hour, hash=9105faf6: >> 2022-02-01T19:44:44.441Z|04068|coverage|INFO|ccmap_expand >> 0.0/sec 0.000/sec 0.0000/sec total: 1129 >> 2022-02-01T19:44:44.441Z|04069|coverage|INFO|ccmap_shrink >> 0.0/sec 11.933/sec 0.1989/sec total: 2363 >> 2022-02-01T19:44:44.441Z|04070|coverage|INFO|cmap_expand >> 0.0/sec 0.833/sec 0.8811/sec total: 11218023 >> 2022-02-01T19:44:44.441Z|04071|coverage|INFO|cmap_shrink >> 8.2/sec 11.950/sec 1.0736/sec total: 11217475 >> 2022-02-01T19:44:44.441Z|04072|coverage|INFO|dpif_port_add >> 0.0/sec 0.000/sec 0.0000/sec total: 229 >> 2022-02-01T19:44:44.441Z|04073|coverage|INFO|dpif_port_del >> 0.0/sec 0.000/sec 0.0000/sec total: 144 >> 2022-02-01T19:44:44.441Z|04074|coverage|INFO|dpif_flow_flush >> 0.0/sec 0.000/sec 0.0000/sec total: 2 >> 2022-02-01T19:44:44.441Z|04075|coverage|INFO|dpif_flow_get >> 0.0/sec 0.000/sec 0.0000/sec total: 23 >> 2022-02-01T19:44:44.441Z|04076|coverage|INFO|dpif_flow_put >> 83.0/sec 97.900/sec 80.8306/sec total: 849051748 >> 2022-02-01T19:44:44.441Z|04077|coverage|INFO|dpif_flow_del >> 147.8/sec 91.733/sec 78.2797/sec total: 810797137 >> 2022-02-01T19:44:44.441Z|04078|coverage|INFO|dpif_execute >> 66.6/sec 83.767/sec 87.8397/sec total: 958697586 >> 2022-02-01T19:44:44.441Z|04079|coverage|INFO|dpif_execute_with_help >> 0.0/sec 4.250/sec 4.9892/sec total: 70713908 >> 2022-02-01T19:44:44.441Z|04080|coverage|INFO|flow_extract >> 169.0/sec 247.017/sec 333.1194/sec total: 7990486004 >> 2022-02-01T19:44:44.441Z|04081|coverage|INFO|miniflow_malloc >> 0.0/sec 0.183/sec 0.7983/sec total: 5871853 >> 2022-02-01T19:44:44.441Z|04082|coverage|INFO|hindex_pathological >> 0.0/sec 0.000/sec 0.0000/sec total: 78 >> 2022-02-01T19:44:44.441Z|04083|coverage|INFO|hindex_expand >> 0.0/sec 0.000/sec 0.0000/sec total: 11 >> 2022-02-01T19:44:44.441Z|04084|coverage|INFO|hmap_pathological >> 0.0/sec 0.633/sec 0.9331/sec total: 16638117 >> 2022-02-01T19:44:44.441Z|04085|coverage|INFO|hmap_expand >> 0.0/sec 114.867/sec 139.8006/sec total: 2184263485 >> 2022-02-01T19:44:44.441Z|04086|coverage|INFO|mac_learning_learned >> 0.0/sec 0.000/sec 0.0003/sec total: 299 >> 2022-02-01T19:44:44.441Z|04087|coverage|INFO|mac_learning_expired >> 0.0/sec 0.000/sec 0.0006/sec total: 117 >> 2022-02-01T19:44:44.441Z|04088|coverage|INFO|mac_learning_moved >> 0.0/sec 0.000/sec 0.0000/sec total: 87 >> 2022-02-01T19:44:44.441Z|04089|coverage|INFO|netdev_get_stats >> 0.0/sec 36.850/sec 43.3306/sec total: 690673778 >> 2022-02-01T19:44:44.441Z|04090|coverage|INFO|txn_unchanged >> 0.0/sec 0.100/sec 0.1614/sec total: 2705864 >> 2022-02-01T19:44:44.441Z|04091|coverage|INFO|txn_incomplete >> 0.0/sec 0.183/sec 0.2239/sec total: 3688261 >> 2022-02-01T19:44:44.441Z|04092|coverage|INFO|txn_success >> 0.0/sec 0.167/sec 0.1953/sec total: 3236818 >> 2022-02-01T19:44:44.441Z|04093|coverage|INFO|poll_create_node >> 3930.6/sec 3941.350/sec 5012.2572/sec total: 115959738162 >> 2022-02-01T19:44:44.441Z|04094|coverage|INFO|poll_zero_timeout >> 159.4/sec 111.417/sec 103.4078/sec total: 1181551227 >> 2022-02-01T19:44:44.441Z|04095|coverage|INFO|rconn_queued >> 0.0/sec 0.050/sec 0.0817/sec total: 1279558 >> 2022-02-01T19:44:44.441Z|04096|coverage|INFO|rconn_sent >> 0.0/sec 0.050/sec 0.0817/sec total: 1279558 >> 2022-02-01T19:44:44.441Z|04097|coverage|INFO|seq_change >> 2205.2/sec 2502.783/sec 2874.6258/sec total: 57335390998 >> 2022-02-01T19:44:44.441Z|04098|coverage|INFO|pstream_open >> 0.0/sec 0.000/sec 0.0000/sec total: 5 >> 2022-02-01T19:44:44.441Z|04099|coverage|INFO|stream_open >> 0.0/sec 0.000/sec 0.0000/sec total: 1 >> 2022-02-01T19:44:44.441Z|04100|coverage|INFO|unixctl_received >> 0.0/sec 0.000/sec 0.0000/sec total: 189 >> 2022-02-01T19:44:44.441Z|04101|coverage|INFO|unixctl_replied >> 0.0/sec 0.000/sec 0.0000/sec total: 189 >> 2022-02-01T19:44:44.441Z|04102|coverage|INFO|util_xalloc >> 7935.8/sec 15234.350/sec 18372.4347/sec total: 345308271208 >> 2022-02-01T19:44:44.441Z|04103|coverage|INFO|vconn_received >> 0.0/sec 0.083/sec 0.3556/sec total: 2018930 >> 2022-02-01T19:44:44.441Z|04104|coverage|INFO|vconn_sent >> 0.0/sec 0.067/sec 0.0983/sec total: 1552067 >> 2022-02-01T19:44:44.441Z|04105|coverage|INFO|netdev_set_policing >> 0.0/sec 0.000/sec 0.0000/sec total: 623 >> 2022-02-01T19:44:44.441Z|04106|coverage|INFO|netdev_get_ifindex >> 0.0/sec 0.100/sec 0.1636/sec total: 2713928 >> 2022-02-01T19:44:44.441Z|04107|coverage|INFO|netdev_set_hwaddr >> 0.0/sec 0.000/sec 0.0000/sec total: 2 >> 2022-02-01T19:44:44.441Z|04108|coverage|INFO|netdev_get_ethtool >> 0.0/sec 0.000/sec 0.0000/sec total: 1073 >> 2022-02-01T19:44:44.441Z|04109|coverage|INFO|netdev_set_ethtool >> 0.0/sec 0.000/sec 0.0000/sec total: 225 >> 2022-02-01T19:44:44.441Z|04110|coverage|INFO|nln_changed >> 0.0/sec 0.050/sec 0.0492/sec total: 819614 >> 2022-02-01T19:44:44.441Z|04111|coverage|INFO|netlink_received >> 346.6/sec 472.133/sec 575.4597/sec total: 11288088779 >> 2022-02-01T19:44:44.441Z|04112|coverage|INFO|netlink_recv_jumbo >> 18.4/sec 113.250/sec 175.1189/sec total: 3935643810 >> 2022-02-01T19:44:44.441Z|04113|coverage|INFO|netlink_sent >> 308.4/sec 378.133/sec 380.4267/sec total: 4753114969 >> 2022-02-01T19:44:44.441Z|04114|coverage|INFO|ofproto_flush >> 0.0/sec 0.000/sec 0.0000/sec total: 1 >> 2022-02-01T19:44:44.441Z|04115|coverage|INFO|ofproto_recv_openflow >> 0.0/sec 0.067/sec 0.3389/sec total: 1746449 >> 2022-02-01T19:44:44.441Z|04116|coverage|INFO|ofproto_update_port >> 0.0/sec 0.000/sec 0.0000/sec total: 814 >> 2022-02-01T19:44:44.441Z|04117|coverage|INFO|packet_in_overflow >> 0.0/sec 0.000/sec 0.0000/sec total: 282520 >> 2022-02-01T19:44:44.441Z|04118|coverage|INFO|rev_reconfigure >> 0.0/sec 0.017/sec 0.0325/sec total: 539904 >> 2022-02-01T19:44:44.441Z|04119|coverage|INFO|rev_port_toggled >> 0.0/sec 0.000/sec 0.0000/sec total: 27 >> 2022-02-01T19:44:44.441Z|04120|coverage|INFO|rev_flow_table >> 0.0/sec 0.000/sec 0.0106/sec total: 10876 >> 2022-02-01T19:44:44.441Z|04121|coverage|INFO|rev_mac_learning >> 0.0/sec 0.000/sec 0.0006/sec total: 312 >> 2022-02-01T19:44:44.441Z|04122|coverage|INFO|dumped_duplicate_flow >> 0.0/sec 0.000/sec 0.0000/sec total: 39204 >> 2022-02-01T19:44:44.441Z|04123|coverage|INFO|dumped_new_flow >> 0.0/sec 0.000/sec 0.0000/sec total: 7177 >> 2022-02-01T19:44:44.441Z|04124|coverage|INFO|handler_duplicate_upcall >> 0.4/sec 1.833/sec 2.2664/sec total: 34762798 >> 2022-02-01T19:44:44.441Z|04125|coverage|INFO|upcall_ukey_contention >> 0.0/sec 0.000/sec 0.0000/sec total: 9 >> 2022-02-01T19:44:44.441Z|04126|coverage|INFO|upcall_ukey_replace >> 0.2/sec 0.050/sec 0.0447/sec total: 7927221 >> 2022-02-01T19:44:44.441Z|04127|coverage|INFO|revalidate_missed_dp_flow >> 0.0/sec 0.000/sec 0.0000/sec total: 1118 >> 2022-02-01T19:44:44.441Z|04128|coverage|INFO|xlate_actions >> 123.6/sec 273.183/sec 233.8275/sec total: 2110727223 >> 2022-02-01T19:44:44.441Z|04129|coverage|INFO|bridge_reconfigure >> 0.0/sec 0.017/sec 0.0325/sec total: 539902 >> 2022-02-01T19:44:44.441Z|04130|coverage|INFO|74 events never hit >> 2022-02-01T19:44:44.441Z|04131|poll_loop|INFO|Dropped 3 log messages in last >> 1058229 seconds (most recently, 1058228 seconds ago) due to excessive rate >> 2022-02-01T19:44:44.441Z|04132|poll_loop|INFO|wakeup due to [POLLIN] on fd >> 230 (/var/run/openvswitch/br-int.mgmt<->) at lib/stream-fd.c:157 (84% CPU >> usage) >> 2022-02-01T19:44:45.441Z|00039|ovs_rcu(urcu7)|WARN|blocked 1000 ms waiting >> for main to quiesce >> 2022-02-01T19:44:46.441Z|00040|ovs_rcu(urcu7)|WARN|blocked 2000 ms waiting >> for main to quiesce >> 2022-02-01T19:44:48.441Z|00041|ovs_rcu(urcu7)|WARN|blocked 4001 ms waiting >> for main to quiesce >> 2022-02-01T19:44:52.441Z|00042|ovs_rcu(urcu7)|WARN|blocked 8000 ms waiting >> for main to quiesce >> 2022-02-01T19:44:55.109Z|04133|timeval|WARN|Unreasonably long 10668ms poll >> interval (0ms user, 10643ms system) >> 2022-02-01T19:44:55.109Z|04134|timeval|WARN|faults: 70 minor, 0 major >> 2022-02-01T19:44:55.109Z|04135|timeval|WARN|context switches: 0 voluntary, 4 >> involuntary >> 2022-02-01T19:44:55.109Z|04136|coverage|INFO|Skipping details of duplicate >> event coverage for hash=9105faf6 >> 2022-02-01T19:44:55.109Z|04137|poll_loop|INFO|wakeup due to [POLLIN] on fd >> 15 (<->/var/run/openvswitch/db.sock) at lib/stream-fd.c:157 (100% CPU usage) >> 2022-02-01T19:44:55.109Z|04138|poll_loop|INFO|wakeup due to [POLLIN] on fd >> 230 (/var/run/openvswitch/br-int.mgmt<->) at lib/stream-fd.c:157 (100% CPU >> usage) >> 2022-02-01T19:44:56.112Z|00043|ovs_rcu(urcu7)|WARN|blocked 1001 ms waiting >> for main to quiesce >> 2022-02-01T19:44:57.112Z|00044|ovs_rcu(urcu7)|WARN|blocked 2000 ms waiting >> for main to quiesce >> 2022-02-01T19:44:59.112Z|00045|ovs_rcu(urcu7)|WARN|blocked 4000 ms waiting >> for main to quiesce >> 2022-02-01T19:45:03.112Z|00046|ovs_rcu(urcu7)|WARN|blocked 8000 ms waiting >> for main to quiesce >> 2022-02-01T19:45:05.249Z|04139|timeval|WARN|Unreasonably long 10140ms poll >> interval (0ms user, 10115ms system) >> 2022-02-01T19:45:05.249Z|04140|timeval|WARN|faults: 248 minor, 0 major >> 2022-02-01T19:45:05.249Z|04141|timeval|WARN|context switches: 1 voluntary, >> 27 involuntary >> 2022-02-01T19:45:05.249Z|04142|coverage|INFO|Skipping details of duplicate >> event coverage for hash=9105faf6 >> 2022-02-01T19:45:05.249Z|04143|poll_loop|INFO|wakeup due to [POLLIN] on fd >> 14 (NETLINK_ROUTE<->NETLINK_ROUTE) at lib/netlink-socket.c:1401 (100% CPU >> usage) >> 2022-02-01T19:45:05.249Z|04144|poll_loop|INFO|wakeup due to [POLLOUT] on fd >> 235 (/var/run/openvswitch/br-int.mgmt<->) at lib/stream-fd.c:153 (100% CPU >> usage) >> 2022-02-01T19:45:05.249Z|04145|poll_loop|INFO|wakeup due to [POLLIN] on fd >> 230 (/var/run/openvswitch/br-int.mgmt<->) at lib/stream-fd.c:157 (100% CPU >> usage) >> 2022-02-01T19:45:06.250Z|00047|ovs_rcu(urcu7)|WARN|blocked 1000 ms waiting >> for main to quiesce >> 2022-02-01T19:45:07.250Z|00048|ovs_rcu(urcu7)|WARN|blocked 2000 ms waiting >> for main to quiesce >> 2022-02-01T19:45:09.250Z|00049|ovs_rcu(urcu7)|WARN|blocked 4000 ms waiting >> for main to quiesce >> 2022-02-01T19:45:13.250Z|00050|ovs_rcu(urcu7)|WARN|blocked 8000 ms waiting >> for main to quiesce >> 2022-02-01T19:45:15.704Z|04146|rconn|WARN|br-int<->unix#272675: connection >> dropped (Broken pipe) >> 2022-02-01T19:45:16.714Z|00051|ovs_rcu(urcu7)|WARN|blocked 1000 ms waiting >> for main to quiesce >> 2022-02-01T19:45:17.713Z|00052|ovs_rcu(urcu7)|WARN|blocked 2000 ms waiting >> for main to quiesce >> 2022-02-01T19:45:19.714Z|00053|ovs_rcu(urcu7)|WARN|blocked 4000 ms waiting >> for main to quiesce >> 2022-02-01T19:45:23.014Z|04147|timeval|WARN|Unreasonably long 17764ms poll >> interval (0ms user, 17720ms system) >> 2022-02-01T19:45:23.014Z|04148|timeval|WARN|faults: 63 minor, 0 major >> 2022-02-01T19:45:23.014Z|04149|timeval|WARN|context switches: 2 voluntary, >> 24 involuntary >> 2022-02-01T19:45:23.014Z|04150|poll_loop|INFO|wakeup due to [POLLIN] on fd >> 15 (<->/var/run/openvswitch/db.sock) at lib/stream-fd.c:157 (100% CPU usage) >> 2022-02-01T19:45:23.014Z|04151|poll_loop|INFO|wakeup due to [POLLIN] on fd >> 230 (/var/run/openvswitch/br-int.mgmt<->) at lib/stream-fd.c:157 (100% CPU >> usage) >> 2022-02-01T19:45:23.021Z|04152|poll_loop|INFO|wakeup due to [POLLIN] on fd >> 230 (/var/run/openvswitch/br-int.mgmt<->) at lib/stream-fd.c:157 (100% CPU >> usage) >> 2022-02-01T19:45:23.022Z|04153|poll_loop|INFO|wakeup due to [POLLIN] on fd >> 230 (/var/run/openvswitch/br-int.mgmt<->) at lib/stream-fd.c:157 (100% CPU >> usage) >> 2022-02-01T19:45:23.022Z|04154|poll_loop|INFO|wakeup due to [POLLIN] on fd >> 230 (/var/run/openvswitch/br-int.mgmt<->) at lib/stream-fd.c:157 (100% CPU >> usage) >> 2022-02-01T19:45:23.023Z|04155|poll_loop|INFO|wakeup due to [POLLIN] on fd >> 230 (/var/run/openvswitch/br-int.mgmt<->) at lib/stream-fd.c:157 (100% CPU >> usage) >> 2022-02-01T19:45:23.023Z|04156|poll_loop|INFO|wakeup due to [POLLIN] on fd >> 230 (/var/run/openvswitch/br-int.mgmt<->) at lib/stream-fd.c:157 (100% CPU >> usage) >> 2022-02-01T19:45:23.023Z|04157|poll_loop|INFO|wakeup due to [POLLIN] on fd >> 230 (/var/run/openvswitch/br-int.mgmt<->) at lib/stream-fd.c:157 (100% CPU >> usage) >> 2022-02-01T19:45:23.024Z|04158|poll_loop|INFO|wakeup due to [POLLIN] on fd >> 230 (/var/run/openvswitch/br-int.mgmt<->) at lib/stream-fd.c:157 (100% CPU >> usage) >> 2022-02-01T19:45:23.024Z|04159|poll_loop|INFO|wakeup due to [POLLIN] on fd >> 230 (/var/run/openvswitch/br-int.mgmt<->) at lib/stream-fd.c:157 (100% CPU >> usage) >> 2022-02-01T19:45:41.274Z|04160|connmgr|INFO|br-int<->unix#272674: 13589 >> flow_mods 18 s ago (13509 adds, 80 modifications) >> 2022-02-01T19:46:31.285Z|04161|rconn|ERR|br-int<->unix#272673: no response >> to inactivity probe after 60 seconds, disconnecting >> >> >> In ovn-controller log: >> >> 2022-02-01T19:44:31.267Z|00077|features|INFO|unix:/run/openvswitch/br-int.mgmt: >> connecting to switch >> 2022-02-01T19:44:31.267Z|00078|rconn|INFO|unix:/run/openvswitch/br-int.mgmt: >> connecting... >> 2022-02-01T19:44:31.267Z|00079|ofctrl|INFO|unix:/run/openvswitch/br-int.mgmt: >> connecting to switch >> 2022-02-01T19:44:31.267Z|00080|rconn|INFO|unix:/run/openvswitch/br-int.mgmt: >> connecting... >> 2022-02-01T19:44:31.268Z|00081|rconn|INFO|unix:/run/openvswitch/br-int.mgmt: >> connected >> 2022-02-01T19:44:31.269Z|00082|rconn|INFO|unix:/run/openvswitch/br-int.mgmt: >> connected >> 2022-02-01T19:44:31.284Z|00083|main|INFO|OVS feature set changed, force >> recompute. >> 2022-02-01T19:44:34.074Z|00084|main|INFO|OVNSB commit failed, force >> recompute next time. >> 2022-02-01T19:44:34.621Z|00085|reconnect|INFO|ssl:<IP>16.0.61:6642: >> connection closed by peer >> 2022-02-01T19:44:35.622Z|00086|reconnect|INFO|ssl:<IP>:6642: connecting... >> 2022-02-01T19:44:35.672Z|00087|reconnect|INFO|ssl:<IP>:6642: connected >> 2022-02-01T19:45:01.470Z|00088|lflow_cache|INFO|Detected cache inactivity >> (last active 30001 ms ago): trimming cache >> 2022-02-01T19:45:23.351Z|00089|binding|INFO|Setting lport <lport> >> ovn-installed in OVS >> 2022-02-01T19:45:23.351Z|00090|binding|INFO|Setting lport <lport> up in >> Southbound >> <there are 80 port bindings...> >> 2022-02-01T19:45:23.352Z|00245|binding|INFO|Setting lport <lportN> >> ovn-installed in OVS >> 2022-02-01T19:45:23.352Z|00246|binding|INFO|Setting lport <lportN> up in >> Southbound >> 2022-02-01T19:45:53.323Z|00247|lflow_cache|INFO|Detected cache inactivity >> (last active 30004 ms ago): trimming cache >> >> >> I’ve got some questions: >> 1. What OVS is doing while `waiting for main to quiesce` ? I noticed this >> message appeares each time downtime occurs. > > 'Unreasonably long 17764ms poll interval (0ms user, 17720ms system)' > is very concerning. It means that the main thread spent almost 18 > seconds inside the kernel executing some syscall, maybe several syscalls. > Main thread is not responsible for flow installation to the kernel, IIRC, > so flow updates should not be a case here. You need to figure out what > are these syscalls and why they are taking so long to execute inside > the kernel. > > Maybe you have some suspicious kernel logs at the same time? > You may also try 'perf' or some other tracing tools to figure out > what is going on. > >> 2. Is there any hint to minimize such downtime while reinstalling OF? > > Installation of rules is an unlikely root cause here. > >> 3. Could there be a situation where ovn-northd after upgrade still >> generating new logical flows, ovn-controller constantly gets updates from SB >> and triggers update to OF in a loop while doesn’t finish getting final >> lflows "view"? So, could it start installing new OF while having in memory >> lflows which are not consistent at that moment? > > northd generates all updates in a single transaction, so SB is updated > atomically. SB will send this update as a single update message, so this > is also atomic. ovn-controller will install all OF rules as an OF bundle, > so that is also atomic. > >> 4. What does `br-int<->unix#272675: connection dropped (Broken pipe)` error >> means? What type of connection is broken? Is it normal? > > This looks like a controller connection, i.e. connection between > ovn-controller with ovs-vswitchd. Doesn't look very good, but > disconnection seems to be caused by inactivity due to overload > on one of the sides of the connection. > >> 5. Do you see anything else abnormal in the logs? > > '0ms user, 17720ms system' is the most abnormal, AFAICT. > > Best regards, Ilya Maximets. > > _______________________________________________ > dev mailing list > d...@openvswitch.org <mailto:d...@openvswitch.org> > https://mail.openvswitch.org/mailman/listinfo/ovs-dev > <https://mail.openvswitch.org/mailman/listinfo/ovs-dev> _______________________________________________ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev