Hi Ilya,

thanks for detailed answer.
I’ve tried to just "soft-restart" ovn-controller and it seems that I found the 
problem, that could block OVS.

What I’ve done:
1. ovs-appctl -t ovs-vswitchd vlog/set file:vconn:dbg
2. ovn-appctl -t ovn-controller exit --restart
3. systemctl restart ovn-controller

In OVS log I saw a lot of flow-mod lines, and CT_FLUSH_ZONE:

2022-02-03T10:43:18.956Z|01574|vconn|DBG|unix#264996: sent (Success): 
OFPT_HELLO (OF1.5) (xid=0x40a6b):
 version bitmap: 0x01, 0x02, 0x03, 0x04, 0x05, 0x06
2022-02-03T10:43:18.956Z|01575|vconn|DBG|unix#264996: received: OFPT_HELLO 
(OF1.5) (xid=0x1):
 version bitmap: 0x06
2022-02-03T10:43:18.956Z|01576|vconn|DBG|unix#264996: negotiated OpenFlow 
version 0x06 (we support version 0x06 and earlier, peer supports version 0x06)
2022-02-03T10:43:18.956Z|01577|vconn|DBG|unix#264996: received: 
OFPT_GET_CONFIG_REQUEST (OF1.5) (xid=0x2):
2022-02-03T10:43:18.956Z|01578|vconn|DBG|unix#264996: sent (Success): 
OFPT_GET_CONFIG_REPLY (OF1.5) (xid=0x2): frags=normal miss_send_len=0
2022-02-03T10:43:18.956Z|01579|vconn|DBG|unix#264996: received: 
NXT_SET_PACKET_IN_FORMAT (OF1.5) (xid=0x3): format=nxt_packet_in2
2022-02-03T10:43:18.956Z|01580|vconn|DBG|unix#264996: received: OFPT_SET_CONFIG 
(OF1.5) (xid=0x4): frags=normal miss_send_len=65535
2022-02-03T10:43:19.123Z|01581|vconn|DBG|unix#264997: sent (Success): 
OFPT_HELLO (OF1.5) (xid=0x40a6c):
 version bitmap: 0x01, 0x02, 0x03, 0x04, 0x05, 0x06
2022-02-03T10:43:19.123Z|01582|vconn|DBG|unix#264997: received: OFPT_HELLO 
(OF1.5) (xid=0x5):
 version bitmap: 0x06
2022-02-03T10:43:19.123Z|01583|vconn|DBG|unix#264997: negotiated OpenFlow 
version 0x06 (we support version 0x06 and earlier, peer supports version 0x06)
2022-02-03T10:43:19.123Z|01584|vconn|DBG|unix#264998: sent (Success): 
OFPT_HELLO (OF1.5) (xid=0x40a6d):
 version bitmap: 0x01, 0x02, 0x03, 0x04, 0x05, 0x06
2022-02-03T10:43:19.123Z|01585|vconn|DBG|unix#264998: received: OFPT_HELLO 
(OF1.5) (xid=0x6):
 version bitmap: 0x06
2022-02-03T10:43:19.123Z|01586|vconn|DBG|unix#264998: negotiated OpenFlow 
version 0x06 (we support version 0x06 and earlier, peer supports version 0x06)
2022-02-03T10:43:19.124Z|01587|vconn|DBG|unix#264998: received: 
NXT_TLV_TABLE_REQUEST (OF1.5) (xid=0x7):
2022-02-03T10:43:19.124Z|01588|vconn|DBG|unix#264998: sent (Success): 
NXT_TLV_TABLE_REPLY (OF1.5) (xid=0x7):
 max option space=256 max fields=64
 allocated option space=4

 mapping table:
  class  type  length  match field
 ------  ----  ------  --------------
  0x102  0x80       4  tun_metadata0
2022-02-03T10:43:19.124Z|01589|vconn|DBG|unix#264998: received: OFPT_FLOW_MOD 
(OF1.5) (xid=0x8): DEL table:255 priority=0 actions=drop
2022-02-03T10:43:19.126Z|01590|vconn|DBG|unix#264998: received: OFPT_GROUP_MOD 
(OF1.5) (xid=0x9):
 DEL group_id=4294967292,type=all
2022-02-03T10:43:19.126Z|01591|vconn|DBG|unix#264998: received: OFPT_METER_MOD 
(OF1.5) (xid=0xa): DEL meter=all bands=
2022-02-03T10:43:19.126Z|01592|vconn|DBG|unix#264998: received: 
OFPT_BUNDLE_CONTROL (OF1.5) (xid=0xb):
 bundle_id=0 type=OPEN_REQUEST flags=atomic ordered
2022-02-03T10:43:19.126Z|01593|vconn|DBG|unix#264998: sent (Success): 
OFPT_BUNDLE_CONTROL (OF1.5) (xid=0xb):
 bundle_id=0 type=OPEN_REPLY flags=0
2022-02-03T10:43:19.126Z|01594|vconn|DBG|unix#264998: received: 
OFPT_BUNDLE_ADD_MESSAGE (OF1.5) (xid=0xc):
 bundle_id=0 flags=atomic ordered
OFPT_FLOW_MOD (OF1.5) (xid=0xc): ADD priority=100,in_port=98 
actions=move:NXM_NX_TUN_ID[40..54]->NXM_NX_REG14[0..14],move:NXM_NX_TUN_ID[24..39]->NXM_NX_REG15[0..15],move:NXM_NX_TUN_ID[0..23]->OXM_OF_METADATA[0..23],resubmit(,38)
2022-02-03T10:43:19.126Z|01595|vconn|DBG|unix#264998: received: 
OFPT_BUNDLE_ADD_MESSAGE (OF1.5) (xid=0xd):
 bundle_id=0 flags=atomic ordered
OFPT_FLOW_MOD (OF1.5) (xid=0xd): ADD priority=100,in_port=96 
actions=move:NXM_NX_TUN_ID[40..54]->NXM_NX_REG14[0..14],move:NXM_NX_TUN_ID[24..39]->NXM_NX_REG15[0..15],move:NXM_NX_TUN_ID[0..23]->OXM_OF_METADATA[0..23],resubmit(,38)
2022-02-03T10:43:19.126Z|01596|vconn|DBG|unix#264998: received: 
OFPT_BUNDLE_ADD_MESSAGE (OF1.5) (xid=0xe):
 bundle_id=0 flags=atomic ordered
OFPT_FLOW_MOD (OF1.5) (xid=0xe): ADD priority=100,in_port=7 
actions=move:NXM_NX_TUN_ID[40..54]->NXM_NX_REG14[0..14],move:NXM_NX_TUN_ID[24..39]->NXM_NX_REG15[0..15],move:NXM_NX_TUN_ID[0..23]->OXM_OF_METADATA[0..23],resubmit(,38)
2022-02-03T10:43:19.126Z|01597|vconn|DBG|unix#264998: received: 
OFPT_BUNDLE_ADD_MESSAGE (OF1.5) (xid=0xf):
 bundle_id=0 flags=atomic ordered
OFPT_FLOW_MOD (OF1.5) (xid=0xf): ADD priority=100,in_port=6 
actions=move:NXM_NX_TUN_ID[40..54]->NXM_NX_REG14[0..14],move:NXM_NX_TUN_ID[24..39]->NXM_NX_REG15[0..15],move:NXM_NX_TUN_ID[0..23]->OXM_OF_METADATA[0..23],resubmit(,38)
2022-02-03T10:43:19.126Z|01598|vconn|DBG|unix#264998: received: 
OFPT_BUNDLE_ADD_MESSAGE (OF1.5) (xid=0x10):
 bundle_id=0 flags=atomic ordered
OFPT_FLOW_MOD (OF1.5) (xid=0x10): ADD priority=100,in_port=55 
actions=move:NXM_NX_TUN_ID[40..54]->NXM_NX_REG14[0..14],move:NXM_NX_TUN_ID[24..39]->NXM_NX_REG15[0..15],move:NXM_NX_TUN_ID[0..23]->OXM_OF_METADATA[0..23],resubmit(,38)
2022-02-03T10:43:19.126Z|01599|vconn|DBG|unix#264998: received: 
OFPT_BUNDLE_ADD_MESSAGE (OF1.5) (xid=0x11):
 bundle_id=0 flags=atomic ordered
….


2022-02-03T10:43:26.293Z|01758|vconn|DBG|unix#264998: received: 
NXT_CT_FLUSH_ZONE (OF1.5) (xid=0x6b): zone_id=52
2022-02-03T10:43:26.424Z|01759|vconn|DBG|unix#264998: received: 
NXT_CT_FLUSH_ZONE (OF1.5) (xid=0x6c): zone_id=55
2022-02-03T10:43:26.554Z|01760|vconn|DBG|unix#264998: received: 
NXT_CT_FLUSH_ZONE (OF1.5) (xid=0x6d): zone_id=54
2022-02-03T10:43:26.685Z|01761|vconn|DBG|unix#264998: received: 
NXT_CT_FLUSH_ZONE (OF1.5) (xid=0x6e): zone_id=58
2022-02-03T10:43:26.816Z|01762|vconn|DBG|unix#264998: received: 
NXT_CT_FLUSH_ZONE (OF1.5) (xid=0x6f): zone_id=57
2022-02-03T10:43:26.946Z|01763|vconn|DBG|unix#264998: received: 
NXT_CT_FLUSH_ZONE (OF1.5) (xid=0x70): zone_id=56
2022-02-03T10:43:27.031Z|00013|ovs_rcu(urcu2)|WARN|blocked 1000 ms waiting for 
main to quiesce
2022-02-03T10:43:27.076Z|01764|vconn|DBG|unix#264998: received: 
NXT_CT_FLUSH_ZONE (OF1.5) (xid=0x71): zone_id=59
2022-02-03T10:43:27.206Z|01765|vconn|DBG|unix#264998: received: 
NXT_CT_FLUSH_ZONE (OF1.5) (xid=0x72): zone_id=61
2022-02-03T10:43:27.336Z|01766|vconn|DBG|unix#264998: received: 
NXT_CT_FLUSH_ZONE (OF1.5) (xid=0x73): zone_id=60
2022-02-03T10:43:27.465Z|01767|vconn|DBG|unix#264998: received: 
NXT_CT_FLUSH_ZONE (OF1.5) (xid=0x74): zone_id=62
2022-02-03T10:43:27.594Z|01768|vconn|DBG|unix#264998: received: 
NXT_CT_FLUSH_ZONE (OF1.5) (xid=0x75): zone_id=66
2022-02-03T10:43:27.726Z|01769|vconn|DBG|unix#264998: received: 
NXT_CT_FLUSH_ZONE (OF1.5) (xid=0x76): zone_id=65
2022-02-03T10:43:27.854Z|01770|vconn|DBG|unix#264998: received: 
NXT_CT_FLUSH_ZONE (OF1.5) (xid=0x77): zone_id=64
2022-02-03T10:43:27.984Z|01771|vconn|DBG|unix#264998: received: 
NXT_CT_FLUSH_ZONE (OF1.5) (xid=0x78): zone_id=63
2022-02-03T10:43:28.031Z|00014|ovs_rcu(urcu2)|WARN|blocked 2000 ms waiting for 
main to quiesce
2022-02-03T10:43:28.111Z|01772|vconn|DBG|unix#264998: received: 
NXT_CT_FLUSH_ZONE (OF1.5) (xid=0x79): zone_id=67
2022-02-03T10:43:28.238Z|01773|vconn|DBG|unix#264998: received: 
NXT_CT_FLUSH_ZONE (OF1.5) (xid=0x7a): zone_id=68
2022-02-03T10:43:28.371Z|01774|vconn|DBG|unix#264998: received: 
NXT_CT_FLUSH_ZONE (OF1.5) (xid=0x7b): zone_id=69
2022-02-03T10:43:28.496Z|01775|vconn|DBG|unix#264998: received: 
NXT_CT_FLUSH_ZONE (OF1.5) (xid=0x7c): zone_id=72
2022-02-03T10:43:28.622Z|01776|vconn|DBG|unix#264998: received: 
NXT_CT_FLUSH_ZONE (OF1.5) (xid=0x7d): zone_id=71
2022-02-03T10:43:28.746Z|01777|vconn|DBG|unix#264998: received: 
NXT_CT_FLUSH_ZONE (OF1.5) (xid=0x7e): zone_id=70
2022-02-03T10:43:28.871Z|01778|vconn|DBG|unix#264998: received: 
NXT_CT_FLUSH_ZONE (OF1.5) (xid=0x7f): zone_id=75
2022-02-03T10:43:29.000Z|01779|vconn|DBG|unix#264998: received: 
NXT_CT_FLUSH_ZONE (OF1.5) (xid=0x80): zone_id=74
2022-02-03T10:43:29.123Z|01780|vconn|DBG|unix#264998: received: 
NXT_CT_FLUSH_ZONE (OF1.5) (xid=0x81): zone_id=73
2022-02-03T10:43:29.246Z|01781|vconn|DBG|unix#264998: received: 
NXT_CT_FLUSH_ZONE (OF1.5) (xid=0x82): zone_id=76
2022-02-03T10:43:29.371Z|01782|vconn|DBG|unix#264998: received: 
NXT_CT_FLUSH_ZONE (OF1.5) (xid=0x83): zone_id=78
2022-02-03T10:43:29.495Z|01783|vconn|DBG|unix#264998: received: 
NXT_CT_FLUSH_ZONE (OF1.5) (xid=0x84): zone_id=77
2022-02-03T10:43:29.618Z|01784|vconn|DBG|unix#264998: received: 
NXT_CT_FLUSH_ZONE (OF1.5) (xid=0x85): zone_id=81
2022-02-03T10:43:29.746Z|01785|vconn|DBG|unix#264998: received: 
NXT_CT_FLUSH_ZONE (OF1.5) (xid=0x86): zone_id=80
2022-02-03T10:43:29.867Z|01786|vconn|DBG|unix#264998: received: 
NXT_CT_FLUSH_ZONE (OF1.5) (xid=0x87): zone_id=79
2022-02-03T10:43:29.988Z|01787|vconn|DBG|unix#264998: received: 
NXT_CT_FLUSH_ZONE (OF1.5) (xid=0x88): zone_id=84
2022-02-03T10:43:30.031Z|00015|ovs_rcu(urcu2)|WARN|blocked 4000 ms waiting for 
main to quiesce
2022-02-03T10:43:30.108Z|01788|vconn|DBG|unix#264998: received: 
NXT_CT_FLUSH_ZONE (OF1.5) (xid=0x89): zone_id=83
2022-02-03T10:43:30.229Z|01789|vconn|DBG|unix#264998: received: 
NXT_CT_FLUSH_ZONE (OF1.5) (xid=0x8a): zone_id=82
2022-02-03T10:43:30.349Z|01790|vconn|DBG|unix#264998: received: 
NXT_CT_FLUSH_ZONE (OF1.5) (xid=0x8b): zone_id=86
2022-02-03T10:43:30.470Z|01791|vconn|DBG|unix#264998: received: 
NXT_CT_FLUSH_ZONE (OF1.5) (xid=0x8c): zone_id=85
2022-02-03T10:43:30.591Z|01792|vconn|DBG|unix#264998: received: 
NXT_CT_FLUSH_ZONE (OF1.5) (xid=0x8d): zone_id=89
2022-02-03T10:43:30.712Z|01793|vconn|DBG|unix#264998: received: 
NXT_CT_FLUSH_ZONE (OF1.5) (xid=0x8e): zone_id=88
2022-02-03T10:43:30.833Z|01794|vconn|DBG|unix#264998: received: 
NXT_CT_FLUSH_ZONE (OF1.5) (xid=0x8f): zone_id=87
2022-02-03T10:43:30.953Z|01795|vconn|DBG|unix#264998: received: 
NXT_CT_FLUSH_ZONE (OF1.5) (xid=0x90): zone_id=90
2022-02-03T10:43:31.073Z|01796|vconn|DBG|unix#264998: received: 
NXT_CT_FLUSH_ZONE (OF1.5) (xid=0x91): zone_id=91
2022-02-03T10:43:31.193Z|01797|vconn|DBG|unix#264998: received: 
NXT_CT_FLUSH_ZONE (OF1.5) (xid=0x92): zone_id=94
2022-02-03T10:43:31.313Z|01798|vconn|DBG|unix#264998: received: 
NXT_CT_FLUSH_ZONE (OF1.5) (xid=0x93): zone_id=93
2022-02-03T10:43:31.433Z|01799|vconn|DBG|unix#264998: received: 
NXT_CT_FLUSH_ZONE (OF1.5) (xid=0x94): zone_id=92
2022-02-03T10:43:31.554Z|01800|vconn|DBG|unix#264998: received: 
NXT_CT_FLUSH_ZONE (OF1.5) (xid=0x95): zone_id=95
2022-02-03T10:43:31.674Z|01801|vconn|DBG|unix#264998: received: 
NXT_CT_FLUSH_ZONE (OF1.5) (xid=0x96): zone_id=96
2022-02-03T10:43:31.795Z|01802|vconn|DBG|unix#264998: received: 
NXT_CT_FLUSH_ZONE (OF1.5) (xid=0x97): zone_id=97
2022-02-03T10:43:31.915Z|01803|vconn|DBG|unix#264998: received: 
NXT_CT_FLUSH_ZONE (OF1.5) (xid=0x98): zone_id=99
2022-02-03T10:43:32.035Z|01804|vconn|DBG|unix#264998: received: 
NXT_CT_FLUSH_ZONE (OF1.5) (xid=0x99): zone_id=98
2022-02-03T10:43:32.156Z|01805|vconn|DBG|unix#264998: received: 
NXT_CT_FLUSH_ZONE (OF1.5) (xid=0x9a): zone_id=100
2022-02-03T10:43:32.275Z|01806|connmgr|INFO|br-int<->unix#264998: 37 flow_mods 
13 s ago (36 adds, 1 deletes)
2022-02-03T10:43:32.276Z|01807|timeval|WARN|Unreasonably long 6245ms poll 
interval (826ms user, 5403ms system)
2022-02-03T10:43:32.276Z|01808|timeval|WARN|faults: 2 minor, 0 major
2022-02-03T10:43:32.276Z|01809|timeval|WARN|context switches: 0 voluntary, 21 
involuntary
2022-02-03T10:43:32.276Z|01810|coverage|INFO|Skipping details of duplicate 
event coverage for hash=4321e6a4
2022-02-03T10:43:32.276Z|01811|poll_loop|INFO|wakeup due to [POLLIN] on fd 15 
(<->/var/run/openvswitch/db.sock) at lib/stream-fd.c:157 (100% CPU usage)
2022-02-03T10:43:32.276Z|01812|poll_loop|INFO|wakeup due to [POLLIN] on fd 228 
(/var/run/openvswitch/br-int.mgmt<->) at lib/stream-fd.c:157 (100% CPU usage)


Please note the time for each NXT_CT_FLUSH_ZONE line took. It flushes just ~8 
zones per second. Maybe, it’s due to quite big conntrack size:
I’ve noticed that on the system there were a huge amount of the conntrack 
records in host’s zone=0.
There were connections generated from overlay tunnel packets, that were not 
configured in iptables to be NOTRACK - there was a configuration management 
issue.
After I’ve applied iptables rules to not to track geneve, conntrack connection 
count decreased from ~60k to 3k and CT_ZONE flush became much faster (about 100 
zones per 1 second).

Thanks for pointing to another to openflows problem!

But my question is why ovn-controller flushes conntrack zones and modifies OFs 
even if nothing has changed? I mean, a "soft-restart" condition (when 
ovn-controller was stopped through unixctl socket with "--restart" option. 
Should I start ovn-controller somehow special?
When I trigger "soft-restart", I expect to have a minimal impact on the 
datapath.

Regards,
Vladislav Odintsov

> On 2 Feb 2022, at 18:20, Ilya Maximets <i.maxim...@ovn.org> wrote:
> 
> On 2/1/22 23:15, Vladislav Odintsov wrote:
>> Hi OVN and OVS folks,
>> 
>> during OVN upgrade from 20.06 to 21.12 version I’ve faced with a quite long 
>> datapath downtime (48-50 seconds).
>> I suppose this downtime was triggered by ovn-controller installing new 
>> openflow rules to ovs.
>> There are relatively not so many flows (13-14k).
>> openvswitch version is 2.13.4.
>> ovn-openflow-probe-interval is configured to 30.
>> 
>> The upgrade procedure was:
>> 1. Set to ovs external_ids an option to force matching ovn-northd internal 
>> version (ovn-match-northd-version=true)
>> 2. Upgrade ovn-controller (to first get ovn-controller, which supports 
>> version match between controller and northd and not to break datapath after 
>> upgrading ovn-northd)
>> 3. Upgrade ovn-northd cluster
>> 
>> Network outage timeframe matches to ovs-vswitchd WARN logs. Please see next:
>> 
>> 2022-02-01T19:44:32.577Z|00035|ovs_rcu(urcu7)|WARN|blocked 1000 ms waiting 
>> for main to quiesce
>> 2022-02-01T19:44:33.577Z|00036|ovs_rcu(urcu7)|WARN|blocked 2000 ms waiting 
>> for main to quiesce
>> 2022-02-01T19:44:35.577Z|00037|ovs_rcu(urcu7)|WARN|blocked 4000 ms waiting 
>> for main to quiesce
>> 2022-02-01T19:44:39.577Z|00038|ovs_rcu(urcu7)|WARN|blocked 8000 ms waiting 
>> for main to quiesce
>> 2022-02-01T19:44:44.396Z|04063|connmgr|INFO|br-int<->unix#272674: 37 
>> flow_mods 13 s ago (36 adds, 1 deletes)
>> 2022-02-01T19:44:44.441Z|04064|timeval|WARN|Unreasonably long 12864ms poll 
>> interval (0ms user, 12833ms system)
>> 2022-02-01T19:44:44.441Z|04065|timeval|WARN|faults: 287 minor, 0 major
>> 2022-02-01T19:44:44.441Z|04066|timeval|WARN|context switches: 1 voluntary, 
>> 18 involuntary
>> 2022-02-01T19:44:44.441Z|04067|coverage|INFO|Event coverage, avg rate over 
>> last: 5 seconds, last minute, last hour,  hash=9105faf6:
>> 2022-02-01T19:44:44.441Z|04068|coverage|INFO|ccmap_expand               
>> 0.0/sec     0.000/sec        0.0000/sec   total: 1129
>> 2022-02-01T19:44:44.441Z|04069|coverage|INFO|ccmap_shrink               
>> 0.0/sec    11.933/sec        0.1989/sec   total: 2363
>> 2022-02-01T19:44:44.441Z|04070|coverage|INFO|cmap_expand                
>> 0.0/sec     0.833/sec        0.8811/sec   total: 11218023
>> 2022-02-01T19:44:44.441Z|04071|coverage|INFO|cmap_shrink                
>> 8.2/sec    11.950/sec        1.0736/sec   total: 11217475
>> 2022-02-01T19:44:44.441Z|04072|coverage|INFO|dpif_port_add              
>> 0.0/sec     0.000/sec        0.0000/sec   total: 229
>> 2022-02-01T19:44:44.441Z|04073|coverage|INFO|dpif_port_del              
>> 0.0/sec     0.000/sec        0.0000/sec   total: 144
>> 2022-02-01T19:44:44.441Z|04074|coverage|INFO|dpif_flow_flush            
>> 0.0/sec     0.000/sec        0.0000/sec   total: 2
>> 2022-02-01T19:44:44.441Z|04075|coverage|INFO|dpif_flow_get              
>> 0.0/sec     0.000/sec        0.0000/sec   total: 23
>> 2022-02-01T19:44:44.441Z|04076|coverage|INFO|dpif_flow_put             
>> 83.0/sec    97.900/sec       80.8306/sec   total: 849051748
>> 2022-02-01T19:44:44.441Z|04077|coverage|INFO|dpif_flow_del            
>> 147.8/sec    91.733/sec       78.2797/sec   total: 810797137
>> 2022-02-01T19:44:44.441Z|04078|coverage|INFO|dpif_execute              
>> 66.6/sec    83.767/sec       87.8397/sec   total: 958697586
>> 2022-02-01T19:44:44.441Z|04079|coverage|INFO|dpif_execute_with_help     
>> 0.0/sec     4.250/sec        4.9892/sec   total: 70713908
>> 2022-02-01T19:44:44.441Z|04080|coverage|INFO|flow_extract             
>> 169.0/sec   247.017/sec      333.1194/sec   total: 7990486004
>> 2022-02-01T19:44:44.441Z|04081|coverage|INFO|miniflow_malloc            
>> 0.0/sec     0.183/sec        0.7983/sec   total: 5871853
>> 2022-02-01T19:44:44.441Z|04082|coverage|INFO|hindex_pathological        
>> 0.0/sec     0.000/sec        0.0000/sec   total: 78
>> 2022-02-01T19:44:44.441Z|04083|coverage|INFO|hindex_expand              
>> 0.0/sec     0.000/sec        0.0000/sec   total: 11
>> 2022-02-01T19:44:44.441Z|04084|coverage|INFO|hmap_pathological          
>> 0.0/sec     0.633/sec        0.9331/sec   total: 16638117
>> 2022-02-01T19:44:44.441Z|04085|coverage|INFO|hmap_expand                
>> 0.0/sec   114.867/sec      139.8006/sec   total: 2184263485
>> 2022-02-01T19:44:44.441Z|04086|coverage|INFO|mac_learning_learned       
>> 0.0/sec     0.000/sec        0.0003/sec   total: 299
>> 2022-02-01T19:44:44.441Z|04087|coverage|INFO|mac_learning_expired       
>> 0.0/sec     0.000/sec        0.0006/sec   total: 117
>> 2022-02-01T19:44:44.441Z|04088|coverage|INFO|mac_learning_moved         
>> 0.0/sec     0.000/sec        0.0000/sec   total: 87
>> 2022-02-01T19:44:44.441Z|04089|coverage|INFO|netdev_get_stats           
>> 0.0/sec    36.850/sec       43.3306/sec   total: 690673778
>> 2022-02-01T19:44:44.441Z|04090|coverage|INFO|txn_unchanged              
>> 0.0/sec     0.100/sec        0.1614/sec   total: 2705864
>> 2022-02-01T19:44:44.441Z|04091|coverage|INFO|txn_incomplete             
>> 0.0/sec     0.183/sec        0.2239/sec   total: 3688261
>> 2022-02-01T19:44:44.441Z|04092|coverage|INFO|txn_success                
>> 0.0/sec     0.167/sec        0.1953/sec   total: 3236818
>> 2022-02-01T19:44:44.441Z|04093|coverage|INFO|poll_create_node         
>> 3930.6/sec  3941.350/sec     5012.2572/sec  total: 115959738162
>> 2022-02-01T19:44:44.441Z|04094|coverage|INFO|poll_zero_timeout        
>> 159.4/sec   111.417/sec      103.4078/sec   total: 1181551227
>> 2022-02-01T19:44:44.441Z|04095|coverage|INFO|rconn_queued               
>> 0.0/sec     0.050/sec        0.0817/sec   total: 1279558
>> 2022-02-01T19:44:44.441Z|04096|coverage|INFO|rconn_sent                 
>> 0.0/sec     0.050/sec        0.0817/sec   total: 1279558
>> 2022-02-01T19:44:44.441Z|04097|coverage|INFO|seq_change               
>> 2205.2/sec  2502.783/sec     2874.6258/sec  total: 57335390998
>> 2022-02-01T19:44:44.441Z|04098|coverage|INFO|pstream_open               
>> 0.0/sec     0.000/sec        0.0000/sec   total: 5
>> 2022-02-01T19:44:44.441Z|04099|coverage|INFO|stream_open                
>> 0.0/sec     0.000/sec        0.0000/sec   total: 1
>> 2022-02-01T19:44:44.441Z|04100|coverage|INFO|unixctl_received           
>> 0.0/sec     0.000/sec        0.0000/sec   total: 189
>> 2022-02-01T19:44:44.441Z|04101|coverage|INFO|unixctl_replied            
>> 0.0/sec     0.000/sec        0.0000/sec   total: 189
>> 2022-02-01T19:44:44.441Z|04102|coverage|INFO|util_xalloc              
>> 7935.8/sec 15234.350/sec    18372.4347/sec  total: 345308271208
>> 2022-02-01T19:44:44.441Z|04103|coverage|INFO|vconn_received             
>> 0.0/sec     0.083/sec        0.3556/sec   total: 2018930
>> 2022-02-01T19:44:44.441Z|04104|coverage|INFO|vconn_sent                 
>> 0.0/sec     0.067/sec        0.0983/sec   total: 1552067
>> 2022-02-01T19:44:44.441Z|04105|coverage|INFO|netdev_set_policing        
>> 0.0/sec     0.000/sec        0.0000/sec   total: 623
>> 2022-02-01T19:44:44.441Z|04106|coverage|INFO|netdev_get_ifindex         
>> 0.0/sec     0.100/sec        0.1636/sec   total: 2713928
>> 2022-02-01T19:44:44.441Z|04107|coverage|INFO|netdev_set_hwaddr          
>> 0.0/sec     0.000/sec        0.0000/sec   total: 2
>> 2022-02-01T19:44:44.441Z|04108|coverage|INFO|netdev_get_ethtool         
>> 0.0/sec     0.000/sec        0.0000/sec   total: 1073
>> 2022-02-01T19:44:44.441Z|04109|coverage|INFO|netdev_set_ethtool         
>> 0.0/sec     0.000/sec        0.0000/sec   total: 225
>> 2022-02-01T19:44:44.441Z|04110|coverage|INFO|nln_changed                
>> 0.0/sec     0.050/sec        0.0492/sec   total: 819614
>> 2022-02-01T19:44:44.441Z|04111|coverage|INFO|netlink_received         
>> 346.6/sec   472.133/sec      575.4597/sec   total: 11288088779
>> 2022-02-01T19:44:44.441Z|04112|coverage|INFO|netlink_recv_jumbo        
>> 18.4/sec   113.250/sec      175.1189/sec   total: 3935643810
>> 2022-02-01T19:44:44.441Z|04113|coverage|INFO|netlink_sent             
>> 308.4/sec   378.133/sec      380.4267/sec   total: 4753114969
>> 2022-02-01T19:44:44.441Z|04114|coverage|INFO|ofproto_flush              
>> 0.0/sec     0.000/sec        0.0000/sec   total: 1
>> 2022-02-01T19:44:44.441Z|04115|coverage|INFO|ofproto_recv_openflow      
>> 0.0/sec     0.067/sec        0.3389/sec   total: 1746449
>> 2022-02-01T19:44:44.441Z|04116|coverage|INFO|ofproto_update_port        
>> 0.0/sec     0.000/sec        0.0000/sec   total: 814
>> 2022-02-01T19:44:44.441Z|04117|coverage|INFO|packet_in_overflow         
>> 0.0/sec     0.000/sec        0.0000/sec   total: 282520
>> 2022-02-01T19:44:44.441Z|04118|coverage|INFO|rev_reconfigure            
>> 0.0/sec     0.017/sec        0.0325/sec   total: 539904
>> 2022-02-01T19:44:44.441Z|04119|coverage|INFO|rev_port_toggled           
>> 0.0/sec     0.000/sec        0.0000/sec   total: 27
>> 2022-02-01T19:44:44.441Z|04120|coverage|INFO|rev_flow_table             
>> 0.0/sec     0.000/sec        0.0106/sec   total: 10876
>> 2022-02-01T19:44:44.441Z|04121|coverage|INFO|rev_mac_learning           
>> 0.0/sec     0.000/sec        0.0006/sec   total: 312
>> 2022-02-01T19:44:44.441Z|04122|coverage|INFO|dumped_duplicate_flow      
>> 0.0/sec     0.000/sec        0.0000/sec   total: 39204
>> 2022-02-01T19:44:44.441Z|04123|coverage|INFO|dumped_new_flow            
>> 0.0/sec     0.000/sec        0.0000/sec   total: 7177
>> 2022-02-01T19:44:44.441Z|04124|coverage|INFO|handler_duplicate_upcall   
>> 0.4/sec     1.833/sec        2.2664/sec   total: 34762798
>> 2022-02-01T19:44:44.441Z|04125|coverage|INFO|upcall_ukey_contention     
>> 0.0/sec     0.000/sec        0.0000/sec   total: 9
>> 2022-02-01T19:44:44.441Z|04126|coverage|INFO|upcall_ukey_replace        
>> 0.2/sec     0.050/sec        0.0447/sec   total: 7927221
>> 2022-02-01T19:44:44.441Z|04127|coverage|INFO|revalidate_missed_dp_flow   
>> 0.0/sec     0.000/sec        0.0000/sec   total: 1118
>> 2022-02-01T19:44:44.441Z|04128|coverage|INFO|xlate_actions            
>> 123.6/sec   273.183/sec      233.8275/sec   total: 2110727223
>> 2022-02-01T19:44:44.441Z|04129|coverage|INFO|bridge_reconfigure         
>> 0.0/sec     0.017/sec        0.0325/sec   total: 539902
>> 2022-02-01T19:44:44.441Z|04130|coverage|INFO|74 events never hit
>> 2022-02-01T19:44:44.441Z|04131|poll_loop|INFO|Dropped 3 log messages in last 
>> 1058229 seconds (most recently, 1058228 seconds ago) due to excessive rate
>> 2022-02-01T19:44:44.441Z|04132|poll_loop|INFO|wakeup due to [POLLIN] on fd 
>> 230 (/var/run/openvswitch/br-int.mgmt<->) at lib/stream-fd.c:157 (84% CPU 
>> usage)
>> 2022-02-01T19:44:45.441Z|00039|ovs_rcu(urcu7)|WARN|blocked 1000 ms waiting 
>> for main to quiesce
>> 2022-02-01T19:44:46.441Z|00040|ovs_rcu(urcu7)|WARN|blocked 2000 ms waiting 
>> for main to quiesce
>> 2022-02-01T19:44:48.441Z|00041|ovs_rcu(urcu7)|WARN|blocked 4001 ms waiting 
>> for main to quiesce
>> 2022-02-01T19:44:52.441Z|00042|ovs_rcu(urcu7)|WARN|blocked 8000 ms waiting 
>> for main to quiesce
>> 2022-02-01T19:44:55.109Z|04133|timeval|WARN|Unreasonably long 10668ms poll 
>> interval (0ms user, 10643ms system)
>> 2022-02-01T19:44:55.109Z|04134|timeval|WARN|faults: 70 minor, 0 major
>> 2022-02-01T19:44:55.109Z|04135|timeval|WARN|context switches: 0 voluntary, 4 
>> involuntary
>> 2022-02-01T19:44:55.109Z|04136|coverage|INFO|Skipping details of duplicate 
>> event coverage for hash=9105faf6
>> 2022-02-01T19:44:55.109Z|04137|poll_loop|INFO|wakeup due to [POLLIN] on fd 
>> 15 (<->/var/run/openvswitch/db.sock) at lib/stream-fd.c:157 (100% CPU usage)
>> 2022-02-01T19:44:55.109Z|04138|poll_loop|INFO|wakeup due to [POLLIN] on fd 
>> 230 (/var/run/openvswitch/br-int.mgmt<->) at lib/stream-fd.c:157 (100% CPU 
>> usage)
>> 2022-02-01T19:44:56.112Z|00043|ovs_rcu(urcu7)|WARN|blocked 1001 ms waiting 
>> for main to quiesce
>> 2022-02-01T19:44:57.112Z|00044|ovs_rcu(urcu7)|WARN|blocked 2000 ms waiting 
>> for main to quiesce
>> 2022-02-01T19:44:59.112Z|00045|ovs_rcu(urcu7)|WARN|blocked 4000 ms waiting 
>> for main to quiesce
>> 2022-02-01T19:45:03.112Z|00046|ovs_rcu(urcu7)|WARN|blocked 8000 ms waiting 
>> for main to quiesce
>> 2022-02-01T19:45:05.249Z|04139|timeval|WARN|Unreasonably long 10140ms poll 
>> interval (0ms user, 10115ms system)
>> 2022-02-01T19:45:05.249Z|04140|timeval|WARN|faults: 248 minor, 0 major
>> 2022-02-01T19:45:05.249Z|04141|timeval|WARN|context switches: 1 voluntary, 
>> 27 involuntary
>> 2022-02-01T19:45:05.249Z|04142|coverage|INFO|Skipping details of duplicate 
>> event coverage for hash=9105faf6
>> 2022-02-01T19:45:05.249Z|04143|poll_loop|INFO|wakeup due to [POLLIN] on fd 
>> 14 (NETLINK_ROUTE<->NETLINK_ROUTE) at lib/netlink-socket.c:1401 (100% CPU 
>> usage)
>> 2022-02-01T19:45:05.249Z|04144|poll_loop|INFO|wakeup due to [POLLOUT] on fd 
>> 235 (/var/run/openvswitch/br-int.mgmt<->) at lib/stream-fd.c:153 (100% CPU 
>> usage)
>> 2022-02-01T19:45:05.249Z|04145|poll_loop|INFO|wakeup due to [POLLIN] on fd 
>> 230 (/var/run/openvswitch/br-int.mgmt<->) at lib/stream-fd.c:157 (100% CPU 
>> usage)
>> 2022-02-01T19:45:06.250Z|00047|ovs_rcu(urcu7)|WARN|blocked 1000 ms waiting 
>> for main to quiesce
>> 2022-02-01T19:45:07.250Z|00048|ovs_rcu(urcu7)|WARN|blocked 2000 ms waiting 
>> for main to quiesce
>> 2022-02-01T19:45:09.250Z|00049|ovs_rcu(urcu7)|WARN|blocked 4000 ms waiting 
>> for main to quiesce
>> 2022-02-01T19:45:13.250Z|00050|ovs_rcu(urcu7)|WARN|blocked 8000 ms waiting 
>> for main to quiesce
>> 2022-02-01T19:45:15.704Z|04146|rconn|WARN|br-int<->unix#272675: connection 
>> dropped (Broken pipe)
>> 2022-02-01T19:45:16.714Z|00051|ovs_rcu(urcu7)|WARN|blocked 1000 ms waiting 
>> for main to quiesce
>> 2022-02-01T19:45:17.713Z|00052|ovs_rcu(urcu7)|WARN|blocked 2000 ms waiting 
>> for main to quiesce
>> 2022-02-01T19:45:19.714Z|00053|ovs_rcu(urcu7)|WARN|blocked 4000 ms waiting 
>> for main to quiesce
>> 2022-02-01T19:45:23.014Z|04147|timeval|WARN|Unreasonably long 17764ms poll 
>> interval (0ms user, 17720ms system)
>> 2022-02-01T19:45:23.014Z|04148|timeval|WARN|faults: 63 minor, 0 major
>> 2022-02-01T19:45:23.014Z|04149|timeval|WARN|context switches: 2 voluntary, 
>> 24 involuntary
>> 2022-02-01T19:45:23.014Z|04150|poll_loop|INFO|wakeup due to [POLLIN] on fd 
>> 15 (<->/var/run/openvswitch/db.sock) at lib/stream-fd.c:157 (100% CPU usage)
>> 2022-02-01T19:45:23.014Z|04151|poll_loop|INFO|wakeup due to [POLLIN] on fd 
>> 230 (/var/run/openvswitch/br-int.mgmt<->) at lib/stream-fd.c:157 (100% CPU 
>> usage)
>> 2022-02-01T19:45:23.021Z|04152|poll_loop|INFO|wakeup due to [POLLIN] on fd 
>> 230 (/var/run/openvswitch/br-int.mgmt<->) at lib/stream-fd.c:157 (100% CPU 
>> usage)
>> 2022-02-01T19:45:23.022Z|04153|poll_loop|INFO|wakeup due to [POLLIN] on fd 
>> 230 (/var/run/openvswitch/br-int.mgmt<->) at lib/stream-fd.c:157 (100% CPU 
>> usage)
>> 2022-02-01T19:45:23.022Z|04154|poll_loop|INFO|wakeup due to [POLLIN] on fd 
>> 230 (/var/run/openvswitch/br-int.mgmt<->) at lib/stream-fd.c:157 (100% CPU 
>> usage)
>> 2022-02-01T19:45:23.023Z|04155|poll_loop|INFO|wakeup due to [POLLIN] on fd 
>> 230 (/var/run/openvswitch/br-int.mgmt<->) at lib/stream-fd.c:157 (100% CPU 
>> usage)
>> 2022-02-01T19:45:23.023Z|04156|poll_loop|INFO|wakeup due to [POLLIN] on fd 
>> 230 (/var/run/openvswitch/br-int.mgmt<->) at lib/stream-fd.c:157 (100% CPU 
>> usage)
>> 2022-02-01T19:45:23.023Z|04157|poll_loop|INFO|wakeup due to [POLLIN] on fd 
>> 230 (/var/run/openvswitch/br-int.mgmt<->) at lib/stream-fd.c:157 (100% CPU 
>> usage)
>> 2022-02-01T19:45:23.024Z|04158|poll_loop|INFO|wakeup due to [POLLIN] on fd 
>> 230 (/var/run/openvswitch/br-int.mgmt<->) at lib/stream-fd.c:157 (100% CPU 
>> usage)
>> 2022-02-01T19:45:23.024Z|04159|poll_loop|INFO|wakeup due to [POLLIN] on fd 
>> 230 (/var/run/openvswitch/br-int.mgmt<->) at lib/stream-fd.c:157 (100% CPU 
>> usage)
>> 2022-02-01T19:45:41.274Z|04160|connmgr|INFO|br-int<->unix#272674: 13589 
>> flow_mods 18 s ago (13509 adds, 80 modifications)
>> 2022-02-01T19:46:31.285Z|04161|rconn|ERR|br-int<->unix#272673: no response 
>> to inactivity probe after 60 seconds, disconnecting
>> 
>> 
>> In ovn-controller log:
>> 
>> 2022-02-01T19:44:31.267Z|00077|features|INFO|unix:/run/openvswitch/br-int.mgmt:
>>  connecting to switch
>> 2022-02-01T19:44:31.267Z|00078|rconn|INFO|unix:/run/openvswitch/br-int.mgmt: 
>> connecting...
>> 2022-02-01T19:44:31.267Z|00079|ofctrl|INFO|unix:/run/openvswitch/br-int.mgmt:
>>  connecting to switch
>> 2022-02-01T19:44:31.267Z|00080|rconn|INFO|unix:/run/openvswitch/br-int.mgmt: 
>> connecting...
>> 2022-02-01T19:44:31.268Z|00081|rconn|INFO|unix:/run/openvswitch/br-int.mgmt: 
>> connected
>> 2022-02-01T19:44:31.269Z|00082|rconn|INFO|unix:/run/openvswitch/br-int.mgmt: 
>> connected
>> 2022-02-01T19:44:31.284Z|00083|main|INFO|OVS feature set changed, force 
>> recompute.
>> 2022-02-01T19:44:34.074Z|00084|main|INFO|OVNSB commit failed, force 
>> recompute next time.
>> 2022-02-01T19:44:34.621Z|00085|reconnect|INFO|ssl:<IP>16.0.61:6642: 
>> connection closed by peer
>> 2022-02-01T19:44:35.622Z|00086|reconnect|INFO|ssl:<IP>:6642: connecting...
>> 2022-02-01T19:44:35.672Z|00087|reconnect|INFO|ssl:<IP>:6642: connected
>> 2022-02-01T19:45:01.470Z|00088|lflow_cache|INFO|Detected cache inactivity 
>> (last active 30001 ms ago): trimming cache
>> 2022-02-01T19:45:23.351Z|00089|binding|INFO|Setting lport <lport> 
>> ovn-installed in OVS
>> 2022-02-01T19:45:23.351Z|00090|binding|INFO|Setting lport <lport> up in 
>> Southbound
>> <there are 80 port bindings...>
>> 2022-02-01T19:45:23.352Z|00245|binding|INFO|Setting lport <lportN> 
>> ovn-installed in OVS
>> 2022-02-01T19:45:23.352Z|00246|binding|INFO|Setting lport <lportN> up in 
>> Southbound
>> 2022-02-01T19:45:53.323Z|00247|lflow_cache|INFO|Detected cache inactivity 
>> (last active 30004 ms ago): trimming cache
>> 
>> 
>> I’ve got some questions:
>> 1. What OVS is doing while `waiting for main to quiesce` ? I noticed this 
>> message appeares each time downtime occurs.
> 
> 'Unreasonably long 17764ms poll interval (0ms user, 17720ms system)'
> is very concerning.  It means that the main thread spent almost 18
> seconds inside the kernel executing some syscall, maybe several syscalls.
> Main thread is not responsible for flow installation to the kernel, IIRC,
> so flow updates should not be a case here.  You need to figure out what
> are these syscalls and why they are taking so long to execute inside
> the kernel.
> 
> Maybe you have some suspicious kernel logs at the same time?
> You may also try 'perf' or some other tracing tools to figure out
> what is going on.
> 
>> 2. Is there any hint to minimize such downtime while reinstalling OF?
> 
> Installation of rules is an unlikely root cause here.
> 
>> 3. Could there be a situation where ovn-northd after upgrade still 
>> generating new logical flows, ovn-controller constantly gets updates from SB 
>> and triggers update to OF in a loop while doesn’t finish getting final 
>> lflows "view"? So, could it start installing new OF while having in memory 
>> lflows which are not consistent at that moment?
> 
> northd generates all updates in a single transaction, so SB is updated
> atomically. SB will send this update as a single update message, so this
> is also atomic.  ovn-controller will install all OF rules as an OF bundle,
> so that is also atomic.
> 
>> 4. What does `br-int<->unix#272675: connection dropped (Broken pipe)` error 
>> means? What type of connection is broken? Is it normal?
> 
> This looks like a controller connection, i.e. connection between
> ovn-controller with ovs-vswitchd.  Doesn't look very good, but
> disconnection seems to be caused by inactivity due to overload
> on one of the sides of the connection.
> 
>> 5. Do you see anything else abnormal in the logs?
> 
> '0ms user, 17720ms system' is the most abnormal, AFAICT.
> 
> Best regards, Ilya Maximets.
> 
> _______________________________________________
> dev mailing list
> d...@openvswitch.org <mailto:d...@openvswitch.org>
> https://mail.openvswitch.org/mailman/listinfo/ovs-dev 
> <https://mail.openvswitch.org/mailman/listinfo/ovs-dev>
_______________________________________________
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Reply via email to