UPD:

I’ve checked ovn-controller dbg logs and checked against its code.
Not sure, but looks like some bug in first processing of ct_zones. This logfile 
is from 20.06 version, but it has similar behaviour on v21.12:

start ovn-controller

2022-02-03T16:09:50.500Z|00032|jsonrpc|DBG|unix:/run/openvswitch/db.sock: 
received reply, result={}, id=3
2022-02-03T16:09:50.500Z|00033|ovsdb_idl|DBG|unix:/run/openvswitch/db.sock: 
received unexpected reply message: {"error":null,"id":3,"result":{}}
2022-02-03T16:09:50.500Z|00034|main|DBG|restoring ct zone 25 for 
'vnet37798-gateway-vif'
2022-02-03T16:09:50.500Z|00035|main|DBG|restoring ct zone 8 for 
'5ae22181-33a5-4596-8143-e5e396fbc17e_dnat'
2022-02-03T16:09:50.500Z|00036|main|DBG|restoring ct zone 57 for 
'3210cc4f-d055-4375-8425-aef0bd4db5fc_dnat'
2022-02-03T16:09:50.500Z|00037|main|DBG|restoring ct zone 10 for 
'vnet38045-gateway-vif'
2022-02-03T16:09:50.500Z|00038|main|DBG|restoring ct zone 60 for 
'910c751a-3d86-420a-9606-3f751de4e8bf_dnat'
2022-02-03T16:09:50.500Z|00039|main|DBG|restoring ct zone 146 for 
'30fda34a-7614-487a-9b17-258781c06e31_snat'
2022-02-03T16:09:50.500Z|00040|main|DBG|restoring ct zone 171 for 
'de72957f-b877-4c13-8b3d-9fe9b3abcc47_snat'
2022-02-03T16:09:50.500Z|00041|main|DBG|restoring ct zone 184 for 
'vnet36210-gateway-vif'
2022-02-03T16:09:50.500Z|00042|main|DBG|restoring ct zone 120 for 
'fdc23a91-fbcc-4fc4-9edd-e2001a827a2a_snat'
2022-02-03T16:09:50.500Z|00043|main|DBG|restoring ct zone 102 for 
'2c9ea3e6-7ba1-445e-b707-1873d9a85e00_dnat'
2022-02-03T16:09:50.500Z|00044|main|DBG|restoring ct zone 152 for 
'fa746f3c-1d4e-450a-80c9-e6cd39676bbc_snat'
2022-02-03T16:09:50.500Z|00045|main|DBG|restoring ct zone 92 for 
'8e81ebd1-f43d-4d90-85d9-dc9e85690f58_dnat'
2022-02-03T16:09:50.500Z|00046|main|DBG|restoring ct zone 135 for 
'ad815732-f341-4a6d-bdce-184dc6c53edf_snat'
2022-02-03T16:09:50.500Z|00047|main|DBG|restoring ct zone 100 for 
'vnet35091-gateway-vif'
2022-02-03T16:09:50.500Z|00048|main|DBG|restoring ct zone 162 for 
'1078be52-1141-4060-b98d-23d96464b8bf_dnat'
2022-02-03T16:09:50.500Z|00049|main|DBG|restoring ct zone 30 for 
'1078be52-1141-4060-b98d-23d96464b8bf_snat'
2022-02-03T16:09:50.500Z|00050|main|DBG|restoring ct zone 130 for 
'605f08c3-f0be-4c0b-a721-5955bf2852d9_snat'
2022-02-03T16:09:50.500Z|00051|main|DBG|restoring ct zone 121 for 
'27e28d02-ff81-4faa-b79b-8b6174e7ceea_dnat'
2022-02-03T16:09:50.500Z|00052|main|DBG|restoring ct zone 70 for 
'vnet35515-gateway-vif'
2022-02-03T16:09:50.500Z|00053|main|DBG|restoring ct zone 26 for 
'b56507c6-ac4b-4de0-b6d3-33936e98ad0b_dnat'
2022-02-03T16:09:50.500Z|00054|main|DBG|restoring ct zone 114 for 
'c7f1776b-0a1f-4434-bac5-db2b59f3363b_snat'
…
2022-02-03T16:09:50.501Z|00226|inc_proc_eng|DBG|Initializing new run
…
2022-02-03T16:09:50.730Z|00287|inc_proc_eng|DBG|Initializing new run
…
2022-02-03T16:09:50.739Z|00311|inc_proc_eng|DBG|Initializing new run
<some updates from SB DB>
…
2022-02-03T16:09:50.754Z|00347|inc_proc_eng|DBG|Initializing new run
…
2022-02-03T16:09:50.779Z|00408|main|DBG|removing ct zone 2 for 
'5601bacd-d246-42f8-aa69-c196f74ca57c_snat'
2022-02-03T16:09:50.779Z|00409|main|DBG|removing ct zone 3 for 
'96b6fa7d-d974-4260-a58e-c78342fd097d_snat'
2022-02-03T16:09:50.779Z|00410|main|DBG|removing ct zone 5 for 
'a8e55770-885f-464e-bfa9-9c80eca0ce62_snat'
2022-02-03T16:09:50.779Z|00411|main|DBG|removing ct zone 6 for 
'2969a6a5-012d-4504-83bf-be5192c4af87_dnat'
2022-02-03T16:09:50.779Z|00412|main|DBG|removing ct zone 9 for 
'03175d56-14c2-4ea9-b671-e427fe6b62e6_dnat'
2022-02-03T16:09:50.779Z|00413|main|DBG|removing ct zone 8 for 
'5ae22181-33a5-4596-8143-e5e396fbc17e_dnat'
2022-02-03T16:09:50.779Z|00414|main|DBG|removing ct zone 12 for 
'11ecffe1-c6f9-4373-948a-fff26113af73_snat'
2022-02-03T16:09:50.779Z|00415|main|DBG|removing ct zone 15 for 
'791f9a0b-81e4-4a9e-8f38-7d56332b6237_snat'
2022-02-03T16:09:50.779Z|00416|main|DBG|removing ct zone 14 for 
'0408d0d9-e7d9-49f7-a0f2-32d611b56686_snat'
2022-02-03T16:09:50.779Z|00417|main|DBG|removing ct zone 13 for 
'd8cd2a0c-bd9e-4472-920e-5443e2a94429_dnat'
2022-02-03T16:09:50.779Z|00418|main|DBG|removing ct zone 17 for 
'4e59bb4f-1ec6-4375-bb1c-63e03d1133ce_snat'
2022-02-03T16:09:50.779Z|00419|main|DBG|removing ct zone 18 for 
'3210cc4f-d055-4375-8425-aef0bd4db5fc_snat'
2022-02-03T16:09:50.779Z|00420|main|DBG|removing ct zone 20 for 
'cfcb7e97-6107-435e-8247-771314968be9_dnat'
2022-02-03T16:09:50.779Z|00421|main|DBG|removing ct zone 23 for 
'aba29611-9a7c-497d-9dbb-2d18314d517b_snat'
2022-02-03T16:09:50.779Z|00422|main|DBG|removing ct zone 21 for 
'57e00565-1250-4fa7-9c11-a7c1c602bf32_snat'
2022-02-03T16:09:50.779Z|00423|main|DBG|removing ct zone 26 for 
'b56507c6-ac4b-4de0-b6d3-33936e98ad0b_dnat'
2022-02-03T16:09:50.779Z|00424|main|DBG|removing ct zone 24 for 
'2486c8ff-9eb4-4326-8601-a2e93343164e_dnat'
2022-02-03T16:09:50.779Z|00425|main|DBG|removing ct zone 27 for 
'278af800-e548-488a-9f5a-4f965718b56d_snat'
2022-02-03T16:09:50.779Z|00426|main|DBG|removing ct zone 30 for 
'1078be52-1141-4060-b98d-23d96464b8bf_snat'
…
...
2022-02-03T16:09:50.780Z|00546|inc_proc_eng|DBG|node: flow_output, recompute 
(forced)
2022-02-03T16:09:50.780Z|00547|ofctrl|DBG|ofctrl_add_flow flow: cookie=0, 
table_id=0, priority=100, in_port=22, 
actions=move:NXM_NX_TUN_ID[40..54]->NXM_NX_REG14[0..14],move:NXM_NX_TUN_ID[24..39]->NXM_NX_REG15[0..15],move:NXM_NX_TUN_ID[0..23]->OXM_OF_METADATA[0..23],resubmit(,33)
2022-02-03T16:09:50.780Z|00548|ofctrl|DBG|ofctrl_add_flow flow: cookie=0, 
table_id=0, priority=100, in_port=9, 
actions=move:NXM_NX_TUN_ID[40..54]->NXM_NX_REG14[0..14],move:NXM_NX_TUN_ID[24..39]->NXM_NX_REG15[0..15],move:NXM_NX_TUN_ID[0..23]->OXM_OF_METADATA[0..23],resubmit(,33)
2022-02-03T16:09:50.780Z|00549|ofctrl|DBG|ofctrl_add_flow flow: cookie=0, 
table_id=0, priority=100, in_port=60, 
actions=move:NXM_NX_TUN_ID[40..54]->NXM_NX_REG14[0..14],move:NXM_NX_TUN_ID[24..39]->NXM_NX_REG15[0..15],move:NXM_NX_TUN_ID[0..23]->OXM_OF_METADATA[0..23],resubmit(,33)
2022-02-03T16:09:50.780Z|00550|ofctrl|DBG|ofctrl_add_flow flow: cookie=0, 
table_id=0, priority=100, in_port=18, 
actions=move:NXM_NX_TUN_ID[40..54]->NXM_NX_REG14[0..14],move:NXM_NX_TUN_ID[24..39]->NXM_NX_REG15[0..15],move:NXM_NX_TUN_ID[0..23]->OXM_OF_METADATA[0..23],resubmit(,33)
2022-02-03T16:09:50.780Z|00551|ofctrl|DBG|ofctrl_add_flow flow: cookie=0, 
table_id=0, priority=100, in_port=37, 
actions=move:NXM_NX_TUN_ID[40..54]->NXM_NX_REG14[0..14],move:NXM_NX_TUN_ID[24..39]->NXM_NX_REG15[0..15],move:NXM_NX_TUN_ID[0..23]->OXM_OF_METADATA[0..23],resubmit(,33)
2022-02-03T16:09:50.780Z|00552|ofctrl|DBG|ofctrl_add_flow flow: cookie=0, 
table_id=0, priority=100, in_port=1, 
actions=move:NXM_NX_TUN_ID[40..54]->NXM_NX_REG14[0..14],move:NXM_NX_TUN_ID[24..39]->NXM_NX_REG15[0..15],move:NXM_NX_TUN_ID[0..23]->OXM_OF_METADATA[0..23],resubmit(,33)
…

After some run iterations:

2022-02-03T16:09:50.792Z|00923|inc_proc_eng|DBG|node: ct_zones, recompute 
(triggered)
2022-02-03T16:09:50.792Z|00924|main|DBG|assigning ct zone 2 to 
'5601bacd-d246-42f8-aa69-c196f74ca57c_snat'
2022-02-03T16:09:50.792Z|00925|main|DBG|assigning ct zone 3 to 
'96b6fa7d-d974-4260-a58e-c78342fd097d_snat'
2022-02-03T16:09:50.792Z|00926|main|DBG|assigning ct zone 5 to 
'a8e55770-885f-464e-bfa9-9c80eca0ce62_snat'
2022-02-03T16:09:50.792Z|00927|main|DBG|assigning ct zone 6 to 
'2969a6a5-012d-4504-83bf-be5192c4af87_dnat'
2022-02-03T16:09:50.792Z|00928|main|DBG|assigning ct zone 8 to 
'5ae22181-33a5-4596-8143-e5e396fbc17e_dnat'
2022-02-03T16:09:50.792Z|00929|main|DBG|assigning ct zone 9 to 
'03175d56-14c2-4ea9-b671-e427fe6b62e6_dnat'
2022-02-03T16:09:50.792Z|00930|main|DBG|assigning ct zone 12 to 
'11ecffe1-c6f9-4373-948a-fff26113af73_snat'
2022-02-03T16:09:50.792Z|00931|main|DBG|assigning ct zone 13 to 
'd8cd2a0c-bd9e-4472-920e-5443e2a94429_dnat'
2022-02-03T16:09:50.792Z|00932|main|DBG|assigning ct zone 14 to 
'0408d0d9-e7d9-49f7-a0f2-32d611b56686_snat'
2022-02-03T16:09:50.792Z|00933|main|DBG|assigning ct zone 15 to 
'791f9a0b-81e4-4a9e-8f38-7d56332b6237_snat'
2022-02-03T16:09:50.792Z|00934|main|DBG|assigning ct zone 17 to 
'4e59bb4f-1ec6-4375-bb1c-63e03d1133ce_snat'
2022-02-03T16:09:50.792Z|00935|main|DBG|assigning ct zone 18 to 
'3210cc4f-d055-4375-8425-aef0bd4db5fc_snat'
2022-02-03T16:09:50.792Z|00936|main|DBG|assigning ct zone 20 to 
'cfcb7e97-6107-435e-8247-771314968be9_dnat'
2022-02-03T16:09:50.792Z|00937|main|DBG|assigning ct zone 21 to 
'57e00565-1250-4fa7-9c11-a7c1c602bf32_snat'
2022-02-03T16:09:50.792Z|00938|main|DBG|assigning ct zone 23 to 
'aba29611-9a7c-497d-9dbb-2d18314d517b_snat'
2022-02-03T16:09:50.792Z|00939|main|DBG|assigning ct zone 24 to 
'2486c8ff-9eb4-4326-8601-a2e93343164e_dnat'
2022-02-03T16:09:50.792Z|00940|main|DBG|assigning ct zone 26 to 
'b56507c6-ac4b-4de0-b6d3-33936e98ad0b_dnat'
...

In my understanding the reason for this could be an empty binding_lports here 
[0] and local_datapaths here [1]
So, this triggers zones removal and on next iteration the required zones are 
created back.
What do you think?

0: https://github.com/ovn-org/ovn/blob/v21.12.0/controller/ovn-controller.c#L651
1: https://github.com/ovn-org/ovn/blob/v21.12.0/controller/ovn-controller.c#L658


Regards,
Vladislav Odintsov

> On 3 Feb 2022, at 14:19, Vladislav Odintsov <odiv...@gmail.com> wrote:
> 
> Hi Ilya,
> 
> thanks for detailed answer.
> I’ve tried to just "soft-restart" ovn-controller and it seems that I found 
> the problem, that could block OVS.
> 
> What I’ve done:
> 1. ovs-appctl -t ovs-vswitchd vlog/set file:vconn:dbg
> 2. ovn-appctl -t ovn-controller exit --restart
> 3. systemctl restart ovn-controller
> 
> In OVS log I saw a lot of flow-mod lines, and CT_FLUSH_ZONE:
> 
> 2022-02-03T10:43:18.956Z|01574|vconn|DBG|unix#264996: sent (Success): 
> OFPT_HELLO (OF1.5) (xid=0x40a6b):
> version bitmap: 0x01, 0x02, 0x03, 0x04, 0x05, 0x06
> 2022-02-03T10:43:18.956Z|01575|vconn|DBG|unix#264996: received: OFPT_HELLO 
> (OF1.5) (xid=0x1):
> version bitmap: 0x06
> 2022-02-03T10:43:18.956Z|01576|vconn|DBG|unix#264996: negotiated OpenFlow 
> version 0x06 (we support version 0x06 and earlier, peer supports version 0x06)
> 2022-02-03T10:43:18.956Z|01577|vconn|DBG|unix#264996: received: 
> OFPT_GET_CONFIG_REQUEST (OF1.5) (xid=0x2):
> 2022-02-03T10:43:18.956Z|01578|vconn|DBG|unix#264996: sent (Success): 
> OFPT_GET_CONFIG_REPLY (OF1.5) (xid=0x2): frags=normal miss_send_len=0
> 2022-02-03T10:43:18.956Z|01579|vconn|DBG|unix#264996: received: 
> NXT_SET_PACKET_IN_FORMAT (OF1.5) (xid=0x3): format=nxt_packet_in2
> 2022-02-03T10:43:18.956Z|01580|vconn|DBG|unix#264996: received: 
> OFPT_SET_CONFIG (OF1.5) (xid=0x4): frags=normal miss_send_len=65535
> 2022-02-03T10:43:19.123Z|01581|vconn|DBG|unix#264997: sent (Success): 
> OFPT_HELLO (OF1.5) (xid=0x40a6c):
> version bitmap: 0x01, 0x02, 0x03, 0x04, 0x05, 0x06
> 2022-02-03T10:43:19.123Z|01582|vconn|DBG|unix#264997: received: OFPT_HELLO 
> (OF1.5) (xid=0x5):
> version bitmap: 0x06
> 2022-02-03T10:43:19.123Z|01583|vconn|DBG|unix#264997: negotiated OpenFlow 
> version 0x06 (we support version 0x06 and earlier, peer supports version 0x06)
> 2022-02-03T10:43:19.123Z|01584|vconn|DBG|unix#264998: sent (Success): 
> OFPT_HELLO (OF1.5) (xid=0x40a6d):
> version bitmap: 0x01, 0x02, 0x03, 0x04, 0x05, 0x06
> 2022-02-03T10:43:19.123Z|01585|vconn|DBG|unix#264998: received: OFPT_HELLO 
> (OF1.5) (xid=0x6):
> version bitmap: 0x06
> 2022-02-03T10:43:19.123Z|01586|vconn|DBG|unix#264998: negotiated OpenFlow 
> version 0x06 (we support version 0x06 and earlier, peer supports version 0x06)
> 2022-02-03T10:43:19.124Z|01587|vconn|DBG|unix#264998: received: 
> NXT_TLV_TABLE_REQUEST (OF1.5) (xid=0x7):
> 2022-02-03T10:43:19.124Z|01588|vconn|DBG|unix#264998: sent (Success): 
> NXT_TLV_TABLE_REPLY (OF1.5) (xid=0x7):
> max option space=256 max fields=64
> allocated option space=4
> 
> mapping table:
>  class  type  length  match field
> ------  ----  ------  --------------
>  0x102  0x80       4  tun_metadata0
> 2022-02-03T10:43:19.124Z|01589|vconn|DBG|unix#264998: received: OFPT_FLOW_MOD 
> (OF1.5) (xid=0x8): DEL table:255 priority=0 actions=drop
> 2022-02-03T10:43:19.126Z|01590|vconn|DBG|unix#264998: received: 
> OFPT_GROUP_MOD (OF1.5) (xid=0x9):
> DEL group_id=4294967292,type=all
> 2022-02-03T10:43:19.126Z|01591|vconn|DBG|unix#264998: received: 
> OFPT_METER_MOD (OF1.5) (xid=0xa): DEL meter=all bands=
> 2022-02-03T10:43:19.126Z|01592|vconn|DBG|unix#264998: received: 
> OFPT_BUNDLE_CONTROL (OF1.5) (xid=0xb):
> bundle_id=0 type=OPEN_REQUEST flags=atomic ordered
> 2022-02-03T10:43:19.126Z|01593|vconn|DBG|unix#264998: sent (Success): 
> OFPT_BUNDLE_CONTROL (OF1.5) (xid=0xb):
> bundle_id=0 type=OPEN_REPLY flags=0
> 2022-02-03T10:43:19.126Z|01594|vconn|DBG|unix#264998: received: 
> OFPT_BUNDLE_ADD_MESSAGE (OF1.5) (xid=0xc):
> bundle_id=0 flags=atomic ordered
> OFPT_FLOW_MOD (OF1.5) (xid=0xc): ADD priority=100,in_port=98 
> actions=move:NXM_NX_TUN_ID[40..54]->NXM_NX_REG14[0..14],move:NXM_NX_TUN_ID[24..39]->NXM_NX_REG15[0..15],move:NXM_NX_TUN_ID[0..23]->OXM_OF_METADATA[0..23],resubmit(,38)
> 2022-02-03T10:43:19.126Z|01595|vconn|DBG|unix#264998: received: 
> OFPT_BUNDLE_ADD_MESSAGE (OF1.5) (xid=0xd):
> bundle_id=0 flags=atomic ordered
> OFPT_FLOW_MOD (OF1.5) (xid=0xd): ADD priority=100,in_port=96 
> actions=move:NXM_NX_TUN_ID[40..54]->NXM_NX_REG14[0..14],move:NXM_NX_TUN_ID[24..39]->NXM_NX_REG15[0..15],move:NXM_NX_TUN_ID[0..23]->OXM_OF_METADATA[0..23],resubmit(,38)
> 2022-02-03T10:43:19.126Z|01596|vconn|DBG|unix#264998: received: 
> OFPT_BUNDLE_ADD_MESSAGE (OF1.5) (xid=0xe):
> bundle_id=0 flags=atomic ordered
> OFPT_FLOW_MOD (OF1.5) (xid=0xe): ADD priority=100,in_port=7 
> actions=move:NXM_NX_TUN_ID[40..54]->NXM_NX_REG14[0..14],move:NXM_NX_TUN_ID[24..39]->NXM_NX_REG15[0..15],move:NXM_NX_TUN_ID[0..23]->OXM_OF_METADATA[0..23],resubmit(,38)
> 2022-02-03T10:43:19.126Z|01597|vconn|DBG|unix#264998: received: 
> OFPT_BUNDLE_ADD_MESSAGE (OF1.5) (xid=0xf):
> bundle_id=0 flags=atomic ordered
> OFPT_FLOW_MOD (OF1.5) (xid=0xf): ADD priority=100,in_port=6 
> actions=move:NXM_NX_TUN_ID[40..54]->NXM_NX_REG14[0..14],move:NXM_NX_TUN_ID[24..39]->NXM_NX_REG15[0..15],move:NXM_NX_TUN_ID[0..23]->OXM_OF_METADATA[0..23],resubmit(,38)
> 2022-02-03T10:43:19.126Z|01598|vconn|DBG|unix#264998: received: 
> OFPT_BUNDLE_ADD_MESSAGE (OF1.5) (xid=0x10):
> bundle_id=0 flags=atomic ordered
> OFPT_FLOW_MOD (OF1.5) (xid=0x10): ADD priority=100,in_port=55 
> actions=move:NXM_NX_TUN_ID[40..54]->NXM_NX_REG14[0..14],move:NXM_NX_TUN_ID[24..39]->NXM_NX_REG15[0..15],move:NXM_NX_TUN_ID[0..23]->OXM_OF_METADATA[0..23],resubmit(,38)
> 2022-02-03T10:43:19.126Z|01599|vconn|DBG|unix#264998: received: 
> OFPT_BUNDLE_ADD_MESSAGE (OF1.5) (xid=0x11):
> bundle_id=0 flags=atomic ordered
> ….
> 
> 
> 2022-02-03T10:43:26.293Z|01758|vconn|DBG|unix#264998: received: 
> NXT_CT_FLUSH_ZONE (OF1.5) (xid=0x6b): zone_id=52
> 2022-02-03T10:43:26.424Z|01759|vconn|DBG|unix#264998: received: 
> NXT_CT_FLUSH_ZONE (OF1.5) (xid=0x6c): zone_id=55
> 2022-02-03T10:43:26.554Z|01760|vconn|DBG|unix#264998: received: 
> NXT_CT_FLUSH_ZONE (OF1.5) (xid=0x6d): zone_id=54
> 2022-02-03T10:43:26.685Z|01761|vconn|DBG|unix#264998: received: 
> NXT_CT_FLUSH_ZONE (OF1.5) (xid=0x6e): zone_id=58
> 2022-02-03T10:43:26.816Z|01762|vconn|DBG|unix#264998: received: 
> NXT_CT_FLUSH_ZONE (OF1.5) (xid=0x6f): zone_id=57
> 2022-02-03T10:43:26.946Z|01763|vconn|DBG|unix#264998: received: 
> NXT_CT_FLUSH_ZONE (OF1.5) (xid=0x70): zone_id=56
> 2022-02-03T10:43:27.031Z|00013|ovs_rcu(urcu2)|WARN|blocked 1000 ms waiting 
> for main to quiesce
> 2022-02-03T10:43:27.076Z|01764|vconn|DBG|unix#264998: received: 
> NXT_CT_FLUSH_ZONE (OF1.5) (xid=0x71): zone_id=59
> 2022-02-03T10:43:27.206Z|01765|vconn|DBG|unix#264998: received: 
> NXT_CT_FLUSH_ZONE (OF1.5) (xid=0x72): zone_id=61
> 2022-02-03T10:43:27.336Z|01766|vconn|DBG|unix#264998: received: 
> NXT_CT_FLUSH_ZONE (OF1.5) (xid=0x73): zone_id=60
> 2022-02-03T10:43:27.465Z|01767|vconn|DBG|unix#264998: received: 
> NXT_CT_FLUSH_ZONE (OF1.5) (xid=0x74): zone_id=62
> 2022-02-03T10:43:27.594Z|01768|vconn|DBG|unix#264998: received: 
> NXT_CT_FLUSH_ZONE (OF1.5) (xid=0x75): zone_id=66
> 2022-02-03T10:43:27.726Z|01769|vconn|DBG|unix#264998: received: 
> NXT_CT_FLUSH_ZONE (OF1.5) (xid=0x76): zone_id=65
> 2022-02-03T10:43:27.854Z|01770|vconn|DBG|unix#264998: received: 
> NXT_CT_FLUSH_ZONE (OF1.5) (xid=0x77): zone_id=64
> 2022-02-03T10:43:27.984Z|01771|vconn|DBG|unix#264998: received: 
> NXT_CT_FLUSH_ZONE (OF1.5) (xid=0x78): zone_id=63
> 2022-02-03T10:43:28.031Z|00014|ovs_rcu(urcu2)|WARN|blocked 2000 ms waiting 
> for main to quiesce
> 2022-02-03T10:43:28.111Z|01772|vconn|DBG|unix#264998: received: 
> NXT_CT_FLUSH_ZONE (OF1.5) (xid=0x79): zone_id=67
> 2022-02-03T10:43:28.238Z|01773|vconn|DBG|unix#264998: received: 
> NXT_CT_FLUSH_ZONE (OF1.5) (xid=0x7a): zone_id=68
> 2022-02-03T10:43:28.371Z|01774|vconn|DBG|unix#264998: received: 
> NXT_CT_FLUSH_ZONE (OF1.5) (xid=0x7b): zone_id=69
> 2022-02-03T10:43:28.496Z|01775|vconn|DBG|unix#264998: received: 
> NXT_CT_FLUSH_ZONE (OF1.5) (xid=0x7c): zone_id=72
> 2022-02-03T10:43:28.622Z|01776|vconn|DBG|unix#264998: received: 
> NXT_CT_FLUSH_ZONE (OF1.5) (xid=0x7d): zone_id=71
> 2022-02-03T10:43:28.746Z|01777|vconn|DBG|unix#264998: received: 
> NXT_CT_FLUSH_ZONE (OF1.5) (xid=0x7e): zone_id=70
> 2022-02-03T10:43:28.871Z|01778|vconn|DBG|unix#264998: received: 
> NXT_CT_FLUSH_ZONE (OF1.5) (xid=0x7f): zone_id=75
> 2022-02-03T10:43:29.000Z|01779|vconn|DBG|unix#264998: received: 
> NXT_CT_FLUSH_ZONE (OF1.5) (xid=0x80): zone_id=74
> 2022-02-03T10:43:29.123Z|01780|vconn|DBG|unix#264998: received: 
> NXT_CT_FLUSH_ZONE (OF1.5) (xid=0x81): zone_id=73
> 2022-02-03T10:43:29.246Z|01781|vconn|DBG|unix#264998: received: 
> NXT_CT_FLUSH_ZONE (OF1.5) (xid=0x82): zone_id=76
> 2022-02-03T10:43:29.371Z|01782|vconn|DBG|unix#264998: received: 
> NXT_CT_FLUSH_ZONE (OF1.5) (xid=0x83): zone_id=78
> 2022-02-03T10:43:29.495Z|01783|vconn|DBG|unix#264998: received: 
> NXT_CT_FLUSH_ZONE (OF1.5) (xid=0x84): zone_id=77
> 2022-02-03T10:43:29.618Z|01784|vconn|DBG|unix#264998: received: 
> NXT_CT_FLUSH_ZONE (OF1.5) (xid=0x85): zone_id=81
> 2022-02-03T10:43:29.746Z|01785|vconn|DBG|unix#264998: received: 
> NXT_CT_FLUSH_ZONE (OF1.5) (xid=0x86): zone_id=80
> 2022-02-03T10:43:29.867Z|01786|vconn|DBG|unix#264998: received: 
> NXT_CT_FLUSH_ZONE (OF1.5) (xid=0x87): zone_id=79
> 2022-02-03T10:43:29.988Z|01787|vconn|DBG|unix#264998: received: 
> NXT_CT_FLUSH_ZONE (OF1.5) (xid=0x88): zone_id=84
> 2022-02-03T10:43:30.031Z|00015|ovs_rcu(urcu2)|WARN|blocked 4000 ms waiting 
> for main to quiesce
> 2022-02-03T10:43:30.108Z|01788|vconn|DBG|unix#264998: received: 
> NXT_CT_FLUSH_ZONE (OF1.5) (xid=0x89): zone_id=83
> 2022-02-03T10:43:30.229Z|01789|vconn|DBG|unix#264998: received: 
> NXT_CT_FLUSH_ZONE (OF1.5) (xid=0x8a): zone_id=82
> 2022-02-03T10:43:30.349Z|01790|vconn|DBG|unix#264998: received: 
> NXT_CT_FLUSH_ZONE (OF1.5) (xid=0x8b): zone_id=86
> 2022-02-03T10:43:30.470Z|01791|vconn|DBG|unix#264998: received: 
> NXT_CT_FLUSH_ZONE (OF1.5) (xid=0x8c): zone_id=85
> 2022-02-03T10:43:30.591Z|01792|vconn|DBG|unix#264998: received: 
> NXT_CT_FLUSH_ZONE (OF1.5) (xid=0x8d): zone_id=89
> 2022-02-03T10:43:30.712Z|01793|vconn|DBG|unix#264998: received: 
> NXT_CT_FLUSH_ZONE (OF1.5) (xid=0x8e): zone_id=88
> 2022-02-03T10:43:30.833Z|01794|vconn|DBG|unix#264998: received: 
> NXT_CT_FLUSH_ZONE (OF1.5) (xid=0x8f): zone_id=87
> 2022-02-03T10:43:30.953Z|01795|vconn|DBG|unix#264998: received: 
> NXT_CT_FLUSH_ZONE (OF1.5) (xid=0x90): zone_id=90
> 2022-02-03T10:43:31.073Z|01796|vconn|DBG|unix#264998: received: 
> NXT_CT_FLUSH_ZONE (OF1.5) (xid=0x91): zone_id=91
> 2022-02-03T10:43:31.193Z|01797|vconn|DBG|unix#264998: received: 
> NXT_CT_FLUSH_ZONE (OF1.5) (xid=0x92): zone_id=94
> 2022-02-03T10:43:31.313Z|01798|vconn|DBG|unix#264998: received: 
> NXT_CT_FLUSH_ZONE (OF1.5) (xid=0x93): zone_id=93
> 2022-02-03T10:43:31.433Z|01799|vconn|DBG|unix#264998: received: 
> NXT_CT_FLUSH_ZONE (OF1.5) (xid=0x94): zone_id=92
> 2022-02-03T10:43:31.554Z|01800|vconn|DBG|unix#264998: received: 
> NXT_CT_FLUSH_ZONE (OF1.5) (xid=0x95): zone_id=95
> 2022-02-03T10:43:31.674Z|01801|vconn|DBG|unix#264998: received: 
> NXT_CT_FLUSH_ZONE (OF1.5) (xid=0x96): zone_id=96
> 2022-02-03T10:43:31.795Z|01802|vconn|DBG|unix#264998: received: 
> NXT_CT_FLUSH_ZONE (OF1.5) (xid=0x97): zone_id=97
> 2022-02-03T10:43:31.915Z|01803|vconn|DBG|unix#264998: received: 
> NXT_CT_FLUSH_ZONE (OF1.5) (xid=0x98): zone_id=99
> 2022-02-03T10:43:32.035Z|01804|vconn|DBG|unix#264998: received: 
> NXT_CT_FLUSH_ZONE (OF1.5) (xid=0x99): zone_id=98
> 2022-02-03T10:43:32.156Z|01805|vconn|DBG|unix#264998: received: 
> NXT_CT_FLUSH_ZONE (OF1.5) (xid=0x9a): zone_id=100
> 2022-02-03T10:43:32.275Z|01806|connmgr|INFO|br-int<->unix#264998: 37 
> flow_mods 13 s ago (36 adds, 1 deletes)
> 2022-02-03T10:43:32.276Z|01807|timeval|WARN|Unreasonably long 6245ms poll 
> interval (826ms user, 5403ms system)
> 2022-02-03T10:43:32.276Z|01808|timeval|WARN|faults: 2 minor, 0 major
> 2022-02-03T10:43:32.276Z|01809|timeval|WARN|context switches: 0 voluntary, 21 
> involuntary
> 2022-02-03T10:43:32.276Z|01810|coverage|INFO|Skipping details of duplicate 
> event coverage for hash=4321e6a4
> 2022-02-03T10:43:32.276Z|01811|poll_loop|INFO|wakeup due to [POLLIN] on fd 15 
> (<->/var/run/openvswitch/db.sock) at lib/stream-fd.c:157 (100% CPU usage)
> 2022-02-03T10:43:32.276Z|01812|poll_loop|INFO|wakeup due to [POLLIN] on fd 
> 228 (/var/run/openvswitch/br-int.mgmt<->) at lib/stream-fd.c:157 (100% CPU 
> usage)
> 
> 
> Please note the time for each NXT_CT_FLUSH_ZONE line took. It flushes just ~8 
> zones per second. Maybe, it’s due to quite big conntrack size:
> I’ve noticed that on the system there were a huge amount of the conntrack 
> records in host’s zone=0.
> There were connections generated from overlay tunnel packets, that were not 
> configured in iptables to be NOTRACK - there was a configuration management 
> issue.
> After I’ve applied iptables rules to not to track geneve, conntrack 
> connection count decreased from ~60k to 3k and CT_ZONE flush became much 
> faster (about 100 zones per 1 second).
> 
> Thanks for pointing to another to openflows problem!
> 
> But my question is why ovn-controller flushes conntrack zones and modifies 
> OFs even if nothing has changed? I mean, a "soft-restart" condition (when 
> ovn-controller was stopped through unixctl socket with "--restart" option. 
> Should I start ovn-controller somehow special?
> When I trigger "soft-restart", I expect to have a minimal impact on the 
> datapath.
> 
> Regards,
> Vladislav Odintsov
> 
>> On 2 Feb 2022, at 18:20, Ilya Maximets <i.maxim...@ovn.org> wrote:
>> 
>> On 2/1/22 23:15, Vladislav Odintsov wrote:
>>> Hi OVN and OVS folks,
>>> 
>>> during OVN upgrade from 20.06 to 21.12 version I’ve faced with a quite long 
>>> datapath downtime (48-50 seconds).
>>> I suppose this downtime was triggered by ovn-controller installing new 
>>> openflow rules to ovs.
>>> There are relatively not so many flows (13-14k).
>>> openvswitch version is 2.13.4.
>>> ovn-openflow-probe-interval is configured to 30.
>>> 
>>> The upgrade procedure was:
>>> 1. Set to ovs external_ids an option to force matching ovn-northd internal 
>>> version (ovn-match-northd-version=true)
>>> 2. Upgrade ovn-controller (to first get ovn-controller, which supports 
>>> version match between controller and northd and not to break datapath after 
>>> upgrading ovn-northd)
>>> 3. Upgrade ovn-northd cluster
>>> 
>>> Network outage timeframe matches to ovs-vswitchd WARN logs. Please see next:
>>> 
>>> 2022-02-01T19:44:32.577Z|00035|ovs_rcu(urcu7)|WARN|blocked 1000 ms waiting 
>>> for main to quiesce
>>> 2022-02-01T19:44:33.577Z|00036|ovs_rcu(urcu7)|WARN|blocked 2000 ms waiting 
>>> for main to quiesce
>>> 2022-02-01T19:44:35.577Z|00037|ovs_rcu(urcu7)|WARN|blocked 4000 ms waiting 
>>> for main to quiesce
>>> 2022-02-01T19:44:39.577Z|00038|ovs_rcu(urcu7)|WARN|blocked 8000 ms waiting 
>>> for main to quiesce
>>> 2022-02-01T19:44:44.396Z|04063|connmgr|INFO|br-int<->unix#272674: 37 
>>> flow_mods 13 s ago (36 adds, 1 deletes)
>>> 2022-02-01T19:44:44.441Z|04064|timeval|WARN|Unreasonably long 12864ms poll 
>>> interval (0ms user, 12833ms system)
>>> 2022-02-01T19:44:44.441Z|04065|timeval|WARN|faults: 287 minor, 0 major
>>> 2022-02-01T19:44:44.441Z|04066|timeval|WARN|context switches: 1 voluntary, 
>>> 18 involuntary
>>> 2022-02-01T19:44:44.441Z|04067|coverage|INFO|Event coverage, avg rate over 
>>> last: 5 seconds, last minute, last hour,  hash=9105faf6:
>>> 2022-02-01T19:44:44.441Z|04068|coverage|INFO|ccmap_expand               
>>> 0.0/sec     0.000/sec        0.0000/sec   total: 1129
>>> 2022-02-01T19:44:44.441Z|04069|coverage|INFO|ccmap_shrink               
>>> 0.0/sec    11.933/sec        0.1989/sec   total: 2363
>>> 2022-02-01T19:44:44.441Z|04070|coverage|INFO|cmap_expand                
>>> 0.0/sec     0.833/sec        0.8811/sec   total: 11218023
>>> 2022-02-01T19:44:44.441Z|04071|coverage|INFO|cmap_shrink                
>>> 8.2/sec    11.950/sec        1.0736/sec   total: 11217475
>>> 2022-02-01T19:44:44.441Z|04072|coverage|INFO|dpif_port_add              
>>> 0.0/sec     0.000/sec        0.0000/sec   total: 229
>>> 2022-02-01T19:44:44.441Z|04073|coverage|INFO|dpif_port_del              
>>> 0.0/sec     0.000/sec        0.0000/sec   total: 144
>>> 2022-02-01T19:44:44.441Z|04074|coverage|INFO|dpif_flow_flush            
>>> 0.0/sec     0.000/sec        0.0000/sec   total: 2
>>> 2022-02-01T19:44:44.441Z|04075|coverage|INFO|dpif_flow_get              
>>> 0.0/sec     0.000/sec        0.0000/sec   total: 23
>>> 2022-02-01T19:44:44.441Z|04076|coverage|INFO|dpif_flow_put             
>>> 83.0/sec    97.900/sec       80.8306/sec   total: 849051748
>>> 2022-02-01T19:44:44.441Z|04077|coverage|INFO|dpif_flow_del            
>>> 147.8/sec    91.733/sec       78.2797/sec   total: 810797137
>>> 2022-02-01T19:44:44.441Z|04078|coverage|INFO|dpif_execute              
>>> 66.6/sec    83.767/sec       87.8397/sec   total: 958697586
>>> 2022-02-01T19:44:44.441Z|04079|coverage|INFO|dpif_execute_with_help     
>>> 0.0/sec     4.250/sec        4.9892/sec   total: 70713908
>>> 2022-02-01T19:44:44.441Z|04080|coverage|INFO|flow_extract             
>>> 169.0/sec   247.017/sec      333.1194/sec   total: 7990486004
>>> 2022-02-01T19:44:44.441Z|04081|coverage|INFO|miniflow_malloc            
>>> 0.0/sec     0.183/sec        0.7983/sec   total: 5871853
>>> 2022-02-01T19:44:44.441Z|04082|coverage|INFO|hindex_pathological        
>>> 0.0/sec     0.000/sec        0.0000/sec   total: 78
>>> 2022-02-01T19:44:44.441Z|04083|coverage|INFO|hindex_expand              
>>> 0.0/sec     0.000/sec        0.0000/sec   total: 11
>>> 2022-02-01T19:44:44.441Z|04084|coverage|INFO|hmap_pathological          
>>> 0.0/sec     0.633/sec        0.9331/sec   total: 16638117
>>> 2022-02-01T19:44:44.441Z|04085|coverage|INFO|hmap_expand                
>>> 0.0/sec   114.867/sec      139.8006/sec   total: 2184263485
>>> 2022-02-01T19:44:44.441Z|04086|coverage|INFO|mac_learning_learned       
>>> 0.0/sec     0.000/sec        0.0003/sec   total: 299
>>> 2022-02-01T19:44:44.441Z|04087|coverage|INFO|mac_learning_expired       
>>> 0.0/sec     0.000/sec        0.0006/sec   total: 117
>>> 2022-02-01T19:44:44.441Z|04088|coverage|INFO|mac_learning_moved         
>>> 0.0/sec     0.000/sec        0.0000/sec   total: 87
>>> 2022-02-01T19:44:44.441Z|04089|coverage|INFO|netdev_get_stats           
>>> 0.0/sec    36.850/sec       43.3306/sec   total: 690673778
>>> 2022-02-01T19:44:44.441Z|04090|coverage|INFO|txn_unchanged              
>>> 0.0/sec     0.100/sec        0.1614/sec   total: 2705864
>>> 2022-02-01T19:44:44.441Z|04091|coverage|INFO|txn_incomplete             
>>> 0.0/sec     0.183/sec        0.2239/sec   total: 3688261
>>> 2022-02-01T19:44:44.441Z|04092|coverage|INFO|txn_success                
>>> 0.0/sec     0.167/sec        0.1953/sec   total: 3236818
>>> 2022-02-01T19:44:44.441Z|04093|coverage|INFO|poll_create_node         
>>> 3930.6/sec  3941.350/sec     5012.2572/sec  total: 115959738162
>>> 2022-02-01T19:44:44.441Z|04094|coverage|INFO|poll_zero_timeout        
>>> 159.4/sec   111.417/sec      103.4078/sec   total: 1181551227
>>> 2022-02-01T19:44:44.441Z|04095|coverage|INFO|rconn_queued               
>>> 0.0/sec     0.050/sec        0.0817/sec   total: 1279558
>>> 2022-02-01T19:44:44.441Z|04096|coverage|INFO|rconn_sent                 
>>> 0.0/sec     0.050/sec        0.0817/sec   total: 1279558
>>> 2022-02-01T19:44:44.441Z|04097|coverage|INFO|seq_change               
>>> 2205.2/sec  2502.783/sec     2874.6258/sec  total: 57335390998
>>> 2022-02-01T19:44:44.441Z|04098|coverage|INFO|pstream_open               
>>> 0.0/sec     0.000/sec        0.0000/sec   total: 5
>>> 2022-02-01T19:44:44.441Z|04099|coverage|INFO|stream_open                
>>> 0.0/sec     0.000/sec        0.0000/sec   total: 1
>>> 2022-02-01T19:44:44.441Z|04100|coverage|INFO|unixctl_received           
>>> 0.0/sec     0.000/sec        0.0000/sec   total: 189
>>> 2022-02-01T19:44:44.441Z|04101|coverage|INFO|unixctl_replied            
>>> 0.0/sec     0.000/sec        0.0000/sec   total: 189
>>> 2022-02-01T19:44:44.441Z|04102|coverage|INFO|util_xalloc              
>>> 7935.8/sec 15234.350/sec    18372.4347/sec  total: 345308271208
>>> 2022-02-01T19:44:44.441Z|04103|coverage|INFO|vconn_received             
>>> 0.0/sec     0.083/sec        0.3556/sec   total: 2018930
>>> 2022-02-01T19:44:44.441Z|04104|coverage|INFO|vconn_sent                 
>>> 0.0/sec     0.067/sec        0.0983/sec   total: 1552067
>>> 2022-02-01T19:44:44.441Z|04105|coverage|INFO|netdev_set_policing        
>>> 0.0/sec     0.000/sec        0.0000/sec   total: 623
>>> 2022-02-01T19:44:44.441Z|04106|coverage|INFO|netdev_get_ifindex         
>>> 0.0/sec     0.100/sec        0.1636/sec   total: 2713928
>>> 2022-02-01T19:44:44.441Z|04107|coverage|INFO|netdev_set_hwaddr          
>>> 0.0/sec     0.000/sec        0.0000/sec   total: 2
>>> 2022-02-01T19:44:44.441Z|04108|coverage|INFO|netdev_get_ethtool         
>>> 0.0/sec     0.000/sec        0.0000/sec   total: 1073
>>> 2022-02-01T19:44:44.441Z|04109|coverage|INFO|netdev_set_ethtool         
>>> 0.0/sec     0.000/sec        0.0000/sec   total: 225
>>> 2022-02-01T19:44:44.441Z|04110|coverage|INFO|nln_changed                
>>> 0.0/sec     0.050/sec        0.0492/sec   total: 819614
>>> 2022-02-01T19:44:44.441Z|04111|coverage|INFO|netlink_received         
>>> 346.6/sec   472.133/sec      575.4597/sec   total: 11288088779
>>> 2022-02-01T19:44:44.441Z|04112|coverage|INFO|netlink_recv_jumbo        
>>> 18.4/sec   113.250/sec      175.1189/sec   total: 3935643810
>>> 2022-02-01T19:44:44.441Z|04113|coverage|INFO|netlink_sent             
>>> 308.4/sec   378.133/sec      380.4267/sec   total: 4753114969
>>> 2022-02-01T19:44:44.441Z|04114|coverage|INFO|ofproto_flush              
>>> 0.0/sec     0.000/sec        0.0000/sec   total: 1
>>> 2022-02-01T19:44:44.441Z|04115|coverage|INFO|ofproto_recv_openflow      
>>> 0.0/sec     0.067/sec        0.3389/sec   total: 1746449
>>> 2022-02-01T19:44:44.441Z|04116|coverage|INFO|ofproto_update_port        
>>> 0.0/sec     0.000/sec        0.0000/sec   total: 814
>>> 2022-02-01T19:44:44.441Z|04117|coverage|INFO|packet_in_overflow         
>>> 0.0/sec     0.000/sec        0.0000/sec   total: 282520
>>> 2022-02-01T19:44:44.441Z|04118|coverage|INFO|rev_reconfigure            
>>> 0.0/sec     0.017/sec        0.0325/sec   total: 539904
>>> 2022-02-01T19:44:44.441Z|04119|coverage|INFO|rev_port_toggled           
>>> 0.0/sec     0.000/sec        0.0000/sec   total: 27
>>> 2022-02-01T19:44:44.441Z|04120|coverage|INFO|rev_flow_table             
>>> 0.0/sec     0.000/sec        0.0106/sec   total: 10876
>>> 2022-02-01T19:44:44.441Z|04121|coverage|INFO|rev_mac_learning           
>>> 0.0/sec     0.000/sec        0.0006/sec   total: 312
>>> 2022-02-01T19:44:44.441Z|04122|coverage|INFO|dumped_duplicate_flow      
>>> 0.0/sec     0.000/sec        0.0000/sec   total: 39204
>>> 2022-02-01T19:44:44.441Z|04123|coverage|INFO|dumped_new_flow            
>>> 0.0/sec     0.000/sec        0.0000/sec   total: 7177
>>> 2022-02-01T19:44:44.441Z|04124|coverage|INFO|handler_duplicate_upcall   
>>> 0.4/sec     1.833/sec        2.2664/sec   total: 34762798
>>> 2022-02-01T19:44:44.441Z|04125|coverage|INFO|upcall_ukey_contention     
>>> 0.0/sec     0.000/sec        0.0000/sec   total: 9
>>> 2022-02-01T19:44:44.441Z|04126|coverage|INFO|upcall_ukey_replace        
>>> 0.2/sec     0.050/sec        0.0447/sec   total: 7927221
>>> 2022-02-01T19:44:44.441Z|04127|coverage|INFO|revalidate_missed_dp_flow   
>>> 0.0/sec     0.000/sec        0.0000/sec   total: 1118
>>> 2022-02-01T19:44:44.441Z|04128|coverage|INFO|xlate_actions            
>>> 123.6/sec   273.183/sec      233.8275/sec   total: 2110727223
>>> 2022-02-01T19:44:44.441Z|04129|coverage|INFO|bridge_reconfigure         
>>> 0.0/sec     0.017/sec        0.0325/sec   total: 539902
>>> 2022-02-01T19:44:44.441Z|04130|coverage|INFO|74 events never hit
>>> 2022-02-01T19:44:44.441Z|04131|poll_loop|INFO|Dropped 3 log messages in 
>>> last 1058229 seconds (most recently, 1058228 seconds ago) due to excessive 
>>> rate
>>> 2022-02-01T19:44:44.441Z|04132|poll_loop|INFO|wakeup due to [POLLIN] on fd 
>>> 230 (/var/run/openvswitch/br-int.mgmt<->) at lib/stream-fd.c:157 (84% CPU 
>>> usage)
>>> 2022-02-01T19:44:45.441Z|00039|ovs_rcu(urcu7)|WARN|blocked 1000 ms waiting 
>>> for main to quiesce
>>> 2022-02-01T19:44:46.441Z|00040|ovs_rcu(urcu7)|WARN|blocked 2000 ms waiting 
>>> for main to quiesce
>>> 2022-02-01T19:44:48.441Z|00041|ovs_rcu(urcu7)|WARN|blocked 4001 ms waiting 
>>> for main to quiesce
>>> 2022-02-01T19:44:52.441Z|00042|ovs_rcu(urcu7)|WARN|blocked 8000 ms waiting 
>>> for main to quiesce
>>> 2022-02-01T19:44:55.109Z|04133|timeval|WARN|Unreasonably long 10668ms poll 
>>> interval (0ms user, 10643ms system)
>>> 2022-02-01T19:44:55.109Z|04134|timeval|WARN|faults: 70 minor, 0 major
>>> 2022-02-01T19:44:55.109Z|04135|timeval|WARN|context switches: 0 voluntary, 
>>> 4 involuntary
>>> 2022-02-01T19:44:55.109Z|04136|coverage|INFO|Skipping details of duplicate 
>>> event coverage for hash=9105faf6
>>> 2022-02-01T19:44:55.109Z|04137|poll_loop|INFO|wakeup due to [POLLIN] on fd 
>>> 15 (<->/var/run/openvswitch/db.sock) at lib/stream-fd.c:157 (100% CPU usage)
>>> 2022-02-01T19:44:55.109Z|04138|poll_loop|INFO|wakeup due to [POLLIN] on fd 
>>> 230 (/var/run/openvswitch/br-int.mgmt<->) at lib/stream-fd.c:157 (100% CPU 
>>> usage)
>>> 2022-02-01T19:44:56.112Z|00043|ovs_rcu(urcu7)|WARN|blocked 1001 ms waiting 
>>> for main to quiesce
>>> 2022-02-01T19:44:57.112Z|00044|ovs_rcu(urcu7)|WARN|blocked 2000 ms waiting 
>>> for main to quiesce
>>> 2022-02-01T19:44:59.112Z|00045|ovs_rcu(urcu7)|WARN|blocked 4000 ms waiting 
>>> for main to quiesce
>>> 2022-02-01T19:45:03.112Z|00046|ovs_rcu(urcu7)|WARN|blocked 8000 ms waiting 
>>> for main to quiesce
>>> 2022-02-01T19:45:05.249Z|04139|timeval|WARN|Unreasonably long 10140ms poll 
>>> interval (0ms user, 10115ms system)
>>> 2022-02-01T19:45:05.249Z|04140|timeval|WARN|faults: 248 minor, 0 major
>>> 2022-02-01T19:45:05.249Z|04141|timeval|WARN|context switches: 1 voluntary, 
>>> 27 involuntary
>>> 2022-02-01T19:45:05.249Z|04142|coverage|INFO|Skipping details of duplicate 
>>> event coverage for hash=9105faf6
>>> 2022-02-01T19:45:05.249Z|04143|poll_loop|INFO|wakeup due to [POLLIN] on fd 
>>> 14 (NETLINK_ROUTE<->NETLINK_ROUTE) at lib/netlink-socket.c:1401 (100% CPU 
>>> usage)
>>> 2022-02-01T19:45:05.249Z|04144|poll_loop|INFO|wakeup due to [POLLOUT] on fd 
>>> 235 (/var/run/openvswitch/br-int.mgmt<->) at lib/stream-fd.c:153 (100% CPU 
>>> usage)
>>> 2022-02-01T19:45:05.249Z|04145|poll_loop|INFO|wakeup due to [POLLIN] on fd 
>>> 230 (/var/run/openvswitch/br-int.mgmt<->) at lib/stream-fd.c:157 (100% CPU 
>>> usage)
>>> 2022-02-01T19:45:06.250Z|00047|ovs_rcu(urcu7)|WARN|blocked 1000 ms waiting 
>>> for main to quiesce
>>> 2022-02-01T19:45:07.250Z|00048|ovs_rcu(urcu7)|WARN|blocked 2000 ms waiting 
>>> for main to quiesce
>>> 2022-02-01T19:45:09.250Z|00049|ovs_rcu(urcu7)|WARN|blocked 4000 ms waiting 
>>> for main to quiesce
>>> 2022-02-01T19:45:13.250Z|00050|ovs_rcu(urcu7)|WARN|blocked 8000 ms waiting 
>>> for main to quiesce
>>> 2022-02-01T19:45:15.704Z|04146|rconn|WARN|br-int<->unix#272675: connection 
>>> dropped (Broken pipe)
>>> 2022-02-01T19:45:16.714Z|00051|ovs_rcu(urcu7)|WARN|blocked 1000 ms waiting 
>>> for main to quiesce
>>> 2022-02-01T19:45:17.713Z|00052|ovs_rcu(urcu7)|WARN|blocked 2000 ms waiting 
>>> for main to quiesce
>>> 2022-02-01T19:45:19.714Z|00053|ovs_rcu(urcu7)|WARN|blocked 4000 ms waiting 
>>> for main to quiesce
>>> 2022-02-01T19:45:23.014Z|04147|timeval|WARN|Unreasonably long 17764ms poll 
>>> interval (0ms user, 17720ms system)
>>> 2022-02-01T19:45:23.014Z|04148|timeval|WARN|faults: 63 minor, 0 major
>>> 2022-02-01T19:45:23.014Z|04149|timeval|WARN|context switches: 2 voluntary, 
>>> 24 involuntary
>>> 2022-02-01T19:45:23.014Z|04150|poll_loop|INFO|wakeup due to [POLLIN] on fd 
>>> 15 (<->/var/run/openvswitch/db.sock) at lib/stream-fd.c:157 (100% CPU usage)
>>> 2022-02-01T19:45:23.014Z|04151|poll_loop|INFO|wakeup due to [POLLIN] on fd 
>>> 230 (/var/run/openvswitch/br-int.mgmt<->) at lib/stream-fd.c:157 (100% CPU 
>>> usage)
>>> 2022-02-01T19:45:23.021Z|04152|poll_loop|INFO|wakeup due to [POLLIN] on fd 
>>> 230 (/var/run/openvswitch/br-int.mgmt<->) at lib/stream-fd.c:157 (100% CPU 
>>> usage)
>>> 2022-02-01T19:45:23.022Z|04153|poll_loop|INFO|wakeup due to [POLLIN] on fd 
>>> 230 (/var/run/openvswitch/br-int.mgmt<->) at lib/stream-fd.c:157 (100% CPU 
>>> usage)
>>> 2022-02-01T19:45:23.022Z|04154|poll_loop|INFO|wakeup due to [POLLIN] on fd 
>>> 230 (/var/run/openvswitch/br-int.mgmt<->) at lib/stream-fd.c:157 (100% CPU 
>>> usage)
>>> 2022-02-01T19:45:23.023Z|04155|poll_loop|INFO|wakeup due to [POLLIN] on fd 
>>> 230 (/var/run/openvswitch/br-int.mgmt<->) at lib/stream-fd.c:157 (100% CPU 
>>> usage)
>>> 2022-02-01T19:45:23.023Z|04156|poll_loop|INFO|wakeup due to [POLLIN] on fd 
>>> 230 (/var/run/openvswitch/br-int.mgmt<->) at lib/stream-fd.c:157 (100% CPU 
>>> usage)
>>> 2022-02-01T19:45:23.023Z|04157|poll_loop|INFO|wakeup due to [POLLIN] on fd 
>>> 230 (/var/run/openvswitch/br-int.mgmt<->) at lib/stream-fd.c:157 (100% CPU 
>>> usage)
>>> 2022-02-01T19:45:23.024Z|04158|poll_loop|INFO|wakeup due to [POLLIN] on fd 
>>> 230 (/var/run/openvswitch/br-int.mgmt<->) at lib/stream-fd.c:157 (100% CPU 
>>> usage)
>>> 2022-02-01T19:45:23.024Z|04159|poll_loop|INFO|wakeup due to [POLLIN] on fd 
>>> 230 (/var/run/openvswitch/br-int.mgmt<->) at lib/stream-fd.c:157 (100% CPU 
>>> usage)
>>> 2022-02-01T19:45:41.274Z|04160|connmgr|INFO|br-int<->unix#272674: 13589 
>>> flow_mods 18 s ago (13509 adds, 80 modifications)
>>> 2022-02-01T19:46:31.285Z|04161|rconn|ERR|br-int<->unix#272673: no response 
>>> to inactivity probe after 60 seconds, disconnecting
>>> 
>>> 
>>> In ovn-controller log:
>>> 
>>> 2022-02-01T19:44:31.267Z|00077|features|INFO|unix:/run/openvswitch/br-int.mgmt:
>>>  connecting to switch
>>> 2022-02-01T19:44:31.267Z|00078|rconn|INFO|unix:/run/openvswitch/br-int.mgmt:
>>>  connecting...
>>> 2022-02-01T19:44:31.267Z|00079|ofctrl|INFO|unix:/run/openvswitch/br-int.mgmt:
>>>  connecting to switch
>>> 2022-02-01T19:44:31.267Z|00080|rconn|INFO|unix:/run/openvswitch/br-int.mgmt:
>>>  connecting...
>>> 2022-02-01T19:44:31.268Z|00081|rconn|INFO|unix:/run/openvswitch/br-int.mgmt:
>>>  connected
>>> 2022-02-01T19:44:31.269Z|00082|rconn|INFO|unix:/run/openvswitch/br-int.mgmt:
>>>  connected
>>> 2022-02-01T19:44:31.284Z|00083|main|INFO|OVS feature set changed, force 
>>> recompute.
>>> 2022-02-01T19:44:34.074Z|00084|main|INFO|OVNSB commit failed, force 
>>> recompute next time.
>>> 2022-02-01T19:44:34.621Z|00085|reconnect|INFO|ssl:<IP>16.0.61:6642: 
>>> connection closed by peer
>>> 2022-02-01T19:44:35.622Z|00086|reconnect|INFO|ssl:<IP>:6642: connecting...
>>> 2022-02-01T19:44:35.672Z|00087|reconnect|INFO|ssl:<IP>:6642: connected
>>> 2022-02-01T19:45:01.470Z|00088|lflow_cache|INFO|Detected cache inactivity 
>>> (last active 30001 ms ago): trimming cache
>>> 2022-02-01T19:45:23.351Z|00089|binding|INFO|Setting lport <lport> 
>>> ovn-installed in OVS
>>> 2022-02-01T19:45:23.351Z|00090|binding|INFO|Setting lport <lport> up in 
>>> Southbound
>>> <there are 80 port bindings...>
>>> 2022-02-01T19:45:23.352Z|00245|binding|INFO|Setting lport <lportN> 
>>> ovn-installed in OVS
>>> 2022-02-01T19:45:23.352Z|00246|binding|INFO|Setting lport <lportN> up in 
>>> Southbound
>>> 2022-02-01T19:45:53.323Z|00247|lflow_cache|INFO|Detected cache inactivity 
>>> (last active 30004 ms ago): trimming cache
>>> 
>>> 
>>> I’ve got some questions:
>>> 1. What OVS is doing while `waiting for main to quiesce` ? I noticed this 
>>> message appeares each time downtime occurs.
>> 
>> 'Unreasonably long 17764ms poll interval (0ms user, 17720ms system)'
>> is very concerning.  It means that the main thread spent almost 18
>> seconds inside the kernel executing some syscall, maybe several syscalls.
>> Main thread is not responsible for flow installation to the kernel, IIRC,
>> so flow updates should not be a case here.  You need to figure out what
>> are these syscalls and why they are taking so long to execute inside
>> the kernel.
>> 
>> Maybe you have some suspicious kernel logs at the same time?
>> You may also try 'perf' or some other tracing tools to figure out
>> what is going on.
>> 
>>> 2. Is there any hint to minimize such downtime while reinstalling OF?
>> 
>> Installation of rules is an unlikely root cause here.
>> 
>>> 3. Could there be a situation where ovn-northd after upgrade still 
>>> generating new logical flows, ovn-controller constantly gets updates from 
>>> SB and triggers update to OF in a loop while doesn’t finish getting final 
>>> lflows "view"? So, could it start installing new OF while having in memory 
>>> lflows which are not consistent at that moment?
>> 
>> northd generates all updates in a single transaction, so SB is updated
>> atomically. SB will send this update as a single update message, so this
>> is also atomic.  ovn-controller will install all OF rules as an OF bundle,
>> so that is also atomic.
>> 
>>> 4. What does `br-int<->unix#272675: connection dropped (Broken pipe)` error 
>>> means? What type of connection is broken? Is it normal?
>> 
>> This looks like a controller connection, i.e. connection between
>> ovn-controller with ovs-vswitchd.  Doesn't look very good, but
>> disconnection seems to be caused by inactivity due to overload
>> on one of the sides of the connection.
>> 
>>> 5. Do you see anything else abnormal in the logs?
>> 
>> '0ms user, 17720ms system' is the most abnormal, AFAICT.
>> 
>> Best regards, Ilya Maximets.
>> 
>> _______________________________________________
>> dev mailing list
>> d...@openvswitch.org <mailto:d...@openvswitch.org>
>> https://mail.openvswitch.org/mailman/listinfo/ovs-dev 
>> <https://mail.openvswitch.org/mailman/listinfo/ovs-dev>
> _______________________________________________
> dev mailing list
> d...@openvswitch.org
> https://mail.openvswitch.org/mailman/listinfo/ovs-dev

_______________________________________________
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Reply via email to