UPD: I’ve checked ovn-controller dbg logs and checked against its code. Not sure, but looks like some bug in first processing of ct_zones. This logfile is from 20.06 version, but it has similar behaviour on v21.12:
start ovn-controller 2022-02-03T16:09:50.500Z|00032|jsonrpc|DBG|unix:/run/openvswitch/db.sock: received reply, result={}, id=3 2022-02-03T16:09:50.500Z|00033|ovsdb_idl|DBG|unix:/run/openvswitch/db.sock: received unexpected reply message: {"error":null,"id":3,"result":{}} 2022-02-03T16:09:50.500Z|00034|main|DBG|restoring ct zone 25 for 'vnet37798-gateway-vif' 2022-02-03T16:09:50.500Z|00035|main|DBG|restoring ct zone 8 for '5ae22181-33a5-4596-8143-e5e396fbc17e_dnat' 2022-02-03T16:09:50.500Z|00036|main|DBG|restoring ct zone 57 for '3210cc4f-d055-4375-8425-aef0bd4db5fc_dnat' 2022-02-03T16:09:50.500Z|00037|main|DBG|restoring ct zone 10 for 'vnet38045-gateway-vif' 2022-02-03T16:09:50.500Z|00038|main|DBG|restoring ct zone 60 for '910c751a-3d86-420a-9606-3f751de4e8bf_dnat' 2022-02-03T16:09:50.500Z|00039|main|DBG|restoring ct zone 146 for '30fda34a-7614-487a-9b17-258781c06e31_snat' 2022-02-03T16:09:50.500Z|00040|main|DBG|restoring ct zone 171 for 'de72957f-b877-4c13-8b3d-9fe9b3abcc47_snat' 2022-02-03T16:09:50.500Z|00041|main|DBG|restoring ct zone 184 for 'vnet36210-gateway-vif' 2022-02-03T16:09:50.500Z|00042|main|DBG|restoring ct zone 120 for 'fdc23a91-fbcc-4fc4-9edd-e2001a827a2a_snat' 2022-02-03T16:09:50.500Z|00043|main|DBG|restoring ct zone 102 for '2c9ea3e6-7ba1-445e-b707-1873d9a85e00_dnat' 2022-02-03T16:09:50.500Z|00044|main|DBG|restoring ct zone 152 for 'fa746f3c-1d4e-450a-80c9-e6cd39676bbc_snat' 2022-02-03T16:09:50.500Z|00045|main|DBG|restoring ct zone 92 for '8e81ebd1-f43d-4d90-85d9-dc9e85690f58_dnat' 2022-02-03T16:09:50.500Z|00046|main|DBG|restoring ct zone 135 for 'ad815732-f341-4a6d-bdce-184dc6c53edf_snat' 2022-02-03T16:09:50.500Z|00047|main|DBG|restoring ct zone 100 for 'vnet35091-gateway-vif' 2022-02-03T16:09:50.500Z|00048|main|DBG|restoring ct zone 162 for '1078be52-1141-4060-b98d-23d96464b8bf_dnat' 2022-02-03T16:09:50.500Z|00049|main|DBG|restoring ct zone 30 for '1078be52-1141-4060-b98d-23d96464b8bf_snat' 2022-02-03T16:09:50.500Z|00050|main|DBG|restoring ct zone 130 for '605f08c3-f0be-4c0b-a721-5955bf2852d9_snat' 2022-02-03T16:09:50.500Z|00051|main|DBG|restoring ct zone 121 for '27e28d02-ff81-4faa-b79b-8b6174e7ceea_dnat' 2022-02-03T16:09:50.500Z|00052|main|DBG|restoring ct zone 70 for 'vnet35515-gateway-vif' 2022-02-03T16:09:50.500Z|00053|main|DBG|restoring ct zone 26 for 'b56507c6-ac4b-4de0-b6d3-33936e98ad0b_dnat' 2022-02-03T16:09:50.500Z|00054|main|DBG|restoring ct zone 114 for 'c7f1776b-0a1f-4434-bac5-db2b59f3363b_snat' … 2022-02-03T16:09:50.501Z|00226|inc_proc_eng|DBG|Initializing new run … 2022-02-03T16:09:50.730Z|00287|inc_proc_eng|DBG|Initializing new run … 2022-02-03T16:09:50.739Z|00311|inc_proc_eng|DBG|Initializing new run <some updates from SB DB> … 2022-02-03T16:09:50.754Z|00347|inc_proc_eng|DBG|Initializing new run … 2022-02-03T16:09:50.779Z|00408|main|DBG|removing ct zone 2 for '5601bacd-d246-42f8-aa69-c196f74ca57c_snat' 2022-02-03T16:09:50.779Z|00409|main|DBG|removing ct zone 3 for '96b6fa7d-d974-4260-a58e-c78342fd097d_snat' 2022-02-03T16:09:50.779Z|00410|main|DBG|removing ct zone 5 for 'a8e55770-885f-464e-bfa9-9c80eca0ce62_snat' 2022-02-03T16:09:50.779Z|00411|main|DBG|removing ct zone 6 for '2969a6a5-012d-4504-83bf-be5192c4af87_dnat' 2022-02-03T16:09:50.779Z|00412|main|DBG|removing ct zone 9 for '03175d56-14c2-4ea9-b671-e427fe6b62e6_dnat' 2022-02-03T16:09:50.779Z|00413|main|DBG|removing ct zone 8 for '5ae22181-33a5-4596-8143-e5e396fbc17e_dnat' 2022-02-03T16:09:50.779Z|00414|main|DBG|removing ct zone 12 for '11ecffe1-c6f9-4373-948a-fff26113af73_snat' 2022-02-03T16:09:50.779Z|00415|main|DBG|removing ct zone 15 for '791f9a0b-81e4-4a9e-8f38-7d56332b6237_snat' 2022-02-03T16:09:50.779Z|00416|main|DBG|removing ct zone 14 for '0408d0d9-e7d9-49f7-a0f2-32d611b56686_snat' 2022-02-03T16:09:50.779Z|00417|main|DBG|removing ct zone 13 for 'd8cd2a0c-bd9e-4472-920e-5443e2a94429_dnat' 2022-02-03T16:09:50.779Z|00418|main|DBG|removing ct zone 17 for '4e59bb4f-1ec6-4375-bb1c-63e03d1133ce_snat' 2022-02-03T16:09:50.779Z|00419|main|DBG|removing ct zone 18 for '3210cc4f-d055-4375-8425-aef0bd4db5fc_snat' 2022-02-03T16:09:50.779Z|00420|main|DBG|removing ct zone 20 for 'cfcb7e97-6107-435e-8247-771314968be9_dnat' 2022-02-03T16:09:50.779Z|00421|main|DBG|removing ct zone 23 for 'aba29611-9a7c-497d-9dbb-2d18314d517b_snat' 2022-02-03T16:09:50.779Z|00422|main|DBG|removing ct zone 21 for '57e00565-1250-4fa7-9c11-a7c1c602bf32_snat' 2022-02-03T16:09:50.779Z|00423|main|DBG|removing ct zone 26 for 'b56507c6-ac4b-4de0-b6d3-33936e98ad0b_dnat' 2022-02-03T16:09:50.779Z|00424|main|DBG|removing ct zone 24 for '2486c8ff-9eb4-4326-8601-a2e93343164e_dnat' 2022-02-03T16:09:50.779Z|00425|main|DBG|removing ct zone 27 for '278af800-e548-488a-9f5a-4f965718b56d_snat' 2022-02-03T16:09:50.779Z|00426|main|DBG|removing ct zone 30 for '1078be52-1141-4060-b98d-23d96464b8bf_snat' … ... 2022-02-03T16:09:50.780Z|00546|inc_proc_eng|DBG|node: flow_output, recompute (forced) 2022-02-03T16:09:50.780Z|00547|ofctrl|DBG|ofctrl_add_flow flow: cookie=0, table_id=0, priority=100, in_port=22, actions=move:NXM_NX_TUN_ID[40..54]->NXM_NX_REG14[0..14],move:NXM_NX_TUN_ID[24..39]->NXM_NX_REG15[0..15],move:NXM_NX_TUN_ID[0..23]->OXM_OF_METADATA[0..23],resubmit(,33) 2022-02-03T16:09:50.780Z|00548|ofctrl|DBG|ofctrl_add_flow flow: cookie=0, table_id=0, priority=100, in_port=9, actions=move:NXM_NX_TUN_ID[40..54]->NXM_NX_REG14[0..14],move:NXM_NX_TUN_ID[24..39]->NXM_NX_REG15[0..15],move:NXM_NX_TUN_ID[0..23]->OXM_OF_METADATA[0..23],resubmit(,33) 2022-02-03T16:09:50.780Z|00549|ofctrl|DBG|ofctrl_add_flow flow: cookie=0, table_id=0, priority=100, in_port=60, actions=move:NXM_NX_TUN_ID[40..54]->NXM_NX_REG14[0..14],move:NXM_NX_TUN_ID[24..39]->NXM_NX_REG15[0..15],move:NXM_NX_TUN_ID[0..23]->OXM_OF_METADATA[0..23],resubmit(,33) 2022-02-03T16:09:50.780Z|00550|ofctrl|DBG|ofctrl_add_flow flow: cookie=0, table_id=0, priority=100, in_port=18, actions=move:NXM_NX_TUN_ID[40..54]->NXM_NX_REG14[0..14],move:NXM_NX_TUN_ID[24..39]->NXM_NX_REG15[0..15],move:NXM_NX_TUN_ID[0..23]->OXM_OF_METADATA[0..23],resubmit(,33) 2022-02-03T16:09:50.780Z|00551|ofctrl|DBG|ofctrl_add_flow flow: cookie=0, table_id=0, priority=100, in_port=37, actions=move:NXM_NX_TUN_ID[40..54]->NXM_NX_REG14[0..14],move:NXM_NX_TUN_ID[24..39]->NXM_NX_REG15[0..15],move:NXM_NX_TUN_ID[0..23]->OXM_OF_METADATA[0..23],resubmit(,33) 2022-02-03T16:09:50.780Z|00552|ofctrl|DBG|ofctrl_add_flow flow: cookie=0, table_id=0, priority=100, in_port=1, actions=move:NXM_NX_TUN_ID[40..54]->NXM_NX_REG14[0..14],move:NXM_NX_TUN_ID[24..39]->NXM_NX_REG15[0..15],move:NXM_NX_TUN_ID[0..23]->OXM_OF_METADATA[0..23],resubmit(,33) … After some run iterations: 2022-02-03T16:09:50.792Z|00923|inc_proc_eng|DBG|node: ct_zones, recompute (triggered) 2022-02-03T16:09:50.792Z|00924|main|DBG|assigning ct zone 2 to '5601bacd-d246-42f8-aa69-c196f74ca57c_snat' 2022-02-03T16:09:50.792Z|00925|main|DBG|assigning ct zone 3 to '96b6fa7d-d974-4260-a58e-c78342fd097d_snat' 2022-02-03T16:09:50.792Z|00926|main|DBG|assigning ct zone 5 to 'a8e55770-885f-464e-bfa9-9c80eca0ce62_snat' 2022-02-03T16:09:50.792Z|00927|main|DBG|assigning ct zone 6 to '2969a6a5-012d-4504-83bf-be5192c4af87_dnat' 2022-02-03T16:09:50.792Z|00928|main|DBG|assigning ct zone 8 to '5ae22181-33a5-4596-8143-e5e396fbc17e_dnat' 2022-02-03T16:09:50.792Z|00929|main|DBG|assigning ct zone 9 to '03175d56-14c2-4ea9-b671-e427fe6b62e6_dnat' 2022-02-03T16:09:50.792Z|00930|main|DBG|assigning ct zone 12 to '11ecffe1-c6f9-4373-948a-fff26113af73_snat' 2022-02-03T16:09:50.792Z|00931|main|DBG|assigning ct zone 13 to 'd8cd2a0c-bd9e-4472-920e-5443e2a94429_dnat' 2022-02-03T16:09:50.792Z|00932|main|DBG|assigning ct zone 14 to '0408d0d9-e7d9-49f7-a0f2-32d611b56686_snat' 2022-02-03T16:09:50.792Z|00933|main|DBG|assigning ct zone 15 to '791f9a0b-81e4-4a9e-8f38-7d56332b6237_snat' 2022-02-03T16:09:50.792Z|00934|main|DBG|assigning ct zone 17 to '4e59bb4f-1ec6-4375-bb1c-63e03d1133ce_snat' 2022-02-03T16:09:50.792Z|00935|main|DBG|assigning ct zone 18 to '3210cc4f-d055-4375-8425-aef0bd4db5fc_snat' 2022-02-03T16:09:50.792Z|00936|main|DBG|assigning ct zone 20 to 'cfcb7e97-6107-435e-8247-771314968be9_dnat' 2022-02-03T16:09:50.792Z|00937|main|DBG|assigning ct zone 21 to '57e00565-1250-4fa7-9c11-a7c1c602bf32_snat' 2022-02-03T16:09:50.792Z|00938|main|DBG|assigning ct zone 23 to 'aba29611-9a7c-497d-9dbb-2d18314d517b_snat' 2022-02-03T16:09:50.792Z|00939|main|DBG|assigning ct zone 24 to '2486c8ff-9eb4-4326-8601-a2e93343164e_dnat' 2022-02-03T16:09:50.792Z|00940|main|DBG|assigning ct zone 26 to 'b56507c6-ac4b-4de0-b6d3-33936e98ad0b_dnat' ... In my understanding the reason for this could be an empty binding_lports here [0] and local_datapaths here [1] So, this triggers zones removal and on next iteration the required zones are created back. What do you think? 0: https://github.com/ovn-org/ovn/blob/v21.12.0/controller/ovn-controller.c#L651 1: https://github.com/ovn-org/ovn/blob/v21.12.0/controller/ovn-controller.c#L658 Regards, Vladislav Odintsov > On 3 Feb 2022, at 14:19, Vladislav Odintsov <odiv...@gmail.com> wrote: > > Hi Ilya, > > thanks for detailed answer. > I’ve tried to just "soft-restart" ovn-controller and it seems that I found > the problem, that could block OVS. > > What I’ve done: > 1. ovs-appctl -t ovs-vswitchd vlog/set file:vconn:dbg > 2. ovn-appctl -t ovn-controller exit --restart > 3. systemctl restart ovn-controller > > In OVS log I saw a lot of flow-mod lines, and CT_FLUSH_ZONE: > > 2022-02-03T10:43:18.956Z|01574|vconn|DBG|unix#264996: sent (Success): > OFPT_HELLO (OF1.5) (xid=0x40a6b): > version bitmap: 0x01, 0x02, 0x03, 0x04, 0x05, 0x06 > 2022-02-03T10:43:18.956Z|01575|vconn|DBG|unix#264996: received: OFPT_HELLO > (OF1.5) (xid=0x1): > version bitmap: 0x06 > 2022-02-03T10:43:18.956Z|01576|vconn|DBG|unix#264996: negotiated OpenFlow > version 0x06 (we support version 0x06 and earlier, peer supports version 0x06) > 2022-02-03T10:43:18.956Z|01577|vconn|DBG|unix#264996: received: > OFPT_GET_CONFIG_REQUEST (OF1.5) (xid=0x2): > 2022-02-03T10:43:18.956Z|01578|vconn|DBG|unix#264996: sent (Success): > OFPT_GET_CONFIG_REPLY (OF1.5) (xid=0x2): frags=normal miss_send_len=0 > 2022-02-03T10:43:18.956Z|01579|vconn|DBG|unix#264996: received: > NXT_SET_PACKET_IN_FORMAT (OF1.5) (xid=0x3): format=nxt_packet_in2 > 2022-02-03T10:43:18.956Z|01580|vconn|DBG|unix#264996: received: > OFPT_SET_CONFIG (OF1.5) (xid=0x4): frags=normal miss_send_len=65535 > 2022-02-03T10:43:19.123Z|01581|vconn|DBG|unix#264997: sent (Success): > OFPT_HELLO (OF1.5) (xid=0x40a6c): > version bitmap: 0x01, 0x02, 0x03, 0x04, 0x05, 0x06 > 2022-02-03T10:43:19.123Z|01582|vconn|DBG|unix#264997: received: OFPT_HELLO > (OF1.5) (xid=0x5): > version bitmap: 0x06 > 2022-02-03T10:43:19.123Z|01583|vconn|DBG|unix#264997: negotiated OpenFlow > version 0x06 (we support version 0x06 and earlier, peer supports version 0x06) > 2022-02-03T10:43:19.123Z|01584|vconn|DBG|unix#264998: sent (Success): > OFPT_HELLO (OF1.5) (xid=0x40a6d): > version bitmap: 0x01, 0x02, 0x03, 0x04, 0x05, 0x06 > 2022-02-03T10:43:19.123Z|01585|vconn|DBG|unix#264998: received: OFPT_HELLO > (OF1.5) (xid=0x6): > version bitmap: 0x06 > 2022-02-03T10:43:19.123Z|01586|vconn|DBG|unix#264998: negotiated OpenFlow > version 0x06 (we support version 0x06 and earlier, peer supports version 0x06) > 2022-02-03T10:43:19.124Z|01587|vconn|DBG|unix#264998: received: > NXT_TLV_TABLE_REQUEST (OF1.5) (xid=0x7): > 2022-02-03T10:43:19.124Z|01588|vconn|DBG|unix#264998: sent (Success): > NXT_TLV_TABLE_REPLY (OF1.5) (xid=0x7): > max option space=256 max fields=64 > allocated option space=4 > > mapping table: > class type length match field > ------ ---- ------ -------------- > 0x102 0x80 4 tun_metadata0 > 2022-02-03T10:43:19.124Z|01589|vconn|DBG|unix#264998: received: OFPT_FLOW_MOD > (OF1.5) (xid=0x8): DEL table:255 priority=0 actions=drop > 2022-02-03T10:43:19.126Z|01590|vconn|DBG|unix#264998: received: > OFPT_GROUP_MOD (OF1.5) (xid=0x9): > DEL group_id=4294967292,type=all > 2022-02-03T10:43:19.126Z|01591|vconn|DBG|unix#264998: received: > OFPT_METER_MOD (OF1.5) (xid=0xa): DEL meter=all bands= > 2022-02-03T10:43:19.126Z|01592|vconn|DBG|unix#264998: received: > OFPT_BUNDLE_CONTROL (OF1.5) (xid=0xb): > bundle_id=0 type=OPEN_REQUEST flags=atomic ordered > 2022-02-03T10:43:19.126Z|01593|vconn|DBG|unix#264998: sent (Success): > OFPT_BUNDLE_CONTROL (OF1.5) (xid=0xb): > bundle_id=0 type=OPEN_REPLY flags=0 > 2022-02-03T10:43:19.126Z|01594|vconn|DBG|unix#264998: received: > OFPT_BUNDLE_ADD_MESSAGE (OF1.5) (xid=0xc): > bundle_id=0 flags=atomic ordered > OFPT_FLOW_MOD (OF1.5) (xid=0xc): ADD priority=100,in_port=98 > actions=move:NXM_NX_TUN_ID[40..54]->NXM_NX_REG14[0..14],move:NXM_NX_TUN_ID[24..39]->NXM_NX_REG15[0..15],move:NXM_NX_TUN_ID[0..23]->OXM_OF_METADATA[0..23],resubmit(,38) > 2022-02-03T10:43:19.126Z|01595|vconn|DBG|unix#264998: received: > OFPT_BUNDLE_ADD_MESSAGE (OF1.5) (xid=0xd): > bundle_id=0 flags=atomic ordered > OFPT_FLOW_MOD (OF1.5) (xid=0xd): ADD priority=100,in_port=96 > actions=move:NXM_NX_TUN_ID[40..54]->NXM_NX_REG14[0..14],move:NXM_NX_TUN_ID[24..39]->NXM_NX_REG15[0..15],move:NXM_NX_TUN_ID[0..23]->OXM_OF_METADATA[0..23],resubmit(,38) > 2022-02-03T10:43:19.126Z|01596|vconn|DBG|unix#264998: received: > OFPT_BUNDLE_ADD_MESSAGE (OF1.5) (xid=0xe): > bundle_id=0 flags=atomic ordered > OFPT_FLOW_MOD (OF1.5) (xid=0xe): ADD priority=100,in_port=7 > actions=move:NXM_NX_TUN_ID[40..54]->NXM_NX_REG14[0..14],move:NXM_NX_TUN_ID[24..39]->NXM_NX_REG15[0..15],move:NXM_NX_TUN_ID[0..23]->OXM_OF_METADATA[0..23],resubmit(,38) > 2022-02-03T10:43:19.126Z|01597|vconn|DBG|unix#264998: received: > OFPT_BUNDLE_ADD_MESSAGE (OF1.5) (xid=0xf): > bundle_id=0 flags=atomic ordered > OFPT_FLOW_MOD (OF1.5) (xid=0xf): ADD priority=100,in_port=6 > actions=move:NXM_NX_TUN_ID[40..54]->NXM_NX_REG14[0..14],move:NXM_NX_TUN_ID[24..39]->NXM_NX_REG15[0..15],move:NXM_NX_TUN_ID[0..23]->OXM_OF_METADATA[0..23],resubmit(,38) > 2022-02-03T10:43:19.126Z|01598|vconn|DBG|unix#264998: received: > OFPT_BUNDLE_ADD_MESSAGE (OF1.5) (xid=0x10): > bundle_id=0 flags=atomic ordered > OFPT_FLOW_MOD (OF1.5) (xid=0x10): ADD priority=100,in_port=55 > actions=move:NXM_NX_TUN_ID[40..54]->NXM_NX_REG14[0..14],move:NXM_NX_TUN_ID[24..39]->NXM_NX_REG15[0..15],move:NXM_NX_TUN_ID[0..23]->OXM_OF_METADATA[0..23],resubmit(,38) > 2022-02-03T10:43:19.126Z|01599|vconn|DBG|unix#264998: received: > OFPT_BUNDLE_ADD_MESSAGE (OF1.5) (xid=0x11): > bundle_id=0 flags=atomic ordered > …. > > > 2022-02-03T10:43:26.293Z|01758|vconn|DBG|unix#264998: received: > NXT_CT_FLUSH_ZONE (OF1.5) (xid=0x6b): zone_id=52 > 2022-02-03T10:43:26.424Z|01759|vconn|DBG|unix#264998: received: > NXT_CT_FLUSH_ZONE (OF1.5) (xid=0x6c): zone_id=55 > 2022-02-03T10:43:26.554Z|01760|vconn|DBG|unix#264998: received: > NXT_CT_FLUSH_ZONE (OF1.5) (xid=0x6d): zone_id=54 > 2022-02-03T10:43:26.685Z|01761|vconn|DBG|unix#264998: received: > NXT_CT_FLUSH_ZONE (OF1.5) (xid=0x6e): zone_id=58 > 2022-02-03T10:43:26.816Z|01762|vconn|DBG|unix#264998: received: > NXT_CT_FLUSH_ZONE (OF1.5) (xid=0x6f): zone_id=57 > 2022-02-03T10:43:26.946Z|01763|vconn|DBG|unix#264998: received: > NXT_CT_FLUSH_ZONE (OF1.5) (xid=0x70): zone_id=56 > 2022-02-03T10:43:27.031Z|00013|ovs_rcu(urcu2)|WARN|blocked 1000 ms waiting > for main to quiesce > 2022-02-03T10:43:27.076Z|01764|vconn|DBG|unix#264998: received: > NXT_CT_FLUSH_ZONE (OF1.5) (xid=0x71): zone_id=59 > 2022-02-03T10:43:27.206Z|01765|vconn|DBG|unix#264998: received: > NXT_CT_FLUSH_ZONE (OF1.5) (xid=0x72): zone_id=61 > 2022-02-03T10:43:27.336Z|01766|vconn|DBG|unix#264998: received: > NXT_CT_FLUSH_ZONE (OF1.5) (xid=0x73): zone_id=60 > 2022-02-03T10:43:27.465Z|01767|vconn|DBG|unix#264998: received: > NXT_CT_FLUSH_ZONE (OF1.5) (xid=0x74): zone_id=62 > 2022-02-03T10:43:27.594Z|01768|vconn|DBG|unix#264998: received: > NXT_CT_FLUSH_ZONE (OF1.5) (xid=0x75): zone_id=66 > 2022-02-03T10:43:27.726Z|01769|vconn|DBG|unix#264998: received: > NXT_CT_FLUSH_ZONE (OF1.5) (xid=0x76): zone_id=65 > 2022-02-03T10:43:27.854Z|01770|vconn|DBG|unix#264998: received: > NXT_CT_FLUSH_ZONE (OF1.5) (xid=0x77): zone_id=64 > 2022-02-03T10:43:27.984Z|01771|vconn|DBG|unix#264998: received: > NXT_CT_FLUSH_ZONE (OF1.5) (xid=0x78): zone_id=63 > 2022-02-03T10:43:28.031Z|00014|ovs_rcu(urcu2)|WARN|blocked 2000 ms waiting > for main to quiesce > 2022-02-03T10:43:28.111Z|01772|vconn|DBG|unix#264998: received: > NXT_CT_FLUSH_ZONE (OF1.5) (xid=0x79): zone_id=67 > 2022-02-03T10:43:28.238Z|01773|vconn|DBG|unix#264998: received: > NXT_CT_FLUSH_ZONE (OF1.5) (xid=0x7a): zone_id=68 > 2022-02-03T10:43:28.371Z|01774|vconn|DBG|unix#264998: received: > NXT_CT_FLUSH_ZONE (OF1.5) (xid=0x7b): zone_id=69 > 2022-02-03T10:43:28.496Z|01775|vconn|DBG|unix#264998: received: > NXT_CT_FLUSH_ZONE (OF1.5) (xid=0x7c): zone_id=72 > 2022-02-03T10:43:28.622Z|01776|vconn|DBG|unix#264998: received: > NXT_CT_FLUSH_ZONE (OF1.5) (xid=0x7d): zone_id=71 > 2022-02-03T10:43:28.746Z|01777|vconn|DBG|unix#264998: received: > NXT_CT_FLUSH_ZONE (OF1.5) (xid=0x7e): zone_id=70 > 2022-02-03T10:43:28.871Z|01778|vconn|DBG|unix#264998: received: > NXT_CT_FLUSH_ZONE (OF1.5) (xid=0x7f): zone_id=75 > 2022-02-03T10:43:29.000Z|01779|vconn|DBG|unix#264998: received: > NXT_CT_FLUSH_ZONE (OF1.5) (xid=0x80): zone_id=74 > 2022-02-03T10:43:29.123Z|01780|vconn|DBG|unix#264998: received: > NXT_CT_FLUSH_ZONE (OF1.5) (xid=0x81): zone_id=73 > 2022-02-03T10:43:29.246Z|01781|vconn|DBG|unix#264998: received: > NXT_CT_FLUSH_ZONE (OF1.5) (xid=0x82): zone_id=76 > 2022-02-03T10:43:29.371Z|01782|vconn|DBG|unix#264998: received: > NXT_CT_FLUSH_ZONE (OF1.5) (xid=0x83): zone_id=78 > 2022-02-03T10:43:29.495Z|01783|vconn|DBG|unix#264998: received: > NXT_CT_FLUSH_ZONE (OF1.5) (xid=0x84): zone_id=77 > 2022-02-03T10:43:29.618Z|01784|vconn|DBG|unix#264998: received: > NXT_CT_FLUSH_ZONE (OF1.5) (xid=0x85): zone_id=81 > 2022-02-03T10:43:29.746Z|01785|vconn|DBG|unix#264998: received: > NXT_CT_FLUSH_ZONE (OF1.5) (xid=0x86): zone_id=80 > 2022-02-03T10:43:29.867Z|01786|vconn|DBG|unix#264998: received: > NXT_CT_FLUSH_ZONE (OF1.5) (xid=0x87): zone_id=79 > 2022-02-03T10:43:29.988Z|01787|vconn|DBG|unix#264998: received: > NXT_CT_FLUSH_ZONE (OF1.5) (xid=0x88): zone_id=84 > 2022-02-03T10:43:30.031Z|00015|ovs_rcu(urcu2)|WARN|blocked 4000 ms waiting > for main to quiesce > 2022-02-03T10:43:30.108Z|01788|vconn|DBG|unix#264998: received: > NXT_CT_FLUSH_ZONE (OF1.5) (xid=0x89): zone_id=83 > 2022-02-03T10:43:30.229Z|01789|vconn|DBG|unix#264998: received: > NXT_CT_FLUSH_ZONE (OF1.5) (xid=0x8a): zone_id=82 > 2022-02-03T10:43:30.349Z|01790|vconn|DBG|unix#264998: received: > NXT_CT_FLUSH_ZONE (OF1.5) (xid=0x8b): zone_id=86 > 2022-02-03T10:43:30.470Z|01791|vconn|DBG|unix#264998: received: > NXT_CT_FLUSH_ZONE (OF1.5) (xid=0x8c): zone_id=85 > 2022-02-03T10:43:30.591Z|01792|vconn|DBG|unix#264998: received: > NXT_CT_FLUSH_ZONE (OF1.5) (xid=0x8d): zone_id=89 > 2022-02-03T10:43:30.712Z|01793|vconn|DBG|unix#264998: received: > NXT_CT_FLUSH_ZONE (OF1.5) (xid=0x8e): zone_id=88 > 2022-02-03T10:43:30.833Z|01794|vconn|DBG|unix#264998: received: > NXT_CT_FLUSH_ZONE (OF1.5) (xid=0x8f): zone_id=87 > 2022-02-03T10:43:30.953Z|01795|vconn|DBG|unix#264998: received: > NXT_CT_FLUSH_ZONE (OF1.5) (xid=0x90): zone_id=90 > 2022-02-03T10:43:31.073Z|01796|vconn|DBG|unix#264998: received: > NXT_CT_FLUSH_ZONE (OF1.5) (xid=0x91): zone_id=91 > 2022-02-03T10:43:31.193Z|01797|vconn|DBG|unix#264998: received: > NXT_CT_FLUSH_ZONE (OF1.5) (xid=0x92): zone_id=94 > 2022-02-03T10:43:31.313Z|01798|vconn|DBG|unix#264998: received: > NXT_CT_FLUSH_ZONE (OF1.5) (xid=0x93): zone_id=93 > 2022-02-03T10:43:31.433Z|01799|vconn|DBG|unix#264998: received: > NXT_CT_FLUSH_ZONE (OF1.5) (xid=0x94): zone_id=92 > 2022-02-03T10:43:31.554Z|01800|vconn|DBG|unix#264998: received: > NXT_CT_FLUSH_ZONE (OF1.5) (xid=0x95): zone_id=95 > 2022-02-03T10:43:31.674Z|01801|vconn|DBG|unix#264998: received: > NXT_CT_FLUSH_ZONE (OF1.5) (xid=0x96): zone_id=96 > 2022-02-03T10:43:31.795Z|01802|vconn|DBG|unix#264998: received: > NXT_CT_FLUSH_ZONE (OF1.5) (xid=0x97): zone_id=97 > 2022-02-03T10:43:31.915Z|01803|vconn|DBG|unix#264998: received: > NXT_CT_FLUSH_ZONE (OF1.5) (xid=0x98): zone_id=99 > 2022-02-03T10:43:32.035Z|01804|vconn|DBG|unix#264998: received: > NXT_CT_FLUSH_ZONE (OF1.5) (xid=0x99): zone_id=98 > 2022-02-03T10:43:32.156Z|01805|vconn|DBG|unix#264998: received: > NXT_CT_FLUSH_ZONE (OF1.5) (xid=0x9a): zone_id=100 > 2022-02-03T10:43:32.275Z|01806|connmgr|INFO|br-int<->unix#264998: 37 > flow_mods 13 s ago (36 adds, 1 deletes) > 2022-02-03T10:43:32.276Z|01807|timeval|WARN|Unreasonably long 6245ms poll > interval (826ms user, 5403ms system) > 2022-02-03T10:43:32.276Z|01808|timeval|WARN|faults: 2 minor, 0 major > 2022-02-03T10:43:32.276Z|01809|timeval|WARN|context switches: 0 voluntary, 21 > involuntary > 2022-02-03T10:43:32.276Z|01810|coverage|INFO|Skipping details of duplicate > event coverage for hash=4321e6a4 > 2022-02-03T10:43:32.276Z|01811|poll_loop|INFO|wakeup due to [POLLIN] on fd 15 > (<->/var/run/openvswitch/db.sock) at lib/stream-fd.c:157 (100% CPU usage) > 2022-02-03T10:43:32.276Z|01812|poll_loop|INFO|wakeup due to [POLLIN] on fd > 228 (/var/run/openvswitch/br-int.mgmt<->) at lib/stream-fd.c:157 (100% CPU > usage) > > > Please note the time for each NXT_CT_FLUSH_ZONE line took. It flushes just ~8 > zones per second. Maybe, it’s due to quite big conntrack size: > I’ve noticed that on the system there were a huge amount of the conntrack > records in host’s zone=0. > There were connections generated from overlay tunnel packets, that were not > configured in iptables to be NOTRACK - there was a configuration management > issue. > After I’ve applied iptables rules to not to track geneve, conntrack > connection count decreased from ~60k to 3k and CT_ZONE flush became much > faster (about 100 zones per 1 second). > > Thanks for pointing to another to openflows problem! > > But my question is why ovn-controller flushes conntrack zones and modifies > OFs even if nothing has changed? I mean, a "soft-restart" condition (when > ovn-controller was stopped through unixctl socket with "--restart" option. > Should I start ovn-controller somehow special? > When I trigger "soft-restart", I expect to have a minimal impact on the > datapath. > > Regards, > Vladislav Odintsov > >> On 2 Feb 2022, at 18:20, Ilya Maximets <i.maxim...@ovn.org> wrote: >> >> On 2/1/22 23:15, Vladislav Odintsov wrote: >>> Hi OVN and OVS folks, >>> >>> during OVN upgrade from 20.06 to 21.12 version I’ve faced with a quite long >>> datapath downtime (48-50 seconds). >>> I suppose this downtime was triggered by ovn-controller installing new >>> openflow rules to ovs. >>> There are relatively not so many flows (13-14k). >>> openvswitch version is 2.13.4. >>> ovn-openflow-probe-interval is configured to 30. >>> >>> The upgrade procedure was: >>> 1. Set to ovs external_ids an option to force matching ovn-northd internal >>> version (ovn-match-northd-version=true) >>> 2. Upgrade ovn-controller (to first get ovn-controller, which supports >>> version match between controller and northd and not to break datapath after >>> upgrading ovn-northd) >>> 3. Upgrade ovn-northd cluster >>> >>> Network outage timeframe matches to ovs-vswitchd WARN logs. Please see next: >>> >>> 2022-02-01T19:44:32.577Z|00035|ovs_rcu(urcu7)|WARN|blocked 1000 ms waiting >>> for main to quiesce >>> 2022-02-01T19:44:33.577Z|00036|ovs_rcu(urcu7)|WARN|blocked 2000 ms waiting >>> for main to quiesce >>> 2022-02-01T19:44:35.577Z|00037|ovs_rcu(urcu7)|WARN|blocked 4000 ms waiting >>> for main to quiesce >>> 2022-02-01T19:44:39.577Z|00038|ovs_rcu(urcu7)|WARN|blocked 8000 ms waiting >>> for main to quiesce >>> 2022-02-01T19:44:44.396Z|04063|connmgr|INFO|br-int<->unix#272674: 37 >>> flow_mods 13 s ago (36 adds, 1 deletes) >>> 2022-02-01T19:44:44.441Z|04064|timeval|WARN|Unreasonably long 12864ms poll >>> interval (0ms user, 12833ms system) >>> 2022-02-01T19:44:44.441Z|04065|timeval|WARN|faults: 287 minor, 0 major >>> 2022-02-01T19:44:44.441Z|04066|timeval|WARN|context switches: 1 voluntary, >>> 18 involuntary >>> 2022-02-01T19:44:44.441Z|04067|coverage|INFO|Event coverage, avg rate over >>> last: 5 seconds, last minute, last hour, hash=9105faf6: >>> 2022-02-01T19:44:44.441Z|04068|coverage|INFO|ccmap_expand >>> 0.0/sec 0.000/sec 0.0000/sec total: 1129 >>> 2022-02-01T19:44:44.441Z|04069|coverage|INFO|ccmap_shrink >>> 0.0/sec 11.933/sec 0.1989/sec total: 2363 >>> 2022-02-01T19:44:44.441Z|04070|coverage|INFO|cmap_expand >>> 0.0/sec 0.833/sec 0.8811/sec total: 11218023 >>> 2022-02-01T19:44:44.441Z|04071|coverage|INFO|cmap_shrink >>> 8.2/sec 11.950/sec 1.0736/sec total: 11217475 >>> 2022-02-01T19:44:44.441Z|04072|coverage|INFO|dpif_port_add >>> 0.0/sec 0.000/sec 0.0000/sec total: 229 >>> 2022-02-01T19:44:44.441Z|04073|coverage|INFO|dpif_port_del >>> 0.0/sec 0.000/sec 0.0000/sec total: 144 >>> 2022-02-01T19:44:44.441Z|04074|coverage|INFO|dpif_flow_flush >>> 0.0/sec 0.000/sec 0.0000/sec total: 2 >>> 2022-02-01T19:44:44.441Z|04075|coverage|INFO|dpif_flow_get >>> 0.0/sec 0.000/sec 0.0000/sec total: 23 >>> 2022-02-01T19:44:44.441Z|04076|coverage|INFO|dpif_flow_put >>> 83.0/sec 97.900/sec 80.8306/sec total: 849051748 >>> 2022-02-01T19:44:44.441Z|04077|coverage|INFO|dpif_flow_del >>> 147.8/sec 91.733/sec 78.2797/sec total: 810797137 >>> 2022-02-01T19:44:44.441Z|04078|coverage|INFO|dpif_execute >>> 66.6/sec 83.767/sec 87.8397/sec total: 958697586 >>> 2022-02-01T19:44:44.441Z|04079|coverage|INFO|dpif_execute_with_help >>> 0.0/sec 4.250/sec 4.9892/sec total: 70713908 >>> 2022-02-01T19:44:44.441Z|04080|coverage|INFO|flow_extract >>> 169.0/sec 247.017/sec 333.1194/sec total: 7990486004 >>> 2022-02-01T19:44:44.441Z|04081|coverage|INFO|miniflow_malloc >>> 0.0/sec 0.183/sec 0.7983/sec total: 5871853 >>> 2022-02-01T19:44:44.441Z|04082|coverage|INFO|hindex_pathological >>> 0.0/sec 0.000/sec 0.0000/sec total: 78 >>> 2022-02-01T19:44:44.441Z|04083|coverage|INFO|hindex_expand >>> 0.0/sec 0.000/sec 0.0000/sec total: 11 >>> 2022-02-01T19:44:44.441Z|04084|coverage|INFO|hmap_pathological >>> 0.0/sec 0.633/sec 0.9331/sec total: 16638117 >>> 2022-02-01T19:44:44.441Z|04085|coverage|INFO|hmap_expand >>> 0.0/sec 114.867/sec 139.8006/sec total: 2184263485 >>> 2022-02-01T19:44:44.441Z|04086|coverage|INFO|mac_learning_learned >>> 0.0/sec 0.000/sec 0.0003/sec total: 299 >>> 2022-02-01T19:44:44.441Z|04087|coverage|INFO|mac_learning_expired >>> 0.0/sec 0.000/sec 0.0006/sec total: 117 >>> 2022-02-01T19:44:44.441Z|04088|coverage|INFO|mac_learning_moved >>> 0.0/sec 0.000/sec 0.0000/sec total: 87 >>> 2022-02-01T19:44:44.441Z|04089|coverage|INFO|netdev_get_stats >>> 0.0/sec 36.850/sec 43.3306/sec total: 690673778 >>> 2022-02-01T19:44:44.441Z|04090|coverage|INFO|txn_unchanged >>> 0.0/sec 0.100/sec 0.1614/sec total: 2705864 >>> 2022-02-01T19:44:44.441Z|04091|coverage|INFO|txn_incomplete >>> 0.0/sec 0.183/sec 0.2239/sec total: 3688261 >>> 2022-02-01T19:44:44.441Z|04092|coverage|INFO|txn_success >>> 0.0/sec 0.167/sec 0.1953/sec total: 3236818 >>> 2022-02-01T19:44:44.441Z|04093|coverage|INFO|poll_create_node >>> 3930.6/sec 3941.350/sec 5012.2572/sec total: 115959738162 >>> 2022-02-01T19:44:44.441Z|04094|coverage|INFO|poll_zero_timeout >>> 159.4/sec 111.417/sec 103.4078/sec total: 1181551227 >>> 2022-02-01T19:44:44.441Z|04095|coverage|INFO|rconn_queued >>> 0.0/sec 0.050/sec 0.0817/sec total: 1279558 >>> 2022-02-01T19:44:44.441Z|04096|coverage|INFO|rconn_sent >>> 0.0/sec 0.050/sec 0.0817/sec total: 1279558 >>> 2022-02-01T19:44:44.441Z|04097|coverage|INFO|seq_change >>> 2205.2/sec 2502.783/sec 2874.6258/sec total: 57335390998 >>> 2022-02-01T19:44:44.441Z|04098|coverage|INFO|pstream_open >>> 0.0/sec 0.000/sec 0.0000/sec total: 5 >>> 2022-02-01T19:44:44.441Z|04099|coverage|INFO|stream_open >>> 0.0/sec 0.000/sec 0.0000/sec total: 1 >>> 2022-02-01T19:44:44.441Z|04100|coverage|INFO|unixctl_received >>> 0.0/sec 0.000/sec 0.0000/sec total: 189 >>> 2022-02-01T19:44:44.441Z|04101|coverage|INFO|unixctl_replied >>> 0.0/sec 0.000/sec 0.0000/sec total: 189 >>> 2022-02-01T19:44:44.441Z|04102|coverage|INFO|util_xalloc >>> 7935.8/sec 15234.350/sec 18372.4347/sec total: 345308271208 >>> 2022-02-01T19:44:44.441Z|04103|coverage|INFO|vconn_received >>> 0.0/sec 0.083/sec 0.3556/sec total: 2018930 >>> 2022-02-01T19:44:44.441Z|04104|coverage|INFO|vconn_sent >>> 0.0/sec 0.067/sec 0.0983/sec total: 1552067 >>> 2022-02-01T19:44:44.441Z|04105|coverage|INFO|netdev_set_policing >>> 0.0/sec 0.000/sec 0.0000/sec total: 623 >>> 2022-02-01T19:44:44.441Z|04106|coverage|INFO|netdev_get_ifindex >>> 0.0/sec 0.100/sec 0.1636/sec total: 2713928 >>> 2022-02-01T19:44:44.441Z|04107|coverage|INFO|netdev_set_hwaddr >>> 0.0/sec 0.000/sec 0.0000/sec total: 2 >>> 2022-02-01T19:44:44.441Z|04108|coverage|INFO|netdev_get_ethtool >>> 0.0/sec 0.000/sec 0.0000/sec total: 1073 >>> 2022-02-01T19:44:44.441Z|04109|coverage|INFO|netdev_set_ethtool >>> 0.0/sec 0.000/sec 0.0000/sec total: 225 >>> 2022-02-01T19:44:44.441Z|04110|coverage|INFO|nln_changed >>> 0.0/sec 0.050/sec 0.0492/sec total: 819614 >>> 2022-02-01T19:44:44.441Z|04111|coverage|INFO|netlink_received >>> 346.6/sec 472.133/sec 575.4597/sec total: 11288088779 >>> 2022-02-01T19:44:44.441Z|04112|coverage|INFO|netlink_recv_jumbo >>> 18.4/sec 113.250/sec 175.1189/sec total: 3935643810 >>> 2022-02-01T19:44:44.441Z|04113|coverage|INFO|netlink_sent >>> 308.4/sec 378.133/sec 380.4267/sec total: 4753114969 >>> 2022-02-01T19:44:44.441Z|04114|coverage|INFO|ofproto_flush >>> 0.0/sec 0.000/sec 0.0000/sec total: 1 >>> 2022-02-01T19:44:44.441Z|04115|coverage|INFO|ofproto_recv_openflow >>> 0.0/sec 0.067/sec 0.3389/sec total: 1746449 >>> 2022-02-01T19:44:44.441Z|04116|coverage|INFO|ofproto_update_port >>> 0.0/sec 0.000/sec 0.0000/sec total: 814 >>> 2022-02-01T19:44:44.441Z|04117|coverage|INFO|packet_in_overflow >>> 0.0/sec 0.000/sec 0.0000/sec total: 282520 >>> 2022-02-01T19:44:44.441Z|04118|coverage|INFO|rev_reconfigure >>> 0.0/sec 0.017/sec 0.0325/sec total: 539904 >>> 2022-02-01T19:44:44.441Z|04119|coverage|INFO|rev_port_toggled >>> 0.0/sec 0.000/sec 0.0000/sec total: 27 >>> 2022-02-01T19:44:44.441Z|04120|coverage|INFO|rev_flow_table >>> 0.0/sec 0.000/sec 0.0106/sec total: 10876 >>> 2022-02-01T19:44:44.441Z|04121|coverage|INFO|rev_mac_learning >>> 0.0/sec 0.000/sec 0.0006/sec total: 312 >>> 2022-02-01T19:44:44.441Z|04122|coverage|INFO|dumped_duplicate_flow >>> 0.0/sec 0.000/sec 0.0000/sec total: 39204 >>> 2022-02-01T19:44:44.441Z|04123|coverage|INFO|dumped_new_flow >>> 0.0/sec 0.000/sec 0.0000/sec total: 7177 >>> 2022-02-01T19:44:44.441Z|04124|coverage|INFO|handler_duplicate_upcall >>> 0.4/sec 1.833/sec 2.2664/sec total: 34762798 >>> 2022-02-01T19:44:44.441Z|04125|coverage|INFO|upcall_ukey_contention >>> 0.0/sec 0.000/sec 0.0000/sec total: 9 >>> 2022-02-01T19:44:44.441Z|04126|coverage|INFO|upcall_ukey_replace >>> 0.2/sec 0.050/sec 0.0447/sec total: 7927221 >>> 2022-02-01T19:44:44.441Z|04127|coverage|INFO|revalidate_missed_dp_flow >>> 0.0/sec 0.000/sec 0.0000/sec total: 1118 >>> 2022-02-01T19:44:44.441Z|04128|coverage|INFO|xlate_actions >>> 123.6/sec 273.183/sec 233.8275/sec total: 2110727223 >>> 2022-02-01T19:44:44.441Z|04129|coverage|INFO|bridge_reconfigure >>> 0.0/sec 0.017/sec 0.0325/sec total: 539902 >>> 2022-02-01T19:44:44.441Z|04130|coverage|INFO|74 events never hit >>> 2022-02-01T19:44:44.441Z|04131|poll_loop|INFO|Dropped 3 log messages in >>> last 1058229 seconds (most recently, 1058228 seconds ago) due to excessive >>> rate >>> 2022-02-01T19:44:44.441Z|04132|poll_loop|INFO|wakeup due to [POLLIN] on fd >>> 230 (/var/run/openvswitch/br-int.mgmt<->) at lib/stream-fd.c:157 (84% CPU >>> usage) >>> 2022-02-01T19:44:45.441Z|00039|ovs_rcu(urcu7)|WARN|blocked 1000 ms waiting >>> for main to quiesce >>> 2022-02-01T19:44:46.441Z|00040|ovs_rcu(urcu7)|WARN|blocked 2000 ms waiting >>> for main to quiesce >>> 2022-02-01T19:44:48.441Z|00041|ovs_rcu(urcu7)|WARN|blocked 4001 ms waiting >>> for main to quiesce >>> 2022-02-01T19:44:52.441Z|00042|ovs_rcu(urcu7)|WARN|blocked 8000 ms waiting >>> for main to quiesce >>> 2022-02-01T19:44:55.109Z|04133|timeval|WARN|Unreasonably long 10668ms poll >>> interval (0ms user, 10643ms system) >>> 2022-02-01T19:44:55.109Z|04134|timeval|WARN|faults: 70 minor, 0 major >>> 2022-02-01T19:44:55.109Z|04135|timeval|WARN|context switches: 0 voluntary, >>> 4 involuntary >>> 2022-02-01T19:44:55.109Z|04136|coverage|INFO|Skipping details of duplicate >>> event coverage for hash=9105faf6 >>> 2022-02-01T19:44:55.109Z|04137|poll_loop|INFO|wakeup due to [POLLIN] on fd >>> 15 (<->/var/run/openvswitch/db.sock) at lib/stream-fd.c:157 (100% CPU usage) >>> 2022-02-01T19:44:55.109Z|04138|poll_loop|INFO|wakeup due to [POLLIN] on fd >>> 230 (/var/run/openvswitch/br-int.mgmt<->) at lib/stream-fd.c:157 (100% CPU >>> usage) >>> 2022-02-01T19:44:56.112Z|00043|ovs_rcu(urcu7)|WARN|blocked 1001 ms waiting >>> for main to quiesce >>> 2022-02-01T19:44:57.112Z|00044|ovs_rcu(urcu7)|WARN|blocked 2000 ms waiting >>> for main to quiesce >>> 2022-02-01T19:44:59.112Z|00045|ovs_rcu(urcu7)|WARN|blocked 4000 ms waiting >>> for main to quiesce >>> 2022-02-01T19:45:03.112Z|00046|ovs_rcu(urcu7)|WARN|blocked 8000 ms waiting >>> for main to quiesce >>> 2022-02-01T19:45:05.249Z|04139|timeval|WARN|Unreasonably long 10140ms poll >>> interval (0ms user, 10115ms system) >>> 2022-02-01T19:45:05.249Z|04140|timeval|WARN|faults: 248 minor, 0 major >>> 2022-02-01T19:45:05.249Z|04141|timeval|WARN|context switches: 1 voluntary, >>> 27 involuntary >>> 2022-02-01T19:45:05.249Z|04142|coverage|INFO|Skipping details of duplicate >>> event coverage for hash=9105faf6 >>> 2022-02-01T19:45:05.249Z|04143|poll_loop|INFO|wakeup due to [POLLIN] on fd >>> 14 (NETLINK_ROUTE<->NETLINK_ROUTE) at lib/netlink-socket.c:1401 (100% CPU >>> usage) >>> 2022-02-01T19:45:05.249Z|04144|poll_loop|INFO|wakeup due to [POLLOUT] on fd >>> 235 (/var/run/openvswitch/br-int.mgmt<->) at lib/stream-fd.c:153 (100% CPU >>> usage) >>> 2022-02-01T19:45:05.249Z|04145|poll_loop|INFO|wakeup due to [POLLIN] on fd >>> 230 (/var/run/openvswitch/br-int.mgmt<->) at lib/stream-fd.c:157 (100% CPU >>> usage) >>> 2022-02-01T19:45:06.250Z|00047|ovs_rcu(urcu7)|WARN|blocked 1000 ms waiting >>> for main to quiesce >>> 2022-02-01T19:45:07.250Z|00048|ovs_rcu(urcu7)|WARN|blocked 2000 ms waiting >>> for main to quiesce >>> 2022-02-01T19:45:09.250Z|00049|ovs_rcu(urcu7)|WARN|blocked 4000 ms waiting >>> for main to quiesce >>> 2022-02-01T19:45:13.250Z|00050|ovs_rcu(urcu7)|WARN|blocked 8000 ms waiting >>> for main to quiesce >>> 2022-02-01T19:45:15.704Z|04146|rconn|WARN|br-int<->unix#272675: connection >>> dropped (Broken pipe) >>> 2022-02-01T19:45:16.714Z|00051|ovs_rcu(urcu7)|WARN|blocked 1000 ms waiting >>> for main to quiesce >>> 2022-02-01T19:45:17.713Z|00052|ovs_rcu(urcu7)|WARN|blocked 2000 ms waiting >>> for main to quiesce >>> 2022-02-01T19:45:19.714Z|00053|ovs_rcu(urcu7)|WARN|blocked 4000 ms waiting >>> for main to quiesce >>> 2022-02-01T19:45:23.014Z|04147|timeval|WARN|Unreasonably long 17764ms poll >>> interval (0ms user, 17720ms system) >>> 2022-02-01T19:45:23.014Z|04148|timeval|WARN|faults: 63 minor, 0 major >>> 2022-02-01T19:45:23.014Z|04149|timeval|WARN|context switches: 2 voluntary, >>> 24 involuntary >>> 2022-02-01T19:45:23.014Z|04150|poll_loop|INFO|wakeup due to [POLLIN] on fd >>> 15 (<->/var/run/openvswitch/db.sock) at lib/stream-fd.c:157 (100% CPU usage) >>> 2022-02-01T19:45:23.014Z|04151|poll_loop|INFO|wakeup due to [POLLIN] on fd >>> 230 (/var/run/openvswitch/br-int.mgmt<->) at lib/stream-fd.c:157 (100% CPU >>> usage) >>> 2022-02-01T19:45:23.021Z|04152|poll_loop|INFO|wakeup due to [POLLIN] on fd >>> 230 (/var/run/openvswitch/br-int.mgmt<->) at lib/stream-fd.c:157 (100% CPU >>> usage) >>> 2022-02-01T19:45:23.022Z|04153|poll_loop|INFO|wakeup due to [POLLIN] on fd >>> 230 (/var/run/openvswitch/br-int.mgmt<->) at lib/stream-fd.c:157 (100% CPU >>> usage) >>> 2022-02-01T19:45:23.022Z|04154|poll_loop|INFO|wakeup due to [POLLIN] on fd >>> 230 (/var/run/openvswitch/br-int.mgmt<->) at lib/stream-fd.c:157 (100% CPU >>> usage) >>> 2022-02-01T19:45:23.023Z|04155|poll_loop|INFO|wakeup due to [POLLIN] on fd >>> 230 (/var/run/openvswitch/br-int.mgmt<->) at lib/stream-fd.c:157 (100% CPU >>> usage) >>> 2022-02-01T19:45:23.023Z|04156|poll_loop|INFO|wakeup due to [POLLIN] on fd >>> 230 (/var/run/openvswitch/br-int.mgmt<->) at lib/stream-fd.c:157 (100% CPU >>> usage) >>> 2022-02-01T19:45:23.023Z|04157|poll_loop|INFO|wakeup due to [POLLIN] on fd >>> 230 (/var/run/openvswitch/br-int.mgmt<->) at lib/stream-fd.c:157 (100% CPU >>> usage) >>> 2022-02-01T19:45:23.024Z|04158|poll_loop|INFO|wakeup due to [POLLIN] on fd >>> 230 (/var/run/openvswitch/br-int.mgmt<->) at lib/stream-fd.c:157 (100% CPU >>> usage) >>> 2022-02-01T19:45:23.024Z|04159|poll_loop|INFO|wakeup due to [POLLIN] on fd >>> 230 (/var/run/openvswitch/br-int.mgmt<->) at lib/stream-fd.c:157 (100% CPU >>> usage) >>> 2022-02-01T19:45:41.274Z|04160|connmgr|INFO|br-int<->unix#272674: 13589 >>> flow_mods 18 s ago (13509 adds, 80 modifications) >>> 2022-02-01T19:46:31.285Z|04161|rconn|ERR|br-int<->unix#272673: no response >>> to inactivity probe after 60 seconds, disconnecting >>> >>> >>> In ovn-controller log: >>> >>> 2022-02-01T19:44:31.267Z|00077|features|INFO|unix:/run/openvswitch/br-int.mgmt: >>> connecting to switch >>> 2022-02-01T19:44:31.267Z|00078|rconn|INFO|unix:/run/openvswitch/br-int.mgmt: >>> connecting... >>> 2022-02-01T19:44:31.267Z|00079|ofctrl|INFO|unix:/run/openvswitch/br-int.mgmt: >>> connecting to switch >>> 2022-02-01T19:44:31.267Z|00080|rconn|INFO|unix:/run/openvswitch/br-int.mgmt: >>> connecting... >>> 2022-02-01T19:44:31.268Z|00081|rconn|INFO|unix:/run/openvswitch/br-int.mgmt: >>> connected >>> 2022-02-01T19:44:31.269Z|00082|rconn|INFO|unix:/run/openvswitch/br-int.mgmt: >>> connected >>> 2022-02-01T19:44:31.284Z|00083|main|INFO|OVS feature set changed, force >>> recompute. >>> 2022-02-01T19:44:34.074Z|00084|main|INFO|OVNSB commit failed, force >>> recompute next time. >>> 2022-02-01T19:44:34.621Z|00085|reconnect|INFO|ssl:<IP>16.0.61:6642: >>> connection closed by peer >>> 2022-02-01T19:44:35.622Z|00086|reconnect|INFO|ssl:<IP>:6642: connecting... >>> 2022-02-01T19:44:35.672Z|00087|reconnect|INFO|ssl:<IP>:6642: connected >>> 2022-02-01T19:45:01.470Z|00088|lflow_cache|INFO|Detected cache inactivity >>> (last active 30001 ms ago): trimming cache >>> 2022-02-01T19:45:23.351Z|00089|binding|INFO|Setting lport <lport> >>> ovn-installed in OVS >>> 2022-02-01T19:45:23.351Z|00090|binding|INFO|Setting lport <lport> up in >>> Southbound >>> <there are 80 port bindings...> >>> 2022-02-01T19:45:23.352Z|00245|binding|INFO|Setting lport <lportN> >>> ovn-installed in OVS >>> 2022-02-01T19:45:23.352Z|00246|binding|INFO|Setting lport <lportN> up in >>> Southbound >>> 2022-02-01T19:45:53.323Z|00247|lflow_cache|INFO|Detected cache inactivity >>> (last active 30004 ms ago): trimming cache >>> >>> >>> I’ve got some questions: >>> 1. What OVS is doing while `waiting for main to quiesce` ? I noticed this >>> message appeares each time downtime occurs. >> >> 'Unreasonably long 17764ms poll interval (0ms user, 17720ms system)' >> is very concerning. It means that the main thread spent almost 18 >> seconds inside the kernel executing some syscall, maybe several syscalls. >> Main thread is not responsible for flow installation to the kernel, IIRC, >> so flow updates should not be a case here. You need to figure out what >> are these syscalls and why they are taking so long to execute inside >> the kernel. >> >> Maybe you have some suspicious kernel logs at the same time? >> You may also try 'perf' or some other tracing tools to figure out >> what is going on. >> >>> 2. Is there any hint to minimize such downtime while reinstalling OF? >> >> Installation of rules is an unlikely root cause here. >> >>> 3. Could there be a situation where ovn-northd after upgrade still >>> generating new logical flows, ovn-controller constantly gets updates from >>> SB and triggers update to OF in a loop while doesn’t finish getting final >>> lflows "view"? So, could it start installing new OF while having in memory >>> lflows which are not consistent at that moment? >> >> northd generates all updates in a single transaction, so SB is updated >> atomically. SB will send this update as a single update message, so this >> is also atomic. ovn-controller will install all OF rules as an OF bundle, >> so that is also atomic. >> >>> 4. What does `br-int<->unix#272675: connection dropped (Broken pipe)` error >>> means? What type of connection is broken? Is it normal? >> >> This looks like a controller connection, i.e. connection between >> ovn-controller with ovs-vswitchd. Doesn't look very good, but >> disconnection seems to be caused by inactivity due to overload >> on one of the sides of the connection. >> >>> 5. Do you see anything else abnormal in the logs? >> >> '0ms user, 17720ms system' is the most abnormal, AFAICT. >> >> Best regards, Ilya Maximets. >> >> _______________________________________________ >> dev mailing list >> d...@openvswitch.org <mailto:d...@openvswitch.org> >> https://mail.openvswitch.org/mailman/listinfo/ovs-dev >> <https://mail.openvswitch.org/mailman/listinfo/ovs-dev> > _______________________________________________ > dev mailing list > d...@openvswitch.org > https://mail.openvswitch.org/mailman/listinfo/ovs-dev _______________________________________________ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev