While working on ovn-controller incremental processing, I was testing the end-to-end latency and comparing with master. I found the latency with master branch is extremely long while changing the way of testing makes it much shorter. The master branch was at commit 593e93e5f (ovsdb clustering was not merged yet). I promised in last ovn meeting to send a report and here are the details.
*Precondition: * there are 1k sandbox HVs connected, and 10k lports already created and bound on these HVs. There are 8 lrouters, each connected with 5 lswitches, and each lswitch has 250 lports. *Test: * - on a sandbox (HV1), create a ovs-port: # ovs-vsctl add-port br-int lsp1 -- set interface lsp1 type=internal -- set interface lsp1 external-ids:iface-id=lsp1 - create the lport on NB on a lswitch that is not yet in local datapath of HV1, wait for port up and then sync HVs # time ovn-nbctl lsp-add lswitch_b9322b_ecpmKt lsp1 -- lsp-set-addresses lsp1 "aa:aa:aa:aa:aa:aa 172.145.222.222"; time ovn-nbctl wait-until Logical_Switch_Port lsp1 up=true; time ovn-nbctl --wait=hv sync *Result:* real 0m0.408s user 0m0.316s sys 0m0.008s real 0m4.837s user 0m0.348s sys 0m0.012s *real 7m48.373s* user 0m0.348s sys 0m0.016s *Notes: * - the time for the last --wait=hv sync is extremely long: 7m48s. - the test is with the patch for nb_cfg separation ( https://patchwork.ozlabs.org/patch/899608/), so that --wait=hv sync is optimized without notification flooding. - the lswitch where the lsp1 is added doesn't belong to any existing local datapath on HV1, i.e., there is no other existing lport on HV1 in this lswitch or any other lswitch that is connected with this lswitch by the same lrouter. This means when adding the lport the HV1 will change monitor condition to watch on a new set of datapaths and port-bindings on them. (tested with adding the lsp1 on an existing local datapath, there is no such extremely long latency.) - adding a command to sync hvs before waiting for port up changes the behavior: the total latency is only 20+ seconds then: # time ovn-nbctl lsp-add lswitch_b9322b_ecpmKt lsp1 -- lsp-set-addresses lsp1 "aa:aa:aa:aa:aa:aa 172.145.222.222"; *time ovn-nbctl --wait=hv sync;* time ovn-nbctl wait-until Logical_Switch_Port lsp1 up=true; time ovn-nbctl --wait=hv sync *Debug log analysis:* With debug log enabled for ovn-controller, I found that the latency was because ovn-controller didn't get any SB update during the 7+ minutes. Here are the related logs and analysis: // HV1 got notification of lsp1 port binding added by northd after the lsp1 is created in NB 2018-03-20T00:00:07.353Z|51461|jsonrpc|DBG|tcp:192.168.10.10:6640: received notification, method="update2", params=["monid",{"Port_Binding":{"d6f145bf-ac8f-4107-a565-9ef1fcc9ec6c":{"insert":{"tunnel_key":252,"mac":"aa:aa:aa:aa:aa:aa 172.145.222.222","logical_port":"lsp1","datapath":["uuid","bb979633-789e-4ef8-9663-51839f000c0f"]}}}}] // Claim the lsp1 on this chassis 2018-03-20T00:00:07.367Z|51465|binding|INFO|Claiming lport lsp1 for this chassis. 2018-03-20T00:00:07.367Z|51466|binding|INFO|lsp1: Claiming aa:aa:aa:aa:aa:aa 172.145.222.222 // Update port binding to this chassis 2018-03-20T00:00:11.381Z|51588|jsonrpc|DBG|tcp:192.168.10.10:6640: send request, method="transact", params=["OVN_Southbound",{"where":[["_uuid","==",["uuid","d6f145bf-ac8f-4107-a565-9ef1fcc9ec6c"]]],"row":{"chassis":["uuid","b2f297f6-cd8b-4429-94f0-274a24d6640a"]},"op":"update","table":"Port_Binding"}], id=*2169* // Send request to change monitor conditions for new local datapaths 2018-03-20T00:00:11.382Z|51594|jsonrpc|DBG|tcp:192.168.10.10:6640: send request, method=“monitor_cond_change” … id=*2170* // Received notification for its own port-binding update 2018-03-20T00:00:11.383Z|51595|jsonrpc|DBG|tcp:192.168.10.10:6640: received notification, method="update2", params=["monid",{"Port_Binding":{"d6f145bf-ac8f-4107-a565-9ef1fcc9ec6c":{"modify":{"chassis":["uuid","b2f297f6-cd8b-4429-94f0-274a24d6640a"]}}}}] 2018-03-20T00:00:11.383Z|51596|jsonrpc|DBG|tcp:192.168.10.10:6640: received reply, result=[{"count":1}], id=*2169* *///////////////// 7 minutes later ... but why???* // Received the notification for the new monitor condition related data (it's huge data) 2018-03-20T00:07:56.183Z|52643|jsonrpc|DBG|tcp:192.168.10.10:6640: received notification, method="update2", params=["monid",{"Multicast_Group":{"944391c4-7bf9-4d46-9662-966f9a5450e0":{"insert":{"ports":["set",[["uuid","00d22282-ac38-4a4e-8f47-3684fa915051"],["uuid","03010cdb-647b-4a1d-b0ca-7f4b0e7a6862"],["uuid","03ba954f-5c96-47f0-b517-4f30b2e61907"],["uuid","0480cf96-5158-45aa-ba49-6053cdec5ce3"],["uuid","052397f5-ebb3-4e81-9970-...(huge data)... 2018-03-20T00:07:56.386Z|52644|jsonrpc|DBG|tcp:192.168.10.10:6640: received reply, result={}, id= *2170* // Received the nb_cfg change triggered by the --wait=hv sync command, together with the lflow update caused by the port status change: up=true 2018-03-20T00:07:56.387Z|52645|jsonrpc|DBG|tcp:192.168.10.10:6640: received notification, method="update2", params=["monid",{"SB_Global":{"555d0a36-82d4-45b4-88cc-c9164d913066":{"modify":{"nb_cfg":45}}},"Logical_Flow":{"3747dba9-9c5f-46bf-a410-4df2e842db58":{"insert":{"match":"arp.tpa == 172.145.222.222 && arp.op == 1","pipeline":"ingress","priority":50,"logical_datapath":["uuid","bb979633-789e-4ef8-9663-51839f000c0f"],"table_id":11,"external_ids":["map",[["source","ovn-northd.c:3707"],["stage-name","ls_in_arp_rsp"]]],"actions":"eth.dst = eth.src; eth.src = aa:aa:aa:aa:aa:aa; arp.op = 2; /* ARP reply */ arp.tha = arp.sha; arp.sha = aa:aa:aa:aa:aa:aa; arp.tpa = arp.spa; arp.spa = 172.145.222.222; outport = inport; flags.loopback = 1; output;"}},"546d7f4f-f82c-4449-bf93-eaacfc35f929":{"insert":{"match":"arp.tpa == 172.145.222.222 && arp.op == 1 && inport == \"lsp1\"","pipeline":"ingress","priority":100,"logical_datapath":["uuid","bb979633-789e-4ef8-9663-51839f000c0f"],"table_id":11,"external_ids":["map",[["source","ovn-northd.c:3723"],["stage-name","ls_in_arp_rsp"]]],"actions":"next;"}}}}] // Respond to the nb_cfg change 2018-03-20T00:08:05.165Z|58829|jsonrpc|DBG|tcp:192.168.10.10:6640: send request, method="transact", params=["OVN_Southbound",{"where":[["_uuid","==",["uuid","e2e7e93f-f640-4860-a3aa-fbd26e18ca97"]]],"row":{"nb_cfg":45},"op":"update","table":"Chassis_NB_Cfg"}], id=2172 ========= the end ======== *Compare:* Below is the one without so long latency, due to an additional --wait=hv sync in between. (# time ovn-nbctl lsp-add lswitch_b9322b_ecpmKt lsp1 -- lsp-set-addresses lsp1 "aa:aa:aa:aa:aa:aa 172.145.222.222"; *time ovn-nbctl --wait=hv sync;* time ovn-nbctl wait-until Logical_Switch_Port lsp1 up=true; time ovn-nbctl --wait=hv sync) // HV1 got notification of lsp1 port binding added by northd after the lsp1 is created in NB 2018-03-20T00:19:32.350Z|65939|jsonrpc|DBG|tcp:192.168.10.10:6640: received notification, method="update2", params=["monid",{"Port_Binding":{"57b18b07-c373-4feb-b539-b8ec20edd89e":{"insert":{"tunnel_key":252,"mac":"aa:aa:aa:aa:aa:aa 172.145.222.222","logical_port":"lsp1","datapath":["uuid","bb979633-789e-4ef8-9663-51839f000c0f"]}}}}] // Claim the lsp1 on this chassis 2018-03-20T00:19:32.365Z|65940|binding|INFO|Claiming lport lsp1 for this chassis. 2018-03-20T00:19:32.365Z|65941|binding|INFO|lsp1: Claiming aa:aa:aa:aa:aa:aa 172.145.222.222 // Update port binding to this chassis 2018-03-20T00:19:36.807Z|66061|jsonrpc|DBG|tcp:192.168.10.10:6640: send request, method="transact", params=["OVN_Southbound",{"where":[["_uuid","==",["uuid","57b18b07-c373-4feb-b539-b8ec20edd89e"]]],"row":{"chassis":["uuid","b2f297f6-cd8b-4429-94f0-274a24d6640a"]},"op":"update","table":"Port_Binding"}], id=2177 // Send request to change monitor conditions for new local datapaths 2018-03-20T00:19:36.808Z|66067|jsonrpc|DBG|tcp:192.168.10.10:6640: send request, method=“monitor_cond_change” … id=2178 // Received notification of nb_cfg update triggered by the first --wait=hv sync command 2018-03-20T00:19:36.809Z|66068|jsonrpc|DBG|tcp:192.168.10.10:6640: received notification, method="update2", params=["monid",{"SB_Global":{"555d0a36-82d4-45b4-88cc-c9164d913066":{"modify":{"nb_cfg":48}}}}] // Received notification for its own port-binding update 2018-03-20T00:19:36.809Z|66069|jsonrpc|DBG|tcp:192.168.10.10:6640: received notification, method="update2", params=["monid",{"Port_Binding":{"57b18b07-c373-4feb-b539-b8ec20edd89e":{"modify":{"chassis":["uuid","b2f297f6-cd8b-4429-94f0-274a24d6640a"]}}}}] 2018-03-20T00:19:36.809Z|66070|jsonrpc|DBG|tcp:192.168.10.10:6640: received reply, result=[{"count":1}], id=2177 // Send update to respond the nb_cfg update 2018-03-20T00:19:41.315Z|66075|jsonrpc|DBG|tcp:192.168.10.10:6640: send request, method="transact", params=["OVN_Southbound",{"where":[["_uuid","==",["uuid","e2e7e93f-f640-4860-a3aa-fbd26e18ca97"]]],"row":{"nb_cfg":48},"op":"update","table":"Chassis_NB_Cfg"}], id=2179 // Received the notification for the new monitor condition related data (it's huge data) 2018-03-20T00:19:42.249Z|66276|jsonrpc|DBG|tcp:192.168.10.10:6640: received notification, method="update2", params=["monid",{"Multicast_Group":{"944391c4-7bf9-4d46-9662-966f9a5450e0":{"insert":{"ports":["set",[["uuid","00d22282-ac38-4a4e-8f47-3684fa915051"],["uuid","03010cdb-647b-4a1d-b0ca-7f4b0e7a6862"],["uuid","03ba954f-5c96-47f0-...(huge data)... 2018-03-20T00:19:42.588Z|66277|jsonrpc|DBG|tcp:192.168.10.10:6640: received reply, result={}, id=2178 // Received the lflow update caused by the port status change: up=true 2018-03-20T00:19:42.588Z|66278|jsonrpc|DBG|tcp:192.168.10.10:6640: received notification, method="update2", params=["monid",{"Logical_Flow":{"fac832c4-461d-4030-8649-9fccca5d6a3b":{"insert":{"match":"arp.tpa == 172.145.222.222 && arp.op == 1 && inport == \"lsp1\"","pipeline":"ingress","priority":100,"logical_datapath":["uuid","bb979633-789e-4ef8-9663-51839f000c0f"],"table_id":11,"external_ids":["map",[["source","ovn-northd.c:3723"],["stage-name","ls_in_arp_rsp"]]],"actions":"next;"}},"0b719145-7bf6-4f26-a2b5-a7a5c8a9494e":{"insert":{"match":"arp.tpa == 172.145.222.222 && arp.op == 1","pipeline":"ingress","priority":50,"logical_datapath":["uuid","bb979633-789e-4ef8-9663-51839f000c0f"],"table_id":11,"external_ids":["map",[["source","ovn-northd.c:3707"],["stage-name","ls_in_arp_rsp"]]],"actions":"eth.dst = eth.src; eth.src = aa:aa:aa:aa:aa:aa; arp.op = 2; /* ARP reply */ arp.tha = arp.sha; arp.sha = aa:aa:aa:aa:aa:aa; arp.tpa = arp.spa; arp.spa = 172.145.222.222; outport = inport; flags.loopback = 1; output;"}}}}] // Received the change notification for its own update for the first nb_cfg change 2018-03-20T00:19:42.588Z|66279|jsonrpc|DBG|tcp:192.168.10.10:6640: received notification, method="update2", params=["monid",{"Chassis_NB_Cfg":{"e2e7e93f-f640-4860-a3aa-fbd26e18ca97":{"modify":{"nb_cfg":48}}}}] 2018-03-20T00:19:42.588Z|66280|jsonrpc|DBG|tcp:192.168.10.10:6640: received reply, result=[{"count":1}], id=2179 // Received the nb_cfg change triggered by the second --wait=hv sync command 2018-03-20T00:19:54.736Z|72055|jsonrpc|DBG|tcp:192.168.10.10:6640: received notification, method="update2", params=["monid",{"SB_Global":{"555d0a36-82d4-45b4-88cc-c9164d913066":{"modify":{"nb_cfg":49}}}}] // ... 2018-03-20T00:19:59.693Z|72060|jsonrpc|DBG|tcp:192.168.10.10:6640: send request, method="transact", params=["OVN_Southbound",{"where":[["_uuid","==",["uuid","e2e7e93f-f640-4860-a3aa-fbd26e18ca97"]]],"row":{"nb_cfg":49},"op":"update","table":"Chassis_NB_Cfg"}], id=2181 // ... 2018-03-20T00:19:59.694Z|72065|jsonrpc|DBG|tcp:192.168.10.10:6640: received notification, method="update2", params=["monid",{"Chassis_NB_Cfg":{"e2e7e93f-f640-4860-a3aa-fbd26e18ca97":{"modify":{"nb_cfg":49}}}}] 2018-03-20T00:19:59.694Z|72066|jsonrpc|DBG|tcp:192.168.10.10:6640: received reply, result=[{"count":1}], id=2181 ====== the end (the correct one) ===== Thanks, Han
_______________________________________________ discuss mailing list disc...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-discuss