Hi Han, all, Lucas, Numan and I have been doing some 'scale' testing of OpenStack using OVN and wanted to present some results and issues that we've found with the Incremental Processing feature in ovn-controller. Below is the scenario that we executed:
* 7 baremetal nodes setup: 3 controllers (running ovn-northd/ovsdb-servers in A/P with pacemaker) + 4 compute nodes. OVS 2.10. * The test consists on: - Create openstack network (OVN LS), subnet and router - Attach subnet to the router and set gw to the external network - Create an OpenStack port and apply a Security Group (ACLs to allow UDP, SSH and ICMP). - Bind the port to one of the 4 compute nodes (randomly) by attaching it to a network namespace. - Wait for the port to be ACTIVE in Neutron ('up == True' in NB) - Wait until the test can ping the port * Running browbeat/rally with 16 simultaneous process to execute the test above 150 times. * When all the 150 'fake VMs' are created, browbeat will delete all the OpenStack/OVN resources. We first tried with OVS/OVN 2.10 and pulled some results which showed 100% success but ovn-controller is quite loaded (as expected) in all the nodes especially during the deletion phase: - Compute node: https://imgur.com/a/tzxfrIR - Controller node (ovn-northd and ovsdb-servers): https://imgur.com/a/8ffKKYF After conducting the tests above, we replaced ovn-controller in all 7 nodes by the one with the current master branch (actually from last week). We also replaced ovn-northd and ovsdb-servers but the ovs-vswitchd has been left untouched (still on 2.10). The expected results were to get less ovn-controller CPU usage and also better times due to the Incremental Processing feature introduced recently. However, the results don't look very good: - Compute node: https://imgur.com/a/wuq87F1 - Controller node (ovn-northd and ovsdb-servers): https://imgur.com/a/99kiyDp One thing that we can tell from the ovs-vswitchd CPU consumption is that it's much less in the Incremental Processing (IP) case which apparently doesn't make much sense. This led us to think that perhaps ovn-controller was not installing the necessary flows in the switch and we confirmed this hypothesis by looking into the dataplane results. Out of the 150 VMs, 10% of them were unreachable via ping when using ovn-controller from master. @Han, others, do you have any ideas as of what could be happening here? We'll be able to use this setup for a few more days so let me know if you want us to pull some other data/traces, ... Some other interesting things: On each of the compute nodes, (with an almost evenly distributed number of logical ports bound to them), the max amount of logical flows in br-int is ~90K (by the end of the test, right before deleting the resources). It looks like with the IP version, ovn-controller leaks some memory: https://imgur.com/a/trQrhWd While with OVS 2.10, it remains pretty flat during the test: https://imgur.com/a/KCkIT4O Looking forward to hearing back :) Daniel _______________________________________________ discuss mailing list disc...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-discuss