Public bug reported: This was tested with stable/ussuri branch with https://review.opendev.org/c/openstack/neutron/+/752795/ backported.
The test setup was 3 controllers, each with 10 api workers and rpc workers, with 250 chassis running ovn-controller. There are 1k networks and 10k ports in total (4k vm ports, 2k ports for fips, 4k ports for routers), 1k routers connected to the same external network, 2k vms (2 vms per network, and all vms additionally connected to a single shared network between them). Northbound DB is 15MB, Southbound DB is 100MB. When change is made in neutron, an update in ovn is created and NB_Global.nb_cfg field is incremented. This translates into SB_Global.nb_cfg change which is picked by all ovn-controllers, which in turn update their entry in Chassis_Private, incrementing Chasiss_Private.nb_cfg. After that, southbound ovsdb sends update to neutron either due to https://review.opendev.org/c/openstack/neutron/+/752795/29/neutron/plugins/ml2/drivers/ovn/mech_driver/ovsdb/ovsdb_monitor.py#249 or https://review.opendev.org/c/openstack/neutron/+/752795/29/neutron/plugins/ml2/drivers/ovn/mech_driver/ovsdb/ovsdb_monitor.py#264 which is then handled by Hash Ring implementation to send update the worker. In my testing, when that happened, all neutron api workers stopped processing API requests until all Chassis_Private events were handled which took around 30 seconds on each nb_cfg update. This could be due to controller nodes in test environment not being scaled up properly, but it seems to be a potential scaling issue. ** Affects: neutron Importance: Undecided Status: New -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to neutron. https://bugs.launchpad.net/bugs/1940950 Title: [ovn] neutron api worker gets overloaded processing chassis_private updates Status in neutron: New Bug description: This was tested with stable/ussuri branch with https://review.opendev.org/c/openstack/neutron/+/752795/ backported. The test setup was 3 controllers, each with 10 api workers and rpc workers, with 250 chassis running ovn-controller. There are 1k networks and 10k ports in total (4k vm ports, 2k ports for fips, 4k ports for routers), 1k routers connected to the same external network, 2k vms (2 vms per network, and all vms additionally connected to a single shared network between them). Northbound DB is 15MB, Southbound DB is 100MB. When change is made in neutron, an update in ovn is created and NB_Global.nb_cfg field is incremented. This translates into SB_Global.nb_cfg change which is picked by all ovn-controllers, which in turn update their entry in Chassis_Private, incrementing Chasiss_Private.nb_cfg. After that, southbound ovsdb sends update to neutron either due to https://review.opendev.org/c/openstack/neutron/+/752795/29/neutron/plugins/ml2/drivers/ovn/mech_driver/ovsdb/ovsdb_monitor.py#249 or https://review.opendev.org/c/openstack/neutron/+/752795/29/neutron/plugins/ml2/drivers/ovn/mech_driver/ovsdb/ovsdb_monitor.py#264 which is then handled by Hash Ring implementation to send update the worker. In my testing, when that happened, all neutron api workers stopped processing API requests until all Chassis_Private events were handled which took around 30 seconds on each nb_cfg update. This could be due to controller nodes in test environment not being scaled up properly, but it seems to be a potential scaling issue. To manage notifications about this bug go to: https://bugs.launchpad.net/neutron/+bug/1940950/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp