Public bug reported:

This was tested with stable/ussuri branch with
https://review.opendev.org/c/openstack/neutron/+/752795/ backported.

The test setup was 3 controllers, each with 10 api workers and rpc
workers, with 250 chassis running ovn-controller. There are 1k networks
and 10k ports in total (4k vm ports, 2k ports for fips, 4k ports for
routers), 1k routers connected to the same external network, 2k vms (2
vms per network, and all vms additionally connected to a single shared
network between them). Northbound DB is 15MB, Southbound DB is 100MB.

When change is made in neutron, an update in ovn is created and
NB_Global.nb_cfg field is incremented. This translates into
SB_Global.nb_cfg change which is picked by all ovn-controllers, which in
turn update their entry in Chassis_Private, incrementing
Chasiss_Private.nb_cfg.

After that, southbound ovsdb sends update to neutron either due to
https://review.opendev.org/c/openstack/neutron/+/752795/29/neutron/plugins/ml2/drivers/ovn/mech_driver/ovsdb/ovsdb_monitor.py#249
or
https://review.opendev.org/c/openstack/neutron/+/752795/29/neutron/plugins/ml2/drivers/ovn/mech_driver/ovsdb/ovsdb_monitor.py#264
which is then handled by Hash Ring implementation to send update the
worker.

In my testing, when that happened, all neutron api workers stopped
processing API requests until all Chassis_Private events were handled
which took around 30 seconds on each nb_cfg update. This could be due to
controller nodes in test environment not being scaled up properly, but
it seems to be a potential scaling issue.

** Affects: neutron
     Importance: Undecided
         Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1940950

Title:
  [ovn] neutron api worker gets overloaded processing chassis_private
  updates

Status in neutron:
  New

Bug description:
  This was tested with stable/ussuri branch with
  https://review.opendev.org/c/openstack/neutron/+/752795/ backported.

  The test setup was 3 controllers, each with 10 api workers and rpc
  workers, with 250 chassis running ovn-controller. There are 1k
  networks and 10k ports in total (4k vm ports, 2k ports for fips, 4k
  ports for routers), 1k routers connected to the same external network,
  2k vms (2 vms per network, and all vms additionally connected to a
  single shared network between them). Northbound DB is 15MB, Southbound
  DB is 100MB.

  When change is made in neutron, an update in ovn is created and
  NB_Global.nb_cfg field is incremented. This translates into
  SB_Global.nb_cfg change which is picked by all ovn-controllers, which
  in turn update their entry in Chassis_Private, incrementing
  Chasiss_Private.nb_cfg.

  After that, southbound ovsdb sends update to neutron either due to
  
https://review.opendev.org/c/openstack/neutron/+/752795/29/neutron/plugins/ml2/drivers/ovn/mech_driver/ovsdb/ovsdb_monitor.py#249
  or
  
https://review.opendev.org/c/openstack/neutron/+/752795/29/neutron/plugins/ml2/drivers/ovn/mech_driver/ovsdb/ovsdb_monitor.py#264
  which is then handled by Hash Ring implementation to send update the
  worker.

  In my testing, when that happened, all neutron api workers stopped
  processing API requests until all Chassis_Private events were handled
  which took around 30 seconds on each nb_cfg update. This could be due
  to controller nodes in test environment not being scaled up properly,
  but it seems to be a potential scaling issue.

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1940950/+subscriptions


-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to     : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp

Reply via email to