Re: [ovs-discuss] OpenStack profiling with networking-ovn - port creation is slow

Daniel Alvarez Sanchez Wed, 14 Feb 2018 02:27:39 -0800

Hi Ben,

Thanks for your inputs. I need to look more carefully into the patch you
submitted but it looks like, at least, we'll be reducing the number of
calls to Datum.__cmp__ which should be good.


For the rest of the things, let me answer inline.


On Tue, Feb 13, 2018 at 8:32 PM, Ben Pfaff <b...@ovn.org> wrote:

> On Tue, Feb 13, 2018 at 12:39:56PM +0100, Daniel Alvarez Sanchez wrote:
> > Hi folks,
> >
> > As we're doing some performance tests in OpenStack using OVN,
> > we noticed that as we keep creating ports, the time for creating a
> > single port increases. Also, ovn-northd CPU consumption is quite
> > high (see [0] which shows the CPU consumption when creating
> > 1000 ports and deleting them. Last part where CPU is at 100% is
> > when all the ports get deleted).
> >
> > With 500 ports in the same Logical Switch, I did some profiling
> > of OpenStack neutron-server adding 10 more ports to that Logical
> > Switch. Currently, neutron-server spawns different API workers
> > (separate processes) which open connections to OVN NB so every
> > time an update message is sent from ovsdb-server it'll be processed
> > by all of them.
> >
> > In my profiling, I used GreenletProfiler in all those processes to
> produce
> > a trace file and then merged all of them together to aggregate the
> > results. In those tests I used OVS master branch compiled it with
> > shared libraries to make use of the JSON C parser. Still, I've seen
> > that most of the time's been spent in the following two modules:
> >
> > - python/ovs/db/data.py:  33%
> > - uuid.py:  21%
> >
> > For the data.py module, this is the usage (self time):
> >
> > Atom.__lt__       16.25%     8283 calls
> > from_json:118      6.18%   406935 calls
> > Atom.__hash__      3.48%  1623832 calls
> > from_json:328      2.01%     5040 calls
> >
> > While for the uuid module:
> >
> > UUID.__cmp__       12.84%  3570975 calls
> > UUID.__init__       4.06%   362541 calls
> > UUID.__hash__       2.96%     1800 calls
> > UUID.__str__        1.03%   355016 calls
> >
> > Most of the calls to Atom.__lt__ come from
> > BaseOvnIdl.__process_update2(idl.py)
> > -> BaseOvnIdl.__row_update(idl.py) -> Datum.__cmp__(data.py) ->
> > Atom.__cmp__(data.py).
>
> I don't know Python well enough to whether these are "real" or correct
> optimizations, but the following strike me as possible low-hanging fruit
> micro-optimizations:
>
> diff --git a/python/ovs/db/data.py b/python/ovs/db/data.py
> index 9e57595f7513..dc816f64708f 100644
> --- a/python/ovs/db/data.py
> +++ b/python/ovs/db/data.py
> @@ -76,12 +76,12 @@ class Atom(object):
>      def __eq__(self, other):
>          if not isinstance(other, Atom) or self.type != other.type:
>              return NotImplemented
> -        return True if self.value == other.value else False
> +        return self.value == other.value
>
>      def __lt__(self, other):
>          if not isinstance(other, Atom) or self.type != other.type:
>              return NotImplemented
> -        return True if self.value < other.value else False
> +        return self.value < other.value
>
>      def __cmp__(self, other):
>          if not isinstance(other, Atom) or self.type != other.type:
>
> We could also reduce the number of calls to Datum.__cmp__ by recognizing
> that if we've already found one change we don't have to do any more
> comparisons, something like this:
>
> diff --git a/python/ovs/db/idl.py b/python/ovs/db/idl.py
> index 60548bcf50b6..2e550adfdf1c 100644
> --- a/python/ovs/db/idl.py
> +++ b/python/ovs/db/idl.py
> @@ -447,6 +447,7 @@ class Idl(object):
>              raise error.Error("<table-updates> is not an object",
>                                table_updates)
>
> +        changed = False
>          for table_name, table_update in six.iteritems(table_updates):
>              table = self.tables.get(table_name)
>              if not table:
> @@ -472,8 +473,8 @@ class Idl(object):
>                                        % (table_name, uuid_string))
>
>                  if version == OVSDB_UPDATE2:
> -                    if self.__process_update2(table, uuid, row_update):
> -                        self.change_seqno += 1
> +                    changed = self.__process_update2(table, uuid,
> row_update,
> +                                                     changed)
>                      continue
>
>                  parser = ovs.db.parser.Parser(row_update, "row-update")
> @@ -485,12 +486,12 @@ class Idl(object):
>                      raise error.Error('<row-update> missing "old" and '
>                                        '"new" members', row_update)
>
> -                if self.__process_update(table, uuid, old, new):
> -                    self.change_seqno += 1
> +                changed = self.__process_update(table, uuid, old, new,
> changed)
> +        if changed:
> +            self.change_seqno += 1
>
> -    def __process_update2(self, table, uuid, row_update):
> +    def __process_update2(self, table, uuid, row_update, changed):
>          row = table.rows.get(uuid)
> -        changed = False
>          if "delete" in row_update:
>              if row:
>                  del table.rows[uuid]
> @@ -511,9 +512,8 @@ class Idl(object):
>              else:
>                  row_update = row_update['initial']
>              self.__add_default(table, row_update)
> -            if self.__row_update(table, row, row_update):
> -                changed = True
> -                self.notify(ROW_CREATE, row)
> +            changed = self.__row_update(table, row, row_update, changed)
> +            self.notify(ROW_CREATE, row)
>          elif "modify" in row_update:
>              if not row:
>                  raise error.Error('Modify non-existing row')
> @@ -528,10 +528,9 @@ class Idl(object):
>                                row_update)
>          return changed
>
> -    def __process_update(self, table, uuid, old, new):
> +    def __process_update(self, table, uuid, old, new, changed):
>          """Returns True if a column changed, False otherwise."""
>          row = table.rows.get(uuid)
> -        changed = False
>          if not new:
>              # Delete row.
>              if row:
> @@ -547,13 +546,13 @@ class Idl(object):
>              if not row:
>                  row = self.__create_row(table, uuid)
>                  changed = True
> +                self.__row_update(table, row, new, True)
> +                self.notify(ROW_CREATE, row)
>              else:
>                  # XXX rate-limit
>                  vlog.warn("cannot add existing row %s to table %s"
>                            % (uuid, table.name))
> -            if self.__row_update(table, row, new):
> -                changed = True
> -                self.notify(ROW_CREATE, row)
> +                changed = self.__row_update(table, row, new, changed)
>          else:
>              op = ROW_UPDATE
>              if not row:
> @@ -563,9 +562,8 @@ class Idl(object):
>                  # XXX rate-limit
>                  vlog.warn("cannot modify missing row %s in table %s"
>                            % (uuid, table.name))
> -            if self.__row_update(table, row, new):
> -                changed = True
> -                self.notify(op, row, Row.from_json(self, table, uuid,
> old))
> +            changed = self.__row_update(table, row, new, changed)
> +            self.notify(op, row, Row.from_json(self, table, uuid, old))
>          return changed
>
>      def __column_name(self, column):
> @@ -608,8 +606,7 @@ class Idl(object):
>
>          return old_row_diff_json
>
> -    def __row_update(self, table, row, row_json):
> -        changed = False
> +    def __row_update(self, table, row, row_json, changed):
>          for column_name, datum_json in six.iteritems(row_json):
>              column = table.columns.get(column_name)
>              if not column:
> @@ -626,14 +623,13 @@ class Idl(object):
>                            % (column_name, table.name, e))
>                  continue
>
> -            if datum != row._data[column_name]:
> +            # If the column changed, update it and record that we saw a
> change.
> +            # We also update it if we've already recorded some other
> change,
> +            # because that's cheaper than doing an unnecessary comparison.
> +            if changed or datum != row._data[column_name]:
>                  row._data[column_name] = datum
>                  if column.alert:
>                      changed = True
> -            else:
> -                # Didn't really change but the OVSDB monitor protocol
> always
> -                # includes every value in a row.
> -                pass
>          return changed
>
>      def __create_row(self, table, uuid):
>
> > The aggregated number of calls to BaseOvnIdl.__process_update2 is
> > 1400 (and we're updating only 10 ports!!) while the total connections
> > opened to NB database are 10:
> >
> > # netstat -np | grep 6641 | grep python | wc -l
> > 10
> >
> > * Bear in mind that those results above were aggregated across all
> > processes.
> >
> > Looks like the main culprit for this explosion could be the way we
> > handle ACLs. As every time we create a port, it'll belong to a Neutron
> > security group (OVN Address Set) and we'll add a new ACL for every
> > Neutron security group rule. If we patch the code to skip the ACL part,
> > the time for creating a port remains stable over the time.
> >
> > From the comparison tests against ML2/OVS (reference implementation),
> > OVN outperforms in most of the operations except for the port creation
> > where we can see it can become a bottleneck.
> >
> > Before optimizing/redesigning the ACL part, we could do some other
> > changes to the way we handle notifications from OVSDB: eg., instead of
> > having multiple processes receiving *all* notifications, we could have
> > one single process subscribed to those notifications and send a more
> > optimized (already parsed) multicast notification to all listening
> processes
> > to keep their own in-memory copies of the DB up to date. All processes
> > would connect to NB database in "write-only" mode to commit their
> > transactions however.
>
> I'm not sure I understand this.  What information is it unnecessary to
> have the processes receive?  If they are getting information they don't
> need, then we might be able to eliminate some of it.
>

I probably didn't explain it very well. Right now we have N processes for
Neutron server (in every node). Each of those opens a connection to NB db
and
they subscribe to updates from certain tables. Each time a change happens,
ovsdb-server will send N update2 messages that has to be processed in
this "expensive" way by each of those N processes. My proposal (yet to
be refined) would be to now open N+1 connections to ovsdb-server and
only subscribe to notifications from 1 of those. So every time a new change
happens, ovsdb-server will send 1 update2 message. This message will be
processed (using Py IDL as we do now) and once processed, send it
(mcast maybe?) to the rest N processes. This msg could be simply a Python
object serialized and we'd be saving all this Datum, Atom, etc. processing
by doing it just once.



>
> Can you sketch the rows that are being inserted or modified when a port
> is added?  I would expect something like this as a minimum:
>
>         * Insert one Logical_Switch_Port row.
>
>         * Add pointer to Logical_Switch_Port to ports column in one row
>           in Logical_Switch.
>
> In addition it sounds like currently we're seeing:
>
>         * Add one ACL row per security group rule.
>
>         * Add pointers to ACL rows to acls column in one row in
>           Logical_Switch.
>
> This is what happens when we create a port in OpenStack (without binding
it)
which belongs to a SG which allows ICMP and SSH traffic and drops the rest
[0]

Basically, you were right and only thing missing was adding the new address
to the
Address_Set table.

and that's the bottleneck?
>

I did some testing by creating 800 ports in OpenStack sequentially and
logging the time
it takes to create the port in OVN. This is the patch I applied to log it:

diff --git a/networking_ovn/ml2/mech_driver.py
b/networking_ovn/ml2/mech_driver.py
index 2fb7d35..0ebbc0e 100644
--- a/networking_ovn/ml2/mech_driver.py
+++ b/networking_ovn/ml2/mech_driver.py
@@ -27,6 +27,7 @@ from neutron_lib.services.qos import constants as
qos_consts
 from oslo_config import cfg
 from oslo_db import exception as os_db_exc
 from oslo_log import log
+from oslo_utils import timeutils

 from neutron.common import utils as n_utils
 from neutron.db import provisioning_blocks
@@ -471,7 +472,11 @@ class OVNMechanismDriver(api.MechanismDriver):
         result in the deletion of the resource.
         """
         port = context.current
+        _timer = timeutils.StopWatch()
+        _timer.restart()
         self._ovn_client.create_port(port)
+        _timer.stop()
+        LOG.debug('[*] create_port = %.2f seconds', _timer.elapsed())
         self._notify_dhcp_updated(port['id'])


Then I parsed the logs and plotted the graph of those times [1].
What I want to do now is to repeat the tests without OpenStack in the
picture. However, I wouldn't like to run ovn-nbctl commands in a script
because every time I open a new connection, it'll have to receive the DB
contents and it'll take longer and longer to process.

[0] http://paste.openstack.org/show/671914/
[1] https://imgur.com/a/nmW5R

_______________________________________________
discuss mailing list
disc...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss

Re: [ovs-discuss] OpenStack profiling with networking-ovn - port creation is slow

Reply via email to