Re: [openstack-dev] Race in FixedIP.associate_pool

melanie witt Thu, 21 Dec 2017 10:53:56 -0800

On Fri, 15 Dec 2017 18:38:00 -0800, Arun Sag wrote:

Here are the sequence of actions happen in nova-network


1. allocate_for_instance calls -> allocate_fixed_ips
2. FixedIPs are successfully associated (we can see this in the log)
3. allocate_for_instance calls get_instance_nw_info, which in turn
gets the fixedip's associated in step 2 using
objects.FixedIPList.get_by_instance_uuid, This raises FixedIPNotFound
exception

We remove the slave and just ran with just single master, the errors
went away. We also switched to using semi-synchronous replication
between master
and slave,  the errors went away too. All of this points to a race
between write and read to the DB.

Does openstack expects synchronous replication to read-only slaves?


No, synchronous replication to read-only slaves is not expected.

The way this is handled is that oslo.db has the notion of an "asyncreader" which is safe to use on an asynchronously updated slave databaseand a regular "reader" which is only safe to use on a synchronouslyupdated slave database, else the master database will be used [1].

In nova, we indicate to oslo.db whether a database API method is safefor use on an asynchronously updated slave database using decorators[2][3]. There are few methods decorated this way.

The method you're seeing the race with, fixed_ip_get_by_instance [4] isdecorated with the "reader" decorator, indicating that it's only safefor a synchronously updated slave database, else it will use the master.

So, this query should *not* be going to an asynchronously updated slavedatabase. If you're using asynchronous replication, it should be goingto the master.

Have you patched any nova/db/sqlalchemy/api method decorators or patchedoslo.db at all to use the "async reader" for more methods? If not, thenit's possible there is a bug in oslo.db or nova related to "asyncreader" state leaking across green threads.

Which reminds me of a fairly recent bug [5] we ran into when doing aconcurrent scatter-gather to multiple cell databases. You might try thepatch [6] locally to see if it changes the behavior when you haveasynchronous replication enabled. We had thought only scatter-gather wasaffected (which was introduced in pike) but it's possible the asyncslave database read might also be affected.

If you could try that patch, please let me know whether it helps and wewill backport it.


Thanks,
-melanie

[1]https://github.com/openstack/oslo.db/blob/0260f0e/oslo_db/sqlalchemy/enginefacade.py#L44-L59[2]https://github.com/openstack/nova/blob/master/nova/db/sqlalchemy/api.py#L214-L219[3]https://github.com/openstack/nova/blob/master/nova/db/sqlalchemy/api.py#L272[4]https://github.com/openstack/nova/blob/master/nova/db/sqlalchemy/api.py#L1469-L1470

[5] https://bugs.launchpad.net/nova/+bug/1722404
[6] https://review.openstack.org/#/c/511651

__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] Race in FixedIP.associate_pool

Reply via email to