On Fri, 15 Dec 2017 18:38:00 -0800, Arun Sag wrote:
Here are the sequence of actions happen in nova-network

1. allocate_for_instance calls -> allocate_fixed_ips
2. FixedIPs are successfully associated (we can see this in the log)
3. allocate_for_instance calls get_instance_nw_info, which in turn
gets the fixedip's associated in step 2 using
objects.FixedIPList.get_by_instance_uuid, This raises FixedIPNotFound
exception

We remove the slave and just ran with just single master, the errors
went away. We also switched to using semi-synchronous replication
between master
and slave,  the errors went away too. All of this points to a race
between write and read to the DB.

Does openstack expects synchronous replication to read-only slaves?

No, synchronous replication to read-only slaves is not expected.

The way this is handled is that oslo.db has the notion of an "async reader" which is safe to use on an asynchronously updated slave database and a regular "reader" which is only safe to use on a synchronously updated slave database, else the master database will be used [1].

In nova, we indicate to oslo.db whether a database API method is safe for use on an asynchronously updated slave database using decorators [2][3]. There are few methods decorated this way.

The method you're seeing the race with, fixed_ip_get_by_instance [4] is decorated with the "reader" decorator, indicating that it's only safe for a synchronously updated slave database, else it will use the master.

So, this query should *not* be going to an asynchronously updated slave database. If you're using asynchronous replication, it should be going to the master.

Have you patched any nova/db/sqlalchemy/api method decorators or patched oslo.db at all to use the "async reader" for more methods? If not, then it's possible there is a bug in oslo.db or nova related to "async reader" state leaking across green threads.

Which reminds me of a fairly recent bug [5] we ran into when doing a concurrent scatter-gather to multiple cell databases. You might try the patch [6] locally to see if it changes the behavior when you have asynchronous replication enabled. We had thought only scatter-gather was affected (which was introduced in pike) but it's possible the async slave database read might also be affected.

If you could try that patch, please let me know whether it helps and we will backport it.

Thanks,
-melanie

[1] https://github.com/openstack/oslo.db/blob/0260f0e/oslo_db/sqlalchemy/enginefacade.py#L44-L59 [2] https://github.com/openstack/nova/blob/master/nova/db/sqlalchemy/api.py#L214-L219 [3] https://github.com/openstack/nova/blob/master/nova/db/sqlalchemy/api.py#L272 [4] https://github.com/openstack/nova/blob/master/nova/db/sqlalchemy/api.py#L1469-L1470
[5] https://bugs.launchpad.net/nova/+bug/1722404
[6] https://review.openstack.org/#/c/511651

__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Reply via email to