Re: [openstack-dev] Race in FixedIP.associate_pool
> On Jan 29, 2018, at 16:45, Arun SAGwrote: > > If anyone is running into db race while running database in > master-slave mode with async replication, The bug has been identified > and getting fixed here > https://bugs.launchpad.net/oslo.db/+bug/1746116 Thanks for your persistence in tracking down the problem and raising the bug. If you get a chance, do please test the proposed patch in your environment to help ensure there aren’t any loose ends left. Once the fix is merged, I think we should propose backports to stable/pike and stable/ocata, do .z releases for them, and bump oslo.db in upper-constraints.txt in the openstack/requirements repo for the stable/pike and stable/ocata branches. That way, operators running from stable can get the fix by upgrading their oslo.db packages to those .z releases. -melanie __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] Race in FixedIP.associate_pool
Hello, On Tue, Dec 12, 2017 at 12:22 PM, Arun SAGwrote: > Hello, > > We are running nova-network in ocata. We use mysql in a master-slave > configuration, The master is read/write, and all reads go to the slave > (slave_connection is set). When we tried to boot multiple VMs in > parallel (lets say 15), we see a race in allocate_for_instance's > FixedIP.associate_pool. We see FixedIP.associate_pool associates an > IP, but later in the code we try to read the allocated FixedIP using > objects.FixedIPList.get_by_instance_uuid and it throws > FixedIPNotFoundException. We also checked the slave replication status > and Seconds_Behind_Master: 0 > [snip] > > This kind of how the logs look like > 2017-12-08 22:33:37,124 DEBUG > [yahoo.contrib.ocata_openstack_yahoo_plugins.nova.network.manager] > /opt/openstack/venv/nova/lib/python2.7/site-packages/yahoo/contrib/ocata_openstack_yahoo_plugins/nova/network/manager.py:get_instance_nw_info:894 > Fixed IP NOT found for instance > 2017-12-08 22:33:37,125 DEBUG > [yahoo.contrib.ocata_openstack_yahoo_plugins.nova.network.manager] > /opt/openstack/venv/nova/lib/python2.7/site-packages/yahoo/contrib/ocata_openstack_yahoo_plugins/nova/network/manager.py:get_instance_nw_info:965 > Built network info: |[]| > 2017-12-08 22:33:37,126 INFO [nova.network.manager] > /opt/openstack/venv/nova/lib/python2.7/site-packages/nova/network/manager.py:allocate_for_instance:428 > Allocated network: '[]' for instance > 2017-12-08 22:33:37,126 ERROR [oslo_messaging.rpc.server] > /opt/openstack/venv/nova/lib/python2.7/site-packages/oslo_messaging/rpc/server.py:_process_incoming:164 > Exception during message handling > Traceback (most recent call last): > File > "/opt/openstack/venv/nova/lib/python2.7/site-packages/oslo_messaging/rpc/server.py", > line 155, in _process_incoming > res = self.dispatcher.dispatch(message) > File > "/opt/openstack/venv/nova/lib/python2.7/site-packages/oslo_messaging/rpc/dispatcher.py", > line 222, in dispatch > return self._do_dispatch(endpoint, method, ctxt, args) > File > "/opt/openstack/venv/nova/lib/python2.7/site-packages/oslo_messaging/rpc/dispatcher.py", > line 192, in _do_dispatch > result = func(ctxt, **new_args) > File > "/opt/openstack/venv/nova/lib/python2.7/site-packages/yahoo/contrib/ocata_openstack_yahoo_plugins/nova/network/manager.py", > line 347, in allocate_for_instance > vif = nw_info[0] > IndexError: list index out of range > > > This problem goes way when we get rid of the slave_connection setting > and just use single master. Has any one else seen this? Any > recommendation to fix this issue? > > This issue is kind of similar to https://bugs.launchpad.net/nova/+bug/1249065 > If anyone is running into db race while running database in master-slave mode with async replication, The bug has been identified and getting fixed here https://bugs.launchpad.net/oslo.db/+bug/1746116 -- Arun S A G http://zer0c00l.in/ __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] Race in FixedIP.associate_pool
On Fri, 15 Dec 2017 18:38:00 -0800, Arun Sag wrote: Here are the sequence of actions happen in nova-network 1. allocate_for_instance calls -> allocate_fixed_ips 2. FixedIPs are successfully associated (we can see this in the log) 3. allocate_for_instance calls get_instance_nw_info, which in turn gets the fixedip's associated in step 2 using objects.FixedIPList.get_by_instance_uuid, This raises FixedIPNotFound exception We remove the slave and just ran with just single master, the errors went away. We also switched to using semi-synchronous replication between master and slave, the errors went away too. All of this points to a race between write and read to the DB. Does openstack expects synchronous replication to read-only slaves? No, synchronous replication to read-only slaves is not expected. The way this is handled is that oslo.db has the notion of an "async reader" which is safe to use on an asynchronously updated slave database and a regular "reader" which is only safe to use on a synchronously updated slave database, else the master database will be used [1]. In nova, we indicate to oslo.db whether a database API method is safe for use on an asynchronously updated slave database using decorators [2][3]. There are few methods decorated this way. The method you're seeing the race with, fixed_ip_get_by_instance [4] is decorated with the "reader" decorator, indicating that it's only safe for a synchronously updated slave database, else it will use the master. So, this query should *not* be going to an asynchronously updated slave database. If you're using asynchronous replication, it should be going to the master. Have you patched any nova/db/sqlalchemy/api method decorators or patched oslo.db at all to use the "async reader" for more methods? If not, then it's possible there is a bug in oslo.db or nova related to "async reader" state leaking across green threads. Which reminds me of a fairly recent bug [5] we ran into when doing a concurrent scatter-gather to multiple cell databases. You might try the patch [6] locally to see if it changes the behavior when you have asynchronous replication enabled. We had thought only scatter-gather was affected (which was introduced in pike) but it's possible the async slave database read might also be affected. If you could try that patch, please let me know whether it helps and we will backport it. Thanks, -melanie [1] https://github.com/openstack/oslo.db/blob/0260f0e/oslo_db/sqlalchemy/enginefacade.py#L44-L59 [2] https://github.com/openstack/nova/blob/master/nova/db/sqlalchemy/api.py#L214-L219 [3] https://github.com/openstack/nova/blob/master/nova/db/sqlalchemy/api.py#L272 [4] https://github.com/openstack/nova/blob/master/nova/db/sqlalchemy/api.py#L1469-L1470 [5] https://bugs.launchpad.net/nova/+bug/1722404 [6] https://review.openstack.org/#/c/511651 __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] Race in FixedIP.associate_pool
Hi Jay, On Fri, Dec 15, 2017 at 1:56 PM, Jay Pipeswrote: > Can you point us to the code that is generating the above? It seems that > get_instance_nw_info() in the Yahoo! manager.py contrib module line 965 is > trying to build network information for an empty list of vNICs... where is > that list of vNICs coming from? The vNICs are empty because objects.FixedIPList.get_by_instance_uuid is empty https://github.com/openstack/nova/blob/master/nova/network/manager.py#L527 The Yahoo! manager.py's get_by_instance_uuid is essentially same as the upstream code except we change the VIF_TYPE in get_instance_nw_info @messaging.expected_exceptions(exception.InstanceNotFound) def get_instance_nw_info(self, context, instance_id, rxtx_factor, host, instance_uuid=None, **kwargs): """Creates network info list for instance. called by allocate_for_instance and network_api context needs to be elevated :returns: network info list [(network,info),(network,info)...] where network = dict containing pertinent data from a network db object and info = dict containing pertinent networking data """ if not uuidutils.is_uuid_like(instance_id): instance_id = instance_uuid instance_uuid = instance_id LOG.debug('Get instance network info', instance_uuid=instance_uuid) try: fixed_ips = objects.FixedIPList.get_by_instance_uuid( context, instance_uuid) except exception.FixedIpNotFoundForInstance: fixed_ips = [] LOG.debug('Found %d fixed IPs associated to the instance in the ' 'database.', len(fixed_ips), instance_uuid=instance_uuid) nw_info = network_model.NetworkInfo() # (saga): The default VIF_TYPE is bridge. We need to use OVS # This is the only reason we copied this method from the base class if not CONF.network_driver or CONF.network_driver == 'nova.network.linux_net': if CONF.linuxnet_interface_driver == 'nova.network.linux_net.LinuxOVSInterfaceDriver': vif_type = network_model.VIF_TYPE_OVS Here are the sequence of actions happen in nova-network 1. allocate_for_instance calls -> allocate_fixed_ips 2. FixedIPs are successfully associated (we can see this in the log) 3. allocate_for_instance calls get_instance_nw_info, which in turn gets the fixedip's associated in step 2 using objects.FixedIPList.get_by_instance_uuid, This raises FixedIPNotFound exception We remove the slave and just ran with just single master, the errors went away. We also switched to using semi-synchronous replication between master and slave, the errors went away too. All of this points to a race between write and read to the DB. Does openstack expects synchronous replication to read-only slaves? -- Arun S A G http://zer0c00l.in/ __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] Race in FixedIP.associate_pool
On 12/12/2017 03:22 PM, Arun SAG wrote: This kind of how the logs look like 2017-12-08 22:33:37,124 DEBUG [yahoo.contrib.ocata_openstack_yahoo_plugins.nova.network.manager] /opt/openstack/venv/nova/lib/python2.7/site-packages/yahoo/contrib/ocata_openstack_yahoo_plugins/nova/network/manager.py:get_instance_nw_info:894 Fixed IP NOT found for instance 2017-12-08 22:33:37,125 DEBUG [yahoo.contrib.ocata_openstack_yahoo_plugins.nova.network.manager] /opt/openstack/venv/nova/lib/python2.7/site-packages/yahoo/contrib/ocata_openstack_yahoo_plugins/nova/network/manager.py:get_instance_nw_info:965 Built network info: |[]| Can you point us to the code that is generating the above? It seems that get_instance_nw_info() in the Yahoo! manager.py contrib module line 965 is trying to build network information for an empty list of vNICs... where is that list of vNICs coming from? Best, -jay __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
[openstack-dev] Race in FixedIP.associate_pool
Hello, We are running nova-network in ocata. We use mysql in a master-slave configuration, The master is read/write, and all reads go to the slave (slave_connection is set). When we tried to boot multiple VMs in parallel (lets say 15), we see a race in allocate_for_instance's FixedIP.associate_pool. We see FixedIP.associate_pool associates an IP, but later in the code we try to read the allocated FixedIP using objects.FixedIPList.get_by_instance_uuid and it throws FixedIPNotFoundException. We also checked the slave replication status and Seconds_Behind_Master: 0 mysql> show slave status \G *** 1. row *** Slave_IO_State: Waiting for master to send event Master_Host: master.hostname Master_User: repl Master_Port: 3306 Connect_Retry: 60 Master_Log_File: mysqld-bin.03 Read_Master_Log_Pos: 142605121 Relay_Log_File: mysqld-relay-bin.06 Relay_Log_Pos: 142602813 Relay_Master_Log_File: mysqld-bin.03 Slave_IO_Running: Yes Slave_SQL_Running: Yes Replicate_Do_DB: Replicate_Ignore_DB: mysql,mysql Replicate_Do_Table: Replicate_Ignore_Table: Replicate_Wild_Do_Table: Replicate_Wild_Ignore_Table: Last_Errno: 0 Last_Error: Skip_Counter: 0 Exec_Master_Log_Pos: 142605121 Relay_Log_Space: 142603151 Until_Condition: None Until_Log_File: Until_Log_Pos: 0 Master_SSL_Allowed: No Master_SSL_CA_File: Master_SSL_CA_Path: Master_SSL_Cert: Master_SSL_Cipher: Master_SSL_Key: Seconds_Behind_Master: 0 Master_SSL_Verify_Server_Cert: No Last_IO_Errno: 0 Last_IO_Error: Last_SQL_Errno: 0 Last_SQL_Error: Replicate_Ignore_Server_Ids: Master_Server_Id: 172687130 Master_UUID: 8112eeef-de1e-11e7-8098-ac162da46cbc Master_Info_File: mysql.slave_master_info SQL_Delay: 0 SQL_Remaining_Delay: NULL Slave_SQL_Running_State: Slave has read all relay log; waiting for the slave I/O thread to update it Master_Retry_Count: 86400 Master_Bind: Last_IO_Error_Timestamp: Last_SQL_Error_Timestamp: Master_SSL_Crl: Master_SSL_Crlpath: Retrieved_Gtid_Set: Executed_Gtid_Set: Auto_Position: 0 This kind of how the logs look like 2017-12-08 22:33:37,124 DEBUG [yahoo.contrib.ocata_openstack_yahoo_plugins.nova.network.manager] /opt/openstack/venv/nova/lib/python2.7/site-packages/yahoo/contrib/ocata_openstack_yahoo_plugins/nova/network/manager.py:get_instance_nw_info:894 Fixed IP NOT found for instance 2017-12-08 22:33:37,125 DEBUG [yahoo.contrib.ocata_openstack_yahoo_plugins.nova.network.manager] /opt/openstack/venv/nova/lib/python2.7/site-packages/yahoo/contrib/ocata_openstack_yahoo_plugins/nova/network/manager.py:get_instance_nw_info:965 Built network info: |[]| 2017-12-08 22:33:37,126 INFO [nova.network.manager] /opt/openstack/venv/nova/lib/python2.7/site-packages/nova/network/manager.py:allocate_for_instance:428 Allocated network: '[]' for instance 2017-12-08 22:33:37,126 ERROR [oslo_messaging.rpc.server] /opt/openstack/venv/nova/lib/python2.7/site-packages/oslo_messaging/rpc/server.py:_process_incoming:164 Exception during message handling Traceback (most recent call last): File "/opt/openstack/venv/nova/lib/python2.7/site-packages/oslo_messaging/rpc/server.py", line 155, in _process_incoming res = self.dispatcher.dispatch(message) File "/opt/openstack/venv/nova/lib/python2.7/site-packages/oslo_messaging/rpc/dispatcher.py", line 222, in dispatch return self._do_dispatch(endpoint, method, ctxt, args) File "/opt/openstack/venv/nova/lib/python2.7/site-packages/oslo_messaging/rpc/dispatcher.py", line 192, in _do_dispatch result = func(ctxt, **new_args) File "/opt/openstack/venv/nova/lib/python2.7/site-packages/yahoo/contrib/ocata_openstack_yahoo_plugins/nova/network/manager.py", line 347, in allocate_for_instance vif = nw_info[0] IndexError: list index out of range This problem goes way when we get rid of the slave_connection setting and just use single master. Has any one else seen this? Any recommendation to fix this issue? This issue is kind of similar to https://bugs.launchpad.net/nova/+bug/1249065 -- Arun S A G http://zer0c00l.in/ __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev