Re: [openstack-dev] Race in FixedIP.associate_pool

2018-01-30 Thread melanie witt
> On Jan 29, 2018, at 16:45, Arun SAG  wrote:
> 
> If anyone is running into db race while running database in
> master-slave mode with async replication, The bug has been identified
> and getting fixed  here
> https://bugs.launchpad.net/oslo.db/+bug/1746116

Thanks for your persistence in tracking down the problem and raising the bug. 
If you get a chance, do please test the proposed patch in your environment to 
help ensure there aren’t any loose ends left.

Once the fix is merged, I think we should propose backports to stable/pike and 
stable/ocata, do .z releases for them, and bump oslo.db in 
upper-constraints.txt in the openstack/requirements repo for the stable/pike 
and stable/ocata branches. That way, operators running from stable can get the 
fix by upgrading their oslo.db packages to those .z releases.

-melanie
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] Race in FixedIP.associate_pool

2018-01-29 Thread Arun SAG
Hello,


On Tue, Dec 12, 2017 at 12:22 PM, Arun SAG  wrote:
> Hello,
>
> We are running nova-network in ocata. We use mysql in a master-slave
> configuration, The master is read/write, and all reads go to the slave
> (slave_connection is set). When we tried to boot multiple VMs in
> parallel (lets say 15), we see a race in allocate_for_instance's
> FixedIP.associate_pool. We see FixedIP.associate_pool associates an
> IP, but later in the code we try to read the allocated FixedIP using
> objects.FixedIPList.get_by_instance_uuid and it throws
> FixedIPNotFoundException. We also checked the slave replication status
> and Seconds_Behind_Master: 0
>
[snip]
>
> This kind of how the logs look like
> 2017-12-08 22:33:37,124 DEBUG
> [yahoo.contrib.ocata_openstack_yahoo_plugins.nova.network.manager]
> /opt/openstack/venv/nova/lib/python2.7/site-packages/yahoo/contrib/ocata_openstack_yahoo_plugins/nova/network/manager.py:get_instance_nw_info:894
> Fixed IP NOT found for instance
> 2017-12-08 22:33:37,125 DEBUG
> [yahoo.contrib.ocata_openstack_yahoo_plugins.nova.network.manager]
> /opt/openstack/venv/nova/lib/python2.7/site-packages/yahoo/contrib/ocata_openstack_yahoo_plugins/nova/network/manager.py:get_instance_nw_info:965
> Built network info: |[]|
> 2017-12-08 22:33:37,126 INFO [nova.network.manager]
> /opt/openstack/venv/nova/lib/python2.7/site-packages/nova/network/manager.py:allocate_for_instance:428
> Allocated network: '[]' for instance
> 2017-12-08 22:33:37,126 ERROR [oslo_messaging.rpc.server]
> /opt/openstack/venv/nova/lib/python2.7/site-packages/oslo_messaging/rpc/server.py:_process_incoming:164
> Exception during message handling
> Traceback (most recent call last):
>   File 
> "/opt/openstack/venv/nova/lib/python2.7/site-packages/oslo_messaging/rpc/server.py",
> line 155, in _process_incoming
> res = self.dispatcher.dispatch(message)
>   File 
> "/opt/openstack/venv/nova/lib/python2.7/site-packages/oslo_messaging/rpc/dispatcher.py",
> line 222, in dispatch
> return self._do_dispatch(endpoint, method, ctxt, args)
>   File 
> "/opt/openstack/venv/nova/lib/python2.7/site-packages/oslo_messaging/rpc/dispatcher.py",
> line 192, in _do_dispatch
> result = func(ctxt, **new_args)
>   File 
> "/opt/openstack/venv/nova/lib/python2.7/site-packages/yahoo/contrib/ocata_openstack_yahoo_plugins/nova/network/manager.py",
> line 347, in allocate_for_instance
> vif = nw_info[0]
> IndexError: list index out of range
>
>
> This problem goes way when we get rid of the slave_connection setting
> and just use single master. Has any one else seen this? Any
> recommendation to fix this issue?
>
> This issue is kind of  similar to https://bugs.launchpad.net/nova/+bug/1249065
>

If anyone is running into db race while running database in
master-slave mode with async replication, The bug has been identified
and getting fixed  here
https://bugs.launchpad.net/oslo.db/+bug/1746116

-- 
Arun S A G
http://zer0c00l.in/

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] Race in FixedIP.associate_pool

2017-12-21 Thread melanie witt

On Fri, 15 Dec 2017 18:38:00 -0800, Arun Sag wrote:

Here are the sequence of actions happen in nova-network

1. allocate_for_instance calls -> allocate_fixed_ips
2. FixedIPs are successfully associated (we can see this in the log)
3. allocate_for_instance calls get_instance_nw_info, which in turn
gets the fixedip's associated in step 2 using
objects.FixedIPList.get_by_instance_uuid, This raises FixedIPNotFound
exception

We remove the slave and just ran with just single master, the errors
went away. We also switched to using semi-synchronous replication
between master
and slave,  the errors went away too. All of this points to a race
between write and read to the DB.

Does openstack expects synchronous replication to read-only slaves?


No, synchronous replication to read-only slaves is not expected.

The way this is handled is that oslo.db has the notion of an "async 
reader" which is safe to use on an asynchronously updated slave database 
and a regular "reader" which is only safe to use on a synchronously 
updated slave database, else the master database will be used [1].


In nova, we indicate to oslo.db whether a database API method is safe 
for use on an asynchronously updated slave database using decorators 
[2][3]. There are few methods decorated this way.


The method you're seeing the race with, fixed_ip_get_by_instance [4] is 
decorated with the "reader" decorator, indicating that it's only safe 
for a synchronously updated slave database, else it will use the master.


So, this query should *not* be going to an asynchronously updated slave 
database. If you're using asynchronous replication, it should be going 
to the master.


Have you patched any nova/db/sqlalchemy/api method decorators or patched 
oslo.db at all to use the "async reader" for more methods? If not, then 
it's possible there is a bug in oslo.db or nova related to "async 
reader" state leaking across green threads.


Which reminds me of a fairly recent bug [5] we ran into when doing a 
concurrent scatter-gather to multiple cell databases. You might try the 
patch [6] locally to see if it changes the behavior when you have 
asynchronous replication enabled. We had thought only scatter-gather was 
affected (which was introduced in pike) but it's possible the async 
slave database read might also be affected.


If you could try that patch, please let me know whether it helps and we 
will backport it.


Thanks,
-melanie

[1] 
https://github.com/openstack/oslo.db/blob/0260f0e/oslo_db/sqlalchemy/enginefacade.py#L44-L59
[2] 
https://github.com/openstack/nova/blob/master/nova/db/sqlalchemy/api.py#L214-L219
[3] 
https://github.com/openstack/nova/blob/master/nova/db/sqlalchemy/api.py#L272
[4] 
https://github.com/openstack/nova/blob/master/nova/db/sqlalchemy/api.py#L1469-L1470

[5] https://bugs.launchpad.net/nova/+bug/1722404
[6] https://review.openstack.org/#/c/511651

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] Race in FixedIP.associate_pool

2017-12-15 Thread Arun SAG
Hi Jay,

On Fri, Dec 15, 2017 at 1:56 PM, Jay Pipes  wrote:

> Can you point us to the code that is generating the above? It seems that
> get_instance_nw_info() in the Yahoo! manager.py contrib module line 965 is
> trying to build network information for an empty list of vNICs... where is
> that list of vNICs coming from?


The vNICs are empty because objects.FixedIPList.get_by_instance_uuid
is empty 
https://github.com/openstack/nova/blob/master/nova/network/manager.py#L527
The Yahoo! manager.py's get_by_instance_uuid is essentially same as
the upstream code except we change the VIF_TYPE in
get_instance_nw_info

@messaging.expected_exceptions(exception.InstanceNotFound)
def get_instance_nw_info(self, context, instance_id, rxtx_factor,
 host, instance_uuid=None, **kwargs):
"""Creates network info list for instance.

called by allocate_for_instance and network_api
context needs to be elevated
:returns: network info list [(network,info),(network,info)...]
where network = dict containing pertinent data from a network db object
and info = dict containing pertinent networking data
"""
if not uuidutils.is_uuid_like(instance_id):
instance_id = instance_uuid
instance_uuid = instance_id
LOG.debug('Get instance network info', instance_uuid=instance_uuid)

try:
fixed_ips = objects.FixedIPList.get_by_instance_uuid(
context, instance_uuid)
except exception.FixedIpNotFoundForInstance:
fixed_ips = []

LOG.debug('Found %d fixed IPs associated to the instance in the '
  'database.',
  len(fixed_ips), instance_uuid=instance_uuid)

nw_info = network_model.NetworkInfo()
# (saga): The default VIF_TYPE is bridge. We need to use OVS
# This is the only reason we copied this method from the base class
if not CONF.network_driver or CONF.network_driver ==
'nova.network.linux_net':
if CONF.linuxnet_interface_driver ==
'nova.network.linux_net.LinuxOVSInterfaceDriver':
vif_type = network_model.VIF_TYPE_OVS


Here are the sequence of actions happen in nova-network

1. allocate_for_instance calls -> allocate_fixed_ips
2. FixedIPs are successfully associated (we can see this in the log)
3. allocate_for_instance calls get_instance_nw_info, which in turn
gets the fixedip's associated in step 2 using
objects.FixedIPList.get_by_instance_uuid, This raises FixedIPNotFound
exception

We remove the slave and just ran with just single master, the errors
went away. We also switched to using semi-synchronous replication
between master
and slave,  the errors went away too. All of this points to a race
between write and read to the DB.

Does openstack expects synchronous replication to read-only slaves?


-- 
Arun S A G
http://zer0c00l.in/

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] Race in FixedIP.associate_pool

2017-12-15 Thread Jay Pipes

On 12/12/2017 03:22 PM, Arun SAG wrote:

This kind of how the logs look like
2017-12-08 22:33:37,124 DEBUG
[yahoo.contrib.ocata_openstack_yahoo_plugins.nova.network.manager]
/opt/openstack/venv/nova/lib/python2.7/site-packages/yahoo/contrib/ocata_openstack_yahoo_plugins/nova/network/manager.py:get_instance_nw_info:894
Fixed IP NOT found for instance
2017-12-08 22:33:37,125 DEBUG
[yahoo.contrib.ocata_openstack_yahoo_plugins.nova.network.manager]
/opt/openstack/venv/nova/lib/python2.7/site-packages/yahoo/contrib/ocata_openstack_yahoo_plugins/nova/network/manager.py:get_instance_nw_info:965
Built network info: |[]|


Can you point us to the code that is generating the above? It seems that 
get_instance_nw_info() in the Yahoo! manager.py contrib module line 965 
is trying to build network information for an empty list of vNICs... 
where is that list of vNICs coming from?


Best,
-jay

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


[openstack-dev] Race in FixedIP.associate_pool

2017-12-12 Thread Arun SAG
Hello,

We are running nova-network in ocata. We use mysql in a master-slave
configuration, The master is read/write, and all reads go to the slave
(slave_connection is set). When we tried to boot multiple VMs in
parallel (lets say 15), we see a race in allocate_for_instance's
FixedIP.associate_pool. We see FixedIP.associate_pool associates an
IP, but later in the code we try to read the allocated FixedIP using
objects.FixedIPList.get_by_instance_uuid and it throws
FixedIPNotFoundException. We also checked the slave replication status
and Seconds_Behind_Master: 0

mysql> show slave status \G
*** 1. row ***
   Slave_IO_State: Waiting for master to send event
  Master_Host: master.hostname
  Master_User: repl
  Master_Port: 3306
Connect_Retry: 60
  Master_Log_File: mysqld-bin.03
  Read_Master_Log_Pos: 142605121
   Relay_Log_File: mysqld-relay-bin.06
Relay_Log_Pos: 142602813
Relay_Master_Log_File: mysqld-bin.03
 Slave_IO_Running: Yes
Slave_SQL_Running: Yes
  Replicate_Do_DB:
  Replicate_Ignore_DB: mysql,mysql
   Replicate_Do_Table:
   Replicate_Ignore_Table:
  Replicate_Wild_Do_Table:
  Replicate_Wild_Ignore_Table:
   Last_Errno: 0
   Last_Error:
 Skip_Counter: 0
  Exec_Master_Log_Pos: 142605121
  Relay_Log_Space: 142603151
  Until_Condition: None
   Until_Log_File:
Until_Log_Pos: 0
   Master_SSL_Allowed: No
   Master_SSL_CA_File:
   Master_SSL_CA_Path:
  Master_SSL_Cert:
Master_SSL_Cipher:
   Master_SSL_Key:
Seconds_Behind_Master: 0
Master_SSL_Verify_Server_Cert: No
Last_IO_Errno: 0
Last_IO_Error:
   Last_SQL_Errno: 0
   Last_SQL_Error:
  Replicate_Ignore_Server_Ids:
 Master_Server_Id: 172687130
  Master_UUID: 8112eeef-de1e-11e7-8098-ac162da46cbc
 Master_Info_File: mysql.slave_master_info
SQL_Delay: 0
  SQL_Remaining_Delay: NULL
  Slave_SQL_Running_State: Slave has read all relay log; waiting
for the slave I/O thread to update it
   Master_Retry_Count: 86400
  Master_Bind:
  Last_IO_Error_Timestamp:
 Last_SQL_Error_Timestamp:
   Master_SSL_Crl:
   Master_SSL_Crlpath:
   Retrieved_Gtid_Set:
Executed_Gtid_Set:
Auto_Position: 0

This kind of how the logs look like
2017-12-08 22:33:37,124 DEBUG
[yahoo.contrib.ocata_openstack_yahoo_plugins.nova.network.manager]
/opt/openstack/venv/nova/lib/python2.7/site-packages/yahoo/contrib/ocata_openstack_yahoo_plugins/nova/network/manager.py:get_instance_nw_info:894
Fixed IP NOT found for instance
2017-12-08 22:33:37,125 DEBUG
[yahoo.contrib.ocata_openstack_yahoo_plugins.nova.network.manager]
/opt/openstack/venv/nova/lib/python2.7/site-packages/yahoo/contrib/ocata_openstack_yahoo_plugins/nova/network/manager.py:get_instance_nw_info:965
Built network info: |[]|
2017-12-08 22:33:37,126 INFO [nova.network.manager]
/opt/openstack/venv/nova/lib/python2.7/site-packages/nova/network/manager.py:allocate_for_instance:428
Allocated network: '[]' for instance
2017-12-08 22:33:37,126 ERROR [oslo_messaging.rpc.server]
/opt/openstack/venv/nova/lib/python2.7/site-packages/oslo_messaging/rpc/server.py:_process_incoming:164
Exception during message handling
Traceback (most recent call last):
  File 
"/opt/openstack/venv/nova/lib/python2.7/site-packages/oslo_messaging/rpc/server.py",
line 155, in _process_incoming
res = self.dispatcher.dispatch(message)
  File 
"/opt/openstack/venv/nova/lib/python2.7/site-packages/oslo_messaging/rpc/dispatcher.py",
line 222, in dispatch
return self._do_dispatch(endpoint, method, ctxt, args)
  File 
"/opt/openstack/venv/nova/lib/python2.7/site-packages/oslo_messaging/rpc/dispatcher.py",
line 192, in _do_dispatch
result = func(ctxt, **new_args)
  File 
"/opt/openstack/venv/nova/lib/python2.7/site-packages/yahoo/contrib/ocata_openstack_yahoo_plugins/nova/network/manager.py",
line 347, in allocate_for_instance
vif = nw_info[0]
IndexError: list index out of range


This problem goes way when we get rid of the slave_connection setting
and just use single master. Has any one else seen this? Any
recommendation to fix this issue?

This issue is kind of  similar to https://bugs.launchpad.net/nova/+bug/1249065

-- 
Arun S A G
http://zer0c00l.in/

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev