Re: [Openstack] instance_info_caches table, nova not populating for some instances

2015-01-22 Thread Don Waterloo
On 3 December 2014 at 11:58, Don Waterloo don.water...@gmail.com wrote:

 I am having a problem that I hope someone can comment on.

 Periodically, an instance ends up w/ 0 rows in 'instance_info_caches' in
 the nova database.


If anyone else has a chance, and can try running this sql query against
their nova database, it will show if you are seeing the same problem I am:

select instances.host,instances.hostname,instances.uuid,instances.user_id
from instance_info_caches,instances where network_info = '[]' and
instances.deleted = 0 and instances.uuid =
instance_info_caches.instance_uuid;

The expectation is this returns 0 rows. I'm finding about 1 instance in 100
ends up with [] for the network_info.
___
Mailing list: http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack
Post to : openstack@lists.openstack.org
Unsubscribe : http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack


Re: [Openstack] instance_info_caches table, nova not populating for some instances

2015-01-22 Thread Don Waterloo
On 22 January 2015 at 09:30, Don Waterloo don.water...@gmail.com wrote:



 On 3 December 2014 at 11:58, Don Waterloo don.water...@gmail.com wrote:

 I am having a problem that I hope someone can comment on.

 Periodically, an instance ends up w/ 0 rows in 'instance_info_caches' in
 the nova database.


 If anyone else has a chance, and can try running this sql query against
 their nova database, it will show if you are seeing the same problem I am:

 select instances.host,instances.hostname,instances.uuid,instances.user_id
 from instance_info_caches,instances where network_info = '[]' and
 instances.deleted = 0 and instances.uuid =
 instance_info_caches.instance_uuid;

 The expectation is this returns 0 rows. I'm finding about 1 instance in
 100 ends up with [] for the network_info.


After some more digging, what is happening in the bad case is a race
condition.

In the 'good' case, _allocate_network_async() is called followed
by _get_guest_xml()

In the 'bad' case, _allocate_network_async() is called, followed
by _get_instance_nw_info().  E.g. it is doing a refresh_cache() on this
instance while it is being created (before its finished).
___
Mailing list: http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack
Post to : openstack@lists.openstack.org
Unsubscribe : http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack


Re: [Openstack] instance_info_caches table, nova not populating for some instances

2015-01-21 Thread Don Waterloo
On 20 January 2015 at 22:20, Don Waterloo don.water...@gmail.com wrote:

 For any one else who hits this, I entered
 https://bugs.launchpad.net/nova/+bug/1413049
 which contains a patch.

 In a nutshell, @ the bottom end of reloading/healing the cache, it turns
 around and uses the cache in _gather_port_ids_and_networks().

 My fix is to use the neutron api and call list_ports.



sigh, my patch is not correct. the order of the ports is not preserved on
the neutron /ports.json call, and that is important to nova.

Does anyone know a solution to this?

e.g. if i have an instance that has 2 ports, GET
/...ports.json?device_id=UUID might return them in either order, presumably
since its stored as a hashmap or something.
___
Mailing list: http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack
Post to : openstack@lists.openstack.org
Unsubscribe : http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack


Re: [Openstack] instance_info_caches table, nova not populating for some instances

2015-01-21 Thread Kevin Benton
Couldn't you sort the result from the neutron port query by port UUID so
the ordering is maintained until ports are added or deleted?

On Wed, Jan 21, 2015 at 1:49 PM, Don Waterloo don.water...@gmail.com
wrote:



 On 20 January 2015 at 22:20, Don Waterloo don.water...@gmail.com wrote:

 For any one else who hits this, I entered
 https://bugs.launchpad.net/nova/+bug/1413049
 which contains a patch.

 In a nutshell, @ the bottom end of reloading/healing the cache, it turns
 around and uses the cache in _gather_port_ids_and_networks().

 My fix is to use the neutron api and call list_ports.



 sigh, my patch is not correct. the order of the ports is not preserved on
 the neutron /ports.json call, and that is important to nova.

 Does anyone know a solution to this?

 e.g. if i have an instance that has 2 ports, GET
 /...ports.json?device_id=UUID might return them in either order, presumably
 since its stored as a hashmap or something.



 ___
 Mailing list:
 http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack
 Post to : openstack@lists.openstack.org
 Unsubscribe :
 http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack




-- 
Kevin Benton
___
Mailing list: http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack
Post to : openstack@lists.openstack.org
Unsubscribe : http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack


Re: [Openstack] instance_info_caches table, nova not populating for some instances

2015-01-20 Thread Nathanael Burton
Don,

I created this bug [1] for Nova a while ago which sounds similar to the
problem you're having although we were running nova-network not neutron.  I
proposed a fix [1] for that bug but it never got merged because I didn't
have time to write the tests.

[1] - https://bugs.launchpad.net/nova/+bug/1378459
[2] - https://review.openstack.org/126633

Nate

On Mon, Jan 19, 2015 at 7:36 PM, Don Waterloo don.water...@gmail.com
wrote:



 On 19 January 2015 at 15:40, Michael Still mi...@stillhq.com wrote:

 I've never heard of anything like this.

 What release of OpenStack are you running? What hypervisor driver?

 Thanks,
 Michael


 This is Juno on Ubunto 14.10 with libvirt kvm.
 after more digging, the periodic task for
 heal_instance_info_cache_interval is returning [] for the affected
 instances, which is overwriting the field. later, when the user does a
 rebuild, the .xml file gets no interfaces.

 so, my bug lies in there somewhere.

 _get_instance_nw_info() is returning the []


 ___
 Mailing list:
 http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack
 Post to : openstack@lists.openstack.org
 Unsubscribe :
 http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack


___
Mailing list: http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack
Post to : openstack@lists.openstack.org
Unsubscribe : http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack


Re: [Openstack] instance_info_caches table, nova not populating for some instances

2015-01-20 Thread Don Waterloo
On 20 January 2015 at 14:04, Don Waterloo don.water...@gmail.com wrote:



 On 20 January 2015 at 13:22, Nathanael Burton 
 nathanael.i.bur...@gmail.com wrote:

 Don,

 I created this bug [1] for Nova a while ago which sounds similar to the
 problem you're having although we were running nova-network not neutron.  I
 proposed a fix [1] for that bug but it never got merged because I didn't
 have time to write the tests.

 [1] - https://bugs.launchpad.net/nova/+bug/1378459
 [2] - https://review.openstack.org/126633

 Nate


It seems there may be a circular issue. What ends up happening is that in
nova/network/neutronv2/api.py, in _gather_port_ids_and_networks()... when I
come in there and my cache is set to [] (e.g. trying to heal or reconcile),
it calls:
ifaces = compute_utils.get_nw_info_for_instance(instance)

which in turn goes and tries to fill the cache.

So while trying to fill the cache, one of the things it relies on must
already be in the cache. Since its not there, it gets [] for the ifaces,
which in turn makes [] for the ports, and it all falls apart.
___
Mailing list: http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack
Post to : openstack@lists.openstack.org
Unsubscribe : http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack


Re: [Openstack] instance_info_caches table, nova not populating for some instances

2015-01-20 Thread Don Waterloo
On 20 January 2015 at 13:22, Nathanael Burton nathanael.i.bur...@gmail.com
wrote:

 Don,

 I created this bug [1] for Nova a while ago which sounds similar to the
 problem you're having although we were running nova-network not neutron.  I
 proposed a fix [1] for that bug but it never got merged because I didn't
 have time to write the tests.

 [1] - https://bugs.launchpad.net/nova/+bug/1378459
 [2] - https://review.openstack.org/126633

 Nate


Thank you for that. I'm working on finding the root cause now. mine for
sure is not an upgrade nor a manual change to the db. In my case, the
instance starts life with a good info_cache, and then it mysteriously goes
to [] (python empty array). The heal keeps getting the value [] each time
for these instances.

So some debuggery is ahead of me. It sounds related for sure.
___
Mailing list: http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack
Post to : openstack@lists.openstack.org
Unsubscribe : http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack


Re: [Openstack] instance_info_caches table, nova not populating for some instances

2015-01-19 Thread Don Waterloo
On 3 December 2014 at 11:58, Don Waterloo don.water...@gmail.com wrote:

 I am having a problem that I hope someone can comment on.

 Periodically, an instance ends up w/ 0 rows in 'instance_info_caches' in
 the nova database.

 as a consequence, when i do 'nova list', it ends up without knowing
 anything about the networks. The instance is allocated an IP, has booted,
 is able to use that IP. Neutron owns the port for it, all is good from that
 standpoint, its just nova knows nothing about it.

 Is 'info_caches' something that is truly a cache? it seems the only known
 repository.



Sorry to follow up my own email, but... is anyone else hitting this? I'm
getting more than just a 'no ip in nova list' symptom, once in a while some
instance ends up w/ 0 bridges in its virsh xml file. What happens is it
comes up normally, all is good and happy. but then some number of day(?)
later, it ends up with no source bridges, no interfaces, and a [] for a
'network_info' field in the instance_info_caches table.

Any idea how this could happen? Its juno on Ubuntu. In ~10K instances
started/stopped since ~jan 1, I now have 15 in this state, so its not super
common. This symptom is more severe, so I cannot live with it.

A reboot does not solve, nor does rebuild (it rebuilds from this info).
Neutron still says the instance is connected, but nova gets it wrong.

 select * from instance_info_caches where network_info = '[]' and deleted =
0;
+-+-++--+--+--+-+
| created_at  | updated_at  | deleted_at | id   |
network_info | instance_uuid| deleted |
+-+-++--+--+--+-+
| 2014-11-03 21:47:44 | 2014-11-03 21:48:05 | NULL   | 4762 | []
| 6996aa1c-7c05-4e36-a86e-d45f7af14352 |   0 |
 ...
___
Mailing list: http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack
Post to : openstack@lists.openstack.org
Unsubscribe : http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack


Re: [Openstack] instance_info_caches table, nova not populating for some instances

2015-01-19 Thread Michael Still
I've never heard of anything like this.

What release of OpenStack are you running? What hypervisor driver?

Thanks,
Michael

On Tue, Jan 20, 2015 at 7:46 AM, Don Waterloo don.water...@gmail.com wrote:


 On 3 December 2014 at 11:58, Don Waterloo don.water...@gmail.com wrote:

 I am having a problem that I hope someone can comment on.

 Periodically, an instance ends up w/ 0 rows in 'instance_info_caches' in
 the nova database.

 as a consequence, when i do 'nova list', it ends up without knowing
 anything about the networks. The instance is allocated an IP, has booted, is
 able to use that IP. Neutron owns the port for it, all is good from that
 standpoint, its just nova knows nothing about it.

 Is 'info_caches' something that is truly a cache? it seems the only known
 repository.



 Sorry to follow up my own email, but... is anyone else hitting this? I'm
 getting more than just a 'no ip in nova list' symptom, once in a while some
 instance ends up w/ 0 bridges in its virsh xml file. What happens is it
 comes up normally, all is good and happy. but then some number of day(?)
 later, it ends up with no source bridges, no interfaces, and a [] for a
 'network_info' field in the instance_info_caches table.

 Any idea how this could happen? Its juno on Ubuntu. In ~10K instances
 started/stopped since ~jan 1, I now have 15 in this state, so its not super
 common. This symptom is more severe, so I cannot live with it.

 A reboot does not solve, nor does rebuild (it rebuilds from this info).
 Neutron still says the instance is connected, but nova gets it wrong.

  select * from instance_info_caches where network_info = '[]' and deleted =
 0;
 +-+-++--+--+--+-+
 | created_at  | updated_at  | deleted_at | id   |
 network_info | instance_uuid| deleted |
 +-+-++--+--+--+-+
 | 2014-11-03 21:47:44 | 2014-11-03 21:48:05 | NULL   | 4762 | []
 | 6996aa1c-7c05-4e36-a86e-d45f7af14352 |   0 |
  ...



 ___
 Mailing list: http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack
 Post to : openstack@lists.openstack.org
 Unsubscribe : http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack




-- 
Rackspace Australia

___
Mailing list: http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack
Post to : openstack@lists.openstack.org
Unsubscribe : http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack


[Openstack] instance_info_caches table, nova not populating for some instances

2014-12-03 Thread Don Waterloo
I am having a problem that I hope someone can comment on.

Periodically, an instance ends up w/ 0 rows in 'instance_info_caches' in
the nova database.

as a consequence, when i do 'nova list', it ends up without knowing
anything about the networks. The instance is allocated an IP, has booted,
is able to use that IP. Neutron owns the port for it, all is good from that
standpoint, its just nova knows nothing about it.

Is 'info_caches' something that is truly a cache? it seems the only known
repository.

There are some spots its possible its not written w/ no message, e.g. in
network/manager.py:
_do_trigger_security_group_members_refresh_for_instance()
try:
# NOTE(vish): We need to make sure the instance info cache has
been
# updated with new ip info before we trigger the
# security group refresh. This is somewhat
inefficient
# but avoids doing some dangerous refactoring for a
# bug fix.
nw_info = self.get_instance_nw_info(admin_context, instance_id,
None, None)
ic = objects.InstanceInfoCache.new(admin_context, instance_id)
ic.network_info = nw_info
ic.save(update_cells=False)
except exception.InstanceInfoCacheNotFound:
pass

no error, no message will be thrown.

It appears its intended to update periodically
(update_instance_cache_with_nw_info()), not sure if that is my problem
(e.g. the refresh wiped it) or the workaround (e.g. it will eventually(?)
fix).

Anyone have any light to shed on this?
___
Mailing list: http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack
Post to : openstack@lists.openstack.org
Unsubscribe : http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack