Re: [Openstack] instance_info_caches table, nova not populating for some instances
On 3 December 2014 at 11:58, Don Waterloo don.water...@gmail.com wrote: I am having a problem that I hope someone can comment on. Periodically, an instance ends up w/ 0 rows in 'instance_info_caches' in the nova database. If anyone else has a chance, and can try running this sql query against their nova database, it will show if you are seeing the same problem I am: select instances.host,instances.hostname,instances.uuid,instances.user_id from instance_info_caches,instances where network_info = '[]' and instances.deleted = 0 and instances.uuid = instance_info_caches.instance_uuid; The expectation is this returns 0 rows. I'm finding about 1 instance in 100 ends up with [] for the network_info. ___ Mailing list: http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack Post to : openstack@lists.openstack.org Unsubscribe : http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack
Re: [Openstack] instance_info_caches table, nova not populating for some instances
On 22 January 2015 at 09:30, Don Waterloo don.water...@gmail.com wrote: On 3 December 2014 at 11:58, Don Waterloo don.water...@gmail.com wrote: I am having a problem that I hope someone can comment on. Periodically, an instance ends up w/ 0 rows in 'instance_info_caches' in the nova database. If anyone else has a chance, and can try running this sql query against their nova database, it will show if you are seeing the same problem I am: select instances.host,instances.hostname,instances.uuid,instances.user_id from instance_info_caches,instances where network_info = '[]' and instances.deleted = 0 and instances.uuid = instance_info_caches.instance_uuid; The expectation is this returns 0 rows. I'm finding about 1 instance in 100 ends up with [] for the network_info. After some more digging, what is happening in the bad case is a race condition. In the 'good' case, _allocate_network_async() is called followed by _get_guest_xml() In the 'bad' case, _allocate_network_async() is called, followed by _get_instance_nw_info(). E.g. it is doing a refresh_cache() on this instance while it is being created (before its finished). ___ Mailing list: http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack Post to : openstack@lists.openstack.org Unsubscribe : http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack
Re: [Openstack] instance_info_caches table, nova not populating for some instances
On 20 January 2015 at 22:20, Don Waterloo don.water...@gmail.com wrote: For any one else who hits this, I entered https://bugs.launchpad.net/nova/+bug/1413049 which contains a patch. In a nutshell, @ the bottom end of reloading/healing the cache, it turns around and uses the cache in _gather_port_ids_and_networks(). My fix is to use the neutron api and call list_ports. sigh, my patch is not correct. the order of the ports is not preserved on the neutron /ports.json call, and that is important to nova. Does anyone know a solution to this? e.g. if i have an instance that has 2 ports, GET /...ports.json?device_id=UUID might return them in either order, presumably since its stored as a hashmap or something. ___ Mailing list: http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack Post to : openstack@lists.openstack.org Unsubscribe : http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack
Re: [Openstack] instance_info_caches table, nova not populating for some instances
Couldn't you sort the result from the neutron port query by port UUID so the ordering is maintained until ports are added or deleted? On Wed, Jan 21, 2015 at 1:49 PM, Don Waterloo don.water...@gmail.com wrote: On 20 January 2015 at 22:20, Don Waterloo don.water...@gmail.com wrote: For any one else who hits this, I entered https://bugs.launchpad.net/nova/+bug/1413049 which contains a patch. In a nutshell, @ the bottom end of reloading/healing the cache, it turns around and uses the cache in _gather_port_ids_and_networks(). My fix is to use the neutron api and call list_ports. sigh, my patch is not correct. the order of the ports is not preserved on the neutron /ports.json call, and that is important to nova. Does anyone know a solution to this? e.g. if i have an instance that has 2 ports, GET /...ports.json?device_id=UUID might return them in either order, presumably since its stored as a hashmap or something. ___ Mailing list: http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack Post to : openstack@lists.openstack.org Unsubscribe : http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack -- Kevin Benton ___ Mailing list: http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack Post to : openstack@lists.openstack.org Unsubscribe : http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack
Re: [Openstack] instance_info_caches table, nova not populating for some instances
Don, I created this bug [1] for Nova a while ago which sounds similar to the problem you're having although we were running nova-network not neutron. I proposed a fix [1] for that bug but it never got merged because I didn't have time to write the tests. [1] - https://bugs.launchpad.net/nova/+bug/1378459 [2] - https://review.openstack.org/126633 Nate On Mon, Jan 19, 2015 at 7:36 PM, Don Waterloo don.water...@gmail.com wrote: On 19 January 2015 at 15:40, Michael Still mi...@stillhq.com wrote: I've never heard of anything like this. What release of OpenStack are you running? What hypervisor driver? Thanks, Michael This is Juno on Ubunto 14.10 with libvirt kvm. after more digging, the periodic task for heal_instance_info_cache_interval is returning [] for the affected instances, which is overwriting the field. later, when the user does a rebuild, the .xml file gets no interfaces. so, my bug lies in there somewhere. _get_instance_nw_info() is returning the [] ___ Mailing list: http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack Post to : openstack@lists.openstack.org Unsubscribe : http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack ___ Mailing list: http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack Post to : openstack@lists.openstack.org Unsubscribe : http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack
Re: [Openstack] instance_info_caches table, nova not populating for some instances
On 20 January 2015 at 14:04, Don Waterloo don.water...@gmail.com wrote: On 20 January 2015 at 13:22, Nathanael Burton nathanael.i.bur...@gmail.com wrote: Don, I created this bug [1] for Nova a while ago which sounds similar to the problem you're having although we were running nova-network not neutron. I proposed a fix [1] for that bug but it never got merged because I didn't have time to write the tests. [1] - https://bugs.launchpad.net/nova/+bug/1378459 [2] - https://review.openstack.org/126633 Nate It seems there may be a circular issue. What ends up happening is that in nova/network/neutronv2/api.py, in _gather_port_ids_and_networks()... when I come in there and my cache is set to [] (e.g. trying to heal or reconcile), it calls: ifaces = compute_utils.get_nw_info_for_instance(instance) which in turn goes and tries to fill the cache. So while trying to fill the cache, one of the things it relies on must already be in the cache. Since its not there, it gets [] for the ifaces, which in turn makes [] for the ports, and it all falls apart. ___ Mailing list: http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack Post to : openstack@lists.openstack.org Unsubscribe : http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack
Re: [Openstack] instance_info_caches table, nova not populating for some instances
On 20 January 2015 at 13:22, Nathanael Burton nathanael.i.bur...@gmail.com wrote: Don, I created this bug [1] for Nova a while ago which sounds similar to the problem you're having although we were running nova-network not neutron. I proposed a fix [1] for that bug but it never got merged because I didn't have time to write the tests. [1] - https://bugs.launchpad.net/nova/+bug/1378459 [2] - https://review.openstack.org/126633 Nate Thank you for that. I'm working on finding the root cause now. mine for sure is not an upgrade nor a manual change to the db. In my case, the instance starts life with a good info_cache, and then it mysteriously goes to [] (python empty array). The heal keeps getting the value [] each time for these instances. So some debuggery is ahead of me. It sounds related for sure. ___ Mailing list: http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack Post to : openstack@lists.openstack.org Unsubscribe : http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack
Re: [Openstack] instance_info_caches table, nova not populating for some instances
On 3 December 2014 at 11:58, Don Waterloo don.water...@gmail.com wrote: I am having a problem that I hope someone can comment on. Periodically, an instance ends up w/ 0 rows in 'instance_info_caches' in the nova database. as a consequence, when i do 'nova list', it ends up without knowing anything about the networks. The instance is allocated an IP, has booted, is able to use that IP. Neutron owns the port for it, all is good from that standpoint, its just nova knows nothing about it. Is 'info_caches' something that is truly a cache? it seems the only known repository. Sorry to follow up my own email, but... is anyone else hitting this? I'm getting more than just a 'no ip in nova list' symptom, once in a while some instance ends up w/ 0 bridges in its virsh xml file. What happens is it comes up normally, all is good and happy. but then some number of day(?) later, it ends up with no source bridges, no interfaces, and a [] for a 'network_info' field in the instance_info_caches table. Any idea how this could happen? Its juno on Ubuntu. In ~10K instances started/stopped since ~jan 1, I now have 15 in this state, so its not super common. This symptom is more severe, so I cannot live with it. A reboot does not solve, nor does rebuild (it rebuilds from this info). Neutron still says the instance is connected, but nova gets it wrong. select * from instance_info_caches where network_info = '[]' and deleted = 0; +-+-++--+--+--+-+ | created_at | updated_at | deleted_at | id | network_info | instance_uuid| deleted | +-+-++--+--+--+-+ | 2014-11-03 21:47:44 | 2014-11-03 21:48:05 | NULL | 4762 | [] | 6996aa1c-7c05-4e36-a86e-d45f7af14352 | 0 | ... ___ Mailing list: http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack Post to : openstack@lists.openstack.org Unsubscribe : http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack
Re: [Openstack] instance_info_caches table, nova not populating for some instances
I've never heard of anything like this. What release of OpenStack are you running? What hypervisor driver? Thanks, Michael On Tue, Jan 20, 2015 at 7:46 AM, Don Waterloo don.water...@gmail.com wrote: On 3 December 2014 at 11:58, Don Waterloo don.water...@gmail.com wrote: I am having a problem that I hope someone can comment on. Periodically, an instance ends up w/ 0 rows in 'instance_info_caches' in the nova database. as a consequence, when i do 'nova list', it ends up without knowing anything about the networks. The instance is allocated an IP, has booted, is able to use that IP. Neutron owns the port for it, all is good from that standpoint, its just nova knows nothing about it. Is 'info_caches' something that is truly a cache? it seems the only known repository. Sorry to follow up my own email, but... is anyone else hitting this? I'm getting more than just a 'no ip in nova list' symptom, once in a while some instance ends up w/ 0 bridges in its virsh xml file. What happens is it comes up normally, all is good and happy. but then some number of day(?) later, it ends up with no source bridges, no interfaces, and a [] for a 'network_info' field in the instance_info_caches table. Any idea how this could happen? Its juno on Ubuntu. In ~10K instances started/stopped since ~jan 1, I now have 15 in this state, so its not super common. This symptom is more severe, so I cannot live with it. A reboot does not solve, nor does rebuild (it rebuilds from this info). Neutron still says the instance is connected, but nova gets it wrong. select * from instance_info_caches where network_info = '[]' and deleted = 0; +-+-++--+--+--+-+ | created_at | updated_at | deleted_at | id | network_info | instance_uuid| deleted | +-+-++--+--+--+-+ | 2014-11-03 21:47:44 | 2014-11-03 21:48:05 | NULL | 4762 | [] | 6996aa1c-7c05-4e36-a86e-d45f7af14352 | 0 | ... ___ Mailing list: http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack Post to : openstack@lists.openstack.org Unsubscribe : http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack -- Rackspace Australia ___ Mailing list: http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack Post to : openstack@lists.openstack.org Unsubscribe : http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack
[Openstack] instance_info_caches table, nova not populating for some instances
I am having a problem that I hope someone can comment on. Periodically, an instance ends up w/ 0 rows in 'instance_info_caches' in the nova database. as a consequence, when i do 'nova list', it ends up without knowing anything about the networks. The instance is allocated an IP, has booted, is able to use that IP. Neutron owns the port for it, all is good from that standpoint, its just nova knows nothing about it. Is 'info_caches' something that is truly a cache? it seems the only known repository. There are some spots its possible its not written w/ no message, e.g. in network/manager.py: _do_trigger_security_group_members_refresh_for_instance() try: # NOTE(vish): We need to make sure the instance info cache has been # updated with new ip info before we trigger the # security group refresh. This is somewhat inefficient # but avoids doing some dangerous refactoring for a # bug fix. nw_info = self.get_instance_nw_info(admin_context, instance_id, None, None) ic = objects.InstanceInfoCache.new(admin_context, instance_id) ic.network_info = nw_info ic.save(update_cells=False) except exception.InstanceInfoCacheNotFound: pass no error, no message will be thrown. It appears its intended to update periodically (update_instance_cache_with_nw_info()), not sure if that is my problem (e.g. the refresh wiped it) or the workaround (e.g. it will eventually(?) fix). Anyone have any light to shed on this? ___ Mailing list: http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack Post to : openstack@lists.openstack.org Unsubscribe : http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack