Re: [Openstack-operators] mitaka/xenial libvirt issues

2017-11-23 Thread Chris Sarginson
I think we may have pinned libvirt-bin as well, (1.3.1), but I can't
guarantee that, sorry - I would suggest its worth trying pinning both
initially.

Chris

On Thu, 23 Nov 2017 at 17:42 Joe Topjian <j...@topjian.net> wrote:

> Hi Chris,
>
> Thanks - we will definitely look into this. To confirm: did you also
> downgrade libvirt as well or was it all qemu?
>
> Thanks,
> Joe
>
> On Thu, Nov 23, 2017 at 9:16 AM, Chris Sarginson <csarg...@gmail.com>
> wrote:
>
>> We hit the same issue a while back (I suspect), which we seemed to
>> resolve by pinning QEMU and related packages at the following version (you
>> might need to hunt down the debs manually):
>>
>> 1:2.5+dfsg-5ubuntu10.5
>>
>> I'm certain there's a launchpad bug for Ubuntu qemu regarding this, but
>> don't have it to hand.
>>
>> Hope this helps,
>> Chris
>>
>> On Thu, 23 Nov 2017 at 15:33 Joe Topjian <j...@topjian.net> wrote:
>>
>>> Hi all,
>>>
>>> We're seeing some strange libvirt issues in an Ubuntu 16.04 environment.
>>> It's running Mitaka, but I don't think this is a problem with OpenStack
>>> itself.
>>>
>>> We're in the process of upgrading this environment from Ubuntu 14.04
>>> with the Mitaka cloud archive to 16.04. Instances are being live migrated
>>> (NFS share) to a new 16.04 compute node (fresh install), so there's a
>>> change between libvirt versions (1.2.2 to 1.3.1). The problem we're seeing
>>> is only happening on the 16.04/1.3.1 nodes.
>>>
>>> We're getting occasional reports of instances not able to be
>>> snapshotted. Upon investigation, the snapshot process quits early with a
>>> libvirt/qemu lock timeout error. We then see that the instance's xml file
>>> has disappeared from /etc/libvirt/qemu and must restart libvirt and
>>> hard-reboot the instance to get things back to a normal state. Trying to
>>> live-migrate the instance to another node causes the same thing to happen.
>>>
>>> However, at some random time, either the snapshot or the migration will
>>> work without error. I haven't been able to reproduce this issue on my own
>>> and haven't been able to figure out the root cause by inspecting instances
>>> reported to me.
>>>
>>> One thing that has stood out is the length of time it takes for libvirt
>>> to start. If I run "/etc/init.d/libvirt-bin start", it takes at least 5
>>> minutes before a simple "virsh list" will work. The command will hang
>>> otherwise. If I increase libvirt's logging level, I can see that during
>>> this period of time, libvirt is working on iptables and ebtables (looks
>>> like it's shelling out commands).
>>>
>>> But if I run "libvirtd -l" straight on the command line, all of this
>>> completes within 5 seconds (including all of the shelling out).
>>>
>>> My initial thought is that systemd is doing some type of throttling
>>> between the system and user slice, but I've tried comparing slice
>>> attributes and, probably due to my lack of understanding of systemd, can't
>>> find anything to prove this.
>>>
>>> Is anyone else running into this problem? Does anyone know what might be
>>> the cause?
>>>
>>> Thanks,
>>> Joe
>>> ___
>>> OpenStack-operators mailing list
>>> OpenStack-operators@lists.openstack.org
>>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators
>>>
>>
>
___
OpenStack-operators mailing list
OpenStack-operators@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators


Re: [Openstack-operators] mitaka/xenial libvirt issues

2017-11-23 Thread Chris Sarginson
We hit the same issue a while back (I suspect), which we seemed to resolve
by pinning QEMU and related packages at the following version (you might
need to hunt down the debs manually):

1:2.5+dfsg-5ubuntu10.5

I'm certain there's a launchpad bug for Ubuntu qemu regarding this, but
don't have it to hand.

Hope this helps,
Chris

On Thu, 23 Nov 2017 at 15:33 Joe Topjian  wrote:

> Hi all,
>
> We're seeing some strange libvirt issues in an Ubuntu 16.04 environment.
> It's running Mitaka, but I don't think this is a problem with OpenStack
> itself.
>
> We're in the process of upgrading this environment from Ubuntu 14.04 with
> the Mitaka cloud archive to 16.04. Instances are being live migrated (NFS
> share) to a new 16.04 compute node (fresh install), so there's a change
> between libvirt versions (1.2.2 to 1.3.1). The problem we're seeing is only
> happening on the 16.04/1.3.1 nodes.
>
> We're getting occasional reports of instances not able to be snapshotted.
> Upon investigation, the snapshot process quits early with a libvirt/qemu
> lock timeout error. We then see that the instance's xml file has
> disappeared from /etc/libvirt/qemu and must restart libvirt and hard-reboot
> the instance to get things back to a normal state. Trying to live-migrate
> the instance to another node causes the same thing to happen.
>
> However, at some random time, either the snapshot or the migration will
> work without error. I haven't been able to reproduce this issue on my own
> and haven't been able to figure out the root cause by inspecting instances
> reported to me.
>
> One thing that has stood out is the length of time it takes for libvirt to
> start. If I run "/etc/init.d/libvirt-bin start", it takes at least 5
> minutes before a simple "virsh list" will work. The command will hang
> otherwise. If I increase libvirt's logging level, I can see that during
> this period of time, libvirt is working on iptables and ebtables (looks
> like it's shelling out commands).
>
> But if I run "libvirtd -l" straight on the command line, all of this
> completes within 5 seconds (including all of the shelling out).
>
> My initial thought is that systemd is doing some type of throttling
> between the system and user slice, but I've tried comparing slice
> attributes and, probably due to my lack of understanding of systemd, can't
> find anything to prove this.
>
> Is anyone else running into this problem? Does anyone know what might be
> the cause?
>
> Thanks,
> Joe
> ___
> OpenStack-operators mailing list
> OpenStack-operators@lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators
>
___
OpenStack-operators mailing list
OpenStack-operators@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators


Re: [Openstack-operators] Neutron Issues

2017-05-02 Thread Chris Sarginson
If you're using openvswitch, with Newton there was a change to the default
agent for configuring openvswitch to be the python ryu library, I think
it's been mentioned on here recently, so probably worth having a poke
through the archives for more information.  I'd check your neutron
openvswitch agent logs for errors pertaining to openflow configuration
specifically, and if you see anything, it's probably worth applying the
following config to your ml2 ini file under the [OVS] section:

of_interface = ovs-ofctl

https://docs.openstack.org/mitaka/config-reference/networking/networking_options_reference.html

Then restart the neutron openvswitch agent, watch the logs, hopefully this
is of some use to you.

On Tue, 2 May 2017 at 21:30 Steve Powell  wrote:

> I forgot to mention I’m running Newton and my neutron.conf file is below
> and I’m running haproxy.
>
>
>
> [DEFAULT]
>
> core_plugin = ml2
>
> service_plugins = router
>
> allow_overlapping_ips = True
>
> notify_nova_on_port_status_changes = True
>
> notify_nova_on_port_data_changes = True
>
> transport_url = rabbit://openstack:#@x.x.x.x
>
> auth_strategy = keystone
>
>
>
> [agent]
>
> root_helper = sudo /usr/bin/neutron-rootwrap /etc/neutron/rootwrap.conf
>
>
>
> [cors]
>
>
>
> [cors.subdomain]
>
>
>
> [database]
>
> connection = mysql+pymysql://neutron:###@
> 10.10.6.220/neutron
>
>
>
> [keystone_authtoken]
>
> auth_url = http://x.x.x.x:35357/v3
>
> auth_uri = https://xxx..xxx:5000/v3
>
> memcached_servers = x.x.x.x:11211
>
> auth_type = password
>
> project_domain_name = Default
>
> user_domain_name = Default
>
> project_name = service
>
> username = neutron
>
> password = ##
>
>
>
>
>
> [matchmaker_redis]
>
>
>
> [nova]
>
>
>
> auth_url = http://x.x.x.x:35357/v3
>
> auth_type = password
>
> project_domain_name = Default
>
> user_domain_name = Default
>
> region_name = RegionOne
>
> project_name = service
>
> username = nova
>
> password = ###
>
>
>
> [oslo_concurrency]
>
>
>
> [oslo_messaging_amqp]
>
>
>
> [oslo_messaging_notifications]
>
>
>
> [oslo_messaging_rabbit]
>
>
>
> [oslo_messaging_zmq]
>
>
>
> [oslo_middleware]
>
> enable_proxy_headers_parsing = True
>
> enable_http_proxy_to_wsgi = True
>
>
>
> [oslo_policy]
>
>
>
> [qos]
>
>
>
> [quotas]
>
>
>
> [ssl]
>
>
>
> *From:* Steve Powell [mailto:spow...@silotechgroup.com]
> *Sent:* Tuesday, May 2, 2017 4:16 PM
> *To:* openstack-operators@lists.openstack.org
> *Subject:* [Openstack-operators] Neutron Issues
>
>
>
>
> This sender failed our fraud detection checks and may not be who they appear 
> to be. Learn about
> spoofing 
>
> Feedback 
>
> Hello Ops!
>
>
>
> I have a major issue slapping me in the face and seek any assistance
> possible. When trying to spin up and instance whether from the command
> line, manually in Horizon, or with a HEAT template I receive the following
> error in nova and, where applicable, heat logs:
>
>
>
> Failed to allocate the network(s), not rescheduling.
>
>
>
> I see in the neutron logs where the request make it through to completion
> but that info is obviously not making it back to nova.
>
>
>
> INFO neutron.notifiers.nova [-] Nova event response: {u'status':
> u'completed', u'code': 200, u'name': u'network-changed', u'server_uuid':
> u'6892bb9e-4256-4fc9-a313-331f0c576a03'}
>
>
>
> What am I missing? Why would the response from neutron not make it back to
> nova?
>
>
>
>
>
>
> ___
> OpenStack-operators mailing list
> OpenStack-operators@lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators
>
___
OpenStack-operators mailing list
OpenStack-operators@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators


Re: [Openstack-operators] live migration fails and libvirtd(?) lockup after trusty > xenial/mitaka > newton upgrade

2016-12-20 Thread Chris Sarginson
Hi Vladimir,

The packages are available on launchpad here:
https://launchpad.net/ubuntu/+source/qemu/1:2.5+dfsg-5ubuntu10.5/+build/10938755

On Tue, 20 Dec 2016 at 12:49 Vladimir Prokofev  wrote:

> Using
> compute1:~$ dpkg-query -W qemu-system-x86
> qemu-system-x86 1:2.5+dfsg-5ubuntu10.6
> compute1:~$ qemu-system-x86_64 --version
> QEMU emulator version 2.5.0 (Debian 1:2.5+dfsg-5ubuntu10.6), Copyright (c)
> 2003-2008 Fabrice Bellard
>
> So I guess you may be right. Now I have to confirm this.
> Did you build your package from source? 1:2.5+dfsg-5ubuntu10.5 is no
> longer available in official repository, and I can't find it anywhere
> except for the source code.
>
>
> 2016-12-20 12:17 GMT+03:00 Sean Redmond :
>
> What version of qemu are you running? I was hit by this bug[1] that seems
> to give off the same faults you are reporting and had to downgrade qemu
> packages to version '1:2.5+dfsg-5ubuntu10.5'
>
> [1] https://bugs.launchpad.net/ubuntu/+source/qemu/+bug/1647389
>
> On Tue, Dec 20, 2016 at 12:10 AM, Vladimir Prokofev  wrote:
>
> Hello Ops.
>
> I want to pick your brains on a live migration issue, cause I'm kinda
> stuck atm.
>
> I'm running a small HA Ubuntu Openstack cloud - 3 controllers(VMs on
> Ubuntu/KVM) with ha-proxy LB, corosync/pacemaker manages VIP, Galera DB, 3
> compute nodes(using KVM hypervisor), 3 network nodes(VMs on Ubuntu/KVM),
> CEPH cluster(4 OSD nodes, 3 MON nodes(VMs on Ubuntu/KVM)).
> Nova, cinder, and glance use CEPH as backend. Neutron uses linux-bridge.
>
> We were running Ubuntu Trusty 14.04/Mitaka, and decided to upgrade to
> Ubuntu Xenial 16.04/Newton. The issue appeared sometime after upgrade of
> all Mitaka packages to the latest version prior to Xenial upgrade, and
> stayed even after upgrade to Xenial/Newton. It's hard to tell for sure as
> we didn't use LM for a while. It worked fine under Trusty/Mitaka, and broke
> under Xenial/latest-Mitaka.
>
>
> With a chance of 20-30% live migration will fail. The instance will pause
> on a source node, but will not be resumed on target node. Target node will
> destroy instance assuming LM failed, and on source node instance will stay
> paused. On source node no new messages appear in nova-compute.log, and
> commands such as "virsh list" won't provide any output, or even exit for
> that matter. nova-compute can be restarted, but after normal startup
> messages it once again doesn't provide any new log entries. Nova considers
> compute node as up and running.
> In dashboard you can see that instance is now residing at a new host, and
> is in shutoff state.
>
> If I restart libvirtd on target or source node then whole system
> "unfreezes". Source host unpauses instance, and it is live again. But it
> now resides on a source node, while database thinks it's on a target node
> and is shutdown. Warning messages will appear on source host:
>
> 2016-12-20 01:13:37.095 20025 WARNING nova.compute.manager
> [req-b3879539-989f-4075-9ef0-d23ef8868102 - - - - -] While synchronizing
> instance power states, found 3 instances in the database and 4 instances on
> the hypervisor.
>
> Currenly I'm stopping/destroying/undefining instance on a source node, and
> launching it via standart Openstack means, but this leads to an instance
> reboot.
>
>
> Last messages in nova-compute.log on source node after LM start:
> 2016-12-20 00:09:35.961 16834 INFO nova.virt.libvirt.migration
> [req-51801b5e-b77a-4d76-ad87-176326ac910e 84498aa7e26443c4908d973f3e86d530
> ecee1197f46e453dba25669554226ce5 - - -] [instance:
> cd8cb1db-dca3-4b0f-a03e-c0befbbd7b53] Increasing downtime to 1251 ms after
> 0 sec elapsed time
> 2016-12-20 00:09:36.127 16834 INFO nova.virt.libvirt.driver
> [req-51801b5e-b77a-4d76-ad87-176326ac910e 84498aa7e26443c4908d973f3e86d530
> ecee1197f46e453dba25669554226ce5 - - -] [instance:
> cd8cb1db-dca3-4b0f-a03e-c0befbbd7b53] Migration running for 0 secs, memory
> 100% remaining; (bytes processed=0, remaining=0, total=0)
> 2016-12-20 00:09:36.894 16834 INFO nova.compute.manager
> [req-3a2a828f-8b76-4a65-b49b-ea8d232a3de5 - - - - -] [instance:
> cd8cb1db-dca3-4b0f-a03e-c0befbbd7b53] VM Paused (Lifecycle Event)
> 2016-12-20 00:09:37.046 16834 INFO nova.compute.manager
> [req-3a2a828f-8b76-4a65-b49b-ea8d232a3de5 - - - - -] [instance:
> cd8cb1db-dca3-4b0f-a03e-c0befbbd7b53] During sync_power_state the instance
> has a pending task (migrating). Skip.
> 2016-12-20 00:09:37.300 16834 INFO nova.virt.libvirt.driver
> [req-51801b5e-b77a-4d76-ad87-176326ac910e 84498aa7e26443c4908d973f3e86d530
> ecee1197f46e453dba25669554226ce5 - - -] [instance:
> cd8cb1db-dca3-4b0f-a03e-c0befbbd7b53] Migration operation has completed
> 2016-12-20 00:09:37.301 16834 INFO nova.compute.manager
> [req-51801b5e-b77a-4d76-ad87-176326ac910e 84498aa7e26443c4908d973f3e86d530
> ecee1197f46e453dba25669554226ce5 - - -] [instance:
> cd8cb1db-dca3-4b0f-a03e-c0befbbd7b53] _post_live_migration() is started..
> 2016-12-20 

Re: [Openstack-operators] Instances failing to launch when rbd backed (ansible Liberty setup)

2016-10-21 Thread Chris Sarginson
It seems like it may be an occurrence of this bug, as you look to be using
python venvs:

https://bugs.launchpad.net/openstack-ansible/+bug/1509837

2016-10-21 12:08:06.242 70661 ERROR nova.compute.manager [instance:
5633d98e-5f79-4c13-8d45-7544069f0e6f]   File "*/openstack/venvs/*nova-12.0.
16/lib/python2.7/site-packages/nova/virt/libvirt/storage/rbd_utils.py",
line 117, in __init__

Chris

On Fri, 21 Oct 2016 at 13:19 Grant Morley  wrote:

> Hi all,
>
> We have a openstack-ansible setup and have ceph installed for the backend.
> However whenever we try and launch a new instance it fails to launch and we
> get the following error:
>
> 2016-10-21 12:08:06.241 70661 INFO nova.virt.libvirt.driver
> [req-79811c40-8394-4e33-b16d-ff5fa7341b6a 41c60f65ae914681b6a6ca27a42ff780
> 324844c815084205995aff10b03a85e1 - - -] [instance:
> 5633d98e-5f79-4c13-8d45-7544069f0e6f] Creating image
> 2016-10-21 12:08:06.242 70661 ERROR nova.compute.manager
> [req-79811c40-8394-4e33-b16d-ff5fa7341b6a 41c60f65ae914681b6a6ca27a42ff780
> 324844c815084205995aff10b03a85e1 - - -] [instance:
> 5633d98e-5f79-4c13-8d45-7544069f0e6f] Instance failed to spawn
> 2016-10-21 12:08:06.242 70661 ERROR nova.compute.manager [instance:
> 5633d98e-5f79-4c13-8d45-7544069f0e6f] Traceback (most recent call last):
> 2016-10-21 12:08:06.242 70661 ERROR nova.compute.manager [instance:
> 5633d98e-5f79-4c13-8d45-7544069f0e6f]   File
> "/openstack/venvs/nova-12.0.16/lib/python2.7/site-packages/nova/compute/manager.py",
> line 2156, in _build_resources
> 2016-10-21 12:08:06.242 70661 ERROR nova.compute.manager [instance:
> 5633d98e-5f79-4c13-8d45-7544069f0e6f] yield resources
> 2016-10-21 12:08:06.242 70661 ERROR nova.compute.manager [instance:
> 5633d98e-5f79-4c13-8d45-7544069f0e6f]   File
> "/openstack/venvs/nova-12.0.16/lib/python2.7/site-packages/nova/compute/manager.py",
> line 2009, in _build_and_run_instance
> 2016-10-21 12:08:06.242 70661 ERROR nova.compute.manager [instance:
> 5633d98e-5f79-4c13-8d45-7544069f0e6f]
> block_device_info=block_device_info)
> 2016-10-21 12:08:06.242 70661 ERROR nova.compute.manager [instance:
> 5633d98e-5f79-4c13-8d45-7544069f0e6f]   File
> "/openstack/venvs/nova-12.0.16/lib/python2.7/site-packages/nova/virt/libvirt/driver.py",
> line 2527, in spawn
> 2016-10-21 12:08:06.242 70661 ERROR nova.compute.manager [instance:
> 5633d98e-5f79-4c13-8d45-7544069f0e6f] admin_pass=admin_password)
> 2016-10-21 12:08:06.242 70661 ERROR nova.compute.manager [instance:
> 5633d98e-5f79-4c13-8d45-7544069f0e6f]   File
> "/openstack/venvs/nova-12.0.16/lib/python2.7/site-packages/nova/virt/libvirt/driver.py",
> line 2939, in _create_image
> 2016-10-21 12:08:06.242 70661 ERROR nova.compute.manager [instance:
> 5633d98e-5f79-4c13-8d45-7544069f0e6f] backend = image('disk')
> 2016-10-21 12:08:06.242 70661 ERROR nova.compute.manager [instance:
> 5633d98e-5f79-4c13-8d45-7544069f0e6f]   File
> "/openstack/venvs/nova-12.0.16/lib/python2.7/site-packages/nova/virt/libvirt/driver.py",
> line 2884, in image
> 2016-10-21 12:08:06.242 70661 ERROR nova.compute.manager [instance:
> 5633d98e-5f79-4c13-8d45-7544069f0e6f] fname + suffix, image_type)
> 2016-10-21 12:08:06.242 70661 ERROR nova.compute.manager [instance:
> 5633d98e-5f79-4c13-8d45-7544069f0e6f]   File
> "/openstack/venvs/nova-12.0.16/lib/python2.7/site-packages/nova/virt/libvirt/imagebackend.py",
> line 967, in image
> 2016-10-21 12:08:06.242 70661 ERROR nova.compute.manager [instance:
> 5633d98e-5f79-4c13-8d45-7544069f0e6f] return backend(instance=instance,
> disk_name=disk_name)
> 2016-10-21 12:08:06.242 70661 ERROR nova.compute.manager [instance:
> 5633d98e-5f79-4c13-8d45-7544069f0e6f]   File
> "/openstack/venvs/nova-12.0.16/lib/python2.7/site-packages/nova/virt/libvirt/imagebackend.py",
> line 748, in __init__
> 2016-10-21 12:08:06.242 70661 ERROR nova.compute.manager [instance:
> 5633d98e-5f79-4c13-8d45-7544069f0e6f] rbd_user=self.rbd_user)
> 2016-10-21 12:08:06.242 70661 ERROR nova.compute.manager [instance:
> 5633d98e-5f79-4c13-8d45-7544069f0e6f]   File
> "/openstack/venvs/nova-12.0.16/lib/python2.7/site-packages/nova/virt/libvirt/storage/rbd_utils.py",
> line 117, in __init__
> 2016-10-21 12:08:06.242 70661 ERROR nova.compute.manager [instance:
> 5633d98e-5f79-4c13-8d45-7544069f0e6f] raise RuntimeError(_('rbd python
> libraries not found'))
> 2016-10-21 12:08:06.242 70661 ERROR nova.compute.manager [instance:
> 5633d98e-5f79-4c13-8d45-7544069f0e6f] RuntimeError: rbd python libraries
> not found
>
> It moans about the rbd python libraries not being found, however all of
> the rbd libraries appear to be installed fine via apt. ( We are running
> Ubuntu)
>
> Compute host packages:
>
> dpkg -l | grep ceph
> ii  ceph-common
> 10.2.3-1trustyamd64common utilities to
> mount and interact with a ceph storage cluster
> ii  libcephfs1
> 10.2.3-1trustyamd64Ceph distributed file
> system