[Yahoo-eng-team] [Bug 1646181] Re: NFS: Fail to boot VM out of large snapshots (30GB+)
Reviewed: https://review.openstack.org/443752 Committed: https://git.openstack.org/cgit/openstack/cinder/commit/?id=52310fa8645cc10b91de7d2b4e10a3b42d4ef073 Submitter: Jenkins Branch:master commit 52310fa8645cc10b91de7d2b4e10a3b42d4ef073 Author: Eric HarneyDate: Thu Mar 9 11:25:53 2017 -0500 Bump prlimit cpu time for qemu-img from 2 to 8 Users have reported that the current CPU limit is not sufficient for processing large enough images when downloading images to volumes. This mirrors a similar increase made in Nova (b78b1f8ce). Closes-Bug: #1646181 Change-Id: I5edea7d1d19fd991e51dca963d2beb7004177498 ** Changed in: cinder Status: In Progress => Fix Released -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1646181 Title: NFS: Fail to boot VM out of large snapshots (30GB+) Status in Cinder: Fix Released Status in OpenStack Compute (nova): Fix Released Status in OpenStack Compute (nova) newton series: Fix Committed Bug description: Description === Using NFS Shared storage, when I try to boot a VM out of a smaller snapshot (1GB) it works fine. Although, when i try to do the same out of a larger snapshot (30GB+) it fails regardless of the OpenStack Release Newton or Mitaka. Steps to reproduce == A chronological list of steps which will bring off the issue you noticed: * I have OpenStack RDO MNewton (or Mitaka) installed and functional * I boot a VM out of a QCOW2 image of about 1GB * Then I loginto that VM and create a large file (33GB) to inflat the VM image * then I shutoff the VM and take a snapshot of it that i call "largeVMsnapshotImage" Alternatively to the steps above, * I have a snapshot from a large VM (30GB+) that I upload in glance and call "largeVMsnapshotImage" Then I do: * then I try to boot a new VM out of that snapshot using the same network * Although the image seems to be copied to the compute node, the VM Creation fails on "qemu-img info" command If I run the same command manually, it works: /usr/bin/python2 -m oslo_concurrency.prlimit --as=1073741824 --cpu=2 -- env LC_ALL=C LANG=C qemu-img info /var/lib/nova/instances/_base/2b54e1ca13134ceb7fc489d58d7aa6fd321b1885 image: /var/lib/nova/instances/_base/2b54e1ca13134ceb7fc489d58d7aa6fd321b1885 file format: raw virtual size: 80G (85899345920 bytes) disk size: 37G Although, in the logs it fails and the VM Creation is interrupted, see log from nova-compute.log on the compute node: ... 2016-11-29 17:52:23.581 10284 ERROR nova.compute.manager [instance: d6889ea2-f277-40e5-afdc-b3b0698537ed] BuildAbortException: Build of instance d6889ea2-f277-40e5-afdc-b3b0698537ed aborted: Disk info file is invalid: qemu-img failed to execute on /var/lib/nova/instances/_base/2b54e1ca13134ceb7fc489d58d7aa6fd321b1885 : Unexpected error while running command. 2016-11-29 17:52:23.581 10284 ERROR nova.compute.manager [instance: d6889ea2-f277-40e5-afdc-b3b0698537ed] Command: /usr/bin/python2 -m oslo_concurrency.prlimit --as=1073741824 --cpu=2 -- env LC_ALL=C LANG=C qemu-img info /var/lib/nova/instances/_base/2b54e1ca13134ceb7fc489d58d7aa6fd321b1885 2016-11-29 17:52:23.581 10284 ERROR nova.compute.manager [instance: d6889ea2-f277-40e5-afdc-b3b0698537ed] Exit code: -9 2016-11-29 17:52:23.581 10284 ERROR nova.compute.manager [instance: d6889ea2-f277-40e5-afdc-b3b0698537ed] Stdout: u'' 2016-11-29 17:52:23.581 10284 ERROR nova.compute.manager [instance: d6889ea2-f277-40e5-afdc-b3b0698537ed] Stderr: u'' ... Expected result === The VM should have been created/booted out of the larg snapshot image. Actual result = The command fails with exit code -9 when Noiva Environment === 1. Running RDO Newton on Centos 7.2 (or Oracle Linux 7.2) and reproduced on RDO Mitaka as well If this is from a distro please provide $ [root@controller ~]# rpm -qa|grep nova openstack-nova-console-14.0.0-1.el7.noarch puppet-nova-9.4.0-1.el7.noarch python-nova-14.0.0-1.el7.noarch openstack-nova-novncproxy-14.0.0-1.el7.noarch openstack-nova-conductor-14.0.0-1.el7.noarch openstack-nova-api-14.0.0-1.el7.noarch openstack-nova-common-14.0.0-1.el7.noarch openstack-nova-scheduler-14.0.0-1.el7.noarch openstack-nova-serialproxy-14.0.0-1.el7.noarch python2-novaclient-6.0.0-1.el7.noarch openstack-nova-cert-14.0.0-1.el7.noarch 2. Which hypervisor did you use? KVM details: [root@compute4 nova]# rpm -qa|grep -Ei "kvm|qemu|libvirt" libvirt-gobject-0.1.9-1.el7.x86_64 libvirt-gconfig-0.1.9-1.el7.x86_64 libvirt-daemon-1.2.17-13.0.1.el7.x86_64 qemu-kvm-common-1.5.3-105.el7.x86_64 qemu-img-1.5.3-105.el7.x86_64 ipxe-roms-qemu-20130517-7.gitc4bce43.el7.noarch
[Yahoo-eng-team] [Bug 1646181] Re: NFS: Fail to boot VM out of large snapshots (30GB+)
For Cinder, this same issue affects image->volume and needs the same fix. ** Also affects: cinder Importance: Undecided Status: New -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1646181 Title: NFS: Fail to boot VM out of large snapshots (30GB+) Status in Cinder: In Progress Status in OpenStack Compute (nova): Fix Released Status in OpenStack Compute (nova) newton series: Fix Committed Bug description: Description === Using NFS Shared storage, when I try to boot a VM out of a smaller snapshot (1GB) it works fine. Although, when i try to do the same out of a larger snapshot (30GB+) it fails regardless of the OpenStack Release Newton or Mitaka. Steps to reproduce == A chronological list of steps which will bring off the issue you noticed: * I have OpenStack RDO MNewton (or Mitaka) installed and functional * I boot a VM out of a QCOW2 image of about 1GB * Then I loginto that VM and create a large file (33GB) to inflat the VM image * then I shutoff the VM and take a snapshot of it that i call "largeVMsnapshotImage" Alternatively to the steps above, * I have a snapshot from a large VM (30GB+) that I upload in glance and call "largeVMsnapshotImage" Then I do: * then I try to boot a new VM out of that snapshot using the same network * Although the image seems to be copied to the compute node, the VM Creation fails on "qemu-img info" command If I run the same command manually, it works: /usr/bin/python2 -m oslo_concurrency.prlimit --as=1073741824 --cpu=2 -- env LC_ALL=C LANG=C qemu-img info /var/lib/nova/instances/_base/2b54e1ca13134ceb7fc489d58d7aa6fd321b1885 image: /var/lib/nova/instances/_base/2b54e1ca13134ceb7fc489d58d7aa6fd321b1885 file format: raw virtual size: 80G (85899345920 bytes) disk size: 37G Although, in the logs it fails and the VM Creation is interrupted, see log from nova-compute.log on the compute node: ... 2016-11-29 17:52:23.581 10284 ERROR nova.compute.manager [instance: d6889ea2-f277-40e5-afdc-b3b0698537ed] BuildAbortException: Build of instance d6889ea2-f277-40e5-afdc-b3b0698537ed aborted: Disk info file is invalid: qemu-img failed to execute on /var/lib/nova/instances/_base/2b54e1ca13134ceb7fc489d58d7aa6fd321b1885 : Unexpected error while running command. 2016-11-29 17:52:23.581 10284 ERROR nova.compute.manager [instance: d6889ea2-f277-40e5-afdc-b3b0698537ed] Command: /usr/bin/python2 -m oslo_concurrency.prlimit --as=1073741824 --cpu=2 -- env LC_ALL=C LANG=C qemu-img info /var/lib/nova/instances/_base/2b54e1ca13134ceb7fc489d58d7aa6fd321b1885 2016-11-29 17:52:23.581 10284 ERROR nova.compute.manager [instance: d6889ea2-f277-40e5-afdc-b3b0698537ed] Exit code: -9 2016-11-29 17:52:23.581 10284 ERROR nova.compute.manager [instance: d6889ea2-f277-40e5-afdc-b3b0698537ed] Stdout: u'' 2016-11-29 17:52:23.581 10284 ERROR nova.compute.manager [instance: d6889ea2-f277-40e5-afdc-b3b0698537ed] Stderr: u'' ... Expected result === The VM should have been created/booted out of the larg snapshot image. Actual result = The command fails with exit code -9 when Noiva Environment === 1. Running RDO Newton on Centos 7.2 (or Oracle Linux 7.2) and reproduced on RDO Mitaka as well If this is from a distro please provide $ [root@controller ~]# rpm -qa|grep nova openstack-nova-console-14.0.0-1.el7.noarch puppet-nova-9.4.0-1.el7.noarch python-nova-14.0.0-1.el7.noarch openstack-nova-novncproxy-14.0.0-1.el7.noarch openstack-nova-conductor-14.0.0-1.el7.noarch openstack-nova-api-14.0.0-1.el7.noarch openstack-nova-common-14.0.0-1.el7.noarch openstack-nova-scheduler-14.0.0-1.el7.noarch openstack-nova-serialproxy-14.0.0-1.el7.noarch python2-novaclient-6.0.0-1.el7.noarch openstack-nova-cert-14.0.0-1.el7.noarch 2. Which hypervisor did you use? KVM details: [root@compute4 nova]# rpm -qa|grep -Ei "kvm|qemu|libvirt" libvirt-gobject-0.1.9-1.el7.x86_64 libvirt-gconfig-0.1.9-1.el7.x86_64 libvirt-daemon-1.2.17-13.0.1.el7.x86_64 qemu-kvm-common-1.5.3-105.el7.x86_64 qemu-img-1.5.3-105.el7.x86_64 ipxe-roms-qemu-20130517-7.gitc4bce43.el7.noarch libvirt-client-1.2.17-13.0.1.el7.x86_64 libvirt-daemon-driver-nodedev-1.2.17-13.0.1.el7.x86_64 libvirt-daemon-driver-lxc-1.2.17-13.0.1.el7.x86_64 libvirt-daemon-kvm-1.2.17-13.0.1.el7.x86_64 libvirt-daemon-driver-secret-1.2.17-13.0.1.el7.x86_64 libvirt-python-1.2.17-2.el7.x86_64 libvirt-daemon-config-network-1.2.17-13.0.1.el7.x86_64 libvirt-daemon-config-nwfilter-1.2.17-13.0.1.el7.x86_64 libvirt-daemon-driver-storage-1.2.17-13.0.1.el7.x86_64 qemu-kvm-1.5.3-105.el7.x86_64 libvirt-1.2.17-13.0.1.el7.x86_64 libvirt-daemon-driver-interface-1.2.17-13.0.1.el7.x86_64
[Yahoo-eng-team] [Bug 1646181] Re: NFS: Fail to boot VM out of large snapshots (30GB+)
Reviewed: https://review.openstack.org/408668 Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=b78b1f8ce3aa407307a6adc5c60de1e960547897 Submitter: Jenkins Branch:master commit b78b1f8ce3aa407307a6adc5c60de1e960547897 Author: Sean DagueDate: Thu Dec 8 10:09:06 2016 -0500 Bump prlimit cpu time for qemu from 2 to 8 We've got user reported bugs that when opperating with slow NFS backends with large (30+ GB) disk files, the prlimit of cpu_time 2 is guessed to be the issue at hand because if folks hot patch a qemu-img that runs before the prlimitted one, the prlimitted one succeeds. This increases the allowed cpu timeout, as well as tweaking the error message so that we return something more prescriptive when the qemu-img command fails with prlimit abort. The original bug (#1449062) the main mitigation concern here was a carefully crafted image that gets qemu-img to generate > 1G of json, and hence could be a node attack vector. cpu_time was never mentioned, and I think was added originally as a belt and suspenders addition. As such, bumping it to 8 seconds shouldn't impact our protection in any real way. Change-Id: I1f4549b787fd3b458e2c48a90bf80025987f08c4 Closes-Bug: #1646181 ** Changed in: nova Status: In Progress => Fix Released -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1646181 Title: NFS: Fail to boot VM out of large snapshots (30GB+) Status in OpenStack Compute (nova): Fix Released Status in OpenStack Compute (nova) newton series: Confirmed Bug description: Description === Using NFS Shared storage, when I try to boot a VM out of a smaller snapshot (1GB) it works fine. Although, when i try to do the same out of a larger snapshot (30GB+) it fails regardless of the OpenStack Release Newton or Mitaka. Steps to reproduce == A chronological list of steps which will bring off the issue you noticed: * I have OpenStack RDO MNewton (or Mitaka) installed and functional * I boot a VM out of a QCOW2 image of about 1GB * Then I loginto that VM and create a large file (33GB) to inflat the VM image * then I shutoff the VM and take a snapshot of it that i call "largeVMsnapshotImage" Alternatively to the steps above, * I have a snapshot from a large VM (30GB+) that I upload in glance and call "largeVMsnapshotImage" Then I do: * then I try to boot a new VM out of that snapshot using the same network * Although the image seems to be copied to the compute node, the VM Creation fails on "qemu-img info" command If I run the same command manually, it works: /usr/bin/python2 -m oslo_concurrency.prlimit --as=1073741824 --cpu=2 -- env LC_ALL=C LANG=C qemu-img info /var/lib/nova/instances/_base/2b54e1ca13134ceb7fc489d58d7aa6fd321b1885 image: /var/lib/nova/instances/_base/2b54e1ca13134ceb7fc489d58d7aa6fd321b1885 file format: raw virtual size: 80G (85899345920 bytes) disk size: 37G Although, in the logs it fails and the VM Creation is interrupted, see log from nova-compute.log on the compute node: ... 2016-11-29 17:52:23.581 10284 ERROR nova.compute.manager [instance: d6889ea2-f277-40e5-afdc-b3b0698537ed] BuildAbortException: Build of instance d6889ea2-f277-40e5-afdc-b3b0698537ed aborted: Disk info file is invalid: qemu-img failed to execute on /var/lib/nova/instances/_base/2b54e1ca13134ceb7fc489d58d7aa6fd321b1885 : Unexpected error while running command. 2016-11-29 17:52:23.581 10284 ERROR nova.compute.manager [instance: d6889ea2-f277-40e5-afdc-b3b0698537ed] Command: /usr/bin/python2 -m oslo_concurrency.prlimit --as=1073741824 --cpu=2 -- env LC_ALL=C LANG=C qemu-img info /var/lib/nova/instances/_base/2b54e1ca13134ceb7fc489d58d7aa6fd321b1885 2016-11-29 17:52:23.581 10284 ERROR nova.compute.manager [instance: d6889ea2-f277-40e5-afdc-b3b0698537ed] Exit code: -9 2016-11-29 17:52:23.581 10284 ERROR nova.compute.manager [instance: d6889ea2-f277-40e5-afdc-b3b0698537ed] Stdout: u'' 2016-11-29 17:52:23.581 10284 ERROR nova.compute.manager [instance: d6889ea2-f277-40e5-afdc-b3b0698537ed] Stderr: u'' ... Expected result === The VM should have been created/booted out of the larg snapshot image. Actual result = The command fails with exit code -9 when Noiva Environment === 1. Running RDO Newton on Centos 7.2 (or Oracle Linux 7.2) and reproduced on RDO Mitaka as well If this is from a distro please provide $ [root@controller ~]# rpm -qa|grep nova openstack-nova-console-14.0.0-1.el7.noarch puppet-nova-9.4.0-1.el7.noarch python-nova-14.0.0-1.el7.noarch openstack-nova-novncproxy-14.0.0-1.el7.noarch openstack-nova-conductor-14.0.0-1.el7.noarch openstack-nova-api-14.0.0-1.el7.noarch
[Yahoo-eng-team] [Bug 1646181] Re: NFS: Fail to boot VM out of large snapshots (30GB+)
** Also affects: nova/newton Importance: Undecided Status: New ** Changed in: nova/newton Status: New => Confirmed ** Changed in: nova/newton Importance: Undecided => Medium -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1646181 Title: NFS: Fail to boot VM out of large snapshots (30GB+) Status in OpenStack Compute (nova): In Progress Status in OpenStack Compute (nova) newton series: Confirmed Bug description: Description === Using NFS Shared storage, when I try to boot a VM out of a smaller snapshot (1GB) it works fine. Although, when i try to do the same out of a larger snapshot (30GB+) it fails regardless of the OpenStack Release Newton or Mitaka. Steps to reproduce == A chronological list of steps which will bring off the issue you noticed: * I have OpenStack RDO MNewton (or Mitaka) installed and functional * I boot a VM out of a QCOW2 image of about 1GB * Then I loginto that VM and create a large file (33GB) to inflat the VM image * then I shutoff the VM and take a snapshot of it that i call "largeVMsnapshotImage" Alternatively to the steps above, * I have a snapshot from a large VM (30GB+) that I upload in glance and call "largeVMsnapshotImage" Then I do: * then I try to boot a new VM out of that snapshot using the same network * Although the image seems to be copied to the compute node, the VM Creation fails on "qemu-img info" command If I run the same command manually, it works: /usr/bin/python2 -m oslo_concurrency.prlimit --as=1073741824 --cpu=2 -- env LC_ALL=C LANG=C qemu-img info /var/lib/nova/instances/_base/2b54e1ca13134ceb7fc489d58d7aa6fd321b1885 image: /var/lib/nova/instances/_base/2b54e1ca13134ceb7fc489d58d7aa6fd321b1885 file format: raw virtual size: 80G (85899345920 bytes) disk size: 37G Although, in the logs it fails and the VM Creation is interrupted, see log from nova-compute.log on the compute node: ... 2016-11-29 17:52:23.581 10284 ERROR nova.compute.manager [instance: d6889ea2-f277-40e5-afdc-b3b0698537ed] BuildAbortException: Build of instance d6889ea2-f277-40e5-afdc-b3b0698537ed aborted: Disk info file is invalid: qemu-img failed to execute on /var/lib/nova/instances/_base/2b54e1ca13134ceb7fc489d58d7aa6fd321b1885 : Unexpected error while running command. 2016-11-29 17:52:23.581 10284 ERROR nova.compute.manager [instance: d6889ea2-f277-40e5-afdc-b3b0698537ed] Command: /usr/bin/python2 -m oslo_concurrency.prlimit --as=1073741824 --cpu=2 -- env LC_ALL=C LANG=C qemu-img info /var/lib/nova/instances/_base/2b54e1ca13134ceb7fc489d58d7aa6fd321b1885 2016-11-29 17:52:23.581 10284 ERROR nova.compute.manager [instance: d6889ea2-f277-40e5-afdc-b3b0698537ed] Exit code: -9 2016-11-29 17:52:23.581 10284 ERROR nova.compute.manager [instance: d6889ea2-f277-40e5-afdc-b3b0698537ed] Stdout: u'' 2016-11-29 17:52:23.581 10284 ERROR nova.compute.manager [instance: d6889ea2-f277-40e5-afdc-b3b0698537ed] Stderr: u'' ... Expected result === The VM should have been created/booted out of the larg snapshot image. Actual result = The command fails with exit code -9 when Noiva Environment === 1. Running RDO Newton on Centos 7.2 (or Oracle Linux 7.2) and reproduced on RDO Mitaka as well If this is from a distro please provide $ [root@controller ~]# rpm -qa|grep nova openstack-nova-console-14.0.0-1.el7.noarch puppet-nova-9.4.0-1.el7.noarch python-nova-14.0.0-1.el7.noarch openstack-nova-novncproxy-14.0.0-1.el7.noarch openstack-nova-conductor-14.0.0-1.el7.noarch openstack-nova-api-14.0.0-1.el7.noarch openstack-nova-common-14.0.0-1.el7.noarch openstack-nova-scheduler-14.0.0-1.el7.noarch openstack-nova-serialproxy-14.0.0-1.el7.noarch python2-novaclient-6.0.0-1.el7.noarch openstack-nova-cert-14.0.0-1.el7.noarch 2. Which hypervisor did you use? KVM details: [root@compute4 nova]# rpm -qa|grep -Ei "kvm|qemu|libvirt" libvirt-gobject-0.1.9-1.el7.x86_64 libvirt-gconfig-0.1.9-1.el7.x86_64 libvirt-daemon-1.2.17-13.0.1.el7.x86_64 qemu-kvm-common-1.5.3-105.el7.x86_64 qemu-img-1.5.3-105.el7.x86_64 ipxe-roms-qemu-20130517-7.gitc4bce43.el7.noarch libvirt-client-1.2.17-13.0.1.el7.x86_64 libvirt-daemon-driver-nodedev-1.2.17-13.0.1.el7.x86_64 libvirt-daemon-driver-lxc-1.2.17-13.0.1.el7.x86_64 libvirt-daemon-kvm-1.2.17-13.0.1.el7.x86_64 libvirt-daemon-driver-secret-1.2.17-13.0.1.el7.x86_64 libvirt-python-1.2.17-2.el7.x86_64 libvirt-daemon-config-network-1.2.17-13.0.1.el7.x86_64 libvirt-daemon-config-nwfilter-1.2.17-13.0.1.el7.x86_64 libvirt-daemon-driver-storage-1.2.17-13.0.1.el7.x86_64 qemu-kvm-1.5.3-105.el7.x86_64 libvirt-1.2.17-13.0.1.el7.x86_64 libvirt-daemon-driver-interface-1.2.17-13.0.1.el7.x86_64