[Yahoo-eng-team] [Bug 1646181] Re: NFS: Fail to boot VM out of large snapshots (30GB+)

2017-03-13 Thread OpenStack Infra
Reviewed:  https://review.openstack.org/443752
Committed: 
https://git.openstack.org/cgit/openstack/cinder/commit/?id=52310fa8645cc10b91de7d2b4e10a3b42d4ef073
Submitter: Jenkins
Branch:master

commit 52310fa8645cc10b91de7d2b4e10a3b42d4ef073
Author: Eric Harney 
Date:   Thu Mar 9 11:25:53 2017 -0500

Bump prlimit cpu time for qemu-img from 2 to 8

Users have reported that the current CPU limit is not
sufficient for processing large enough images when
downloading images to volumes.

This mirrors a similar increase made in Nova (b78b1f8ce).

Closes-Bug: #1646181

Change-Id: I5edea7d1d19fd991e51dca963d2beb7004177498


** Changed in: cinder
   Status: In Progress => Fix Released

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1646181

Title:
  NFS: Fail to boot VM out of large snapshots (30GB+)

Status in Cinder:
  Fix Released
Status in OpenStack Compute (nova):
  Fix Released
Status in OpenStack Compute (nova) newton series:
  Fix Committed

Bug description:
  Description
  ===
  Using NFS Shared storage, when I try to boot a VM out of a smaller snapshot 
(1GB) it works fine.
  Although, when i try to do the same out of a larger snapshot (30GB+) it fails 
regardless of the OpenStack Release Newton or Mitaka.

  Steps to reproduce
  ==
  A chronological list of steps which will bring off the
  issue you noticed:
  * I have OpenStack RDO MNewton (or Mitaka) installed and functional
  * I boot a VM out of a QCOW2 image of about 1GB
  * Then I loginto that VM and create a large file (33GB) to inflat the VM image
  * then I shutoff the VM and take a snapshot of it that i call 
"largeVMsnapshotImage"

  Alternatively to the steps above,
  * I have a snapshot from a large VM (30GB+) that I upload in glance and call 
"largeVMsnapshotImage"

  Then I do:
  * then I try to boot a new VM out of that snapshot using the same network
  * Although the image seems to be copied to the compute node, the VM Creation 
fails on "qemu-img info" command

  If I run the same command manually, it works:
  /usr/bin/python2 -m oslo_concurrency.prlimit --as=1073741824 --cpu=2 -- env 
LC_ALL=C LANG=C qemu-img info 
/var/lib/nova/instances/_base/2b54e1ca13134ceb7fc489d58d7aa6fd321b1885
  image: /var/lib/nova/instances/_base/2b54e1ca13134ceb7fc489d58d7aa6fd321b1885
  file format: raw
  virtual size: 80G (85899345920 bytes)
  disk size: 37G

  Although, in the logs it fails and the VM Creation is interrupted, see log 
from nova-compute.log on the compute node:
  ...
  2016-11-29 17:52:23.581 10284 ERROR nova.compute.manager [instance: 
d6889ea2-f277-40e5-afdc-b3b0698537ed] BuildAbortException: Build of instance 
d6889ea2-f277-40e5-afdc-b3b0698537ed aborted: Disk info file is invalid: 
qemu-img failed to execute on 
/var/lib/nova/instances/_base/2b54e1ca13134ceb7fc489d58d7aa6fd321b1885 : 
Unexpected error while running command.
  2016-11-29 17:52:23.581 10284 ERROR nova.compute.manager [instance: 
d6889ea2-f277-40e5-afdc-b3b0698537ed] Command: /usr/bin/python2 -m 
oslo_concurrency.prlimit --as=1073741824 --cpu=2 -- env LC_ALL=C LANG=C 
qemu-img info 
/var/lib/nova/instances/_base/2b54e1ca13134ceb7fc489d58d7aa6fd321b1885
  2016-11-29 17:52:23.581 10284 ERROR nova.compute.manager [instance: 
d6889ea2-f277-40e5-afdc-b3b0698537ed] Exit code: -9
  2016-11-29 17:52:23.581 10284 ERROR nova.compute.manager [instance: 
d6889ea2-f277-40e5-afdc-b3b0698537ed] Stdout: u''
  2016-11-29 17:52:23.581 10284 ERROR nova.compute.manager [instance: 
d6889ea2-f277-40e5-afdc-b3b0698537ed] Stderr: u''
  ...

  
  Expected result
  ===
  The VM should have been created/booted out of the larg snapshot image.

  Actual result
  =
  The command fails with exit code -9 when Noiva

  Environment
  ===
  1. Running RDO Newton on Centos 7.2 (or Oracle Linux 7.2) and reproduced on 
RDO Mitaka as well

 If this is from a distro please provide
 $ [root@controller ~]# rpm -qa|grep nova
  openstack-nova-console-14.0.0-1.el7.noarch
  puppet-nova-9.4.0-1.el7.noarch
  python-nova-14.0.0-1.el7.noarch
  openstack-nova-novncproxy-14.0.0-1.el7.noarch
  openstack-nova-conductor-14.0.0-1.el7.noarch
  openstack-nova-api-14.0.0-1.el7.noarch
  openstack-nova-common-14.0.0-1.el7.noarch
  openstack-nova-scheduler-14.0.0-1.el7.noarch
  openstack-nova-serialproxy-14.0.0-1.el7.noarch
  python2-novaclient-6.0.0-1.el7.noarch
  openstack-nova-cert-14.0.0-1.el7.noarch

  
  2. Which hypervisor did you use?
 KVM
 
 details:
 [root@compute4 nova]# rpm -qa|grep -Ei "kvm|qemu|libvirt"
  libvirt-gobject-0.1.9-1.el7.x86_64
  libvirt-gconfig-0.1.9-1.el7.x86_64
  libvirt-daemon-1.2.17-13.0.1.el7.x86_64
  qemu-kvm-common-1.5.3-105.el7.x86_64
  qemu-img-1.5.3-105.el7.x86_64
  ipxe-roms-qemu-20130517-7.gitc4bce43.el7.noarch
  

[Yahoo-eng-team] [Bug 1646181] Re: NFS: Fail to boot VM out of large snapshots (30GB+)

2017-03-09 Thread Eric Harney
For Cinder, this same issue affects image->volume and needs the same
fix.

** Also affects: cinder
   Importance: Undecided
   Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1646181

Title:
  NFS: Fail to boot VM out of large snapshots (30GB+)

Status in Cinder:
  In Progress
Status in OpenStack Compute (nova):
  Fix Released
Status in OpenStack Compute (nova) newton series:
  Fix Committed

Bug description:
  Description
  ===
  Using NFS Shared storage, when I try to boot a VM out of a smaller snapshot 
(1GB) it works fine.
  Although, when i try to do the same out of a larger snapshot (30GB+) it fails 
regardless of the OpenStack Release Newton or Mitaka.

  Steps to reproduce
  ==
  A chronological list of steps which will bring off the
  issue you noticed:
  * I have OpenStack RDO MNewton (or Mitaka) installed and functional
  * I boot a VM out of a QCOW2 image of about 1GB
  * Then I loginto that VM and create a large file (33GB) to inflat the VM image
  * then I shutoff the VM and take a snapshot of it that i call 
"largeVMsnapshotImage"

  Alternatively to the steps above,
  * I have a snapshot from a large VM (30GB+) that I upload in glance and call 
"largeVMsnapshotImage"

  Then I do:
  * then I try to boot a new VM out of that snapshot using the same network
  * Although the image seems to be copied to the compute node, the VM Creation 
fails on "qemu-img info" command

  If I run the same command manually, it works:
  /usr/bin/python2 -m oslo_concurrency.prlimit --as=1073741824 --cpu=2 -- env 
LC_ALL=C LANG=C qemu-img info 
/var/lib/nova/instances/_base/2b54e1ca13134ceb7fc489d58d7aa6fd321b1885
  image: /var/lib/nova/instances/_base/2b54e1ca13134ceb7fc489d58d7aa6fd321b1885
  file format: raw
  virtual size: 80G (85899345920 bytes)
  disk size: 37G

  Although, in the logs it fails and the VM Creation is interrupted, see log 
from nova-compute.log on the compute node:
  ...
  2016-11-29 17:52:23.581 10284 ERROR nova.compute.manager [instance: 
d6889ea2-f277-40e5-afdc-b3b0698537ed] BuildAbortException: Build of instance 
d6889ea2-f277-40e5-afdc-b3b0698537ed aborted: Disk info file is invalid: 
qemu-img failed to execute on 
/var/lib/nova/instances/_base/2b54e1ca13134ceb7fc489d58d7aa6fd321b1885 : 
Unexpected error while running command.
  2016-11-29 17:52:23.581 10284 ERROR nova.compute.manager [instance: 
d6889ea2-f277-40e5-afdc-b3b0698537ed] Command: /usr/bin/python2 -m 
oslo_concurrency.prlimit --as=1073741824 --cpu=2 -- env LC_ALL=C LANG=C 
qemu-img info 
/var/lib/nova/instances/_base/2b54e1ca13134ceb7fc489d58d7aa6fd321b1885
  2016-11-29 17:52:23.581 10284 ERROR nova.compute.manager [instance: 
d6889ea2-f277-40e5-afdc-b3b0698537ed] Exit code: -9
  2016-11-29 17:52:23.581 10284 ERROR nova.compute.manager [instance: 
d6889ea2-f277-40e5-afdc-b3b0698537ed] Stdout: u''
  2016-11-29 17:52:23.581 10284 ERROR nova.compute.manager [instance: 
d6889ea2-f277-40e5-afdc-b3b0698537ed] Stderr: u''
  ...

  
  Expected result
  ===
  The VM should have been created/booted out of the larg snapshot image.

  Actual result
  =
  The command fails with exit code -9 when Noiva

  Environment
  ===
  1. Running RDO Newton on Centos 7.2 (or Oracle Linux 7.2) and reproduced on 
RDO Mitaka as well

 If this is from a distro please provide
 $ [root@controller ~]# rpm -qa|grep nova
  openstack-nova-console-14.0.0-1.el7.noarch
  puppet-nova-9.4.0-1.el7.noarch
  python-nova-14.0.0-1.el7.noarch
  openstack-nova-novncproxy-14.0.0-1.el7.noarch
  openstack-nova-conductor-14.0.0-1.el7.noarch
  openstack-nova-api-14.0.0-1.el7.noarch
  openstack-nova-common-14.0.0-1.el7.noarch
  openstack-nova-scheduler-14.0.0-1.el7.noarch
  openstack-nova-serialproxy-14.0.0-1.el7.noarch
  python2-novaclient-6.0.0-1.el7.noarch
  openstack-nova-cert-14.0.0-1.el7.noarch

  
  2. Which hypervisor did you use?
 KVM
 
 details:
 [root@compute4 nova]# rpm -qa|grep -Ei "kvm|qemu|libvirt"
  libvirt-gobject-0.1.9-1.el7.x86_64
  libvirt-gconfig-0.1.9-1.el7.x86_64
  libvirt-daemon-1.2.17-13.0.1.el7.x86_64
  qemu-kvm-common-1.5.3-105.el7.x86_64
  qemu-img-1.5.3-105.el7.x86_64
  ipxe-roms-qemu-20130517-7.gitc4bce43.el7.noarch
  libvirt-client-1.2.17-13.0.1.el7.x86_64
  libvirt-daemon-driver-nodedev-1.2.17-13.0.1.el7.x86_64
  libvirt-daemon-driver-lxc-1.2.17-13.0.1.el7.x86_64
  libvirt-daemon-kvm-1.2.17-13.0.1.el7.x86_64
  libvirt-daemon-driver-secret-1.2.17-13.0.1.el7.x86_64
  libvirt-python-1.2.17-2.el7.x86_64
  libvirt-daemon-config-network-1.2.17-13.0.1.el7.x86_64
  libvirt-daemon-config-nwfilter-1.2.17-13.0.1.el7.x86_64
  libvirt-daemon-driver-storage-1.2.17-13.0.1.el7.x86_64
  qemu-kvm-1.5.3-105.el7.x86_64
  libvirt-1.2.17-13.0.1.el7.x86_64
  libvirt-daemon-driver-interface-1.2.17-13.0.1.el7.x86_64
  

[Yahoo-eng-team] [Bug 1646181] Re: NFS: Fail to boot VM out of large snapshots (30GB+)

2016-12-09 Thread OpenStack Infra
Reviewed:  https://review.openstack.org/408668
Committed: 
https://git.openstack.org/cgit/openstack/nova/commit/?id=b78b1f8ce3aa407307a6adc5c60de1e960547897
Submitter: Jenkins
Branch:master

commit b78b1f8ce3aa407307a6adc5c60de1e960547897
Author: Sean Dague 
Date:   Thu Dec 8 10:09:06 2016 -0500

Bump prlimit cpu time for qemu from 2 to 8

We've got user reported bugs that when opperating with slow NFS
backends with large (30+ GB) disk files, the prlimit of cpu_time 2 is
guessed to be the issue at hand because if folks hot patch a qemu-img
that runs before the prlimitted one, the prlimitted one succeeds.

This increases the allowed cpu timeout, as well as tweaking the error
message so that we return something more prescriptive when the
qemu-img command fails with prlimit abort.

The original bug (#1449062) the main mitigation concern here was a
carefully crafted image that gets qemu-img to generate > 1G of json,
and hence could be a node attack vector. cpu_time was never mentioned,
and I think was added originally as a belt and suspenders addition. As
such, bumping it to 8 seconds shouldn't impact our protection in any
real way.

Change-Id: I1f4549b787fd3b458e2c48a90bf80025987f08c4
Closes-Bug: #1646181


** Changed in: nova
   Status: In Progress => Fix Released

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1646181

Title:
  NFS: Fail to boot VM out of large snapshots (30GB+)

Status in OpenStack Compute (nova):
  Fix Released
Status in OpenStack Compute (nova) newton series:
  Confirmed

Bug description:
  Description
  ===
  Using NFS Shared storage, when I try to boot a VM out of a smaller snapshot 
(1GB) it works fine.
  Although, when i try to do the same out of a larger snapshot (30GB+) it fails 
regardless of the OpenStack Release Newton or Mitaka.

  Steps to reproduce
  ==
  A chronological list of steps which will bring off the
  issue you noticed:
  * I have OpenStack RDO MNewton (or Mitaka) installed and functional
  * I boot a VM out of a QCOW2 image of about 1GB
  * Then I loginto that VM and create a large file (33GB) to inflat the VM image
  * then I shutoff the VM and take a snapshot of it that i call 
"largeVMsnapshotImage"

  Alternatively to the steps above,
  * I have a snapshot from a large VM (30GB+) that I upload in glance and call 
"largeVMsnapshotImage"

  Then I do:
  * then I try to boot a new VM out of that snapshot using the same network
  * Although the image seems to be copied to the compute node, the VM Creation 
fails on "qemu-img info" command

  If I run the same command manually, it works:
  /usr/bin/python2 -m oslo_concurrency.prlimit --as=1073741824 --cpu=2 -- env 
LC_ALL=C LANG=C qemu-img info 
/var/lib/nova/instances/_base/2b54e1ca13134ceb7fc489d58d7aa6fd321b1885
  image: /var/lib/nova/instances/_base/2b54e1ca13134ceb7fc489d58d7aa6fd321b1885
  file format: raw
  virtual size: 80G (85899345920 bytes)
  disk size: 37G

  Although, in the logs it fails and the VM Creation is interrupted, see log 
from nova-compute.log on the compute node:
  ...
  2016-11-29 17:52:23.581 10284 ERROR nova.compute.manager [instance: 
d6889ea2-f277-40e5-afdc-b3b0698537ed] BuildAbortException: Build of instance 
d6889ea2-f277-40e5-afdc-b3b0698537ed aborted: Disk info file is invalid: 
qemu-img failed to execute on 
/var/lib/nova/instances/_base/2b54e1ca13134ceb7fc489d58d7aa6fd321b1885 : 
Unexpected error while running command.
  2016-11-29 17:52:23.581 10284 ERROR nova.compute.manager [instance: 
d6889ea2-f277-40e5-afdc-b3b0698537ed] Command: /usr/bin/python2 -m 
oslo_concurrency.prlimit --as=1073741824 --cpu=2 -- env LC_ALL=C LANG=C 
qemu-img info 
/var/lib/nova/instances/_base/2b54e1ca13134ceb7fc489d58d7aa6fd321b1885
  2016-11-29 17:52:23.581 10284 ERROR nova.compute.manager [instance: 
d6889ea2-f277-40e5-afdc-b3b0698537ed] Exit code: -9
  2016-11-29 17:52:23.581 10284 ERROR nova.compute.manager [instance: 
d6889ea2-f277-40e5-afdc-b3b0698537ed] Stdout: u''
  2016-11-29 17:52:23.581 10284 ERROR nova.compute.manager [instance: 
d6889ea2-f277-40e5-afdc-b3b0698537ed] Stderr: u''
  ...

  
  Expected result
  ===
  The VM should have been created/booted out of the larg snapshot image.

  Actual result
  =
  The command fails with exit code -9 when Noiva

  Environment
  ===
  1. Running RDO Newton on Centos 7.2 (or Oracle Linux 7.2) and reproduced on 
RDO Mitaka as well

 If this is from a distro please provide
 $ [root@controller ~]# rpm -qa|grep nova
  openstack-nova-console-14.0.0-1.el7.noarch
  puppet-nova-9.4.0-1.el7.noarch
  python-nova-14.0.0-1.el7.noarch
  openstack-nova-novncproxy-14.0.0-1.el7.noarch
  openstack-nova-conductor-14.0.0-1.el7.noarch
  openstack-nova-api-14.0.0-1.el7.noarch
  

[Yahoo-eng-team] [Bug 1646181] Re: NFS: Fail to boot VM out of large snapshots (30GB+)

2016-12-08 Thread Matt Riedemann
** Also affects: nova/newton
   Importance: Undecided
   Status: New

** Changed in: nova/newton
   Status: New => Confirmed

** Changed in: nova/newton
   Importance: Undecided => Medium

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1646181

Title:
  NFS: Fail to boot VM out of large snapshots (30GB+)

Status in OpenStack Compute (nova):
  In Progress
Status in OpenStack Compute (nova) newton series:
  Confirmed

Bug description:
  Description
  ===
  Using NFS Shared storage, when I try to boot a VM out of a smaller snapshot 
(1GB) it works fine.
  Although, when i try to do the same out of a larger snapshot (30GB+) it fails 
regardless of the OpenStack Release Newton or Mitaka.

  Steps to reproduce
  ==
  A chronological list of steps which will bring off the
  issue you noticed:
  * I have OpenStack RDO MNewton (or Mitaka) installed and functional
  * I boot a VM out of a QCOW2 image of about 1GB
  * Then I loginto that VM and create a large file (33GB) to inflat the VM image
  * then I shutoff the VM and take a snapshot of it that i call 
"largeVMsnapshotImage"

  Alternatively to the steps above,
  * I have a snapshot from a large VM (30GB+) that I upload in glance and call 
"largeVMsnapshotImage"

  Then I do:
  * then I try to boot a new VM out of that snapshot using the same network
  * Although the image seems to be copied to the compute node, the VM Creation 
fails on "qemu-img info" command

  If I run the same command manually, it works:
  /usr/bin/python2 -m oslo_concurrency.prlimit --as=1073741824 --cpu=2 -- env 
LC_ALL=C LANG=C qemu-img info 
/var/lib/nova/instances/_base/2b54e1ca13134ceb7fc489d58d7aa6fd321b1885
  image: /var/lib/nova/instances/_base/2b54e1ca13134ceb7fc489d58d7aa6fd321b1885
  file format: raw
  virtual size: 80G (85899345920 bytes)
  disk size: 37G

  Although, in the logs it fails and the VM Creation is interrupted, see log 
from nova-compute.log on the compute node:
  ...
  2016-11-29 17:52:23.581 10284 ERROR nova.compute.manager [instance: 
d6889ea2-f277-40e5-afdc-b3b0698537ed] BuildAbortException: Build of instance 
d6889ea2-f277-40e5-afdc-b3b0698537ed aborted: Disk info file is invalid: 
qemu-img failed to execute on 
/var/lib/nova/instances/_base/2b54e1ca13134ceb7fc489d58d7aa6fd321b1885 : 
Unexpected error while running command.
  2016-11-29 17:52:23.581 10284 ERROR nova.compute.manager [instance: 
d6889ea2-f277-40e5-afdc-b3b0698537ed] Command: /usr/bin/python2 -m 
oslo_concurrency.prlimit --as=1073741824 --cpu=2 -- env LC_ALL=C LANG=C 
qemu-img info 
/var/lib/nova/instances/_base/2b54e1ca13134ceb7fc489d58d7aa6fd321b1885
  2016-11-29 17:52:23.581 10284 ERROR nova.compute.manager [instance: 
d6889ea2-f277-40e5-afdc-b3b0698537ed] Exit code: -9
  2016-11-29 17:52:23.581 10284 ERROR nova.compute.manager [instance: 
d6889ea2-f277-40e5-afdc-b3b0698537ed] Stdout: u''
  2016-11-29 17:52:23.581 10284 ERROR nova.compute.manager [instance: 
d6889ea2-f277-40e5-afdc-b3b0698537ed] Stderr: u''
  ...

  
  Expected result
  ===
  The VM should have been created/booted out of the larg snapshot image.

  Actual result
  =
  The command fails with exit code -9 when Noiva

  Environment
  ===
  1. Running RDO Newton on Centos 7.2 (or Oracle Linux 7.2) and reproduced on 
RDO Mitaka as well

 If this is from a distro please provide
 $ [root@controller ~]# rpm -qa|grep nova
  openstack-nova-console-14.0.0-1.el7.noarch
  puppet-nova-9.4.0-1.el7.noarch
  python-nova-14.0.0-1.el7.noarch
  openstack-nova-novncproxy-14.0.0-1.el7.noarch
  openstack-nova-conductor-14.0.0-1.el7.noarch
  openstack-nova-api-14.0.0-1.el7.noarch
  openstack-nova-common-14.0.0-1.el7.noarch
  openstack-nova-scheduler-14.0.0-1.el7.noarch
  openstack-nova-serialproxy-14.0.0-1.el7.noarch
  python2-novaclient-6.0.0-1.el7.noarch
  openstack-nova-cert-14.0.0-1.el7.noarch

  
  2. Which hypervisor did you use?
 KVM
 
 details:
 [root@compute4 nova]# rpm -qa|grep -Ei "kvm|qemu|libvirt"
  libvirt-gobject-0.1.9-1.el7.x86_64
  libvirt-gconfig-0.1.9-1.el7.x86_64
  libvirt-daemon-1.2.17-13.0.1.el7.x86_64
  qemu-kvm-common-1.5.3-105.el7.x86_64
  qemu-img-1.5.3-105.el7.x86_64
  ipxe-roms-qemu-20130517-7.gitc4bce43.el7.noarch
  libvirt-client-1.2.17-13.0.1.el7.x86_64
  libvirt-daemon-driver-nodedev-1.2.17-13.0.1.el7.x86_64
  libvirt-daemon-driver-lxc-1.2.17-13.0.1.el7.x86_64
  libvirt-daemon-kvm-1.2.17-13.0.1.el7.x86_64
  libvirt-daemon-driver-secret-1.2.17-13.0.1.el7.x86_64
  libvirt-python-1.2.17-2.el7.x86_64
  libvirt-daemon-config-network-1.2.17-13.0.1.el7.x86_64
  libvirt-daemon-config-nwfilter-1.2.17-13.0.1.el7.x86_64
  libvirt-daemon-driver-storage-1.2.17-13.0.1.el7.x86_64
  qemu-kvm-1.5.3-105.el7.x86_64
  libvirt-1.2.17-13.0.1.el7.x86_64
  libvirt-daemon-driver-interface-1.2.17-13.0.1.el7.x86_64