On Tue, Jan 19, 2016 at 12:31:48PM +0100, Kashyap Chamarthy wrote:
> On Mon, Jan 18, 2016 at 04:19:58PM +0000, Richard W.M. Jones wrote:
> > On Mon, Jan 18, 2016 at 03:33:25PM +0000, Richard W.M. Jones wrote:
> > > I tried another workaround which was to get virt-resize to fsync the
> > > output file before closing the libvirt connection, but that doesn't
> > > work for reasons I don't understand so far - still studying this.
> >
> > I worked out what was happening here -- I'd inserted the fsync at the
> > wrong place in virt-resize. So I have now successfully worked around
> > this for the virt-resize case, however it's still a problem that could
> > manifest itself in other uses of libvirt + qemu + slow devices.
>
> We've seen the "Failed to terminate process 1275 with SIGTERM: Device or
> resource busy" error occur in context of OpenStack as well[1][2].
>
> The behavior is from virDomainDestroy() API (src/libvirt-domain.c):
>
> [...]
> * virDomainDestroy first requests that a guest terminate (e.g.
> * SIGTERM), then waits for it to comply. After a reasonable timeout,
> * if the guest still exists, virDomainDestroy will forcefully
> * terminate the guest (e.g. SIGKILL) if necessary (which may produce
> * undesirable results, for example unflushed disk cache in the
> * guest). To avoid this possibility, it's recommended to instead
> * call virDomainDestroyFlags, sending the
> * VIR_DOMAIN_DESTROY_GRACEFUL flag.
> [...]
>
> Dan Berrange explains[1]:
>
> There are two reasons why you'd get this failure ("Failed to terminate
> process: Device or resource busy") from libvirt.
>
> - The host is so overloaded that the kernel was not able to clean up
> the process in the time that libvirt was prepared to wait. If this
> is the case, the process should eventually go away on its own
> after a short while longer and everything should return to normal
>
> - There is some problem, causing the process to get stuck in an
> uninterruptable wait state. This is usually due to something going
> wrong in the storage stack, causing some I/O read/write operation
> to hang in kernel space. In this case the process will stay around
> in the zombie state forever, or until the storage problem is
> resolved.Thanks for finding this documentation. The problem with this theory is we are passing the VIR_DOMAIN_DESTROY_GRACEFUL flag, so that would indicate that this flag is buggy. I think what we need is a test case, so here goes. Note you must run these steps as *non-root*. (1) Download the attachment to /var/tmp (2) chmod +x /var/tmp/qemu.sh (3) killall libvirtd ;# kills the session libvirtd (4) LIBGUESTFS_HV=/var/tmp/qemu.sh guestfish -N fs exit -vx You should see at the end of the output: libguestfs: calling virDomainDestroy "guestfs-q94hsiz89t8jp418" flags=VIR_DOMAIN_DESTROY_GRACEFUL [pause of a few seconds] libguestfs: error: could not destroy libvirt domain: Failed to terminate process 11412 with SIGTERM: Device or resource busy [code=38 domain=0] If someone else can reproduce this, then I will file a bug. Rich. > [1] https://bugzilla.redhat.com/show_bug.cgi?id=1205647 -- > nova.virt.libvirt.driver fails to shutdown reboot instance with > error 'Code=38 Error=Failed to terminate process 4260 with SIGKILL: > Device or resource busy' > [2] https://bugs.launchpad.net/nova/+bug/1353939 -- Rescue fails with > 'Failed to terminate process: Device or resource busy' in the n-cpu > log > > -- > /kashyap -- Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones Read my programming and virtualization blog: http://rwmj.wordpress.com Fedora Windows cross-compiler. Compile Windows programs, test, and build Windows installers. Over 100 libraries supported. http://fedoraproject.org/wiki/MinGW
qemu.sh
Description: Bourne shell script
-- libvir-list mailing list [email protected] https://www.redhat.com/mailman/listinfo/libvir-list
