Re: [libvirt] libvirtd not responding to virsh, results in virsh hanging

Chris Friesen Fri, 31 Mar 2017 10:22:53 -0700

Hi, I finally got a chance to take another look at this issue. We've reproducedit in another test lab. New information below.


On 03/18/2017 12:41 AM, Michal Privoznik wrote:

On 17.03.2017 23:21, Chris Friesen wrote:

Hi,


We've recently run into an issue with libvirt 1.2.17 in the context of
an OpenStack deployment.

Let me just say that 1.2.17 is rather old libvirt. Can you try with one
of the latests one to see whether the bug still reproduces?

Difficult, the version seems likely to be part of the problem. We haven't seenthis issue with migrations between hosts with libvirtd 1.2.17 or between hostswith libvirtd 2.0.0, just when the versions are mismatched.

The issue occurs when we are trying to do an in-service upgrade, so the sourcehost is running libvirt 1.2.17 and we're trying to live-migrate to a dest hostthat has been upgraded to libvirt 2.0.0.

Interestingly, the issue doesn't always happen, it's intermittent. We recentlyreproduced it on the fourth guest we live-migrated from the "old" host to the"new" host--the first three migrated without difficulty. (And the first threewere configured very closely to the fourth...boot from iscsi, same number/typeof NICs, same amount of vCPUs/RAM, same topology, etc.)


To answer a previous question, yes we're doing tunneled migration in this case.

Interestingly, when I hit "c" to continue in the debugger, I got this:

(gdb) c
Continuing.

Program received signal SIGPIPE, Broken pipe.
[Switching to Thread 0x7f0573fff700 (LWP 186865)]
0x00007f05b5cbb1cd in write () from /lib64/libpthread.so.0
(gdb) c
Continuing.
[Thread 0x7f0573fff700 (LWP 186865) exited]
(gdb) quit
A debugging session is active.

         Inferior 1 [process 37471] will be detached.

Quit anyway? (y or n) y
Detaching from program: /usr/sbin/libvirtd, process 37471


This is becasue there might be some keep alive going on. Introduced in
0.9.8, libvirt has keepalive mechanism in place (repeatedly sending
ping/pong between client & server). Now, should 5 subsequent pings get
lost (this is configurable of course) libvirt thinks the connection is
broken and closes it. If you attach a debugger to libvirt, the whole
daemon is paused, among with the event loop so server cannot reply to
client's pings which in turn makes client think the connection is
broken. Thus it closes the connection which is observed as broken pipe
in the daemon.

I've reproduced the issue in another test lab, in this case compute-2 is the"old" host while compute-0 and compute-1 are the "new" hosts. Three guests havelive-migrated from compute-2 to compute-0, and a fourth appears to be stuckin-progress, but libvirtd is hung so any "virsh" commands also hang.

Running "netstat -apn |grep libvirtd" shows an open connection between compute-2(192.168.205.134) and compute-0 (192.168.205.24). Presumably this correspondsto the migration that appears to be "stuck" in progress.


compute-2:/home/wrsroot# netstat -atpn|grep libvirtd

tcp 0 0 0.0.0.0:16509 0.0.0.0:* LISTEN35787/libvirtdtcp 0 0 192.168.205.134:51760 192.168.205.24:16509 ESTABLISHED35787/libvirtdtcp6 0 0 :::16509 :::* LISTEN35787/libvirtd

Running "virsh list" on compute-0 shows 9 guests, which agrees with the numberof running "qemu-kvm" processes. Interestingly, the guest from the migrationwith an open connection in libvirtd is *not* running and doesn't show up in the"virsh list" output.

The /var/log/libvirt/qemu/instance-0000000e.log file on compute-0 corresponds tothe instance that libvirtd is "stuck" migrating, and it ends with these lines:

2017-03-29T06:38:37.886940Z qemu-kvm: VQ 2 size 0x80 < last_avail_idx 0x47b -used_idx 0x47c2017-03-29T06:38:37.886974Z qemu-kvm: error while loading state for instance 0x0of device '0000:00:07.0/virtio-balloon'2017-03-29T06:38:37.888684Z qemu-kvm: load of migration failed: Operation notpermitted

2017-03-29 06:38:37.896+0000: shutting down

I think this implies a qemu incompatibility of some sort between the differentqemu versions on the "old" and "new" hosts, but it doesn't explain why libvirtddidn't close down the migration connection between the two hosts.


The corresponding libvirtd logs on compute-0 are:

2017-03-29T06:38:35.000 401: warning : qemuDomainObjTaint:3580 : Domain id=10name='instance-0000000e' uuid=57ae849f-aa66-422a-90a2-62db6c59db29 is tainted:high-privileges2017-03-29T06:38:37.000 49075: error : qemuMonitorIO:695 : internal error: Endof file from monitor2017-03-29T06:38:37.000 49075: error : qemuProcessReportLogError:1810 : internalerror: qemu unexpectedly closed the monitor: EAL:eal_memory.c:1591: WARNING:Address Space Layout Randomization (ASLR) is enabled in the kernel.EAL:eal_memory.c:1593: This may cause issues with mapping memory intosecondary processes2017-03-29T06:38:37.886940Z qemu-kvm: VQ 2 size 0x80 < last_avail_idx 0x47b -used_idx 0x47c2017-03-29T06:38:37.886974Z qemu-kvm: error while loading state for instance 0x0of device '0000:00:07.0/virtio-balloon'2017-03-29T06:38:37.888684Z qemu-kvm: load of migration failed: Operation notpermitted

So the question remains, why is this connection still up between the twolibvirtd processes when the migration was aborted?

I ran tcpdump looking for TCP traffic between the two libvirtd processes, andwas unable to see any after several minutes. So it doesn't look like there isany regular keepalive messaging going on (/etc/libvirt/libvirtd.conf doesn'tspecify any keepalive settings so we'd be using the defaults I think). And yetthe TCP connection is stuck open.


Chris

--
libvir-list mailing list
libvir-list@redhat.com
https://www.redhat.com/mailman/listinfo/libvir-list

Re: [libvirt] libvirtd not responding to virsh, results in virsh hanging

Reply via email to