Re: [libvirt] libvirtd not responding to virsh, results in virsh hanging

2017-03-31 Thread Chris Friesen

On 03/31/2017 11:30 AM, Chris Friesen wrote:

On 03/31/2017 11:21 AM, Chris Friesen wrote:


I ran tcpdump looking for TCP traffic between the two libvirtd processes, and
was unable to see any after several minutes.  So it doesn't look like there is
any regular keepalive messaging going on (/etc/libvirt/libvirtd.conf doesn't
specify any keepalive settings so we'd be using the defaults I think).  And yet
the TCP connection is stuck open.


Turns out I ran tcpdump in the wrong windowoops.  There's what appears to be
a keepalive sequence every 5 seconds.

I still don't understand why the connection wasn't taken down when qemu exited
on the destination host.


One final update for nowI attached gdb to libvirtd on the source host and 
then killed libvirtd on the destination host.  I saw the TCP connection get 
closed down, and gdb showed this:


[Thread 0x7f8948ab3700 (LWP 4514) exited]

At this point "virsh" commands on the source host work as expected, it's no 
longer hung.


So it appears we have a number of factors contributing to the hang:
1) failure of migration in qemu
2) connection between hosts not getting torn down when migration fails
3) the libvirtd thread managing the migration on the source side appears to be 
sleeping indefinitely while holding a resource of some sort which causes the 
apparent hang when we try to do other operations


Chris

--
libvir-list mailing list
libvir-list@redhat.com
https://www.redhat.com/mailman/listinfo/libvir-list


Re: [libvirt] libvirtd not responding to virsh, results in virsh hanging -- correction

2017-03-31 Thread Chris Friesen

On 03/31/2017 11:21 AM, Chris Friesen wrote:


I ran tcpdump looking for TCP traffic between the two libvirtd processes, and
was unable to see any after several minutes.  So it doesn't look like there is
any regular keepalive messaging going on (/etc/libvirt/libvirtd.conf doesn't
specify any keepalive settings so we'd be using the defaults I think).  And yet
the TCP connection is stuck open.


Turns out I ran tcpdump in the wrong windowoops.  There's what appears to be 
a keepalive sequence every 5 seconds.


I still don't understand why the connection wasn't taken down when qemu exited 
on the destination host.


Chris

--
libvir-list mailing list
libvir-list@redhat.com
https://www.redhat.com/mailman/listinfo/libvir-list


Re: [libvirt] libvirtd not responding to virsh, results in virsh hanging

2017-03-31 Thread Chris Friesen
Hi, I finally got a chance to take another look at this issue.  We've reproduced 
it in another test lab.  New information below.


On 03/18/2017 12:41 AM, Michal Privoznik wrote:

On 17.03.2017 23:21, Chris Friesen wrote:

Hi,

We've recently run into an issue with libvirt 1.2.17 in the context of
an OpenStack deployment.



Let me just say that 1.2.17 is rather old libvirt. Can you try with one
of the latests one to see whether the bug still reproduces?


Difficult, the version seems likely to be part of the problem.  We haven't seen 
this issue with migrations between hosts with libvirtd 1.2.17 or between hosts 
with libvirtd 2.0.0, just when the versions are mismatched.


The issue occurs when we are trying to do an in-service upgrade, so the source 
host is running libvirt 1.2.17 and we're trying to live-migrate to a dest host 
that has been upgraded to libvirt 2.0.0.


Interestingly, the issue doesn't always happen, it's intermittent.  We recently 
reproduced it on the fourth guest we live-migrated from the "old" host to the 
"new" host--the first three migrated without difficulty. (And the first three 
were configured very closely to the fourth...boot from iscsi, same number/type 
of NICs, same amount of vCPUs/RAM, same topology, etc.)


To answer a previous question, yes we're doing tunneled migration in this case.


Interestingly, when I hit "c" to continue in the debugger, I got this:

(gdb) c
Continuing.

Program received signal SIGPIPE, Broken pipe.
[Switching to Thread 0x7f0573fff700 (LWP 186865)]
0x7f05b5cbb1cd in write () from /lib64/libpthread.so.0
(gdb) c
Continuing.
[Thread 0x7f0573fff700 (LWP 186865) exited]
(gdb) quit
A debugging session is active.

 Inferior 1 [process 37471] will be detached.

Quit anyway? (y or n) y
Detaching from program: /usr/sbin/libvirtd, process 37471



This is becasue there might be some keep alive going on. Introduced in
0.9.8, libvirt has keepalive mechanism in place (repeatedly sending
ping/pong between client & server). Now, should 5 subsequent pings get
lost (this is configurable of course) libvirt thinks the connection is
broken and closes it. If you attach a debugger to libvirt, the whole
daemon is paused, among with the event loop so server cannot reply to
client's pings which in turn makes client think the connection is
broken. Thus it closes the connection which is observed as broken pipe
in the daemon.


I've reproduced the issue in another test lab, in this case compute-2 is the 
"old" host while compute-0 and compute-1 are the "new" hosts.  Three guests have 
live-migrated from compute-2 to compute-0, and a fourth appears to be stuck 
in-progress, but libvirtd is hung so any "virsh" commands also hang.


Running "netstat -apn |grep libvirtd" shows an open connection between compute-2 
(192.168.205.134) and compute-0 (192.168.205.24).  Presumably this corresponds 
to the migration that appears to be "stuck" in progress.


compute-2:/home/wrsroot# netstat -atpn|grep libvirtd
tcp0  0 0.0.0.0:16509   0.0.0.0:*   LISTEN 
35787/libvirtd
tcp0  0 192.168.205.134:51760   192.168.205.24:16509ESTABLISHED 
35787/libvirtd
tcp6   0  0 :::16509:::*LISTEN 
35787/libvirtd




Running "virsh list" on compute-0 shows 9 guests, which agrees with the number 
of running "qemu-kvm" processes.  Interestingly, the guest from the migration 
with an open connection in libvirtd is *not* running and doesn't show up in the 
"virsh list" output.


The /var/log/libvirt/qemu/instance-000e.log file on compute-0 corresponds to 
the instance that libvirtd is "stuck" migrating, and it ends with these lines:


2017-03-29T06:38:37.886940Z qemu-kvm: VQ 2 size 0x80 < last_avail_idx 0x47b - 
used_idx 0x47c
2017-03-29T06:38:37.886974Z qemu-kvm: error while loading state for instance 0x0 
of device ':00:07.0/virtio-balloon'
2017-03-29T06:38:37.888684Z qemu-kvm: load of migration failed: Operation not 
permitted

2017-03-29 06:38:37.896+: shutting down


I think this implies a qemu incompatibility of some sort between the different 
qemu versions on the "old" and "new" hosts, but it doesn't explain why libvirtd 
didn't close down the migration connection between the two hosts.


The corresponding libvirtd logs on compute-0 are:

2017-03-29T06:38:35.000 401: warning : qemuDomainObjTaint:3580 : Domain id=10 
name='instance-000e' uuid=57ae849f-aa66-422a-90a2-62db6c59db29 is tainted: 
high-privileges
2017-03-29T06:38:37.000 49075: error : qemuMonitorIO:695 : internal error: End 
of file from monitor
2017-03-29T06:38:37.000 49075: error : qemuProcessReportLogError:1810 : internal 
error: qemu unexpectedly closed the monitor: EAL:eal_memory.c:1591: WARNING: 
Address Space Layout Randomization (ASLR) is enabled in the kernel.
EAL:eal_memory.c:1593:This may cause issues with mapping memory into 
secondary processes
2017-03-29T06:38:37.886940Z qemu-kvm: VQ 2 size 0x80 < last_ava

Re: [libvirt] libvirtd not responding to virsh, results in virsh hanging

2017-03-17 Thread Michal Privoznik
On 17.03.2017 23:21, Chris Friesen wrote:
> Hi,
> 
> We've recently run into an issue with libvirt 1.2.17 in the context of
> an OpenStack deployment.
> 

Let me just say that 1.2.17 is rather old libvirt. Can you try with one
of the latests one to see whether the bug still reproduces?

> Occasionally after doing live migrations from a compute node with
> libvirt 1.2.17 to a compute node with libvirt 2.0.0 we see libvirtd on
> the 1.2.17 side stop responding.  When this happens, if you run a
> command like "sudo virsh list" then it just hangs waiting for a response
> from libvirtd.
> 
> Running "ps -elfT|grep libvirtd" shows many threads waiting on a futex,
> but two threads in poll_schedule_timeout() as part of the poll()
> syscall.  On a non-hung libvirtd I only see one thread in
> poll_schedule_timeout().

So looks like libvirt is waiting for something.

> 
> If I kill and restart libvirtd (this took two tries, it didn't actually
> die the first time) then the problem seems to go away.
> 
> I just tried attaching gdb to the "hung" libvirtd process and running
> "thread apply all backtrace".  This printed backtraces for the threads,
> including the one that was apparently stuck in poll():
> 
> Thread 17 (Thread 0x7f0573fff700 (LWP 186865)):
> #0  0x7f05b59d769d in poll () from /lib64/libc.so.6
> #1  0x7f05b7f01b9a in virNetClientIOEventLoop () from
> /lib64/libvirt.so.0
> #2  0x7f05b7f0234b in virNetClientSendInternal () from
> /lib64/libvirt.so.0
> #3  0x7f05b7f036f3 in virNetClientSendWithReply () from
> /lib64/libvirt.so.0
> #4  0x7f05b7f04eb3 in virNetClientStreamSendPacket () from
> /lib64/libvirt.so.0
> #5  0x7f05b7ed8db5 in remoteStreamFinish () from /lib64/libvirt.so.0
> #6  0x7f05b7ec7eaa in virStreamFinish () from /lib64/libvirt.so.0
> #7  0x7f059bd9323d in qemuMigrationIOFunc () from
> /usr/lib64/libvirt/connection-driver/libvirt_driver_qemu.so
> #8  0x7f05b7e09aa2 in virThreadHelper () from /lib64/libvirt.so.0
> #9  0x7f05b5cb4dc5 in start_thread () from /lib64/libpthread.so.0
> #10 0x7f05b59e1ced in clone () from /lib64/libc.so.6

This means that libvirt is trying to finish the migration stream (which
happens on the end of the migration process), but the client socket is
not writable. So perhaps client is broken? Or the connection got
interrupted?

> 
> 
> Interestingly, when I hit "c" to continue in the debugger, I got this:
> 
> (gdb) c
> Continuing.
> 
> Program received signal SIGPIPE, Broken pipe.
> [Switching to Thread 0x7f0573fff700 (LWP 186865)]
> 0x7f05b5cbb1cd in write () from /lib64/libpthread.so.0
> (gdb) c
> Continuing.
> [Thread 0x7f0573fff700 (LWP 186865) exited]
> (gdb) quit
> A debugging session is active.
> 
> Inferior 1 [process 37471] will be detached.
> 
> Quit anyway? (y or n) y
> Detaching from program: /usr/sbin/libvirtd, process 37471
> 

This is becasue there might be some keep alive going on. Introduced in
0.9.8, libvirt has keepalive mechanism in place (repeatedly sending
ping/pong between client & server). Now, should 5 subsequent pings get
lost (this is configurable of course) libvirt thinks the connection is
broken and closes it. If you attach a debugger to libvirt, the whole
daemon is paused, among with the event loop so server cannot reply to
client's pings which in turn makes client think the connection is
broken. Thus it closes the connection which is observed as broken pipe
in the daemon.

> 
> Now thread 186865 seems to be gone, and libvirtd is no longer hung.
> 
> Has anyone seen anything like this before?  Anyone have an idea where to
> start looking?

I'd say logs. We need to find out where the problem actually occurs.
Whether it is one side of the migration (I bet on destination), or
perhaps virsh itself (you're doing tunnelled migration?).

Michal

--
libvir-list mailing list
libvir-list@redhat.com
https://www.redhat.com/mailman/listinfo/libvir-list


[libvirt] libvirtd not responding to virsh, results in virsh hanging

2017-03-17 Thread Chris Friesen

Hi,

We've recently run into an issue with libvirt 1.2.17 in the context of an 
OpenStack deployment.


Occasionally after doing live migrations from a compute node with libvirt 1.2.17 
to a compute node with libvirt 2.0.0 we see libvirtd on the 1.2.17 side stop 
responding.  When this happens, if you run a command like "sudo virsh list" then 
it just hangs waiting for a response from libvirtd.


Running "ps -elfT|grep libvirtd" shows many threads waiting on a futex, but two 
threads in poll_schedule_timeout() as part of the poll() syscall.  On a non-hung 
libvirtd I only see one thread in poll_schedule_timeout().


If I kill and restart libvirtd (this took two tries, it didn't actually die the 
first time) then the problem seems to go away.


I just tried attaching gdb to the "hung" libvirtd process and running "thread 
apply all backtrace".  This printed backtraces for the threads, including the 
one that was apparently stuck in poll():


Thread 17 (Thread 0x7f0573fff700 (LWP 186865)):
#0  0x7f05b59d769d in poll () from /lib64/libc.so.6
#1  0x7f05b7f01b9a in virNetClientIOEventLoop () from /lib64/libvirt.so.0
#2  0x7f05b7f0234b in virNetClientSendInternal () from /lib64/libvirt.so.0
#3  0x7f05b7f036f3 in virNetClientSendWithReply () from /lib64/libvirt.so.0
#4  0x7f05b7f04eb3 in virNetClientStreamSendPacket () from 
/lib64/libvirt.so.0
#5  0x7f05b7ed8db5 in remoteStreamFinish () from /lib64/libvirt.so.0
#6  0x7f05b7ec7eaa in virStreamFinish () from /lib64/libvirt.so.0
#7  0x7f059bd9323d in qemuMigrationIOFunc () from 
/usr/lib64/libvirt/connection-driver/libvirt_driver_qemu.so

#8  0x7f05b7e09aa2 in virThreadHelper () from /lib64/libvirt.so.0
#9  0x7f05b5cb4dc5 in start_thread () from /lib64/libpthread.so.0
#10 0x7f05b59e1ced in clone () from /lib64/libc.so.6


Interestingly, when I hit "c" to continue in the debugger, I got this:

(gdb) c
Continuing.

Program received signal SIGPIPE, Broken pipe.
[Switching to Thread 0x7f0573fff700 (LWP 186865)]
0x7f05b5cbb1cd in write () from /lib64/libpthread.so.0
(gdb) c
Continuing.
[Thread 0x7f0573fff700 (LWP 186865) exited]
(gdb) quit
A debugging session is active.

Inferior 1 [process 37471] will be detached.

Quit anyway? (y or n) y
Detaching from program: /usr/sbin/libvirtd, process 37471


Now thread 186865 seems to be gone, and libvirtd is no longer hung.

Has anyone seen anything like this before?  Anyone have an idea where to start 
looking?


Thanks,
Chris


--
libvir-list mailing list
libvir-list@redhat.com
https://www.redhat.com/mailman/listinfo/libvir-list