Bug#861225: [Regression] Soft lockup in KVM/QEMU virtual machine

2017-04-30 Thread Olav Seyfarth
> try 3.16.43-2, should be available on mirrors in a day or two

Did that just now since my guestes received that kernel update.
Booting fine again, no issues starting/running guests.

Seems Solved, please close this bug.
Thanks a lot for your work, all kernel devs!!



Bug#861225: [Regression] Soft lockup in KVM/QEMU virtual machine

2017-04-30 Thread Salvatore Bonaccorso
Hi Olav,

On Sun, Apr 30, 2017 at 08:54:30PM +0200, Olav Seyfarth wrote:
> Will it appear in proposed? I will test it for sure, will let you
> know. But I am AFK till next saturday ...

Yes on -proposed. It is scheduled to be included for the next point
release which is due to be on next weekend indeed:

https://lists.debian.org/debian-live/2017/04/msg3.html

Regards,
Salvatore



Bug#861225: [Regression] Soft lockup in KVM/QEMU virtual machine

2017-04-30 Thread Olav Seyfarth
Will it appear in proposed? I will test it for sure, will let you know. But I 
am AFK till next saturday ...


signature.asc
Description: OpenPGP digital signature


Bug#861225: [Regression] Soft lockup in KVM/QEMU virtual machine

2017-04-30 Thread Ben Hutchings
On Fri, 2017-04-28 at 18:11 +0200, Olav Seyfarth wrote:
> Hi Ben,
> 
> first, thanks for your patience, very much appreciated. I know how hard
> debugging can be, I'm helping with Enigmail If I have time to do so.
> I tried to write clearly but now see that I did not succeed:
> 
> > Based on your original report, giving a kernel log from the guest
> > (which has also been upgraded), I thought you were reporting an issue
> > triggered by upgrading the guest kernel. Now I think what you're
> > actually reporting is that upgrading the host kernel casues guests
> > to crash. Is that correct?
> 
> No. Host and guests received the (unattended) upgrade but downgrading
> the _host_ (only) turned the system in a stable state. You might have
> spotted "Guests still are on 3.16.43-1" in my original report and
> deducted that the host seems to be the culprit. Well hidden, I agree.
> Sorry for that! So maybe you want to rephrase the bug title (again).
> 
> While investigating when my guests did not start, I tried to start them
> using virsh --console - and received (nothing) for some minutes. Just as
> I was about to kill the terminal, there was that kernel panic messages.
> So I saved them, not aware by that time that I was the host's console
> messages being shown. (At least I now think that it was.)
[...]

I think this is the same bug as #861313, which is now fixed in version
3.16.43-2.  That should be available on mirrors in a day or two.  Let
us know if it works for you.

Ben.

-- 
Ben Hutchings
This sentence contradicts itself - no actually it doesn't.


signature.asc
Description: This is a digitally signed message part


Bug#861225: [Regression] Soft lockup in KVM/QEMU virtual machine

2017-04-28 Thread Olav Seyfarth
Hi Ben,

first, thanks for your patience, very much appreciated. I know how hard
debugging can be, I'm helping with Enigmail If I have time to do so.
I tried to write clearly but now see that I did not succeed:

> Based on your original report, giving a kernel log from the guest
> (which has also been upgraded), I thought you were reporting an issue
> triggered by upgrading the guest kernel. Now I think what you're
> actually reporting is that upgrading the host kernel casues guests
> to crash. Is that correct?

No. Host and guests received the (unattended) upgrade but downgrading
the _host_ (only) turned the system in a stable state. You might have
spotted "Guests still are on 3.16.43-1" in my original report and
deducted that the host seems to be the culprit. Well hidden, I agree.
Sorry for that! So maybe you want to rephrase the bug title (again).

While investigating when my guests did not start, I tried to start them
using virsh --console - and received (nothing) for some minutes. Just as
I was about to kill the terminal, there was that kernel panic messages.
So I saved them, not aware by that time that I was the host's console
messages being shown. (At least I now think that it was.)

> were reporting an issue triggered by upgrading the guest kernel

To clearify: The crash only happened upon rebooting the whole system.
Unattended upgrade installed the new kernel but did not reload it.
I rebooted due to a PHP and mySQL upgrade, to make sure new versions
being active and that THEY would come up correctly upon reboot.

> If so, can you check whether the host kernel logs anything when this
> happens, and send that?

Now, as it seems necessary what I can do (on the host) is to remove the
APT Pin, apt update and upgrade, then boot, open a terminal from my
laptop to the host (how do I make sure to get console output there?) and
start some VM guests to make the host crash. Copy the console output
from the terminal and revert all changes (re-activate pin, downgrade,
reboot, fire up guests).

Since I want to avoid having to do this multiple times, what exactly do
I need to capture?

> But I can't fix a bug if I don't understand what the bug is or how to
> reproduce it!

Thought you'd say that. I just hoped that the stuff I reported answering
your remark "you cut too much" might already have helped.

Olav



signature.asc
Description: OpenPGP digital signature


Bug#861225: [Regression] Soft lockup in KVM/QEMU virtual machine

2017-04-27 Thread Ben Hutchings
On Thu, 2017-04-27 at 11:34 +0200, Olav Seyfarth wrote:
> Hi Ben,
> 
> > [Reply to all, not just to me]
> 
> sorry, using my mobile phone email client I did not notice that.
> 
> > You cut too much.
> 
> Below my message is what I did cut (running the older, stable kernel).
> 
> Might any of the packages unattendedly installed tonight have any
> influence on the "Soft lockup in KVM/QEMU virtual machine"?

Probably not.

> tail -4 /var/log/apt/history.log
> Start-Date: 2017-04-26  06:26:19
> Commandline: /usr/bin/unattended-upgrade
> Upgrade: multiarch-support:amd64 (2.19-18+deb8u7, 2.19-18+deb8u8),
> libc-bin:amd64 (2.19-18+deb8u7, 2.19-18+deb8u8), libc6:amd64
> (2.19-18+deb8u7, 2.19-18+deb8u8), minicom:amd64 (2.7-1, 2.7-1+deb8u1),
> libxslt1.1:amd64 (1.1.28-2+deb8u2, 1.1.28-2+deb8u3)
> End-Date: 2017-04-26  06:26:55
> 
> 
> > This indicates there was an earlier BUG logged; please send that too.
> 
> Is that necessary for this bug? I ask since I hesitate to deliberately
> break my production server (hosts all my internal und external services
> like e-mail and file service). "Never touch a running system." ...

Based on your original report, giving a kernel log from the guest
(which has also been upgraded), I thought you were reporting an issue
triggered by upgrading the guest kernel.

Now I think what you're actually reporting is that upgrading the host
kernel casues guests to crash.  Is that correct?  If so, can you check
whether the host kernel logs anything when this happens, and send that?

> I would like to do that once there is a newer kernel in proposed or
> security that needs to be tested anyway. Would that be OK, too?
[...]

But I can't fix a bug if I don't understand what the bug is or how to
reproduce it!

Ben.

-- 
Ben Hutchings
I say we take off; nuke the site from orbit.  It's the only way to be
sure.



signature.asc
Description: This is a digitally signed message part


Bug#861225: [Regression] Soft lockup in KVM/QEMU virtual machine

2017-04-27 Thread Olav Seyfarth
Hi Ben,

> [Reply to all, not just to me]

sorry, using my mobile phone email client I did not notice that.

> You cut too much.

Below my message is what I did cut (running the older, stable kernel).

Might any of the packages unattendedly installed tonight have any
influence on the "Soft lockup in KVM/QEMU virtual machine"?

tail -4 /var/log/apt/history.log
Start-Date: 2017-04-26  06:26:19
Commandline: /usr/bin/unattended-upgrade
Upgrade: multiarch-support:amd64 (2.19-18+deb8u7, 2.19-18+deb8u8),
libc-bin:amd64 (2.19-18+deb8u7, 2.19-18+deb8u8), libc6:amd64
(2.19-18+deb8u7, 2.19-18+deb8u8), minicom:amd64 (2.7-1, 2.7-1+deb8u1),
libxslt1.1:amd64 (1.1.28-2+deb8u2, 1.1.28-2+deb8u3)
End-Date: 2017-04-26  06:26:55


> This indicates there was an earlier BUG logged; please send that too.

Is that necessary for this bug? I ask since I hesitate to deliberately
break my production server (hosts all my internal und external services
like e-mail and file service). "Never touch a running system." ...

I would like to do that once there is a newer kernel in proposed or
security that needs to be tested anyway. Would that be OK, too?

> connect a serial port on the VM to a pty device, then use 'screen'

I think I'll manage that.

Olav
___

-- Package-specific info:
** Version:
Linux version 3.16.0-4-amd64 (debian-ker...@lists.debian.org) (gcc
version 4.8.4 (Debian 4.8.4-1) ) #1 SMP Debian 3.16.39-1+deb8u2 (2017-03-07)

** Command line:
BOOT_IMAGE=/boot/vmlinuz-3.16.0-4-amd64
root=UUID=2a4af6bf-3d76-491c-900b-e3462dafe143 ro quiet

** Not tainted

** Kernel log:
[ 9851.668942] ata1.00: exception Emask 0x10 SAct 0x1 SErr 0x400100
action 0x6 frozen
[ 9851.668978] ata1.00: irq_stat 0x0800, interface fatal error
[ 9851.668994] ata1: SError: { UnrecovData Handshk }
[ 9851.669007] ata1.00: failed command: WRITE FPDMA QUEUED
[ 9851.669024] ata1.00: cmd 61/20:80:a0:22:cd/00:00:04:00:00/40 tag 16
ncq 16384 out
 res 40/00:84:a0:22:cd/00:00:04:00:00/40 Emask 0x10 (ATA bus error)
[ 9851.669061] ata1.00: status: { DRDY }
[ 9851.669072] ata1: hard resetting link
[ 9851.988985] ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 320)
[ 9851.990798] ata1.00: configured for UDMA/133
[ 9851.990804] ata1: EH complete
[19431.306953] ata1.00: exception Emask 0x10 SAct 0x10 SErr 0x400100
action 0x6 frozen
[19431.307001] ata1.00: irq_stat 0x0800, interface fatal error
[19431.307034] ata1: SError: { UnrecovData Handshk }
[19431.307062] ata1.00: failed command: WRITE FPDMA QUEUED
[19431.307094] ata1.00: cmd 61/f8:a0:00:ec:db/03:00:a3:00:00/40 tag 20
ncq 520192 out
 res 40/00:a4:00:ec:db/00:00:a3:00:00/40 Emask 0x10 (ATA bus error)
[19431.307172] ata1.00: status: { DRDY }
[19431.307194] ata1: hard resetting link
[19431.626962] ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 320)
[19431.628131] ata1.00: configured for UDMA/133
[19431.628137] ata1: EH complete
[20792.458164] ata1.00: exception Emask 0x10 SAct 0x2000 SErr 0x400100
action 0x6 frozen
[20792.458201] ata1.00: irq_stat 0x0800, interface fatal error
[20792.458225] ata1: SError: { UnrecovData Handshk }
[20792.458246] ata1.00: failed command: WRITE FPDMA QUEUED
[20792.458269] ata1.00: cmd 61/00:68:00:5c:df/04:00:a3:00:00/40 tag 13
ncq 524288 out
 res 40/00:6c:00:5c:df/00:00:a3:00:00/40 Emask 0x10 (ATA bus error)
[20792.458327] ata1.00: status: { DRDY }
[20792.458343] ata1: hard resetting link
[20792.778167] ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 320)
[20792.779473] ata1.00: configured for UDMA/133
[20792.779480] ata1: EH complete
[20812.766645] ata1.00: exception Emask 0x10 SAct 0x30 SErr 0x400100
action 0x6 frozen
[20812.766681] ata1.00: irq_stat 0x0800, interface fatal error
[20812.766697] ata1: SError: { UnrecovData Handshk }
[20812.766711] ata1.00: failed command: WRITE FPDMA QUEUED
[20812.766727] ata1.00: cmd 61/00:20:00:54:e1/04:00:a3:00:00/40 tag 4
ncq 524288 out
 res 40/00:24:00:54:e1/00:00:a3:00:00/40 Emask 0x10 (ATA bus error)
[20812.768142] ata1.00: status: { DRDY }
[20812.768849] ata1.00: failed command: WRITE FPDMA QUEUED
[20812.769548] ata1.00: cmd 61/40:28:c0:58:e1/03:00:a3:00:00/40 tag 5
ncq 425984 out
 res 40/00:24:00:54:e1/00:00:a3:00:00/40 Emask 0x10 (ATA bus error)
[20812.770946] ata1.00: status: { DRDY }
[20812.771793] ata1: hard resetting link
[20813.090689] ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 320)
[20813.100289] ata1.00: configured for UDMA/133
[20813.100297] ata1: EH complete
[36112.342766] ata1.00: exception Emask 0x10 SAct 0x100 SErr 0x400100
action 0x6 frozen
[36112.343493] ata1.00: irq_stat 0x0800, interface fatal error
[36112.344197] ata1: SError: { UnrecovData Handshk }
[36112.344894] ata1.00: failed command: WRITE FPDMA QUEUED
[36112.345594] ata1.00: cmd 61/08:40:98:41:30/00:00:00:00:00/40 tag 8
ncq 4096 out
 res 40/00:44:98:41:30/00:00:00:00:00/40 Emask 0x10 (ATA bus error)
[36112.347006]