Re: Guest reboot issues since QEMU 6.0 and Linux 5.11

2022-08-02 Thread Fiona Ebner
Am 28.07.22 um 12:13 schrieb Yan Vugenfirer:
> Hi Fabian,
> 
> Can you save the dump file with QEMU monitor using dump-guest-memory or with 
> virsh dump?
> Then you can use elf2dmp (compiled with QEMU and is found in “contrib” 
> folder) to covert the dump file to WinDbg format and examine the stack. 
> 

Hi Yan,
thank you for the suggestion!

So for the two VMs in the KVM_EXIT_SHUTDOWN-qemu_system_reset-loop, I get

> 2 CPU states has been found
> CPU #0 CR3 is 0x0080
> DirectoryTableBase = 0x000fffd08000 has been found from CPU #0 as 
> interrupt handling CR3
> [1]4169758 segmentation fault (core dumped)  elf2dmp memdump.elf 
> memdump.dmp

I tried twice more, hoping for better timing, but the results were the
same (haven't looked into why it segfaults yet). For the second one
there is no segfault, but still an error upon conversion:

> 2 CPU states has been found
> CPU #0 CR3 is 0x0080
> DirectoryTableBase = 0x has been found from CPU #0 as 
> interrupt handling CR3
> Failed to find paging base


For the VM with the spinning circles, the dump was converted
successfully at least, but I don't have any experience with WinDbg and
nothing interesting pops out to me:

> Microsoft (R) Windows Debugger Version 10.0.22621.1 AMD64
> Copyright (c) Microsoft Corporation. All rights reserved.
> 
> 
> Loading Dump File [F:\win-reboot-dump\memdump-circles.dmp]
> Kernel Complete Dump File: Full address space is available
> 
> Comment: 'Hello from elf2dmp!'
> Symbol search path is: srv*
> Executable search path is: 
> Windows 8.1 Kernel Version 9600 MP (2 procs) Free x64
> Product: Server, suite: TerminalServer SingleUserTS
> Edition build lab: 9600.19913.amd64fre.winblue_ltsb_escrow.201207-1920
> Machine Name:
> Kernel base = 0xf802`05073000 PsLoadedModuleList = 0xf802`053385d0
> System Uptime: 0 days 0:00:52.919
> Loading Kernel Symbols
> ...
> .
> Loading User Symbols
> 
> Loading unloaded module list
> ..
> Unknown exception - code  (first/second chance not available)
> For analysis of this file, run !analyze -v
> 0: kd> ~0
> 0: kd> kb
>  # RetAddr   : Args to Child  
>  : Call Site
> 00 f802`0526a0ad : e002`00519650 e002`00519410 
> ` f802`05354180 : hal!HalProcessorIdle+0xf
> 01 f802`05168a50 : f802`05354180 f802`066a2300 
> ` e002`00519410 : nt!PpmIdleDefaultExecute+0x1d
> 02 f802`050dd186 : f802`05354180 f802`066a23cc 
> f802`066a23d0 f802`066a23d8 : nt!PpmIdleExecuteTransition+0x400
> 03 f802`051b71ac : f802`05354180 f802`05354180 
> f802`053bba00 ` : nt!PoIdle+0x2f6
> 04 ` : f802`066a3000 f802`0669c000 
> ` ` : nt!KiIdleLoop+0x2c
> 0: kd> r
> rax= rbx= rcx=0086
> rdx= rsi= rdi=e00200519410
> rip=f8020501e81f rsp=f802066a2258 rbp=f802066a2339
>  r8=  r9= r10=0002
> r11=0001 r12= r13=0001
> r14=1f8de167 r15=
> iopl=0 nv up ei ng nz na pe nc
> cs=0010  ss=0018  ds=002b  es=002b  fs=0053  gs=002b efl=0282
> hal!HalProcessorIdle+0xf:
> f802`0501e81f c3  ret
> 0: kd> ~1
> 1: kd> kb
>  # RetAddr   : Args to Child  
>  : Call Site
> 00 f802`0526a0ad : e002`00521b20 e002`005218e0 
> ` d000`203da180 : hal!HalProcessorIdle+0xf
> 01 f802`05168a50 : d000`203da180 d000`203f8300 
> ` 018e`b7f04213 : nt!PpmIdleDefaultExecute+0x1d
> 02 f802`050dd186 : d000`203da180 d000`203f83cc 
> d000`203f83d0 d000`203f83d8 : nt!PpmIdleExecuteTransition+0x400
> 03 f802`051b71ac : d000`203da180 d000`203da180 
> d000`203ea300 ` : nt!PoIdle+0x2f6
> 04 ` : d000`203f9000 d000`203f2000 
> ` ` : nt!KiIdleLoop+0x2c
> 1: kd> r
> rax=0020 rbx= rcx=0086
> rdx= rsi= rdi=e002005218e0
> rip=f8020501e81f rsp=d000203f8258 rbp=d000203f8339
>  r8=  r9= r10=0002
> r11=0001 r12= r13=0001
> r14=000128d624996edd r15=
> iopl=0 nv up ei ng nz na pe nc
> cs=0010  ss=0018  ds=002b  es=002b  fs=0053  gs=002b efl=0282
> hal!HalProcessorIdle+0xf:
> f802`0501e81f c3  ret

Is there anything I should be looking at in particular?

I took a second 

Re: Guest reboot issues since QEMU 6.0 and Linux 5.11

2022-07-28 Thread Yan Vugenfirer
Hi Fabian,

Can you save the dump file with QEMU monitor using dump-guest-memory or with 
virsh dump?
Then you can use elf2dmp (compiled with QEMU and is found in “contrib” folder) 
to covert the dump file to WinDbg format and examine the stack. 


Best regards,
Yan.


> On 21 Jul 2022, at 3:49 PM, Fabian Ebner  wrote:
> 
> Hi,
> since about half a year ago, we're getting user reports about guest
> reboot issues with KVM/QEMU[0].
> 
> The most common scenario is a Windows Server VM (2012R2/2016/2019,
> UEFI/OVMF and SeaBIOS) getting stuck during the screen with the Windows
> logo and the spinning circles after a reboot was triggered from within
> the guest. Quitting the kvm process and booting with a fresh instance
> works. The issue seems to become more likely, the longer the kvm
> instance runs.
> 
> We did not get such reports while we were providing Linux 5.4 and QEMU
> 5.2.0, but we do with Linux 5.11/5.13/5.15 and QEMU 6.x.
> 
> I'm just wondering if anybody has seen this issue before or might have a
> hunch what it's about? Any tips on what to look out for when debugging
> are also greatly appreciated!
> 
> We do have debug access to a user's test VM and the VM state was saved
> before a problematic reboot, but I can't modify the host system there.
> AFAICT QEMU just executes guest code as usual, but I'm really not sure
> what to look out for.
> 
> That VM has CPU type host, and a colleague did have a similar enough CPU
> to load the VM state, but for him, the reboot went through normally. On
> the user's system, it triggers consistently after loading the VM state
> and rebooting.
> 
> So unfortunately, we didn't manage to reproduce the issue locally yet.
> With two other images provided by users, we ran into a boot loop, where
> QEMU resets the CPUs and does a few KVM_RUNs before the exit reason is
> KVM_EXIT_SHUTDOWN (which to my understanding indicates a triple fault)
> and then it repeats. It's not clear if the issues are related.
> 
> There are also a few reports about non-Windows VMs, mostly Ubuntu 20.04
> with UEFI/OVMF, but again, it's not clear if the issues are related.
> 
> [0]: https://forum.proxmox.com/threads/100744/
> (the forum thread is a bit chaotic unfortunately).
> 
> Best Regards,
> Fabi
> 
> 
> 




Re: Guest reboot issues since QEMU 6.0 and Linux 5.11

2022-07-22 Thread Fiona Ebner
Am 21.07.22 um 17:51 schrieb Maxim Levitsky:
> On Thu, 2022-07-21 at 14:49 +0200, Fabian Ebner wrote:
>> Hi,
>> since about half a year ago, we're getting user reports about guest
>> reboot issues with KVM/QEMU[0].
>>
>> The most common scenario is a Windows Server VM (2012R2/2016/2019,
>> UEFI/OVMF and SeaBIOS) getting stuck during the screen with the Windows
>> logo and the spinning circles after a reboot was triggered from within
>> the guest. Quitting the kvm process and booting with a fresh instance
>> works. The issue seems to become more likely, the longer the kvm
>> instance runs.
>>
>> We did not get such reports while we were providing Linux 5.4 and QEMU
>> 5.2.0, but we do with Linux 5.11/5.13/5.15 and QEMU 6.x.
>>
>> I'm just wondering if anybody has seen this issue before or might have a
>> hunch what it's about? Any tips on what to look out for when debugging
>> are also greatly appreciated!
>>
>> We do have debug access to a user's test VM and the VM state was saved
>> before a problematic reboot, but I can't modify the host system there.
>> AFAICT QEMU just executes guest code as usual, but I'm really not sure
>> what to look out for.
>>
>> That VM has CPU type host, and a colleague did have a similar enough CPU
>> to load the VM state, but for him, the reboot went through normally. On
>> the user's system, it triggers consistently after loading the VM state
>> and rebooting.
>>
>> So unfortunately, we didn't manage to reproduce the issue locally yet.
>> With two other images provided by users, we ran into a boot loop, where
>> QEMU resets the CPUs and does a few KVM_RUNs before the exit reason is
>> KVM_EXIT_SHUTDOWN (which to my understanding indicates a triple fa
>> ult)
>> and then it repeats. It's not clear if the issues are related.
> 
> 
> Does the guest have HyperV enabled in it (that is nested virtualization?)
> 

For all three machines described above
Get-WindowsOptionalFeature -Online -FeatureName Microsoft-Hyper-V
indicates that HyperV is disabled.

> Intel or AMD?
> 

We do have reports for both Intel and AMD.

> Does the VM uses secure boot / SMM?
> 

The customer VM which can reliably trigger the issue after loading the
state and rebooting uses SeaBIOS. For the other two VMs,
Confirm-SecureBootUEFI
returns "False".

SMM might be a lead! We did disable SMM in the past, because apparently
there were problems with it (didn't dig out which, was before I worked
here), and the timing of enabling it and the reports coming in would
match. I guess (some) guest OSes don't expect it to be suddenly turned on?

However, there is a report of a user with two clusters with QEMU 5.2,
one with kernel 5.4 without the issue and one with kernel 5.11 with the
issue (Windows VM with spinning circles). So that's confusing :/


We do use some additional options if the OS type is "Windows" in our
high-level configuration, including hyperV enlightenments:

> -cpu 
> 'host,hv_ipi,hv_relaxed,hv_reset,hv_runtime,hv_spinlocks=0x1fff,hv_stimer,hv_synic,hv_time,hv_vapic,hv_vpindex,+kvm_pv_eoi,+kvm_pv_unhalt'
> -no-hpet
> -rtc 'driftfix=slew,base=localtime'
> -global 'kvm-pit.lost_tick_policy=discard'

But one user reported running into the issue even with OS type "other",
i.e. when the above options are not present and CPU flags should be just
'+kvm_pv_eoi,+kvm_pv_unhalt'. There are also reports with CPU type
different from 'host', also with 'kvm64' (where we automatically set the
flags +lahf_lm,+sep).


Thank you and Best Regards,
Fiona

P.S. Please don't mind the (from your perspective sudden) name change.
I'm still the same person and don't intend to change it again :)

> Best regards,
>   Maxim Levitsky
> 
>>
>> There are also a few reports about non-Windows VMs, mostly Ubuntu 20.04
>> with UEFI/OVMF, but again, it's not clear if the issues are related.
>>
>> [0]: https://forum.proxmox.com/threads/100744/
>> (the forum thread is a bit chaotic unfortunately).
>>
>> Best Regards,
>> Fabi
>>
>>
> 
> 
> 




Re: Guest reboot issues since QEMU 6.0 and Linux 5.11

2022-07-21 Thread Maxim Levitsky
On Thu, 2022-07-21 at 14:49 +0200, Fabian Ebner wrote:
> Hi,
> since about half a year ago, we're getting user reports about guest
> reboot issues with KVM/QEMU[0].
> 
> The most common scenario is a Windows Server VM (2012R2/2016/2019,
> UEFI/OVMF and SeaBIOS) getting stuck during the screen with the Windows
> logo and the spinning circles after a reboot was triggered from within
> the guest. Quitting the kvm process and booting with a fresh instance
> works. The issue seems to become more likely, the longer the kvm
> instance runs.
> 
> We did not get such reports while we were providing Linux 5.4 and QEMU
> 5.2.0, but we do with Linux 5.11/5.13/5.15 and QEMU 6.x.
> 
> I'm just wondering if anybody has seen this issue before or might have a
> hunch what it's about? Any tips on what to look out for when debugging
> are also greatly appreciated!
> 
> We do have debug access to a user's test VM and the VM state was saved
> before a problematic reboot, but I can't modify the host system there.
> AFAICT QEMU just executes guest code as usual, but I'm really not sure
> what to look out for.
> 
> That VM has CPU type host, and a colleague did have a similar enough CPU
> to load the VM state, but for him, the reboot went through normally. On
> the user's system, it triggers consistently after loading the VM state
> and rebooting.
> 
> So unfortunately, we didn't manage to reproduce the issue locally yet.
> With two other images provided by users, we ran into a boot loop, where
> QEMU resets the CPUs and does a few KVM_RUNs before the exit reason is
> KVM_EXIT_SHUTDOWN (which to my understanding indicates a triple fa
> ult)
> and then it repeats. It's not clear if the issues are related.


Does the guest have HyperV enabled in it (that is nested virtualization?)

Intel or AMD?

Does the VM uses secure boot / SMM?

Best regards,
Maxim Levitsky

> 
> There are also a few reports about non-Windows VMs, mostly Ubuntu 20.04
> with UEFI/OVMF, but again, it's not clear if the issues are related.
> 
> [0]: https://forum.proxmox.com/threads/100744/
> (the forum thread is a bit chaotic unfortunately).
> 
> Best Regards,
> Fabi
> 
>