Public bug reported:

BugLink: https://bugs.launchpad.net/bugs/1911848

[Impact]

On CascadeLake based KVM hosts, Windows Server 2k16 and 2k19 guests will
fail to start once they have enabled the hyper-v role for nested
virtualisation.

The Windows Server guests will get stuck in the late stages of boot,
before the graphical login screen appears, on Windows Server systems
with the desktop environment installed.

If you look at performance metrics for the guest, the CPU will appear to
be stuck at 100%, and it never changes from 100%. The Windows Server
guest is unresponsive.

The KVM settings use Cascadelake-Server-noTSX virtual CPUs, with some
very specific settings needed for nested virtualisation. See testcase
section. If you use any other vcpu type, the problem does not reproduce.

Known workarounds are to install the 5.8 HWE kernel, in which case the
server will come up as expected.

[Fix]

The following commit fixes the issue, and landed in mainline 5.8-rc1:

commit 8081ad06b68a728e676d3b08e9ab70ce4039747b
Author: Sean Christopherson <sea...@google.com>
Date:   Wed Apr 22 19:25:40 2020 -0700
Subject: KVM: x86: Set KVM_REQ_EVENT if run is canceled with req_immediate_exit 
set
Link: 
https://github.com/torvalds/linux/commit/8081ad06b68a728e676d3b08e9ab70ce4039747b

It appears that pending requests to the hypervisor can be lost or
delayed if an immediate exit was requested in vcpu_enter_guest(). As the
commit message mentions, only the !injected case is affected, so we add
a check at the cancel_injection label to see if we got there as a result
of an immediate exit, and then re-issue a KVM_REQ_EVENT request if we
are.

The Windows guest is waiting for an event to be processed, which never
happens, and so gets stuck.

Even though the above commit has a Fixes: tag to a commit in 3.15-rc1,
in my testing the 4.15 kernel with a Bionic-ussuri userspace does not
reproduce the issue, so SRU to Bionic will not be needed.

[Testcase]

A cascadelake based Xeon server is required. Anything else and the bug
will not reproduce.

I used a c5.metal server on AWS. It has the following processor:
Intel(R) Xeon(R) Platinum 8275CL CPU @ 3.00GHz

Install a KVM stack, and ubuntu-desktop. Set up xrdp and confirm you can
reach the desktop. Copy a Windows Server 2k19 image to the destination
server, as well as a recent ISO image of virtio drivers.

Install virt-manager.

Create a new virtual machine using the Windows 2k19 defaults. Use 8
vcpus, 16gb ram. Click customise button to change settings before
install.

Change the hard disk to be SATA, attach a new cd rom driver for the
virtio drivers. Change networking to virtio. Change CPU to Cascadelake-
Server-noTSX.

Edit the virsh xml, and ensure you set the following features for CPU:

  <cpu mode='custom' match='exact' check='full'>
    <model fallback='forbid'>Cascadelake-Server-noTSX</model>
    <topology sockets='8' cores='1' threads='1'/>
    <feature policy='require' name='invpcid'/>
    <feature policy='require' name='pcid'/>
    <feature policy='require' name='vmx'/>
    <feature policy='require' name='hypervisor'/>
    <feature policy='disable' name='mpx'/>
    <feature policy='require' name='pku'/>
    <feature policy='require' name='arch-capabilities'/>
    <feature policy='require' name='rdctl-no'/>
    <feature policy='require' name='ibrs-all'/>
    <feature policy='require' name='skip-l1dfl-vmentry'/>
    <feature policy='require' name='mds-no'/>
  </cpu>

Those settings are an absolute must.

Boot the VM, and install Windows 2k19 with the desktop environment. Once
it is installed, open up computer management > device manager and
install drivers from the virtio ISO for missing hardware, likely the
network and balloon devices.

>From there, go to server manager, and install the hyper-v role.

Reboot the server. It will reboot a few times, and on the final time, it
will lock up before it reaches the log in screen.

In virt-manager, go to the performance tab. The CPU will be stuck at
100%. The windows guest will be non responsive.

A patched kernel is available in the following ppa:

https://launchpad.net/~mruffell/+archive/ubuntu/sf296306-test

If you install this kernel and boot the Windows 2k19 guest, it will come
up normally when the hyper-v role is enabled, and you will be able to
log in.

[Where problems could occur]

This is a change to a core part of the kvm subsystem, so there is
potential for regression which could affect all users of KVM.

If a regression were to occur, there are no workarounds. Users would
need to downgrade their kernel while a fix is developed.

** Affects: linux (Ubuntu)
     Importance: Undecided
         Status: Fix Released

** Affects: linux (Ubuntu Focal)
     Importance: Medium
     Assignee: Matthew Ruffell (mruffell)
         Status: In Progress


** Tags: focal sts

** Changed in: linux (Ubuntu)
       Status: New => Fix Released

** Also affects: linux (Ubuntu Focal)
   Importance: Undecided
       Status: New

** Changed in: linux (Ubuntu Focal)
       Status: New => In Progress

** Changed in: linux (Ubuntu Focal)
   Importance: Undecided => Medium

** Changed in: linux (Ubuntu Focal)
     Assignee: (unassigned) => Matthew Ruffell (mruffell)

** Tags added: focal sts

** Description changed:

+ BugLink: https://bugs.launchpad.net/bugs/1911848
+ 
  [Impact]
  
  On CascadeLake based KVM hosts, Windows Server 2k16 and 2k19 guests will
  fail to start once they have enabled the hyper-v role for nested
  virtualisation.
  
  The Windows Server guests will get stuck in the late stages of boot,
  before the graphical login screen appears, on Windows Server systems
  with the desktop environment installed.
  
  If you look at performance metrics for the guest, the CPU will appear to
  be stuck at 100%, and it never changes from 100%. The Windows Server
  guest is unresponsive.
  
  The KVM settings use Cascadelake-Server-noTSX virtual CPUs, with some
  very specific settings needed for nested virtualisation. See testcase
  section. If you use any other vcpu type, the problem does not reproduce.
  
  Known workarounds are to install the 5.8 HWE kernel, in which case the
  server will come up as expected.
  
  [Fix]
  
  The following commit fixes the issue, and landed in mainline 5.8-rc1:
  
  commit 8081ad06b68a728e676d3b08e9ab70ce4039747b
  Author: Sean Christopherson <sea...@google.com>
  Date:   Wed Apr 22 19:25:40 2020 -0700
  Subject: KVM: x86: Set KVM_REQ_EVENT if run is canceled with 
req_immediate_exit set
  Link: 
https://github.com/torvalds/linux/commit/8081ad06b68a728e676d3b08e9ab70ce4039747b
  
  It appears that pending requests to the hypervisor can be lost or
  delayed if an immediate exit was requested in vcpu_enter_guest(). As the
  commit message mentions, only the !injected case is affected, so we add
  a check at the cancel_injection label to see if we got there as a result
  of an immediate exit, and then re-issue a KVM_REQ_EVENT request if we
  are.
  
  The Windows guest is waiting for an event to be processed, which never
  happens, and so gets stuck.
  
  Even though the above commit has a Fixes: tag to a commit in 3.15-rc1,
  in my testing the 4.15 kernel with a Bionic-ussuri userspace does not
  reproduce the issue, so SRU to Bionic will not be needed.
  
  [Testcase]
  
  A cascadelake based Xeon server is required. Anything else and the bug
  will not reproduce.
  
  I used a c5.metal server on AWS. It has the following processor:
  Intel(R) Xeon(R) Platinum 8275CL CPU @ 3.00GHz
  
  Install a KVM stack, and ubuntu-desktop. Set up xrdp and confirm you can
  reach the desktop. Copy a Windows Server 2k19 image to the destination
  server, as well as a recent ISO image of virtio drivers.
  
  Install virt-manager.
  
  Create a new virtual machine using the Windows 2k19 defaults. Use 8
  vcpus, 16gb ram. Click customise button to change settings before
  install.
  
  Change the hard disk to be SATA, attach a new cd rom driver for the
  virtio drivers. Change networking to virtio. Change CPU to Cascadelake-
  Server-noTSX.
  
  Edit the virsh xml, and ensure you set the following features for CPU:
  
-   <cpu mode='custom' match='exact' check='full'>
-     <model fallback='forbid'>Cascadelake-Server-noTSX</model>
-     <topology sockets='8' cores='1' threads='1'/>
-     <feature policy='require' name='invpcid'/>
-     <feature policy='require' name='pcid'/>
-     <feature policy='require' name='vmx'/>
-     <feature policy='require' name='hypervisor'/>
-     <feature policy='disable' name='mpx'/>
-     <feature policy='require' name='pku'/>
-     <feature policy='require' name='arch-capabilities'/>
-     <feature policy='require' name='rdctl-no'/>
-     <feature policy='require' name='ibrs-all'/>
-     <feature policy='require' name='skip-l1dfl-vmentry'/>
-     <feature policy='require' name='mds-no'/>
-   </cpu>
+   <cpu mode='custom' match='exact' check='full'>
+     <model fallback='forbid'>Cascadelake-Server-noTSX</model>
+     <topology sockets='8' cores='1' threads='1'/>
+     <feature policy='require' name='invpcid'/>
+     <feature policy='require' name='pcid'/>
+     <feature policy='require' name='vmx'/>
+     <feature policy='require' name='hypervisor'/>
+     <feature policy='disable' name='mpx'/>
+     <feature policy='require' name='pku'/>
+     <feature policy='require' name='arch-capabilities'/>
+     <feature policy='require' name='rdctl-no'/>
+     <feature policy='require' name='ibrs-all'/>
+     <feature policy='require' name='skip-l1dfl-vmentry'/>
+     <feature policy='require' name='mds-no'/>
+   </cpu>
  
  Those settings are an absolute must.
  
  Boot the VM, and install Windows 2k19 with the desktop environment. Once
  it is installed, open up computer management > device manager and
  install drivers from the virtio ISO for missing hardware, likely the
  network and balloon devices.
  
  From there, go to server manager, and install the hyper-v role.
  
  Reboot the server. It will reboot a few times, and on the final time, it
  will lock up before it reaches the log in screen.
  
  In virt-manager, go to the performance tab. The CPU will be stuck at
  100%. The windows guest will be non responsive.
  
  A patched kernel is available in the following ppa:
  
  https://launchpad.net/~mruffell/+archive/ubuntu/sf296306-test
  
  If you install this kernel and boot the Windows 2k19 guest, it will come
  up normally when the hyper-v role is enabled, and you will be able to
  log in.
  
  [Where problems could occur]
  
  This is a change to a core part of the kvm subsystem, so there is
  potential for regression which could affect all users of KVM.
  
  If a regression were to occur, there are no workarounds. Users would
  need to downgrade their kernel while a fix is developed.

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1911848

Title:
  kvm: Windows 2k19 with Hyper-v role gets stuck on pending hypervisor
  requests on cascadelake based kvm hosts

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1911848/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

Reply via email to