Re: [PATCH v5 untested] kvm: better MWAIT emulation for guests

Radim Krčmář Thu, 16 Mar 2017 07:09:12 -0700

2017-03-16 09:24-0400, Gabriel L. Somlo:
> On Thu, Mar 16, 2017 at 01:41:28AM +0200, Michael S. Tsirkin wrote:
> > On Wed, Mar 15, 2017 at 07:35:34PM -0400, Gabriel L. Somlo wrote:
> > > On Wed, Mar 15, 2017 at 11:22:18PM +0200, Michael S. Tsirkin wrote:
> > > > Guests running Mac OS 5, 6, and 7 (Leopard through Lion) have a problem:
> > > > unless explicitly provided with kernel command line argument
> > > > "idlehalt=0" they'd implicitly assume MONITOR and MWAIT availability,
> > > > without checking CPUID.
> > > > 
> > > > We currently emulate that as a NOP but on VMX we can do better: let
> > > > guest stop the CPU until timer, IPI or memory change.  CPU will be busy
> > > > but that isn't any worse than a NOP emulation.
> > > > 
> > > > Note that mwait within guests is not the same as on real hardware
> > > > because halt causes an exit while mwait doesn't.  For this reason it
> > > > might not be a good idea to use the regular MWAIT flag in CPUID to
> > > > signal this capability.  Add a flag in the hypervisor leaf instead.
> > > > 
> > > > Additionally, we add a capability for QEMU - e.g. if it knows there's an
> > > > isolated CPU dedicated for the VCPU it can set the standard MWAIT flag
> > > > to improve guest behaviour.
> > > 
> > > Same behavior (on the mac pro 1,1 running F22 with custom-compiled
> > > kernel from kvm git master, plus this patch on top).
> > > 
> > > The OS X 10.7 kernel hangs (or at least progresses extremely slowly)
> > > on boot, does not bring up guest graphical interface within the first
> > > 10 minutes that I waited for it. That, in contrast with the default
> > > nop-based emulation where the guest comes up within 30 seconds.
> > 
> > 
> > Thanks a lot, meanwhile I'll try to write a unit-test and experiment
> > with various behaviours.
> > 
> > > I will run another round of tests on a newer Mac (4-year-old macbook
> > > air) and report back tomorrow.
> > > 
> > > Going off on a tangent, why would encouraging otherwise well-behaved
> > > guests (like linux ones, for example) to use MWAIT be desirable to
> > > begin with ? Is it a matter of minimizing the overhead associated with
> > > exiting and re-entering L1 ? Because if so, AFAIR staying inside L1 and
> > > running guest-mode MWAIT in a tight loop will actually waste the host
> > > CPU without the opportunity to yield to some other L0 thread. Sorry if
> > > I fell into the middle of an ongoing conversation on this and missed
> > > most of the relevant context, in which case please feel free to ignore
> > > me... :)
> > > 
> > > Thanks,
> > > --G
> > 
> > It's just some experiments I'm running, I'm not ready to describe it
> > yet. I thought this part might be useful to at least some guests, so
> > trying to upstream it right now.
> 
> OK, so on a macbook air running F25 and the latest kvm git master plus
> your v5 patch (4.11.0-rc2+), things appear to work.
> 
> host-side cpuid output:
> eax=0x000040 ebx=0x000040 ecx=0x000003 edx=0x021120
> 
> guest-side cpuid output:
> eax=00000000 ebx=00000000 ecx=0x000003 edx=00000000
> 
> processor     : 3
> vendor_id     : GenuineIntel
> cpu family    : 6
> model         : 42
> model name    : Intel(R) Core(TM) i7-2677M CPU @ 1.80GHz
> stepping      : 7
> microcode     : 0x29
> cpu MHz               : 1157.849
> cache size    : 4096 KB
> physical id   : 0
> siblings      : 4
> core id               : 1
> cpu cores     : 2
> apicid                : 3
> initial apicid        : 3
> fpu           : yes
> fpu_exception : yes
> cpuid level   : 13
> wp            : yes
> flags         : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov 
> pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp 
> lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc 
> cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 
> cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave 
> avx lahf_lm tpr_shadow vnmi flexpriority ept vpid xsaveopt dtherm ida arat 
> pln pts
> bugs          :
> bogomips      : 3604.68
> clflush size  : 64
> cache_alignment       : 64
> address sizes : 36 bits physical, 48 bits virtual
> power management:
> 
> After studying your patch a bit more carefully (sorry, it's crazy
> around here right now :) ) I realized you're simply trying to
> (selectively) decide when to exit L1 and emulate as NOP vs. when to
> just allow L1 to execute MONITOR & MWAIT natively.
> 
> Is that right ? Because if so, the issues I saw on my MacPro1,1 are
> weird and inexplicable, given that allowing L>=1 to run MONITOR/MWAIT
> natively was one of the options Alex Graf and Rene Rebe used back in
> the very early days of OS X on QEMU, at the time I got involved with
> that project. Here's part of an out of tree patch against 3.4 which did
> just that, and worked as far as I remember on *any* MWAIT capable
> intel chip I had access to back in 2010:
> 
> ##############################################################################
> # 99-mwait.patch.kvm-kmod (Rene Rebe <r...@exactcode.de>) 2010-04-27
> ##############################################################################
> diff -pNarU5 linux-3.4/arch/x86/kvm/cpuid.c linux-3.4-mac/arch/x86/kvm/cpuid.c
> --- linux-3.4/arch/x86/kvm/cpuid.c    2012-05-20 18:29:13.000000000 -0400
> +++ linux-3.4-mac/arch/x86/kvm/cpuid.c        2012-10-09 11:42:59.921215750 
> -0400
> @@ -222,11 +222,11 @@ static int do_cpuid_ent(struct kvm_cpuid
>               f_nx | 0 /* Reserved */ | F(MMXEXT) | F(MMX) |
>               F(FXSR) | F(FXSR_OPT) | f_gbpages | f_rdtscp |
>               0 /* Reserved */ | f_lm | F(3DNOWEXT) | F(3DNOW);
>       /* cpuid 1.ecx */
>       const u32 kvm_supported_word4_x86_features =
> -             F(XMM3) | F(PCLMULQDQ) | 0 /* DTES64, MONITOR */ |
> +             F(XMM3) | F(PCLMULQDQ) | F(MWAIT) /* DTES64, MONITOR */ |
>               0 /* DS-CPL, VMX, SMX, EST */ |
>               0 /* TM2 */ | F(SSSE3) | 0 /* CNXT-ID */ | 0 /* Reserved */ |
>               F(FMA) | F(CX16) | 0 /* xTPR Update, PDCM */ |
>               0 /* Reserved, DCA */ | F(XMM4_1) |
>               F(XMM4_2) | F(X2APIC) | F(MOVBE) | F(POPCNT) |
> diff -pNarU5 linux-3.4/arch/x86/kvm/svm.c linux-3.4-mac/arch/x86/kvm/svm.c
> --- linux-3.4/arch/x86/kvm/svm.c      2012-05-20 18:29:13.000000000 -0400
> +++ linux-3.4-mac/arch/x86/kvm/svm.c  2012-10-09 11:44:41.598997481 -0400
> @@ -1102,12 +1102,10 @@ static void init_vmcb(struct vcpu_svm *s
>       set_intercept(svm, INTERCEPT_VMSAVE);
>       set_intercept(svm, INTERCEPT_STGI);
>       set_intercept(svm, INTERCEPT_CLGI);
>       set_intercept(svm, INTERCEPT_SKINIT);
>       set_intercept(svm, INTERCEPT_WBINVD);
> -     set_intercept(svm, INTERCEPT_MONITOR);
> -     set_intercept(svm, INTERCEPT_MWAIT);
>       set_intercept(svm, INTERCEPT_XSETBV);
>  
>       control->iopm_base_pa = iopm_base;
>       control->msrpm_base_pa = __pa(svm->msrpm);
>       control->int_ctl = V_INTR_MASKING_MASK;
> diff -pNarU5 linux-3.4/arch/x86/kvm/vmx.c linux-3.4-mac/arch/x86/kvm/vmx.c
> --- linux-3.4/arch/x86/kvm/vmx.c      2012-05-20 18:29:13.000000000 -0400
> +++ linux-3.4-mac/arch/x86/kvm/vmx.c  2012-10-09 11:42:59.925215977 -0400
> @@ -1938,11 +1938,11 @@ static __init void nested_vmx_setup_ctls
>               nested_vmx_procbased_ctls_low, nested_vmx_procbased_ctls_high);
>       nested_vmx_procbased_ctls_low = 0;
>       nested_vmx_procbased_ctls_high &=
>               CPU_BASED_VIRTUAL_INTR_PENDING | CPU_BASED_USE_TSC_OFFSETING |
>               CPU_BASED_HLT_EXITING | CPU_BASED_INVLPG_EXITING |
> -             CPU_BASED_MWAIT_EXITING | CPU_BASED_CR3_LOAD_EXITING |
> +             CPU_BASED_CR3_LOAD_EXITING |
>               CPU_BASED_CR3_STORE_EXITING |
>  #ifdef CONFIG_X86_64
>               CPU_BASED_CR8_LOAD_EXITING | CPU_BASED_CR8_STORE_EXITING |
>  #endif
>               CPU_BASED_MOV_DR_EXITING | CPU_BASED_UNCOND_IO_EXITING |
> @@ -2404,12 +2404,10 @@ static __init int setup_vmcs_config(stru
>             CPU_BASED_CR3_LOAD_EXITING |
>             CPU_BASED_CR3_STORE_EXITING |
>             CPU_BASED_USE_IO_BITMAPS |
>             CPU_BASED_MOV_DR_EXITING |
>             CPU_BASED_USE_TSC_OFFSETING |
> -           CPU_BASED_MWAIT_EXITING |
> -           CPU_BASED_MONITOR_EXITING |
>             CPU_BASED_INVLPG_EXITING |
>             CPU_BASED_RDPMC_EXITING;
>  
>       opt = CPU_BASED_TPR_SHADOW |
>             CPU_BASED_USE_MSR_BITMAPS |
> 
> If all you're trying to do is (selectively) revert to this behavior,
> that "shouldn't" mess it up for the MacPro either, so I'm thoroughly
> confused at this point :)
> 
> Back in 2010, running MWAIT in L>=1  behaved 100% exactly like a NOP,
> didn't power down the physical CPU, just immediately moved on to the
> next instruction. As such, there was no power saving and no
> opportunity to yield to another L0 thread either, unlike with NOP
> emulation at L0.
> 
> Did that change on newer Intel chips (i.e., is guest-mode MWAIT now
> doing something smarter than just acting as a guest-mode NOP) ?


Probably, MWAIT in new intel chips enters power saving mode normally.

If hardware-executed MWAIT acted as a NOP in your old chip, then that
shouldn't be a problem either ...  Maybe OS X gets confused into doing
something really dumb because we do not expose the MONITOR/MWAIT feature
bit correctly.

Can you try this QEMU patch on the old hardware?

diff --git a/target/i386/cpu.c b/target/i386/cpu.c
index 7aa762245a54..4b112e12188a 100644
--- a/target/i386/cpu.c
+++ b/target/i386/cpu.c
@@ -2764,10 +2764,7 @@ void cpu_x86_cpuid(CPUX86State *env, uint32_t index, 
uint32_t count,
         break;
     case 5:
         /* mwait info: needed for Core compatibility */
-        *eax = 0; /* Smallest monitor-line size in bytes */
-        *ebx = 0; /* Largest monitor-line size in bytes */
-        *ecx = CPUID_MWAIT_EMX | CPUID_MWAIT_IBE;
-        *edx = 0;
+        host_cpuid(index, 0, eax, ebx, ecx, edx);
         break;
     case 6:
         /* Thermal and Power Leaf */
diff --git a/target/i386/kvm.c b/target/i386/kvm.c
index 55865dbee0aa..1eb78291b093 100644
--- a/target/i386/kvm.c
+++ b/target/i386/kvm.c
@@ -360,6 +360,7 @@ uint32_t kvm_arch_get_supported_cpuid(KVMState *s, uint32_t 
function,
         if (!kvm_irqchip_in_kernel()) {
             ret &= ~CPUID_EXT_X2APIC;
         }
+        ret |= CPUID_EXT_MONITOR;
     } else if (function == 6 && reg == R_EAX) {
         ret |= CPUID_6_EAX_ARAT; /* safe to allow because of emulated APIC */
     } else if (function == 7 && index == 0 && reg == R_EBX) {


Thanks.

Re: [PATCH v5 untested] kvm: better MWAIT emulation for guests

Reply via email to