Re: [PATCH 01/12] KVM: x86: Collect information for setting TSC scaling ratio
On Sun, Sep 27, 2015 at 10:38 PM, Haozhong Zhangwrote: > > The number of bits of the fractional part of the 64-bit TSC scaling > ratio in VMX and SVM is different. This patch makes the architecture > code to collect the number of fractional bits and other related > information into variables that can be accessed in the common code. > > Signed-off-by: Haozhong Zhang > --- > arch/x86/include/asm/kvm_host.h | 8 > arch/x86/kvm/svm.c | 5 + > arch/x86/kvm/x86.c | 8 > 3 files changed, 21 insertions(+) > > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h > index 2beee03..5b9b86e 100644 > --- a/arch/x86/include/asm/kvm_host.h > +++ b/arch/x86/include/asm/kvm_host.h > @@ -965,6 +965,14 @@ extern bool kvm_has_tsc_control; > extern u32 kvm_min_guest_tsc_khz; > /* maximum supported tsc_khz for guests */ > extern u32 kvm_max_guest_tsc_khz; > +/* number of bits of the fractional part of the TSC scaling ratio */ > +extern u8 kvm_tsc_scaling_ratio_frac_bits; > +/* reserved bits of TSC scaling ratio (SBZ) */ > +extern u64 kvm_tsc_scaling_ratio_rsvd; > +/* default TSC scaling ratio (= 1.0) */ > +extern u64 kvm_default_tsc_scaling_ratio; > +/* maximum allowed value of TSC scaling ratio */ > +extern u64 kvm_max_tsc_scaling_ratio; Do we need all 3 of kvm_max_guest_tsc_khz, kvm_max_tsc_scaling_ratio, and kvm_tsc_scaling_ratio_rsvd (since only SVM has reserved bits - and just for complaining if the high bits are set, which can already be expressed by max_tsc_scaling ratio) kvm_max_tsc_scaling_ratio seems to be write-only. > > enum emulation_result { > EMULATE_DONE, /* no further processing */ > diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c > index 94b7d15..eff7db7 100644 > --- a/arch/x86/kvm/svm.c > +++ b/arch/x86/kvm/svm.c > @@ -963,7 +963,12 @@ static __init int svm_hardware_setup(void) > max = min(0x7fffULL, __scale_tsc(tsc_khz, TSC_RATIO_MAX)); > > kvm_max_guest_tsc_khz = max; > + > + kvm_max_tsc_scaling_ratio = TSC_RATIO_MAX; > + kvm_tsc_scaling_ratio_frac_bits = 32; > + kvm_tsc_scaling_ratio_rsvd = TSC_RATIO_RSVD; > } > + kvm_default_tsc_scaling_ratio = TSC_RATIO_DEFAULT; > > if (nested) { > printk(KERN_INFO "kvm: Nested Virtualization enabled\n"); > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c > index 991466b..f888225 100644 > --- a/arch/x86/kvm/x86.c > +++ b/arch/x86/kvm/x86.c > @@ -106,6 +106,14 @@ bool kvm_has_tsc_control; > EXPORT_SYMBOL_GPL(kvm_has_tsc_control); > u32 kvm_max_guest_tsc_khz; > EXPORT_SYMBOL_GPL(kvm_max_guest_tsc_khz); > +u8 kvm_tsc_scaling_ratio_frac_bits; > +EXPORT_SYMBOL_GPL(kvm_tsc_scaling_ratio_frac_bits); > +u64 kvm_tsc_scaling_ratio_rsvd; > +EXPORT_SYMBOL_GPL(kvm_tsc_scaling_ratio_rsvd); > +u64 kvm_default_tsc_scaling_ratio; > +EXPORT_SYMBOL_GPL(kvm_default_tsc_scaling_ratio); > +u64 kvm_max_tsc_scaling_ratio; > +EXPORT_SYMBOL_GPL(kvm_max_tsc_scaling_ratio); > > /* tsc tolerance in parts per million - default to 1/2 of the NTP threshold > */ > static u32 tsc_tolerance_ppm = 250; > -- > 2.4.8 > > -- > To unsubscribe from this list: send the line "unsubscribe kvm" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 00/12] KVM: x86: add support for VMX TSC scaling
On Sun, Sep 27, 2015 at 10:37 PM, Haozhong Zhang <haozhong.zh...@intel.com> wrote: > This patchset adds support for VMX TSC scaling feature which is > available on Intel Skylake CPU. The specification of VMX TSC scaling > can be found at > http://www.intel.com/content/www/us/en/processors/timestamp-counter-scaling-virtualization-white-paper.html > > VMX TSC scaling allows guest TSC which is read by guest rdtsc(p) > instructions increases in a rate that is customized by the hypervisor > and can be different than the host TSC rate. Basically, VMX TSC > scaling adds a 64-bit field called TSC multiplier in VMCS so that, if > VMX TSC scaling is enabled, TSC read by guest rdtsc(p) instructions > will be calculated by the following formula: > > guest EDX:EAX = (Host TSC * TSC multiplier) >> 48 + VMX TSC Offset > > where, Host TSC = Host MSR_IA32_TSC + Host MSR_IA32_TSC_ADJUST. > > This patchset, when cooperating with another QEMU patchset (sent in > another email "target-i386: save/restore vcpu's TSC rate during > migration"), allows guest programs observe a consistent TSC rate even > though they are migrated among machines with different host TSC rates. > > VMX TSC scaling shares some common logics with SVM TSC scaling which > is already supported by KVM. Patch 1 ~ 8 move those common logics from > SVM code to the common code. Upon them, patch 9 ~ 12 add VMX-specific > support for VMX TSC scaling. reviewed-by: Eric Northup <digitale...@google.com> > > Haozhong Zhang (12): > KVM: x86: Collect information for setting TSC scaling ratio > KVM: x86: Add a common TSC scaling ratio field in kvm_vcpu_arch > KVM: x86: Add a common TSC scaling function > KVM: x86: Replace call-back set_tsc_khz() with a common function > KVM: x86: Replace call-back compute_tsc_offset() with a common function > KVM: x86: Move TSC scaling logic out of call-back adjust_tsc_offset() > KVM: x86: Move TSC scaling logic out of call-back read_l1_tsc() > KVM: x86: Use the correct vcpu's TSC rate to compute time scale > KVM: VMX: Enable and initialize VMX TSC scaling > KVM: VMX: Setup TSC scaling ratio when a vcpu is loaded > KVM: VMX: Use a scaled host TSC for guest readings of MSR_IA32_TSC > KVM: VMX: Dump TSC multiplier in dump_vmcs() > > arch/x86/include/asm/kvm_host.h | 24 +++ > arch/x86/include/asm/vmx.h | 4 +- > arch/x86/kvm/lapic.c| 5 +- > arch/x86/kvm/svm.c | 113 +++-- > arch/x86/kvm/vmx.c | 60 > arch/x86/kvm/x86.c | 154 > +--- > include/linux/kvm_host.h| 21 +- > 7 files changed, 221 insertions(+), 160 deletions(-) > > -- > 2.4.8 > > -- > To unsubscribe from this list: send the line "unsubscribe kvm" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] vhost: support upto 509 memory regions
On Tue, Feb 17, 2015 at 4:32 AM, Michael S. Tsirkin m...@redhat.com wrote: On Tue, Feb 17, 2015 at 11:59:48AM +0100, Paolo Bonzini wrote: On 17/02/2015 10:02, Michael S. Tsirkin wrote: Increasing VHOST_MEMORY_MAX_NREGIONS from 65 to 509 to match KVM_USER_MEM_SLOTS fixes issue for vhost-net. Signed-off-by: Igor Mammedov imamm...@redhat.com This scares me a bit: each region is 32byte, we are talking a 16K allocation that userspace can trigger. What's bad with a 16K allocation? It fails when memory is fragmented. How does kvm handle this issue? It doesn't. Paolo I'm guessing kvm doesn't do memory scans on data path, vhost does. qemu is just doing things that kernel didn't expect it to need. Instead, I suggest reducing number of GPA-HVA mappings: you have GPA 1,5,7 map them at HVA 11,15,17 then you can have 1 slot: 1-11 To avoid libc reusing the memory holes, reserve them with MAP_NORESERVE or something like this. This works beautifully when host virtual address bits are more plentiful than guest physical address bits. Not all architectures have that property, though. We can discuss smarter lookup algorithms but I'd rather userspace didn't do things that we then have to work around in kernel. -- MST -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] KVM: x86 emulator: emulate MOVNTDQ
On Fri, Jul 11, 2014 at 10:56 AM, Alex Williamson alex.william...@redhat.com wrote: Windows 8.1 guest with NVIDIA driver and GPU fails to boot with an emulation failure. The KVM spew suggests the fault is with lack of movntdq emulation (courtesy of Paolo): Code=02 00 00 b8 08 00 00 00 f3 0f 6f 44 0a f0 f3 0f 6f 4c 0a e0 66 0f e7 41 f0 66 0f e7 49 e0 48 83 e9 40 f3 0f 6f 44 0a 10 f3 0f 6f 0c 0a 66 0f e7 41 10 $ as -o a.out .section .text .byte 0x66, 0x0f, 0xe7, 0x41, 0xf0 .byte 0x66, 0x0f, 0xe7, 0x49, 0xe0 $ objdump -d a.out 0: 66 0f e7 41 f0 movntdq %xmm0,-0x10(%rcx) 5: 66 0f e7 49 e0 movntdq %xmm1,-0x20(%rcx) Add the necessary emulation. Signed-off-by: Alex Williamson alex.william...@redhat.com Cc: Paolo Bonzini pbonz...@redhat.com --- Hope I got all the flags correct from copying similar MOV ops, but it allows the guest to boot, so I suspect it's ok. arch/x86/kvm/emulate.c |7 ++- 1 file changed, 6 insertions(+), 1 deletion(-) diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c index e4e833d..ae39f08 100644 --- a/arch/x86/kvm/emulate.c +++ b/arch/x86/kvm/emulate.c @@ -3681,6 +3681,10 @@ static const struct gprefix pfx_0f_28_0f_29 = { I(Aligned, em_mov), I(Aligned, em_mov), N, N, }; +static const struct gprefix pfx_0f_e7 = { + N, I(Sse, em_mov), N, N, I think you need 'Aligned' flag in here - from my reading of the manual, this instruction will #GP if the memory operand isn't aligned. +}; + static const struct escape escape_d9 = { { N, N, N, N, N, N, N, I(DstMem, em_fnstcw), }, { @@ -3951,7 +3955,8 @@ static const struct opcode twobyte_table[256] = { /* 0xD0 - 0xDF */ N, N, N, N, N, N, N, N, N, N, N, N, N, N, N, N, /* 0xE0 - 0xEF */ - N, N, N, N, N, N, N, N, N, N, N, N, N, N, N, N, + N, N, N, N, N, N, N, GP(SrcReg | DstMem | ModRM | Mov, pfx_0f_e7), + N, N, N, N, N, N, N, N, /* 0xF0 - 0xFF */ N, N, N, N, N, N, N, N, N, N, N, N, N, N, N, N }; -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/3] KVM: x86: correct mwait and monitor emulation
On Wed, Jun 18, 2014 at 7:19 AM, Nadav Amit na...@cs.technion.ac.il wrote: mwait and monitor are currently handled as nop. Considering this behavior, they should still be handled correctly, i.e., check execution conditions and generate exceptions when required. mwait and monitor may also be executed in real-mode and are not handled in that case. This patch performs the emulation of monitor-mwait according to Intel SDM (other than checking whether interrupt can be used as a break event). Signed-off-by: Nadav Amit na...@cs.technion.ac.il --- arch/x86/kvm/emulate.c | 41 +++-- arch/x86/kvm/svm.c | 22 ++ arch/x86/kvm/vmx.c | 27 +++ 3 files changed, 52 insertions(+), 38 deletions(-) diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c index ef7a5a0..424b58d 100644 --- a/arch/x86/kvm/emulate.c +++ b/arch/x86/kvm/emulate.c @@ -3344,6 +3344,43 @@ static int em_bswap(struct x86_emulate_ctxt *ctxt) return X86EMUL_CONTINUE; } +static int em_monitor(struct x86_emulate_ctxt *ctxt) +{ + int rc; + struct segmented_address addr; + u64 rcx = reg_read(ctxt, VCPU_REGS_RCX); + u64 rax = reg_read(ctxt, VCPU_REGS_RAX); + u8 byte; I'd request: u32 ebx, ecx, edx, eax = 1; ctxt-opt-get_cpuid(ctxt, eax, ebx, ecx, edx); if (!(ecx FFL(MWAIT))) return emulate_ud(ctxt); and also in em_mwait. + + if (ctxt-mode != X86EMUL_MODE_PROT64) + rcx = (u32)rcx; + if (rcx != 0) + return emulate_gp(ctxt, 0); + + addr.seg = seg_override(ctxt); + addr.ea = ctxt-ad_bytes == 8 ? rax : (u32)rax; + rc = segmented_read(ctxt, addr, byte, 1); + if (rc != X86EMUL_CONTINUE) + return rc; + + printk_once(KERN_WARNING kvm: MONITOR instruction emulated as NOP!\n); + return X86EMUL_CONTINUE; +} + +static int em_mwait(struct x86_emulate_ctxt *ctxt) +{ + u64 rcx = reg_read(ctxt, VCPU_REGS_RCX); + + if (ctxt-mode != X86EMUL_MODE_PROT64) + rcx = (u32)rcx; + if ((rcx ~1UL) != 0) + return emulate_gp(ctxt, 0); + + /* Accepting interrupt as break event regardless to cpuid */ + printk_once(KERN_WARNING kvm: MWAIT instruction emulated as NOP!\n); + return X86EMUL_CONTINUE; +} + static bool valid_cr(int nr) { switch (nr) { @@ -3557,8 +3594,8 @@ static int check_perm_out(struct x86_emulate_ctxt *ctxt) F2bv(((_f) ~Lock) | DstAcc | SrcImm, _e) static const struct opcode group7_rm1[] = { - DI(SrcNone | Priv, monitor), - DI(SrcNone | Priv, mwait), + II(SrcNone | Priv | NoBigReal | UDOnPriv, em_monitor, monitor), + II(SrcNone | Priv | NoBigReal | UDOnPriv, em_mwait, mwait), N, N, N, N, N, N, }; diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c index ec8366c..a524e04 100644 --- a/arch/x86/kvm/svm.c +++ b/arch/x86/kvm/svm.c @@ -3274,24 +3274,6 @@ static int pause_interception(struct vcpu_svm *svm) return 1; } -static int nop_interception(struct vcpu_svm *svm) -{ - skip_emulated_instruction((svm-vcpu)); - return 1; -} - -static int monitor_interception(struct vcpu_svm *svm) -{ - printk_once(KERN_WARNING kvm: MONITOR instruction emulated as NOP!\n); - return nop_interception(svm); -} - -static int mwait_interception(struct vcpu_svm *svm) -{ - printk_once(KERN_WARNING kvm: MWAIT instruction emulated as NOP!\n); - return nop_interception(svm); -} - static int (*const svm_exit_handlers[])(struct vcpu_svm *svm) = { [SVM_EXIT_READ_CR0] = cr_interception, [SVM_EXIT_READ_CR3] = cr_interception, @@ -3349,8 +3331,8 @@ static int (*const svm_exit_handlers[])(struct vcpu_svm *svm) = { [SVM_EXIT_CLGI] = clgi_interception, [SVM_EXIT_SKINIT] = skinit_interception, [SVM_EXIT_WBINVD] = emulate_on_interception, - [SVM_EXIT_MONITOR] = monitor_interception, - [SVM_EXIT_MWAIT]= mwait_interception, + [SVM_EXIT_MONITOR] = emulate_on_interception, + [SVM_EXIT_MWAIT]= emulate_on_interception, [SVM_EXIT_XSETBV] = xsetbv_interception, [SVM_EXIT_NPF] = pf_interception, }; diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index 801332e..7023e71 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -5672,22 +5672,17 @@ static int handle_pause(struct kvm_vcpu *vcpu) return 1; } -static int handle_nop(struct kvm_vcpu *vcpu) +static int handle_emulate(struct kvm_vcpu *vcpu) { - skip_emulated_instruction(vcpu); - return
Re: [PATCH 3/3] KVM: x86: correct mwait and monitor emulation
Quoting Gabriel's post http://www.spinics.net/lists/kvm/msg103792.html : [...] E.g., OS X 10.5 *does* check CPUID, and panics if it doesn't find it. It needs the MONITOR cpuid flag to be on, *and* the actual instructions to work. On Wed, Jun 18, 2014 at 11:23 AM, Nadav Amit nadav.a...@gmail.com wrote: On 6/18/14, 8:59 PM, Eric Northup wrote: On Wed, Jun 18, 2014 at 7:19 AM, Nadav Amit na...@cs.technion.ac.il wrote: mwait and monitor are currently handled as nop. Considering this behavior, they should still be handled correctly, i.e., check execution conditions and generate exceptions when required. mwait and monitor may also be executed in real-mode and are not handled in that case. This patch performs the emulation of monitor-mwait according to Intel SDM (other than checking whether interrupt can be used as a break event). Signed-off-by: Nadav Amit na...@cs.technion.ac.il --- arch/x86/kvm/emulate.c | 41 +++-- arch/x86/kvm/svm.c | 22 ++ arch/x86/kvm/vmx.c | 27 +++ 3 files changed, 52 insertions(+), 38 deletions(-) diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c index ef7a5a0..424b58d 100644 --- a/arch/x86/kvm/emulate.c +++ b/arch/x86/kvm/emulate.c @@ -3344,6 +3344,43 @@ static int em_bswap(struct x86_emulate_ctxt *ctxt) return X86EMUL_CONTINUE; } +static int em_monitor(struct x86_emulate_ctxt *ctxt) +{ + int rc; + struct segmented_address addr; + u64 rcx = reg_read(ctxt, VCPU_REGS_RCX); + u64 rax = reg_read(ctxt, VCPU_REGS_RAX); + u8 byte; I'd request: u32 ebx, ecx, edx, eax = 1; ctxt-opt-get_cpuid(ctxt, eax, ebx, ecx, edx); if (!(ecx FFL(MWAIT))) return emulate_ud(ctxt); and also in em_mwait. I had similar implementation on previous version, which also checked on mwait whether interrupt as break event matches ECX value. However, I was under the impression that it was decided that MWAIT will always be emulated as NOP to avoid misbehaving VMs that ignore CPUID (see the discussion at http://www.spinics.net/lists/kvm/msg102766.html ). Nadav -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2] kvm: x86: emulate monitor and mwait instructions as nop
On Wed, May 7, 2014 at 1:52 PM, Gabriel L. Somlo gso...@gmail.com wrote: Treat monitor and mwait instructions as nop, which is architecturally correct (but inefficient) behavior. We do this to prevent misbehaving guests (e.g. OS X = 10.7) from crashing after they fail to check for monitor/mwait availability via cpuid. Since mwait-based idle loops relying on these nop-emulated instructions would keep the host CPU pegged at 100%, do NOT advertise their presence via cpuid, to prevent compliant guests from using them inadvertently. If it's going to peg the host CPU at 100% anyway, why bother emulating it? Just let the guest run the mwait instruction! Have a condition that controls whether CPU_BASED_MWAIT_EXITING gets set in the VMCS processor execution controls. Go ahead and put it in CPUID, since it's actually allowed. Signed-off-by: Gabriel L. Somlo so...@cmu.edu --- New in v2: remove invalid_op handler functions which were only used to handle exits caused by monitor and mwait On Wed, May 07, 2014 at 08:31:27PM +0200, Alexander Graf wrote: On 05/07/2014 08:15 PM, Michael S. Tsirkin wrote: If we really want to be paranoid and worry about guests that use this strange way to trigger invalid opcode, we can make it possible for userspace to enable/disable this hack, and teach qemu to set it. That would make it even safer than it was. Not sure it's worth it, just a thought. Since we don't trap on non-exposed other instructions (new SSE and whatdoiknow) I don't think it's really bad to just expose MONITOR/MWAIT as nops. So AFAICT, linux prefers to use mwait for idling if cpuid tells it that it's available. If we keep telling everyone that we do NOT have monitor and mwait available, compliant guests will never end up using them, and this hack would remain completely invisible to them, which is good (better to use hlt-based idle loops when you're a vm guest, that would actually allow the host to relax while you're halted :) So the only time anyone would be able to tell we have this hack would be when they're about to receive an invalid opcode for using monitor/mwait in violation of what CPUID (would have) told them. That's what happens to OS X prior to 10.8, which is when I'm hypothesizing the Apple devs begain to seriously think about their OS running as a vm guest (on fusion and parallels)... Instead of killing the misbehaving guest with an invalid opcode, we'd allow them to peg the host CPU with their monitor == mwait == nop idle loop instead, which, at least on OS X, should be tolerable long enough to run 'rm -rf System/Library/Extensions/AppleIntelCPUPowerManagement.kext' and reboot the guest, after which things would settle down by reverting the guest to a hlt-based idle loop. The only reason I can think of to add functionality for enabling/disabling this hack would be to protect against a malicious guest which would use mwait *on purpose* to peg the host CPU. But a malicious guest could just run for(;;); in ring 0 and accomplish the same goal, so we wouldn't really gain anything in exchange for the added complexity... Thanks, Gabriel arch/x86/kvm/cpuid.c | 2 ++ arch/x86/kvm/svm.c | 28 arch/x86/kvm/vmx.c | 20 3 files changed, 38 insertions(+), 12 deletions(-) diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c index f47a104..d094fc6 100644 --- a/arch/x86/kvm/cpuid.c +++ b/arch/x86/kvm/cpuid.c @@ -283,6 +283,8 @@ static inline int __do_cpuid_ent(struct kvm_cpuid_entry2 *entry, u32 function, 0 /* Reserved */ | f_lm | F(3DNOWEXT) | F(3DNOW); /* cpuid 1.ecx */ const u32 kvm_supported_word4_x86_features = + /* NOTE: MONITOR (and MWAIT) are emulated as NOP, +* but *not* advertised to guests via CPUID ! */ F(XMM3) | F(PCLMULQDQ) | 0 /* DTES64, MONITOR */ | 0 /* DS-CPL, VMX, SMX, EST */ | 0 /* TM2 */ | F(SSSE3) | 0 /* CNXT-ID */ | 0 /* Reserved */ | diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c index 7f4f9c2..0b7d58d 100644 --- a/arch/x86/kvm/svm.c +++ b/arch/x86/kvm/svm.c @@ -2770,12 +2770,6 @@ static int xsetbv_interception(struct vcpu_svm *svm) return 1; } -static int invalid_op_interception(struct vcpu_svm *svm) -{ - kvm_queue_exception(svm-vcpu, UD_VECTOR); - return 1; -} - static int task_switch_interception(struct vcpu_svm *svm) { u16 tss_selector; @@ -3287,6 +3281,24 @@ static int pause_interception(struct vcpu_svm *svm) return 1; } +static int nop_interception(struct vcpu_svm *svm) +{ + skip_emulated_instruction((svm-vcpu)); + return 1; +} + +static int monitor_interception(struct vcpu_svm *svm) +{ + printk_once(KERN_WARNING kvm: MONITOR instruction emulated as NOP!\n); + return nop_interception(svm); +} + +static int
Re: [PATCH 4/4] kvm: Implement PEBS virtualization
On Thu, May 29, 2014 at 6:12 PM, Andi Kleen a...@firstfloor.org wrote: From: Andi Kleen a...@linux.intel.com PEBS (Precise Event Bases Sampling) profiling is very powerful, allowing improved sampling precision and much additional information, like address or TSX abort profiling. cycles:p and :pp uses PEBS. This patch enables PEBS profiling in KVM guests. PEBS writes profiling records to a virtual address in memory. Since the guest controls the virtual address space the PEBS record is directly delivered to the guest buffer. We set up the PEBS state that is works correctly.The CPU cannot handle any kinds of faults during these guest writes. To avoid any problems with guest pages being swapped by the host we pin the pages when the PEBS buffer is setup, by intercepting that MSR. Typically profilers only set up a single page, so pinning that is not a big problem. The pinning is limited to 17 pages currently (64K+1) In theory the guest can change its own page tables after the PEBS setup. The host has no way to track that with EPT. But if a guest would do that it could only crash itself. It's not expected that normal profilers do that. The patch also adds the basic glue to enable the PEBS CPUIDs and other PEBS MSRs, and ask perf to enable PEBS as needed. Due to various limitations it currently only works on Silvermont based systems. This patch doesn't implement the extended MSRs some CPUs support. For example latency profiling on SLM will not work at this point. Timing: The emulation is somewhat more expensive than a real PMU. This may trigger the expensive PMI detection in the guest. Usually this can be disabled with echo 0 /proc/sys/kernel/perf_cpu_time_max_percent Migration: In theory it should should be possible (as long as we migrate to a host with the same PEBS event and the same PEBS format), but I'm not sure the basic KVM PMU code supports it correctly: no code to save/restore state, unless I'm missing something. Once the PMU code grows proper migration support it should be straight forward to handle the PEBS state too. Signed-off-by: Andi Kleen a...@linux.intel.com --- arch/x86/include/asm/kvm_host.h | 6 ++ arch/x86/include/uapi/asm/msr-index.h | 4 + arch/x86/kvm/cpuid.c | 10 +- arch/x86/kvm/pmu.c| 184 -- arch/x86/kvm/vmx.c| 6 ++ 5 files changed, 196 insertions(+), 14 deletions(-) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 7de069af..d87cb66 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -319,6 +319,8 @@ struct kvm_pmc { struct kvm_vcpu *vcpu; }; +#define MAX_PINNED_PAGES 17 /* 64k buffer + ds */ + struct kvm_pmu { unsigned nr_arch_gp_counters; unsigned nr_arch_fixed_counters; @@ -335,6 +337,10 @@ struct kvm_pmu { struct kvm_pmc fixed_counters[INTEL_PMC_MAX_FIXED]; struct irq_work irq_work; u64 reprogram_pmi; + u64 pebs_enable; + u64 ds_area; + struct page *pinned_pages[MAX_PINNED_PAGES]; + unsigned num_pinned_pages; }; enum { diff --git a/arch/x86/include/uapi/asm/msr-index.h b/arch/x86/include/uapi/asm/msr-index.h index fcf2b3a..409a582 100644 --- a/arch/x86/include/uapi/asm/msr-index.h +++ b/arch/x86/include/uapi/asm/msr-index.h @@ -72,6 +72,10 @@ #define MSR_IA32_PEBS_ENABLE 0x03f1 #define MSR_IA32_DS_AREA 0x0600 #define MSR_IA32_PERF_CAPABILITIES 0x0345 +#define PERF_CAP_PEBS_TRAP (1U 6) +#define PERF_CAP_ARCH_REG (1U 7) +#define PERF_CAP_PEBS_FORMAT (0xf 8) + #define MSR_PEBS_LD_LAT_THRESHOLD 0x03f6 #define MSR_MTRRfix64K_0 0x0250 diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c index f47a104..c8cc76b 100644 --- a/arch/x86/kvm/cpuid.c +++ b/arch/x86/kvm/cpuid.c @@ -260,6 +260,10 @@ static inline int __do_cpuid_ent(struct kvm_cpuid_entry2 *entry, u32 function, unsigned f_rdtscp = kvm_x86_ops-rdtscp_supported() ? F(RDTSCP) : 0; unsigned f_invpcid = kvm_x86_ops-invpcid_supported() ? F(INVPCID) : 0; unsigned f_mpx = kvm_x86_ops-mpx_supported() ? F(MPX) : 0; + bool pebs = perf_pebs_virtualization(); + unsigned f_ds = pebs ? F(DS) : 0; + unsigned f_pdcm = pebs ? F(PDCM) : 0; + unsigned f_dtes64 = pebs ? F(DTES64) : 0; /* cpuid 1.edx */ const u32 kvm_supported_word0_x86_features = @@ -268,7 +272,7 @@ static inline int __do_cpuid_ent(struct kvm_cpuid_entry2 *entry, u32 function, F(CX8) | F(APIC) | 0 /* Reserved */ | F(SEP) | F(MTRR) | F(PGE) | F(MCA) | F(CMOV) | F(PAT) | F(PSE36) | 0 /* PSN */ | F(CLFLUSH) | - 0 /* Reserved, DS, ACPI */ | F(MMX) | + f_ds /*
Re: [PATCH v4 00/12] arm/arm64: KVM: host cache maintenance when guest caches are off
On Tue, Feb 18, 2014 at 7:27 AM, Marc Zyngier marc.zyng...@arm.com wrote: When we run a guest with cache disabled, we don't flush the cache to the Point of Coherency, hence possibly missing bits of data that have been written in the cache, but have not yet reached memory. We also have the opposite issue: when a guest enables its cache, whatever sits in the cache is suddenly going to become visible, shadowing whatever the guest has written into RAM. There are several approaches to these issues: - Using the DC bit when caches are off: this breaks guests assuming caches off while doing DMA operations. Bootloaders, for example. It also breaks the I-D coherency. - Fetch the memory attributes on translation fault, and flush the cache while handling the fault. This relies on using the PAR_EL1 register to obtain the Stage-1 memory attributes, and tends to be slow. - Detecting the translation faults occuring with MMU off (and performing a cache clean), and trapping SCTLR_EL1 to detect the moment when the guest is turning its caches on (and performing a cache invalidation). Trapping of SCTLR_EL1 is then disabled to ensure the best performance. This will preclude noticing the 2nd .. Nth cache off - on cycles, right? Will any guests care - doesn't kexec go through a caches-off state? This patch series implements the last solution, for both arm and arm64. Tested on TC2 (ARMv7) and FVP model (ARMv8). From v3 (http://www.spinics.net/lists/arm-kernel/msg305211.html) - Dropped the LPAE-specific pmd_addr_end - Added kvm_p[gum]d_addr_end to deal with 40bit IPAs, and fixed the callers of p[gum]d_addr_end with IPA parameters - Added patch #12 which, while not strictly related, felt a bit lonely on the mailing list From v2 (http://www.spinics.net/lists/arm-kernel/msg302472.html): - Addressed most (hopefully all) of Christoffer's comments - Added a new LPAE pmd_addr_end to deal with 40bit IPAs From v1 (http://www.spinics.net/lists/kvm/msg99404.html): - Fixed AArch32 VM handling on arm64 (Reported by Anup) - Added ARMv7 support: * Fixed a couple of issues regarding handling of 64bit cp15 regs * Per-vcpu HCR * Switching of AMAIR0 and AMAIR1 Marc Zyngier (12): arm64: KVM: force cache clean on page fault when caches are off arm64: KVM: allows discrimination of AArch32 sysreg access arm64: KVM: trap VM system registers until MMU and caches are ON ARM: KVM: introduce kvm_p*d_addr_end arm64: KVM: flush VM pages before letting the guest enable caches ARM: KVM: force cache clean on page fault when caches are off ARM: KVM: fix handling of trapped 64bit coprocessor accesses ARM: KVM: fix ordering of 64bit coprocessor accesses ARM: KVM: introduce per-vcpu HYP Configuration Register ARM: KVM: add world-switch for AMAIR{0,1} ARM: KVM: trap VM system registers until MMU and caches are ON ARM: KVM: fix warning in mmu.c arch/arm/include/asm/kvm_arm.h | 4 +- arch/arm/include/asm/kvm_asm.h | 4 +- arch/arm/include/asm/kvm_host.h | 9 ++-- arch/arm/include/asm/kvm_mmu.h | 29 +-- arch/arm/kernel/asm-offsets.c| 1 + arch/arm/kvm/coproc.c| 84 +++--- arch/arm/kvm/coproc.h| 14 +++-- arch/arm/kvm/coproc_a15.c| 2 +- arch/arm/kvm/coproc_a7.c | 2 +- arch/arm/kvm/guest.c | 1 + arch/arm/kvm/interrupts_head.S | 21 +--- arch/arm/kvm/mmu.c | 110 --- arch/arm64/include/asm/kvm_arm.h | 3 +- arch/arm64/include/asm/kvm_asm.h | 3 +- arch/arm64/include/asm/kvm_mmu.h | 21 ++-- arch/arm64/kvm/sys_regs.c| 99 ++- arch/arm64/kvm/sys_regs.h| 2 + 17 files changed, 341 insertions(+), 68 deletions(-) -- 1.8.3.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH V12 3/5] kvm : Fold pv_unhalt flag into GET_MP_STATE ioctl to aid migration
On Tue, Aug 6, 2013 at 11:23 AM, Raghavendra K T raghavendra...@linux.vnet.ibm.com wrote: kvm : Fold pv_unhalt flag into GET_MP_STATE ioctl to aid migration From: Raghavendra K T raghavendra...@linux.vnet.ibm.com During migration, any vcpu that got kicked but did not become runnable (still in halted state) should be runnable after migration. If this is about migration correctness, could it get folded into the previous patch 2/5, so that there's not a broken commit which could hurt bisection? Signed-off-by: Raghavendra K T raghavendra...@linux.vnet.ibm.com Acked-by: Gleb Natapov g...@redhat.com Acked-by: Ingo Molnar mi...@kernel.org --- arch/x86/kvm/x86.c |7 ++- 1 file changed, 6 insertions(+), 1 deletion(-) diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index dae4575..1e73dab 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -6284,7 +6284,12 @@ int kvm_arch_vcpu_ioctl_get_mpstate(struct kvm_vcpu *vcpu, struct kvm_mp_state *mp_state) { kvm_apic_accept_events(vcpu); - mp_state-mp_state = vcpu-arch.mp_state; + if (vcpu-arch.mp_state == KVM_MP_STATE_HALTED + vcpu-arch.pv.pv_unhalted) + mp_state-mp_state = KVM_MP_STATE_RUNNABLE; + else + mp_state-mp_state = vcpu-arch.mp_state; + return 0; } ___ Virtualization mailing list virtualizat...@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 1/2] Hyper-H reference counter
On Mon, May 13, 2013 at 4:45 AM, Vadim Rozenfeld vroze...@redhat.com wrote: Signed-off: Peter Lieven p...@dlh.net Signed-off: Gleb Natapov g...@redhat.com Signed-off: Vadim Rozenfeld vroze...@redhat.com The following patch allows to activate Hyper-V reference time counter --- arch/x86/include/asm/kvm_host.h| 2 ++ arch/x86/include/uapi/asm/hyperv.h | 3 +++ arch/x86/kvm/x86.c | 25 - 3 files changed, 29 insertions(+), 1 deletion(-) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 3741c65..f0fee35 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -575,6 +575,8 @@ struct kvm_arch { /* fields used by HYPER-V emulation */ u64 hv_guest_os_id; u64 hv_hypercall; + u64 hv_ref_count; + u64 hv_tsc_page; #ifdef CONFIG_KVM_MMU_AUDIT int audit_point; diff --git a/arch/x86/include/uapi/asm/hyperv.h b/arch/x86/include/uapi/asm/hyperv.h index b80420b..9711819 100644 --- a/arch/x86/include/uapi/asm/hyperv.h +++ b/arch/x86/include/uapi/asm/hyperv.h @@ -136,6 +136,9 @@ /* MSR used to read the per-partition time reference counter */ #define HV_X64_MSR_TIME_REF_COUNT 0x4020 +/* A partition's reference time stamp counter (TSC) page */ +#define HV_X64_MSR_REFERENCE_TSC 0x4021 + /* Define the virtual APIC registers */ #define HV_X64_MSR_EOI 0x4070 #define HV_X64_MSR_ICR 0x4071 diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 094b5d9..1a4036d 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -843,7 +843,7 @@ EXPORT_SYMBOL_GPL(kvm_rdpmc); static u32 msrs_to_save[] = { MSR_KVM_SYSTEM_TIME, MSR_KVM_WALL_CLOCK, MSR_KVM_SYSTEM_TIME_NEW, MSR_KVM_WALL_CLOCK_NEW, - HV_X64_MSR_GUEST_OS_ID, HV_X64_MSR_HYPERCALL, + HV_X64_MSR_GUEST_OS_ID, HV_X64_MSR_HYPERCALL,HV_X64_MSR_TIME_REF_COUNT, HV_X64_MSR_APIC_ASSIST_PAGE, MSR_KVM_ASYNC_PF_EN, MSR_KVM_STEAL_TIME, MSR_KVM_PV_EOI_EN, MSR_IA32_SYSENTER_CS, MSR_IA32_SYSENTER_ESP, MSR_IA32_SYSENTER_EIP, @@ -1764,6 +1764,8 @@ static bool kvm_hv_msr_partition_wide(u32 msr) switch (msr) { case HV_X64_MSR_GUEST_OS_ID: case HV_X64_MSR_HYPERCALL: + case HV_X64_MSR_REFERENCE_TSC: + case HV_X64_MSR_TIME_REF_COUNT: r = true; break; } @@ -1803,6 +1805,21 @@ static int set_msr_hyperv_pw(struct kvm_vcpu *vcpu, u32 msr, u64 data) if (__copy_to_user((void __user *)addr, instructions, 4)) return 1; kvm-arch.hv_hypercall = data; + kvm-arch.hv_ref_count = get_kernel_ns(); + break; + } + case HV_X64_MSR_REFERENCE_TSC: { + u64 gfn; + unsigned long addr; + u32 tsc_ref; + gfn = data HV_X64_MSR_HYPERCALL_PAGE_ADDRESS_SHIFT; + addr = gfn_to_hva(kvm, gfn); + if (kvm_is_error_hva(addr)) + return 1; + tsc_ref = 0; + if(__copy_to_user((void __user *)addr, tsc_ref, sizeof(tsc_ref))) Does this do the right thing when we're migrating? How does usermode learn that the guest page has been dirtied? + return 1; + kvm-arch.hv_tsc_page = data; break; } default: @@ -2229,6 +2246,12 @@ static int get_msr_hyperv_pw(struct kvm_vcpu *vcpu, u32 msr, u64 *pdata) case HV_X64_MSR_HYPERCALL: data = kvm-arch.hv_hypercall; break; + case HV_X64_MSR_TIME_REF_COUNT: + data = div_u64(get_kernel_ns() - kvm-arch.hv_ref_count,100); + break; + case HV_X64_MSR_REFERENCE_TSC: + data = kvm-arch.hv_tsc_page; + break; default: vcpu_unimpl(vcpu, Hyper-V unhandled rdmsr: 0x%x\n, msr); return 1; -- 1.8.1.2 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] kvm tools: virtio-net mergable rx buffers
Do you care about guests with drivers that don't negotiate VIRTIO_NET_F_MRG_RXBUF? On Mon, Apr 22, 2013 at 5:32 PM, Sasha Levin sasha.le...@oracle.com wrote: Support mergable rx buffers for virtio-net. This helps reduce the amount of memory the guest kernel has to allocate per rx vq. Signed-off-by: Sasha Levin sasha.le...@oracle.com --- tools/kvm/include/kvm/uip.h | 4 ++-- tools/kvm/include/kvm/util.h | 3 +++ tools/kvm/net/uip/core.c | 54 +--- tools/kvm/net/uip/tcp.c | 2 +- tools/kvm/net/uip/udp.c | 2 +- tools/kvm/util/util.c| 15 tools/kvm/virtio/net.c | 37 ++ 7 files changed, 55 insertions(+), 62 deletions(-) diff --git a/tools/kvm/include/kvm/uip.h b/tools/kvm/include/kvm/uip.h index ac248d2..fa82f10 100644 --- a/tools/kvm/include/kvm/uip.h +++ b/tools/kvm/include/kvm/uip.h @@ -252,7 +252,7 @@ struct uip_tcp_socket { }; struct uip_tx_arg { - struct virtio_net_hdr *vnet; + struct virtio_net_hdr_mrg_rxbuf *vnet; struct uip_info *info; struct uip_eth *eth; int vnet_len; @@ -332,7 +332,7 @@ static inline u16 uip_eth_hdrlen(struct uip_eth *eth) } int uip_tx(struct iovec *iov, u16 out, struct uip_info *info); -int uip_rx(struct iovec *iov, u16 in, struct uip_info *info); +int uip_rx(unsigned char *buffer, u32 length, struct uip_info *info); int uip_init(struct uip_info *info); int uip_tx_do_ipv4_udp_dhcp(struct uip_tx_arg *arg); diff --git a/tools/kvm/include/kvm/util.h b/tools/kvm/include/kvm/util.h index 0df9f0d..6f8ac83 100644 --- a/tools/kvm/include/kvm/util.h +++ b/tools/kvm/include/kvm/util.h @@ -22,6 +22,7 @@ #include sys/param.h #include sys/types.h #include linux/types.h +#include sys/uio.h #ifdef __GNUC__ #define NORETURN __attribute__((__noreturn__)) @@ -94,4 +95,6 @@ struct kvm; void *mmap_hugetlbfs(struct kvm *kvm, const char *htlbfs_path, u64 size); void *mmap_anon_or_hugetlbfs(struct kvm *kvm, const char *hugetlbfs_path, u64 size); +int memcpy_toiovecend(const struct iovec *iov, int iovlen, unsigned char *kdata, size_t len); + #endif /* KVM__UTIL_H */ diff --git a/tools/kvm/net/uip/core.c b/tools/kvm/net/uip/core.c index 4e5bb82..d9e9993 100644 --- a/tools/kvm/net/uip/core.c +++ b/tools/kvm/net/uip/core.c @@ -7,7 +7,7 @@ int uip_tx(struct iovec *iov, u16 out, struct uip_info *info) { - struct virtio_net_hdr *vnet; + struct virtio_net_hdr_mrg_rxbuf *vnet; struct uip_tx_arg arg; int eth_len, vnet_len; struct uip_eth *eth; @@ -74,63 +74,21 @@ int uip_tx(struct iovec *iov, u16 out, struct uip_info *info) return vnet_len + eth_len; } -int uip_rx(struct iovec *iov, u16 in, struct uip_info *info) +int uip_rx(unsigned char *buffer, u32 length, struct uip_info *info) { - struct virtio_net_hdr *vnet; - struct uip_eth *eth; struct uip_buf *buf; - int vnet_len; - int eth_len; - char *p; int len; - int cnt; - int i; /* * Sleep until there is a buffer for guest */ buf = uip_buf_get_used(info); - /* -* Fill device to guest buffer, vnet hdr fisrt -*/ - vnet_len = iov[0].iov_len; - vnet = iov[0].iov_base; - if (buf-vnet_len vnet_len) { - len = -1; - goto out; - } - memcpy(vnet, buf-vnet, buf-vnet_len); - - /* -* Then, the real eth data -* Note: Be sure buf-eth_len is not bigger than the buffer len that guest provides -*/ - cnt = buf-eth_len; - p = buf-eth; - for (i = 1; i in; i++) { - eth_len = iov[i].iov_len; - eth = iov[i].iov_base; - if (cnt eth_len) { - memcpy(eth, p, eth_len); - cnt -= eth_len; - p += eth_len; - } else { - memcpy(eth, p, cnt); - cnt -= cnt; - break; - } - } - - if (cnt) { - pr_warning(uip_rx error); - len = -1; - goto out; - } + memcpy(buffer, buf-vnet, buf-vnet_len); + memcpy(buffer + buf-vnet_len, buf-eth, buf-eth_len); len = buf-vnet_len + buf-eth_len; -out: uip_buf_set_free(info, buf); return len; } @@ -172,8 +130,8 @@ int uip_init(struct uip_info *info) } list_for_each_entry(buf, buf_head, list) { - buf-vnet = malloc(sizeof(struct virtio_net_hdr)); - buf-vnet_len = sizeof(struct virtio_net_hdr); + buf-vnet = malloc(sizeof(struct virtio_net_hdr_mrg_rxbuf)); + buf-vnet_len = sizeof(struct virtio_net_hdr_mrg_rxbuf);
Re: [PATCHv2] KVM: x86: Fix memory leak in vmx.c
On Wed, Apr 17, 2013 at 10:54 AM, Andrew Honig aho...@google.com wrote: If userspace creates and destroys multiple VMs within the same process we leak 20k of memory in the userspace process context per VM. This patch frees the memory in kvm_arch_destroy_vm. If the process exits without closing the VM file descriptor or the file descriptor has been shared with another process then we don't need to free the memory. Technically, I think there's still a (temporary) leak in the case where the last close happened from the wrong process: f_op-release() gets called from a context where it won't whack the kvm memory regions. However, that's a perverse case not expected in practice -- it will get cleaned up when the original process exits and has it's mm cleaned up. Since the one affected (the original open()ing process of /dev/kvm) is the one the one affected and also the one who misbehaved (shared its file descriptor), I don't know that it's worth trying to nail that case down as long as the host kernel isn't compromised (it won't be). Perhaps comment it though, at least in the changelog entry? Signed-off-by: Andrew Honig aho...@google.com --- arch/x86/kvm/x86.c | 17 + 1 file changed, 17 insertions(+) diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index e172132..e93e16b 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -6811,6 +6811,23 @@ void kvm_arch_sync_events(struct kvm *kvm) void kvm_arch_destroy_vm(struct kvm *kvm) { + if (current-mm == kvm-mm) { + /* +* Free memory regions allocated on behalf of userspace, +* unless the the memory map has changed due to process exit +* or fd copying. +*/ + struct kvm_userspace_memory_region mem; + memset(mem, 0, sizeof(mem)); + mem.slot = APIC_ACCESS_PAGE_PRIVATE_MEMSLOT; + kvm_set_memory_region(kvm, mem, 0); + + mem.slot = IDENTITY_PAGETABLE_PRIVATE_MEMSLOT; + kvm_set_memory_region(kvm, mem, 0); + + mem.slot = TSS_PRIVATE_MEMSLOT; + kvm_set_memory_region(kvm, mem, 0); + } kvm_iommu_unmap_guest(kvm); kfree(kvm-arch.vpic); kfree(kvm-arch.vioapic); -- 1.7.10.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Is this really a CVE? - buffer overflow in handling of MSR_KVM_SYSTEM_TIME (CVE-2013-1796)
On Tue, Apr 2, 2013 at 11:05 PM, Florian Beck beckfloria...@gmail.com wrote: The CVE-2013-1796 (https://git.kernel.org/cgit/virt/kvm/kvm.git/commit/?id=c300aa64ddf57d9c5d9c898a64b36877345dd4a9) reports a possibility of host memory corruption. I see that this could lead into corruption of guest kernel memory, but how could be the wrong aligned address reported by guest corrupt host kernel memory? If the region crosses a page boundary. Regards, Florian -- This was the posted fix for CVE-2013-1796: -- diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index f7c850b..2ade60c 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -1959,6 +1959,11 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info) /* ...but clean it before doing the actual write */ vcpu-arch.time_offset = data ~(PAGE_MASK | 1); + /* Check that the address is 32-byte aligned. */ + if (vcpu-arch.time_offset + (sizeof(struct pvclock_vcpu_time_info) - 1)) + break; + vcpu-arch.time_page = gfn_to_page(vcpu-kvm, data PAGE_SHIFT); -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Best way to busy-wait for a virtio queue?
On Fri, Mar 29, 2013 at 4:12 PM, H. Peter Anvin h...@zytor.com wrote: Is there any preferred way to busy-wait on a virtio event? As in: the guest doesn't have anything useful to do until something is plopped down on the virtio queue, but would like to proceed as quickly as possible after that. Passing through an interrupt handler seems like unnecessary overhead. How much information do you have about the host? It is possible that leaving the vCPU running is displacing execution from whatever host thread(s) would be involved in making progress towards the event you want delivered - in that case, the interrupt overhead might be balanced out by lower latency of the event delivery. Right now I have a poll loop looking like (pseudocode): outw(0, trigger); while (readl(ring-output pointer) != final output pointer) cpu_relax();/* x86 PAUSE instruction */ ... but I have no idea how much sense that makes. The cleanest expression of the desired semantic I can think of would be MONITOR/MWAIT, except that KVM doesn't allow those instructions in the guest. For the case of a 100% non-overcommitted host (including host i/o processing), there's no reason not to allow the guest to run those instructions. Lacking that, I think the above busy-loop w/PAUSE in it will end up causing a pause-loop exit - so it has largely the same effect but also works on current hosts. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 5/5] KVM: MMU: fast invalid all mmio sptes
On Fri, Mar 15, 2013 at 8:29 AM, Xiao Guangrong xiaoguangr...@linux.vnet.ibm.com wrote: This patch tries to introduce a very simple and scale way to invalid all mmio sptes - it need not walk any shadow pages and hold mmu-lock KVM maintains a global mmio invalid generation-number which is stored in kvm-arch.mmio_invalid_gen and every mmio spte stores the current global generation-number into his available bits when it is created When KVM need zap all mmio sptes, it just simply increase the global generation-number. When guests do mmio access, KVM intercepts a MMIO #PF then it walks the shadow page table and get the mmio spte. If the generation-number on the spte does not equal the global generation-number, it will go to the normal #PF handler to update the mmio spte Since 19 bits are used to store generation-number on mmio spte, the generation-number can be round after 33554432 times. It is large enough for nearly all most cases, but making the code be more strong, we zap all shadow pages when the number is round Signed-off-by: Xiao Guangrong xiaoguangr...@linux.vnet.ibm.com --- arch/x86/include/asm/kvm_host.h |2 + arch/x86/kvm/mmu.c | 61 +-- arch/x86/kvm/mmutrace.h | 17 +++ arch/x86/kvm/paging_tmpl.h |7 +++- arch/x86/kvm/vmx.c |4 ++ arch/x86/kvm/x86.c |6 +-- 6 files changed, 82 insertions(+), 15 deletions(-) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index ef7f4a5..572398e 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -529,6 +529,7 @@ struct kvm_arch { unsigned int n_requested_mmu_pages; unsigned int n_max_mmu_pages; unsigned int indirect_shadow_pages; + unsigned int mmio_invalid_gen; Could this get initialized to something close to the wrap-around value, so that the wrap-around case gets more real-world coverage? struct hlist_head mmu_page_hash[KVM_NUM_MMU_PAGES]; /* * Hash table of struct kvm_mmu_page. @@ -765,6 +766,7 @@ void kvm_mmu_slot_remove_write_access(struct kvm *kvm, int slot); void kvm_mmu_write_protect_pt_masked(struct kvm *kvm, struct kvm_memory_slot *slot, gfn_t gfn_offset, unsigned long mask); +void kvm_mmu_invalid_mmio_spte(struct kvm *kvm); void kvm_mmu_zap_all(struct kvm *kvm); unsigned int kvm_mmu_calculate_mmu_pages(struct kvm *kvm); void kvm_mmu_change_mmu_pages(struct kvm *kvm, unsigned int kvm_nr_mmu_pages); diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index 13626f4..7093a92 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -234,12 +234,13 @@ static unsigned int get_mmio_spte_generation(u64 spte) static void mark_mmio_spte(struct kvm *kvm, u64 *sptep, u64 gfn, unsigned access) { - u64 mask = generation_mmio_spte_mask(0); + unsigned int gen = ACCESS_ONCE(kvm-arch.mmio_invalid_gen); + u64 mask = generation_mmio_spte_mask(gen); access = ACC_WRITE_MASK | ACC_USER_MASK; mask |= shadow_mmio_mask | access | gfn PAGE_SHIFT; - trace_mark_mmio_spte(sptep, gfn, access, 0); + trace_mark_mmio_spte(sptep, gfn, access, gen); mmu_spte_set(sptep, mask); } @@ -269,6 +270,34 @@ static bool set_mmio_spte(struct kvm *kvm, u64 *sptep, gfn_t gfn, return false; } +static bool check_mmio_spte(struct kvm *kvm, u64 spte) +{ + return get_mmio_spte_generation(spte) == + ACCESS_ONCE(kvm-arch.mmio_invalid_gen); +} + +/* + * The caller should protect concurrent access on + * kvm-arch.mmio_invalid_gen. Currently, it is used by + * kvm_arch_commit_memory_region and protected by kvm-slots_lock. + */ +void kvm_mmu_invalid_mmio_spte(struct kvm *kvm) +{ + /* Ensure update memslot has been completed. */ + smp_mb(); + +trace_kvm_mmu_invalid_mmio_spte(kvm); + + /* +* The very rare case: if the generation-number is round, +* zap all shadow pages. +*/ + if (unlikely(kvm-arch.mmio_invalid_gen++ == MAX_GEN)) { + kvm-arch.mmio_invalid_gen = 0; + return kvm_mmu_zap_all(kvm); + } +} + static inline u64 rsvd_bits(int s, int e) { return ((1ULL (e - s + 1)) - 1) s; @@ -3183,9 +3212,12 @@ static u64 walk_shadow_page_get_mmio_spte(struct kvm_vcpu *vcpu, u64 addr) } /* - * If it is a real mmio page fault, return 1 and emulat the instruction - * directly, return 0 to let CPU fault again on the address, -1 is - * returned if bug is detected. + * Return value: + * 2: invalid spte is detected then let the real page fault path + *update the mmio spte. + * 1: it is a real mmio page fault, emulate the instruction directly. + * 0: let CPU fault again on the
[PATCH] virtio_scsi: fix memory leak on full queue condition.
virtscsi_queuecommand was leaking memory when the virtio queue was full. Tested: Guest operates correctly even with very small queue sizes, validated we're not leaking kmalloc-192 sized allocations anymore. Signed-off-by: Eric Northup digitale...@google.com --- drivers/scsi/virtio_scsi.c |2 ++ 1 files changed, 2 insertions(+), 0 deletions(-) diff --git a/drivers/scsi/virtio_scsi.c b/drivers/scsi/virtio_scsi.c index 595af1a..dd8dc27 100644 --- a/drivers/scsi/virtio_scsi.c +++ b/drivers/scsi/virtio_scsi.c @@ -469,6 +469,8 @@ static int virtscsi_queuecommand(struct Scsi_Host *sh, struct scsi_cmnd *sc) sizeof cmd-req.cmd, sizeof cmd-resp.cmd, GFP_ATOMIC) = 0) ret = 0; + else + mempool_free(cmd, virtscsi_cmd_pool); out: return ret; -- 1.7.7.3 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/8] use jump labels to streamline common APIC configuration
On Sun, Aug 5, 2012 at 5:58 AM, Gleb Natapov g...@redhat.com wrote: APIC code has a lot of checks for apic presence and apic HW/SW enable state. Most common configuration is when each vcpu has in kernel apic and it is fully enabled. This path series uses jump labels to turn checks to nops in the common case. What is the target workload and how does the performance compare? As a naive question, how different is it than just using gcc branch hints? [...] -- Typing one-handed, please don't mistake brevity for rudeness. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 0/5] Export offsets of VMCS fields as note information for kdump
On Mon, May 21, 2012 at 8:53 PM, Yanfei Zhang zhangyan...@cn.fujitsu.com wrote: 于 2012年05月22日 02:58, Eric Northup 写道: [...] So you can have the VMCS offset dumping be a manually-loaded module. Build a database mapping from (CPUID, microcode revision) - (VMCSINFO). There's no need for anything beyond the (CPUID, microcode revision) to be put in the kdump, since your offline processing of a kdump can then look up the rest. [...] We have considered this way, but there are two issues: 1) vmx resource is unique for a single cpu, and it's risky to grab it forcibly on the environment where kvm module is used, in particular on customer's environment. To do this safely, kvm support is needed. It's not risky: you just have to make sure that no one else is going to use the VMCS on your CPU while you're running. You can disable preemption and then save the old VMCS pointer from the CPU (see the VMPTRST instructions). Load your temporary VMCS pointer, discover the fields, then restore the original VMCS pointer. Then re-enable preemption and you're done. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 0/5] Export offsets of VMCS fields as note information for kdump
On Wed, May 16, 2012 at 12:50 AM, zhangyanfei zhangyan...@cn.fujitsu.com wrote: This patch set exports offsets of VMCS fields as note information for kdump. We call it VMCSINFO. The purpose of VMCSINFO is to retrieve runtime state of guest machine image, such as registers, in host machine's crash dump as VMCS format. The problem is that VMCS internal is hidden by Intel in its specification. So, we slove this problem by reverse engineering implemented in this patch set. The VMCSINFO is exported via sysfs to kexec-tools just like VMCOREINFO. Perhaps I'm wrong, but this solution seems much, much more dynamic than it needs to be. The VMCS offsets aren't going to change between different boots on the same CPU, unless perhaps the microcode has been updated. So you can have the VMCS offset dumping be a manually-loaded module. Build a database mapping from (CPUID, microcode revision) - (VMCSINFO). There's no need for anything beyond the (CPUID, microcode revision) to be put in the kdump, since your offline processing of a kdump can then look up the rest. It means you don't have to interact with the vmx module at all, and no extra modules or code have to be loaded on the millions of Linux machines that won't need the functionality. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] kvm: don't call mmu_shrinker w/o used_mmu_pages
On Sun, Apr 22, 2012 at 2:16 AM, Avi Kivity a...@redhat.com wrote: On 04/21/2012 05:15 AM, Mike Waychison wrote: [...] There is no mmu_list_lock. Do you mean kvm_lock or kvm-mmu_lock? If the former, then we could easily fix this by dropping kvm_lock while the work is being done. If the latter, then it's more difficult. (kvm_lock being contended implies that mmu_shrink is called concurrently?) On a 32-core system experiencing memory pressure, mmu_shrink was often being called concurrently (before we turned it off). With just one, or a small number of VMs on a host, when the mmu_shrinker contents on the kvm_lock, that's just a proxy for the contention on kvm-mmu_lock. It is the one that gets reported, though, since it gets acquired first. The contention on mmu_lock would indeed be difficult to remove. Our case was perhaps unusual, because of the use of memory containers. So some cgroups were under memory pressure (thus calling the shrinker) but the various VCPU threads (whose guest page tables were being evicted by the shrinker) could immediately turn around and successfully re-allocate them. That made the kvm-mmu_lock really hot. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Linux Crash Caused By KVM?
On Wed, Apr 11, 2012 at 7:45 AM, Avi Kivity a...@redhat.com wrote: On 04/11/2012 05:11 AM, Peijie Yu wrote: For this problem, i found that panic is caused by BUG_ON(in_nmi()) which means NMI happened during another NMI Context; But i check the Intel Technical Manual and found While an NMI interrupt handler is executing, the processor disables additional calls to the NMI handler until the next IRET instruction is executed. So, how this happen? The NMI path for kvm is different; the processor exits from the guest with NMIs blocked, then executes kvm code until it issues int $2 in vmx_complete_interrupts(). If an IRET is executed in this path, then NMIs will be unblocked and nested NMIs may occur. One way this can happen is if we access the vmap area and incur a fault, between the VMEXIT and invoking the NMI handler. Or perhaps the NMI handler itself generates a fault. Or we have a debug exception in that path. Is this reproducible? As an FYI, there have been BIOSes whose SMI handlers ran IRETs. So the NMI blocking can go away surprisingly. See 29.8 NMI handling while in SMM in the Intel SDM vol 3. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2] KVM: Introduce direct MSI message injection for in-kernel irqchips
On Wed, Mar 28, 2012 at 10:47 AM, Jan Kiszka jan.kis...@siemens.com wrote: [...] +4.61 KVM_SET_MSI + +Capability: KVM_CAP_SET_MSI +Architectures: x86 +Type: vm ioctl +Parameters: struct kvm_msi (in) +Returns: 0 on success, -1 on error Is this the actual behavior? It looked to me like the successful return value ended up getting set by __apic_accept_irq(), which claims to Return 1 if successfully added and 0 if discarded. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/2 v3] kvm: notify host when guest panicked
On Wed, Mar 14, 2012 at 6:25 AM, Gleb Natapov g...@redhat.com wrote: On Wed, Mar 14, 2012 at 03:16:05PM +0200, Avi Kivity wrote: On 03/14/2012 03:14 PM, Gleb Natapov wrote: On Wed, Mar 14, 2012 at 03:07:46PM +0200, Avi Kivity wrote: On 03/14/2012 01:11 PM, Wen Congyang wrote: I don't think we want to use the driver. Instead, have a small piece of code that resets the device and pushes out a string (the panic message?) without any interrupts etc. It's still going to be less reliable than a hypercall, I agree. Do you still want to use complicated and less reliable way? Are you willing to try it out and see how complicated it really is? While it's more complicated, it's also more flexible. You can communicate the panic message, whether the guest is attempting a kdump and its own recovery or whether it wants the host to do it, etc., you can communicate less severe failures like oopses. hypercall can take arguments to achieve the same. It has to be designed in advance; and every time we notice something's missing we have to update the host kernel. We and in the designed stage now. Not to late to design something flexible :) Panic hypercall can take GPA of a buffer where host puts panic info as a parameter. This buffer can be read by QEMU and passed to management. If a host kernel change is in the works, I think it might be cleanest to have the host kernel export a new kind of VCPU exit for unhandled-by-KVM hypercalls. Then usermode can respond to the hypercall as appropriate. This would permit adding or changing future hypercalls without host kernel changes. Guest panic is almost the definition of not-a-fast-path, and so what's the reason to handle it in the host kernel. Punting the functionality to user-space isn't a magic bullet for getting a good interface designed, but in my opinion it is a better place to be doing this. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] Next gen kvm api
On Thu, Feb 2, 2012 at 8:09 AM, Avi Kivity a...@redhat.com wrote: [...] Moving to syscalls avoids these problems, but introduces new ones: - adding new syscalls is generally frowned upon, and kvm will need several - syscalls into modules are harder and rarer than into core kernel code - will need to add a vcpu pointer to task_struct, and a kvm pointer to mm_struct - Lost a good place to put access control (permissions on /dev/kvm) for which user-mode processes can use KVM. How would the ability to use sys_kvm_* be regulated? -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC] KVM MMU: improve large munmap efficiency
Flush the shadow MMU instead of iterating over each host VA when doing a large invalidate range callback. The previous code is O(N) in the number of virtual pages being invalidated, while holding both the MMU spinlock and the mmap_sem. Large unmaps can cause significant delay, during which the process is unkillable. Worse, all page allocation could be delayed if there's enough memory pressure that mmu_shrink gets called. Signed-off-by: Eric Northup digitale...@google.com --- We have seen delays of over 30 seconds doing a large (128GB) unmap. It'd be nicer to check if the amount of work to be done by the entire flush is less than the work to be done iterating over each HVA page, but that information isn't currently available to the arch- independent part of KVM. Better ideas would be most welcome ;-) Tested by attaching a debugger to a running qemu w/kvm and running call munmap(0, 1UL 46). diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index 7287bf5..9fe303a 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -61,6 +61,8 @@ #define CREATE_TRACE_POINTS #include trace/events/kvm.h +#define MMU_NOTIFIER_FLUSH_THRESHOLD_PAGES (1024u*1024u*1024u) + MODULE_AUTHOR(Qumranet); MODULE_LICENSE(GPL); @@ -332,8 +334,12 @@ static void kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn, * count is also read inside the mmu_lock critical section. */ kvm-mmu_notifier_count++; - for (; start end; start += PAGE_SIZE) - need_tlb_flush |= kvm_unmap_hva(kvm, start); + if (end - start MMU_NOTIFIER_FLUSH_THRESHOLD_PAGES) + for (; start end; start += PAGE_SIZE) + need_tlb_flush |= kvm_unmap_hva(kvm, start); + else + kvm_arch_flush_shadow(kvm); + need_tlb_flush |= kvm-tlbs_dirty; spin_unlock(kvm-mmu_lock); srcu_read_unlock(kvm-srcu, idx); -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 06/13] x86/ticketlock: add slowpath logic
On Thu, Sep 1, 2011 at 5:54 PM, Jeremy Fitzhardinge jer...@goop.org wrote: From: Jeremy Fitzhardinge jeremy.fitzhardi...@citrix.com Maintain a flag in both LSBs of the ticket lock which indicates whether anyone is in the lock slowpath and may need kicking when the current holder unlocks. The flags are set when the first locker enters the slowpath, and cleared when unlocking to an empty queue. Are there actually two flags maintained? I only see the one in the ticket tail getting set/cleared/tested. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] kvm tools: adds a PCI device that exports a host shared segment as a PCI BAR in the guest
Just FYI, one issue that I found with exposing host memory regions as a PCI BAR (including via a very old version of the ivshmem driver... haven't tried a newer one) is that x86's pci_mmap_page_range doesn't want to set up a write-back cacheable mapping of a BAR. It may not matter for your requirements, but the uncached access reduced guest-host bandwidth via the shared memory driver by a lot. If you need the physical address to be fixed, you might be better off by reserving a memory region in the e820 map rather than a PCI BAR, since BARs can move around. On Thu, Aug 25, 2011 at 8:08 AM, David Evensky even...@dancer.ca.sandia.gov wrote: Adding in the rest of what ivshmem does shouldn't affect our use, *I think*. I hadn't intended this to do everything that ivshmem does, but I can see how that would be useful. It would be cool if it could grow into that. Our requirements for the driver in kvm tool are that another program on the host can create a shared segment (anonymous, non-file backed) with a specified handle, size, and contents. That this segment is available to the guest at boot time at a specified address and that no driver will change the contents of the memory except under direct user action. Also, when the guest goes away the shared memory segment shouldn't be affected (e.g. contents changed). Finally, we cannot change the lightweight nature of kvm tool. This is the feature of ivshmem that I need to check today. I did some testing a month ago, but it wasn't detailed enough to check this out. \dae On Thu, Aug 25, 2011 at 02:25:48PM +0300, Sasha Levin wrote: On Thu, 2011-08-25 at 11:59 +0100, Stefan Hajnoczi wrote: On Thu, Aug 25, 2011 at 11:37 AM, Pekka Enberg penb...@kernel.org wrote: Hi Stefan, On Thu, Aug 25, 2011 at 1:31 PM, Stefan Hajnoczi stefa...@gmail.com wrote: It's obviously not competing. One thing you might want to consider is making the guest interface compatible with ivshmem. Is there any reason we shouldn't do that? I don't consider that a requirement, just nice to have. The point of implementing the same interface as ivshmem is that users don't need to rejig guests or applications in order to switch between hypervisors. A different interface also prevents same-to-same benchmarks. There is little benefit to creating another virtual device interface when a perfectly good one already exists. The question should be: how is this shmem device different and better than ivshmem? If there is no justification then implement the ivshmem interface. So which interface are we actually taking about? Userspace/kernel in the guest or hypervisor/guest kernel? The hardware interface. Same PCI BAR layout and semantics. Either way, while it would be nice to share the interface but it's not a *requirement* for tools/kvm unless ivshmem is specified in the virtio spec or the driver is in mainline Linux. We don't intend to require people to implement non-standard and non-Linux QEMU interfaces. OTOH, ivshmem would make the PCI ID problem go away. Introducing yet another non-standard and non-Linux interface doesn't help though. If there is no significant improvement over ivshmem then it makes sense to let ivshmem gain critical mass and more users instead of fragmenting the space. I support doing it ivshmem-compatible, though it doesn't have to be a requirement right now (that is, use this patch as a base and build it towards ivshmem - which shouldn't be an issue since this patch provides the PCI+SHM parts which are required by ivshmem anyway). ivshmem is a good, documented, stable interface backed by a lot of research and testing behind it. Looking at the spec it's obvious that Cam had KVM in mind when designing it and thats exactly what we want to have in the KVM tool. David, did you have any plans to extend it to become ivshmem-compatible? If not, would turning it into such break any code that depends on it horribly? -- Sasha. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC v5 86/86] 440fx: fix PAM, PCI holes
On Wed, Jul 20, 2011 at 9:50 AM, Avi Kivity a...@redhat.com wrote: [...] @@ -130,7 +137,13 @@ static void pc_init1(MemoryRegion *system_memory, if (pci_enabled) { pci_bus = i440fx_init(i440fx_state, piix3_devfn, isa_irq, - system_memory, system_io, ram_size); + system_memory, system_io, ram_size, + 0xe000, 0x1fe0, + 0x1 + above_4g_mem_size, + (sizeof(target_phys_addr_t) == 32 sizeof(target_phys_addr_t) == 8 ? + ? 0 + : ((uint64_t)1 63)), + pci_memory, ram_memory); } else { pci_bus = NULL; i440fx_state = NULL; -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] kvm: log directly from the guest to the host kvm buffer
On Thu, May 12, 2011 at 8:42 AM, Avi Kivity a...@redhat.com wrote: On 05/12/2011 06:39 PM, Dhaval Giani wrote: I think that one hypercall per trace is too expensive. Tracing is meant to be lightweight! I think the guest can log to a buffer, which is flushed on overflow or when a vmexit occurs. That gives us automatic serialization between a vcpu and the cpu it runs on, but not between a vcpu and a different host cpu. hmm. So, basically, log all of these events, and then send them to the host either on an exit, or when your buffer fills up. There is one problem with approach though. One of the reasons I wanted this approach was beacuse i wanted to co-relate the guest and the host times. (which is why I kept is synchronous). I lose out that information with what you say. However I see your point about the overhead. I will think about this a bit more. You might use kvmclock to get a zero-exit (but not zero-cost) time which can be correlated. Another option is to use xadd on a shared memory area to have a global counter incremented. However that can be slow on large machines, and is hard to do securely with multiple guests. If the guest puts guest TSC into the buffer with each event, KVM can convert guest-host time when it drains the buffers on the next vmexit. That's enough information to do an offline correlation of guest and host events. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC] KVM MMU: fix hashing for TDP and non-paging modes
On Mon, Apr 26, 2010 at 2:46 PM, Marcelo Tosatti mtosa...@redhat.com wrote: Doh, and your patch does not. But it does not apply to kvm.git -next branch, can you regenerate please? -- For TDP mode, avoid creating multiple page table roots for the single guest-to-host physical address map by fixing the inputs used for the shadow page table hash in mmu_alloc_roots(). Signed-off-by: Eric Northup digitale...@google.com --- diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index ddfa865..9696d65 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -2059,10 +2059,12 @@ static int mmu_alloc_roots(struct kvm_vcpu *vcpu) hpa_t root = vcpu-arch.mmu.root_hpa; ASSERT(!VALID_PAGE(root)); - if (tdp_enabled) - direct = 1; if (mmu_check_root(vcpu, root_gfn)) return 1; + if (tdp_enabled) { + direct = 1; + root_gfn = 0; + } sp = kvm_mmu_get_page(vcpu, root_gfn, 0, PT64_ROOT_LEVEL, direct, ACC_ALL, NULL); @@ -2072,8 +2074,6 @@ static int mmu_alloc_roots(struct kvm_vcpu *vcpu) return 0; } direct = !is_paging(vcpu); - if (tdp_enabled) - direct = 1; for (i = 0; i 4; ++i) { hpa_t root = vcpu-arch.mmu.pae_root[i]; @@ -2089,6 +2089,10 @@ static int mmu_alloc_roots(struct kvm_vcpu *vcpu) root_gfn = 0; if (mmu_check_root(vcpu, root_gfn)) return 1; + if (tdp_enabled) { + direct = 1; + root_gfn = i 30; + } sp = kvm_mmu_get_page(vcpu, root_gfn, i 30, PT32_ROOT_LEVEL, direct, ACC_ALL, NULL); -- -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH RFC] KVM MMU: fix hashing for TDP and non-paging modes
I've been reading the x86's mmu.c recently and had been wondering about something. Avi's recent mmu documentation (thanks!) seems to have confirmed my understanding of how the shadow paging is supposed to be working. In TDP mode, when mmu_alloc_roots() calls kvm_mmu_get_page(), why does it pass (vcpu-arch.cr3 PAGE_SHIFT) or (vcpu-arch.mmu.pae_root[i]) as gfn? It seems to me that in TDP mode, gfn should be either zero for the root page table, or 0/1GB/2GB/3GB (for PAE page tables). The existing behavior can lead to multiple, semantically-identical TDP roots being created by mmu_alloc_roots, depending on the VCPU's CR3 at the time that mmu_alloc_roots was called. But the nested page tables should be* independent of the VCPU state. That wastes some memory and causes extra page faults while populating the extra copies of the page tables. *assuming that we aren't modeling per-VCPU state that might change the physical address map as seen by that VCPU, such as setting the APIC base to an address overlapping RAM. All feedback would be welcome, since I'm new to this system! A strawman patch follows. thanks, -Eric -- For TDP mode, avoid creating multiple page table roots for the single guest-to-host physical address map by fixing the inputs used for the shadow page table hash in mmu_alloc_roots(). Signed-off-by: Eric Northup digitale...@google.com --- arch/x86/kvm/mmu.c | 12 1 files changed, 8 insertions(+), 4 deletions(-) diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index ddfa865..9696d65 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -2059,10 +2059,12 @@ static int mmu_alloc_roots(struct kvm_vcpu *vcpu) hpa_t root = vcpu-arch.mmu.root_hpa; ASSERT(!VALID_PAGE(root)); - if (tdp_enabled) - direct = 1; if (mmu_check_root(vcpu, root_gfn)) return 1; + if (tdp_enabled) { + direct = 1; + root_gfn = 0; + } sp = kvm_mmu_get_page(vcpu, root_gfn, 0, PT64_ROOT_LEVEL, direct, ACC_ALL, NULL); @@ -2072,8 +2074,6 @@ static int mmu_alloc_roots(struct kvm_vcpu *vcpu) return 0; } direct = !is_paging(vcpu); - if (tdp_enabled) - direct = 1; for (i = 0; i 4; ++i) { hpa_t root = vcpu-arch.mmu.pae_root[i]; @@ -2089,6 +2089,10 @@ static int mmu_alloc_roots(struct kvm_vcpu *vcpu) root_gfn = 0; if (mmu_check_root(vcpu, root_gfn)) return 1; + if (tdp_enabled) { + direct = 1; + root_gfn = i 30; + } sp = kvm_mmu_get_page(vcpu, root_gfn, i 30, PT32_ROOT_LEVEL, direct, ACC_ALL, NULL); -- -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html