Re: [PATCH 01/12] KVM: x86: Collect information for setting TSC scaling ratio

2015-09-28 Thread Eric Northup
On Sun, Sep 27, 2015 at 10:38 PM, Haozhong Zhang
 wrote:
>
> The number of bits of the fractional part of the 64-bit TSC scaling
> ratio in VMX and SVM is different. This patch makes the architecture
> code to collect the number of fractional bits and other related
> information into variables that can be accessed in the common code.
>
> Signed-off-by: Haozhong Zhang 
> ---
>  arch/x86/include/asm/kvm_host.h | 8 
>  arch/x86/kvm/svm.c  | 5 +
>  arch/x86/kvm/x86.c  | 8 
>  3 files changed, 21 insertions(+)
>
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 2beee03..5b9b86e 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -965,6 +965,14 @@ extern bool kvm_has_tsc_control;
>  extern u32  kvm_min_guest_tsc_khz;
>  /* maximum supported tsc_khz for guests */
>  extern u32  kvm_max_guest_tsc_khz;
> +/* number of bits of the fractional part of the TSC scaling ratio */
> +extern u8   kvm_tsc_scaling_ratio_frac_bits;
> +/* reserved bits of TSC scaling ratio (SBZ) */
> +extern u64  kvm_tsc_scaling_ratio_rsvd;
> +/* default TSC scaling ratio (= 1.0) */
> +extern u64  kvm_default_tsc_scaling_ratio;
> +/* maximum allowed value of TSC scaling ratio */
> +extern u64  kvm_max_tsc_scaling_ratio;

Do we need all 3 of kvm_max_guest_tsc_khz, kvm_max_tsc_scaling_ratio,
and kvm_tsc_scaling_ratio_rsvd (since only SVM has reserved bits - and
just for complaining if the high bits are set, which can already be
expressed by max_tsc_scaling ratio)

kvm_max_tsc_scaling_ratio seems to be write-only.

>
>  enum emulation_result {
> EMULATE_DONE, /* no further processing */
> diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
> index 94b7d15..eff7db7 100644
> --- a/arch/x86/kvm/svm.c
> +++ b/arch/x86/kvm/svm.c
> @@ -963,7 +963,12 @@ static __init int svm_hardware_setup(void)
> max = min(0x7fffULL, __scale_tsc(tsc_khz, TSC_RATIO_MAX));
>
> kvm_max_guest_tsc_khz = max;
> +
> +   kvm_max_tsc_scaling_ratio = TSC_RATIO_MAX;
> +   kvm_tsc_scaling_ratio_frac_bits = 32;
> +   kvm_tsc_scaling_ratio_rsvd = TSC_RATIO_RSVD;
> }
> +   kvm_default_tsc_scaling_ratio = TSC_RATIO_DEFAULT;
>
> if (nested) {
> printk(KERN_INFO "kvm: Nested Virtualization enabled\n");
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 991466b..f888225 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -106,6 +106,14 @@ bool kvm_has_tsc_control;
>  EXPORT_SYMBOL_GPL(kvm_has_tsc_control);
>  u32  kvm_max_guest_tsc_khz;
>  EXPORT_SYMBOL_GPL(kvm_max_guest_tsc_khz);
> +u8   kvm_tsc_scaling_ratio_frac_bits;
> +EXPORT_SYMBOL_GPL(kvm_tsc_scaling_ratio_frac_bits);
> +u64  kvm_tsc_scaling_ratio_rsvd;
> +EXPORT_SYMBOL_GPL(kvm_tsc_scaling_ratio_rsvd);
> +u64  kvm_default_tsc_scaling_ratio;
> +EXPORT_SYMBOL_GPL(kvm_default_tsc_scaling_ratio);
> +u64  kvm_max_tsc_scaling_ratio;
> +EXPORT_SYMBOL_GPL(kvm_max_tsc_scaling_ratio);
>
>  /* tsc tolerance in parts per million - default to 1/2 of the NTP threshold 
> */
>  static u32 tsc_tolerance_ppm = 250;
> --
> 2.4.8
>
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 00/12] KVM: x86: add support for VMX TSC scaling

2015-09-28 Thread Eric Northup
On Sun, Sep 27, 2015 at 10:37 PM, Haozhong Zhang
<haozhong.zh...@intel.com> wrote:
> This patchset adds support for VMX TSC scaling feature which is
> available on Intel Skylake CPU. The specification of VMX TSC scaling
> can be found at
> http://www.intel.com/content/www/us/en/processors/timestamp-counter-scaling-virtualization-white-paper.html
>
> VMX TSC scaling allows guest TSC which is read by guest rdtsc(p)
> instructions increases in a rate that is customized by the hypervisor
> and can be different than the host TSC rate. Basically, VMX TSC
> scaling adds a 64-bit field called TSC multiplier in VMCS so that, if
> VMX TSC scaling is enabled, TSC read by guest rdtsc(p) instructions
> will be calculated by the following formula:
>
>   guest EDX:EAX = (Host TSC * TSC multiplier) >> 48 + VMX TSC Offset
>
> where, Host TSC = Host MSR_IA32_TSC + Host MSR_IA32_TSC_ADJUST.
>
> This patchset, when cooperating with another QEMU patchset (sent in
> another email "target-i386: save/restore vcpu's TSC rate during
> migration"), allows guest programs observe a consistent TSC rate even
> though they are migrated among machines with different host TSC rates.
>
> VMX TSC scaling shares some common logics with SVM TSC scaling which
> is already supported by KVM. Patch 1 ~ 8 move those common logics from
> SVM code to the common code. Upon them, patch 9 ~ 12 add VMX-specific
> support for VMX TSC scaling.

reviewed-by: Eric Northup <digitale...@google.com>

>
> Haozhong Zhang (12):
>   KVM: x86: Collect information for setting TSC scaling ratio
>   KVM: x86: Add a common TSC scaling ratio field in kvm_vcpu_arch
>   KVM: x86: Add a common TSC scaling function
>   KVM: x86: Replace call-back set_tsc_khz() with a common function
>   KVM: x86: Replace call-back compute_tsc_offset() with a common function
>   KVM: x86: Move TSC scaling logic out of call-back adjust_tsc_offset()
>   KVM: x86: Move TSC scaling logic out of call-back read_l1_tsc()
>   KVM: x86: Use the correct vcpu's TSC rate to compute time scale
>   KVM: VMX: Enable and initialize VMX TSC scaling
>   KVM: VMX: Setup TSC scaling ratio when a vcpu is loaded
>   KVM: VMX: Use a scaled host TSC for guest readings of MSR_IA32_TSC
>   KVM: VMX: Dump TSC multiplier in dump_vmcs()
>
>  arch/x86/include/asm/kvm_host.h |  24 +++
>  arch/x86/include/asm/vmx.h  |   4 +-
>  arch/x86/kvm/lapic.c|   5 +-
>  arch/x86/kvm/svm.c  | 113 +++--
>  arch/x86/kvm/vmx.c  |  60 
>  arch/x86/kvm/x86.c  | 154 
> +---
>  include/linux/kvm_host.h|  21 +-
>  7 files changed, 221 insertions(+), 160 deletions(-)
>
> --
> 2.4.8
>
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] vhost: support upto 509 memory regions

2015-02-17 Thread Eric Northup
On Tue, Feb 17, 2015 at 4:32 AM, Michael S. Tsirkin m...@redhat.com wrote:
 On Tue, Feb 17, 2015 at 11:59:48AM +0100, Paolo Bonzini wrote:


 On 17/02/2015 10:02, Michael S. Tsirkin wrote:
   Increasing VHOST_MEMORY_MAX_NREGIONS from 65 to 509
   to match KVM_USER_MEM_SLOTS fixes issue for vhost-net.
  
   Signed-off-by: Igor Mammedov imamm...@redhat.com
 
  This scares me a bit: each region is 32byte, we are talking
  a 16K allocation that userspace can trigger.

 What's bad with a 16K allocation?

 It fails when memory is fragmented.

  How does kvm handle this issue?

 It doesn't.

 Paolo

 I'm guessing kvm doesn't do memory scans on data path,
 vhost does.

 qemu is just doing things that kernel didn't expect it to need.

 Instead, I suggest reducing number of GPA-HVA mappings:

 you have GPA 1,5,7
 map them at HVA 11,15,17
 then you can have 1 slot: 1-11

 To avoid libc reusing the memory holes, reserve them with MAP_NORESERVE
 or something like this.

This works beautifully when host virtual address bits are more
plentiful than guest physical address bits.  Not all architectures
have that property, though.

 We can discuss smarter lookup algorithms but I'd rather
 userspace didn't do things that we then have to
 work around in kernel.


 --
 MST
 --
 To unsubscribe from this list: send the line unsubscribe kvm in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] KVM: x86 emulator: emulate MOVNTDQ

2014-07-11 Thread Eric Northup
On Fri, Jul 11, 2014 at 10:56 AM, Alex Williamson
alex.william...@redhat.com wrote:
 Windows 8.1 guest with NVIDIA driver and GPU fails to boot with an
 emulation failure.  The KVM spew suggests the fault is with lack of
 movntdq emulation (courtesy of Paolo):

 Code=02 00 00 b8 08 00 00 00 f3 0f 6f 44 0a f0 f3 0f 6f 4c 0a e0 66 0f e7 
 41 f0 66 0f e7 49 e0 48 83 e9 40 f3 0f 6f 44 0a 10 f3 0f 6f 0c 0a 66 0f e7 41 
 10

 $ as -o a.out
 .section .text
 .byte 0x66, 0x0f, 0xe7, 0x41, 0xf0
 .byte 0x66, 0x0f, 0xe7, 0x49, 0xe0
 $ objdump -d a.out
 0:  66 0f e7 41 f0  movntdq %xmm0,-0x10(%rcx)
 5:  66 0f e7 49 e0  movntdq %xmm1,-0x20(%rcx)

 Add the necessary emulation.

 Signed-off-by: Alex Williamson alex.william...@redhat.com
 Cc: Paolo Bonzini pbonz...@redhat.com
 ---

 Hope I got all the flags correct from copying similar MOV ops, but it
 allows the guest to boot, so I suspect it's ok.

  arch/x86/kvm/emulate.c |7 ++-
  1 file changed, 6 insertions(+), 1 deletion(-)

 diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
 index e4e833d..ae39f08 100644
 --- a/arch/x86/kvm/emulate.c
 +++ b/arch/x86/kvm/emulate.c
 @@ -3681,6 +3681,10 @@ static const struct gprefix pfx_0f_28_0f_29 = {
 I(Aligned, em_mov), I(Aligned, em_mov), N, N,
  };

 +static const struct gprefix pfx_0f_e7 = {
 +   N, I(Sse, em_mov), N, N,

I think you need 'Aligned' flag in here - from my reading of the
manual, this instruction will #GP if the memory operand isn't aligned.

 +};
 +
  static const struct escape escape_d9 = { {
 N, N, N, N, N, N, N, I(DstMem, em_fnstcw),
  }, {
 @@ -3951,7 +3955,8 @@ static const struct opcode twobyte_table[256] = {
 /* 0xD0 - 0xDF */
 N, N, N, N, N, N, N, N, N, N, N, N, N, N, N, N,
 /* 0xE0 - 0xEF */
 -   N, N, N, N, N, N, N, N, N, N, N, N, N, N, N, N,
 +   N, N, N, N, N, N, N, GP(SrcReg | DstMem | ModRM | Mov, pfx_0f_e7),
 +   N, N, N, N, N, N, N, N,
 /* 0xF0 - 0xFF */
 N, N, N, N, N, N, N, N, N, N, N, N, N, N, N, N
  };

 --
 To unsubscribe from this list: send the line unsubscribe kvm in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/3] KVM: x86: correct mwait and monitor emulation

2014-06-18 Thread Eric Northup
On Wed, Jun 18, 2014 at 7:19 AM, Nadav Amit na...@cs.technion.ac.il wrote:
 mwait and monitor are currently handled as nop. Considering this behavior, 
 they
 should still be handled correctly, i.e., check execution conditions and 
 generate
 exceptions when required. mwait and monitor may also be executed in real-mode
 and are not handled in that case.  This patch performs the emulation of
 monitor-mwait according to Intel SDM (other than checking whether interrupt 
 can
 be used as a break event).

 Signed-off-by: Nadav Amit na...@cs.technion.ac.il
 ---
  arch/x86/kvm/emulate.c | 41 +++--
  arch/x86/kvm/svm.c | 22 ++
  arch/x86/kvm/vmx.c | 27 +++
  3 files changed, 52 insertions(+), 38 deletions(-)

 diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
 index ef7a5a0..424b58d 100644
 --- a/arch/x86/kvm/emulate.c
 +++ b/arch/x86/kvm/emulate.c
 @@ -3344,6 +3344,43 @@ static int em_bswap(struct x86_emulate_ctxt *ctxt)
 return X86EMUL_CONTINUE;
  }

 +static int em_monitor(struct x86_emulate_ctxt *ctxt)
 +{
 +   int rc;
 +   struct segmented_address addr;
 +   u64 rcx = reg_read(ctxt, VCPU_REGS_RCX);
 +   u64 rax = reg_read(ctxt, VCPU_REGS_RAX);
 +   u8 byte;

I'd request:

u32 ebx, ecx, edx, eax = 1;
ctxt-opt-get_cpuid(ctxt, eax, ebx, ecx, edx);
if (!(ecx  FFL(MWAIT)))
return emulate_ud(ctxt);

and also in em_mwait.

 +
 +   if (ctxt-mode != X86EMUL_MODE_PROT64)
 +   rcx = (u32)rcx;
 +   if (rcx != 0)
 +   return emulate_gp(ctxt, 0);
 +
 +   addr.seg = seg_override(ctxt);
 +   addr.ea = ctxt-ad_bytes == 8 ? rax : (u32)rax;
 +   rc = segmented_read(ctxt, addr, byte, 1);
 +   if (rc != X86EMUL_CONTINUE)
 +   return rc;
 +
 +   printk_once(KERN_WARNING kvm: MONITOR instruction emulated as 
 NOP!\n);
 +   return X86EMUL_CONTINUE;
 +}
 +
 +static int em_mwait(struct x86_emulate_ctxt *ctxt)
 +{
 +   u64 rcx = reg_read(ctxt, VCPU_REGS_RCX);
 +
 +   if (ctxt-mode != X86EMUL_MODE_PROT64)
 +   rcx = (u32)rcx;
 +   if ((rcx  ~1UL) != 0)
 +   return emulate_gp(ctxt, 0);
 +
 +   /* Accepting interrupt as break event regardless to cpuid */
 +   printk_once(KERN_WARNING kvm: MWAIT instruction emulated as NOP!\n);
 +   return X86EMUL_CONTINUE;
 +}
 +
  static bool valid_cr(int nr)
  {
 switch (nr) {
 @@ -3557,8 +3594,8 @@ static int check_perm_out(struct x86_emulate_ctxt *ctxt)
 F2bv(((_f)  ~Lock) | DstAcc | SrcImm, _e)

  static const struct opcode group7_rm1[] = {
 -   DI(SrcNone | Priv, monitor),
 -   DI(SrcNone | Priv, mwait),
 +   II(SrcNone | Priv | NoBigReal | UDOnPriv, em_monitor, monitor),
 +   II(SrcNone | Priv | NoBigReal | UDOnPriv, em_mwait, mwait),
 N, N, N, N, N, N,
  };

 diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
 index ec8366c..a524e04 100644
 --- a/arch/x86/kvm/svm.c
 +++ b/arch/x86/kvm/svm.c
 @@ -3274,24 +3274,6 @@ static int pause_interception(struct vcpu_svm *svm)
 return 1;
  }

 -static int nop_interception(struct vcpu_svm *svm)
 -{
 -   skip_emulated_instruction((svm-vcpu));
 -   return 1;
 -}
 -
 -static int monitor_interception(struct vcpu_svm *svm)
 -{
 -   printk_once(KERN_WARNING kvm: MONITOR instruction emulated as 
 NOP!\n);
 -   return nop_interception(svm);
 -}
 -
 -static int mwait_interception(struct vcpu_svm *svm)
 -{
 -   printk_once(KERN_WARNING kvm: MWAIT instruction emulated as NOP!\n);
 -   return nop_interception(svm);
 -}
 -
  static int (*const svm_exit_handlers[])(struct vcpu_svm *svm) = {
 [SVM_EXIT_READ_CR0] = cr_interception,
 [SVM_EXIT_READ_CR3] = cr_interception,
 @@ -3349,8 +3331,8 @@ static int (*const svm_exit_handlers[])(struct vcpu_svm 
 *svm) = {
 [SVM_EXIT_CLGI] = clgi_interception,
 [SVM_EXIT_SKINIT]   = skinit_interception,
 [SVM_EXIT_WBINVD]   = emulate_on_interception,
 -   [SVM_EXIT_MONITOR]  = monitor_interception,
 -   [SVM_EXIT_MWAIT]= mwait_interception,
 +   [SVM_EXIT_MONITOR]  = emulate_on_interception,
 +   [SVM_EXIT_MWAIT]= emulate_on_interception,
 [SVM_EXIT_XSETBV]   = xsetbv_interception,
 [SVM_EXIT_NPF]  = pf_interception,
  };
 diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
 index 801332e..7023e71 100644
 --- a/arch/x86/kvm/vmx.c
 +++ b/arch/x86/kvm/vmx.c
 @@ -5672,22 +5672,17 @@ static int handle_pause(struct kvm_vcpu *vcpu)
 return 1;
  }

 -static int handle_nop(struct kvm_vcpu *vcpu)
 +static int handle_emulate(struct kvm_vcpu *vcpu)
  {
 -   skip_emulated_instruction(vcpu);
 -   return 

Re: [PATCH 3/3] KVM: x86: correct mwait and monitor emulation

2014-06-18 Thread Eric Northup
Quoting Gabriel's post http://www.spinics.net/lists/kvm/msg103792.html :

[...]

 E.g., OS X 10.5 *does* check CPUID, and panics if it doesn't find it.
 It needs the MONITOR cpuid flag to be on, *and* the actual
 instructions to work.




On Wed, Jun 18, 2014 at 11:23 AM, Nadav Amit nadav.a...@gmail.com wrote:
 On 6/18/14, 8:59 PM, Eric Northup wrote:

 On Wed, Jun 18, 2014 at 7:19 AM, Nadav Amit na...@cs.technion.ac.il
 wrote:

 mwait and monitor are currently handled as nop. Considering this
 behavior, they
 should still be handled correctly, i.e., check execution conditions and
 generate
 exceptions when required. mwait and monitor may also be executed in
 real-mode
 and are not handled in that case.  This patch performs the emulation of
 monitor-mwait according to Intel SDM (other than checking whether
 interrupt can
 be used as a break event).

 Signed-off-by: Nadav Amit na...@cs.technion.ac.il
 ---
   arch/x86/kvm/emulate.c | 41 +++--
   arch/x86/kvm/svm.c | 22 ++
   arch/x86/kvm/vmx.c | 27 +++
   3 files changed, 52 insertions(+), 38 deletions(-)

 diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
 index ef7a5a0..424b58d 100644
 --- a/arch/x86/kvm/emulate.c
 +++ b/arch/x86/kvm/emulate.c
 @@ -3344,6 +3344,43 @@ static int em_bswap(struct x86_emulate_ctxt *ctxt)
  return X86EMUL_CONTINUE;
   }

 +static int em_monitor(struct x86_emulate_ctxt *ctxt)
 +{
 +   int rc;
 +   struct segmented_address addr;
 +   u64 rcx = reg_read(ctxt, VCPU_REGS_RCX);
 +   u64 rax = reg_read(ctxt, VCPU_REGS_RAX);
 +   u8 byte;


 I'd request:

 u32 ebx, ecx, edx, eax = 1;
 ctxt-opt-get_cpuid(ctxt, eax, ebx, ecx, edx);
 if (!(ecx  FFL(MWAIT)))
  return emulate_ud(ctxt);

 and also in em_mwait.


 I had similar implementation on previous version, which also checked on
 mwait whether interrupt as break event matches ECX value. However, I was
 under the impression that it was decided that MWAIT will always be emulated
 as NOP to avoid misbehaving VMs that ignore CPUID (see the discussion at
 http://www.spinics.net/lists/kvm/msg102766.html ).

 Nadav
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] kvm: x86: emulate monitor and mwait instructions as nop

2014-06-05 Thread Eric Northup
On Wed, May 7, 2014 at 1:52 PM, Gabriel L. Somlo gso...@gmail.com wrote:
 Treat monitor and mwait instructions as nop, which is architecturally
 correct (but inefficient) behavior. We do this to prevent misbehaving
 guests (e.g. OS X = 10.7) from crashing after they fail to check for
 monitor/mwait availability via cpuid.

 Since mwait-based idle loops relying on these nop-emulated instructions
 would keep the host CPU pegged at 100%, do NOT advertise their presence
 via cpuid, to prevent compliant guests from using them inadvertently.

If it's going to peg the host CPU at 100% anyway, why bother emulating
it?  Just let the guest run the mwait instruction!  Have a condition
that controls whether CPU_BASED_MWAIT_EXITING gets set in the VMCS
processor execution controls.  Go ahead and put it in CPUID, since
it's actually allowed.


 Signed-off-by: Gabriel L. Somlo so...@cmu.edu
 ---

 New in v2: remove invalid_op handler functions which were only used to
handle exits caused by monitor and mwait

 On Wed, May 07, 2014 at 08:31:27PM +0200, Alexander Graf wrote:
 On 05/07/2014 08:15 PM, Michael S. Tsirkin wrote:
 If we really want to be paranoid and worry about guests
 that use this strange way to trigger invalid opcode,
 we can make it possible for userspace to enable/disable
 this hack, and teach qemu to set it.
 
 That would make it even safer than it was.
 
 Not sure it's worth it, just a thought.

 Since we don't trap on non-exposed other instructions (new SSE and
 whatdoiknow) I don't think it's really bad to just expose
 MONITOR/MWAIT as nops.

 So AFAICT, linux prefers to use mwait for idling if cpuid tells it that
 it's available. If we keep telling everyone that we do NOT have monitor
 and mwait available, compliant guests will never end up using them, and
 this hack would remain completely invisible to them, which is good
 (better to use hlt-based idle loops when you're a vm guest, that would
 actually allow the host to relax while you're halted :)

 So the only time anyone would be able to tell we have this hack would be
 when they're about to receive an invalid opcode for using monitor/mwait
 in violation of what CPUID (would have) told them. That's what happens
 to OS X prior to 10.8, which is when I'm hypothesizing the Apple devs
 begain to seriously think about their OS running as a vm guest (on fusion
 and parallels)...

 Instead of killing the misbehaving guest with an invalid opcode, we'd
 allow them to peg the host CPU with their monitor == mwait == nop idle
 loop instead, which, at least on OS X, should be tolerable long enough
 to run 'rm -rf System/Library/Extensions/AppleIntelCPUPowerManagement.kext'
 and reboot the guest, after which things would settle down by reverting
 the guest to a hlt-based idle loop.

 The only reason I can think of to add functionality for enabling/disabling
 this hack would be to protect against a malicious guest which would use
 mwait *on purpose* to peg the host CPU. But a malicious guest could just
 run for(;;); in ring 0 and accomplish the same goal, so we wouldn't
 really gain anything in exchange for the added complexity...

 Thanks,
   Gabriel

  arch/x86/kvm/cpuid.c |  2 ++
  arch/x86/kvm/svm.c   | 28 
  arch/x86/kvm/vmx.c   | 20 
  3 files changed, 38 insertions(+), 12 deletions(-)

 diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
 index f47a104..d094fc6 100644
 --- a/arch/x86/kvm/cpuid.c
 +++ b/arch/x86/kvm/cpuid.c
 @@ -283,6 +283,8 @@ static inline int __do_cpuid_ent(struct kvm_cpuid_entry2 
 *entry, u32 function,
 0 /* Reserved */ | f_lm | F(3DNOWEXT) | F(3DNOW);
 /* cpuid 1.ecx */
 const u32 kvm_supported_word4_x86_features =
 +   /* NOTE: MONITOR (and MWAIT) are emulated as NOP,
 +* but *not* advertised to guests via CPUID ! */
 F(XMM3) | F(PCLMULQDQ) | 0 /* DTES64, MONITOR */ |
 0 /* DS-CPL, VMX, SMX, EST */ |
 0 /* TM2 */ | F(SSSE3) | 0 /* CNXT-ID */ | 0 /* Reserved */ |
 diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
 index 7f4f9c2..0b7d58d 100644
 --- a/arch/x86/kvm/svm.c
 +++ b/arch/x86/kvm/svm.c
 @@ -2770,12 +2770,6 @@ static int xsetbv_interception(struct vcpu_svm *svm)
 return 1;
  }

 -static int invalid_op_interception(struct vcpu_svm *svm)
 -{
 -   kvm_queue_exception(svm-vcpu, UD_VECTOR);
 -   return 1;
 -}
 -
  static int task_switch_interception(struct vcpu_svm *svm)
  {
 u16 tss_selector;
 @@ -3287,6 +3281,24 @@ static int pause_interception(struct vcpu_svm *svm)
 return 1;
  }

 +static int nop_interception(struct vcpu_svm *svm)
 +{
 +   skip_emulated_instruction((svm-vcpu));
 +   return 1;
 +}
 +
 +static int monitor_interception(struct vcpu_svm *svm)
 +{
 +   printk_once(KERN_WARNING kvm: MONITOR instruction emulated as 
 NOP!\n);
 +   return nop_interception(svm);
 +}
 +
 +static int 

Re: [PATCH 4/4] kvm: Implement PEBS virtualization

2014-06-02 Thread Eric Northup
On Thu, May 29, 2014 at 6:12 PM, Andi Kleen a...@firstfloor.org wrote:
 From: Andi Kleen a...@linux.intel.com

 PEBS (Precise Event Bases Sampling) profiling is very powerful,
 allowing improved sampling precision and much additional information,
 like address or TSX abort profiling. cycles:p and :pp uses PEBS.

 This patch enables PEBS profiling in KVM guests.

 PEBS writes profiling records to a virtual address in memory. Since
 the guest controls the virtual address space the PEBS record
 is directly delivered to the guest buffer. We set up the PEBS state
 that is works correctly.The CPU cannot handle any kinds of faults during
 these guest writes.

 To avoid any problems with guest pages being swapped by the host we
 pin the pages when the PEBS buffer is setup, by intercepting
 that MSR.

 Typically profilers only set up a single page, so pinning that is not
 a big problem. The pinning is limited to 17 pages currently (64K+1)

 In theory the guest can change its own page tables after the PEBS
 setup. The host has no way to track that with EPT. But if a guest
 would do that it could only crash itself. It's not expected
 that normal profilers do that.

 The patch also adds the basic glue to enable the PEBS CPUIDs
 and other PEBS MSRs, and ask perf to enable PEBS as needed.

 Due to various limitations it currently only works on Silvermont
 based systems.

 This patch doesn't implement the extended MSRs some CPUs support.
 For example latency profiling on SLM will not work at this point.

 Timing:

 The emulation is somewhat more expensive than a real PMU. This
 may trigger the expensive PMI detection in the guest.
 Usually this can be disabled with
 echo 0  /proc/sys/kernel/perf_cpu_time_max_percent

 Migration:

 In theory it should should be possible (as long as we migrate to
 a host with the same PEBS event and the same PEBS format), but I'm not
 sure the basic KVM PMU code supports it correctly: no code to
 save/restore state, unless I'm missing something. Once the PMU
 code grows proper migration support it should be straight forward
 to handle the PEBS state too.

 Signed-off-by: Andi Kleen a...@linux.intel.com
 ---
  arch/x86/include/asm/kvm_host.h   |   6 ++
  arch/x86/include/uapi/asm/msr-index.h |   4 +
  arch/x86/kvm/cpuid.c  |  10 +-
  arch/x86/kvm/pmu.c| 184 
 --
  arch/x86/kvm/vmx.c|   6 ++
  5 files changed, 196 insertions(+), 14 deletions(-)

 diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
 index 7de069af..d87cb66 100644
 --- a/arch/x86/include/asm/kvm_host.h
 +++ b/arch/x86/include/asm/kvm_host.h
 @@ -319,6 +319,8 @@ struct kvm_pmc {
 struct kvm_vcpu *vcpu;
  };

 +#define MAX_PINNED_PAGES 17 /* 64k buffer + ds */
 +
  struct kvm_pmu {
 unsigned nr_arch_gp_counters;
 unsigned nr_arch_fixed_counters;
 @@ -335,6 +337,10 @@ struct kvm_pmu {
 struct kvm_pmc fixed_counters[INTEL_PMC_MAX_FIXED];
 struct irq_work irq_work;
 u64 reprogram_pmi;
 +   u64 pebs_enable;
 +   u64 ds_area;
 +   struct page *pinned_pages[MAX_PINNED_PAGES];
 +   unsigned num_pinned_pages;
  };

  enum {
 diff --git a/arch/x86/include/uapi/asm/msr-index.h 
 b/arch/x86/include/uapi/asm/msr-index.h
 index fcf2b3a..409a582 100644
 --- a/arch/x86/include/uapi/asm/msr-index.h
 +++ b/arch/x86/include/uapi/asm/msr-index.h
 @@ -72,6 +72,10 @@
  #define MSR_IA32_PEBS_ENABLE   0x03f1
  #define MSR_IA32_DS_AREA   0x0600
  #define MSR_IA32_PERF_CAPABILITIES 0x0345
 +#define PERF_CAP_PEBS_TRAP (1U  6)
 +#define PERF_CAP_ARCH_REG  (1U  7)
 +#define PERF_CAP_PEBS_FORMAT   (0xf  8)
 +
  #define MSR_PEBS_LD_LAT_THRESHOLD  0x03f6

  #define MSR_MTRRfix64K_0   0x0250
 diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
 index f47a104..c8cc76b 100644
 --- a/arch/x86/kvm/cpuid.c
 +++ b/arch/x86/kvm/cpuid.c
 @@ -260,6 +260,10 @@ static inline int __do_cpuid_ent(struct kvm_cpuid_entry2 
 *entry, u32 function,
 unsigned f_rdtscp = kvm_x86_ops-rdtscp_supported() ? F(RDTSCP) : 0;
 unsigned f_invpcid = kvm_x86_ops-invpcid_supported() ? F(INVPCID) : 
 0;
 unsigned f_mpx = kvm_x86_ops-mpx_supported() ? F(MPX) : 0;
 +   bool pebs = perf_pebs_virtualization();
 +   unsigned f_ds = pebs ? F(DS) : 0;
 +   unsigned f_pdcm = pebs ? F(PDCM) : 0;
 +   unsigned f_dtes64 = pebs ? F(DTES64) : 0;

 /* cpuid 1.edx */
 const u32 kvm_supported_word0_x86_features =
 @@ -268,7 +272,7 @@ static inline int __do_cpuid_ent(struct kvm_cpuid_entry2 
 *entry, u32 function,
 F(CX8) | F(APIC) | 0 /* Reserved */ | F(SEP) |
 F(MTRR) | F(PGE) | F(MCA) | F(CMOV) |
 F(PAT) | F(PSE36) | 0 /* PSN */ | F(CLFLUSH) |
 -   0 /* Reserved, DS, ACPI */ | F(MMX) |
 +   f_ds /* 

Re: [PATCH v4 00/12] arm/arm64: KVM: host cache maintenance when guest caches are off

2014-02-18 Thread Eric Northup
On Tue, Feb 18, 2014 at 7:27 AM, Marc Zyngier marc.zyng...@arm.com wrote:

 When we run a guest with cache disabled, we don't flush the cache to
 the Point of Coherency, hence possibly missing bits of data that have
 been written in the cache, but have not yet reached memory.

 We also have the opposite issue: when a guest enables its cache,
 whatever sits in the cache is suddenly going to become visible,
 shadowing whatever the guest has written into RAM.

 There are several approaches to these issues:
 - Using the DC bit when caches are off: this breaks guests assuming
   caches off while doing DMA operations. Bootloaders, for example.
   It also breaks the I-D coherency.
 - Fetch the memory attributes on translation fault, and flush the
   cache while handling the fault. This relies on using the PAR_EL1
   register to obtain the Stage-1 memory attributes, and tends to be
   slow.
 - Detecting the translation faults occuring with MMU off (and
   performing a cache clean), and trapping SCTLR_EL1 to detect the
   moment when the guest is turning its caches on (and performing a
   cache invalidation). Trapping of SCTLR_EL1 is then disabled to
   ensure the best performance.

This will preclude noticing the 2nd .. Nth cache off - on cycles,
right?  Will any guests care - doesn't kexec go through a caches-off
state?


 This patch series implements the last solution, for both arm and
 arm64. Tested on TC2 (ARMv7) and FVP model (ARMv8).

 From v3 (http://www.spinics.net/lists/arm-kernel/msg305211.html)
 - Dropped the LPAE-specific pmd_addr_end
 - Added kvm_p[gum]d_addr_end to deal with 40bit IPAs, and fixed the
   callers of p[gum]d_addr_end with IPA parameters
 - Added patch #12 which, while not strictly related, felt a bit lonely
   on the mailing list

 From v2 (http://www.spinics.net/lists/arm-kernel/msg302472.html):
 - Addressed most (hopefully all) of Christoffer's comments
 - Added a new LPAE pmd_addr_end to deal with 40bit IPAs

 From v1 (http://www.spinics.net/lists/kvm/msg99404.html):
 - Fixed AArch32 VM handling on arm64 (Reported by Anup)
 - Added ARMv7 support:
   * Fixed a couple of issues regarding handling of 64bit cp15 regs
   * Per-vcpu HCR
   * Switching of AMAIR0 and AMAIR1

 Marc Zyngier (12):
   arm64: KVM: force cache clean on page fault when caches are off
   arm64: KVM: allows discrimination of AArch32 sysreg access
   arm64: KVM: trap VM system registers until MMU and caches are ON
   ARM: KVM: introduce kvm_p*d_addr_end
   arm64: KVM: flush VM pages before letting the guest enable caches
   ARM: KVM: force cache clean on page fault when caches are off
   ARM: KVM: fix handling of trapped 64bit coprocessor accesses
   ARM: KVM: fix ordering of 64bit coprocessor accesses
   ARM: KVM: introduce per-vcpu HYP Configuration Register
   ARM: KVM: add world-switch for AMAIR{0,1}
   ARM: KVM: trap VM system registers until MMU and caches are ON
   ARM: KVM: fix warning in mmu.c

  arch/arm/include/asm/kvm_arm.h   |   4 +-
  arch/arm/include/asm/kvm_asm.h   |   4 +-
  arch/arm/include/asm/kvm_host.h  |   9 ++--
  arch/arm/include/asm/kvm_mmu.h   |  29 +--
  arch/arm/kernel/asm-offsets.c|   1 +
  arch/arm/kvm/coproc.c|  84 +++---
  arch/arm/kvm/coproc.h|  14 +++--
  arch/arm/kvm/coproc_a15.c|   2 +-
  arch/arm/kvm/coproc_a7.c |   2 +-
  arch/arm/kvm/guest.c |   1 +
  arch/arm/kvm/interrupts_head.S   |  21 +---
  arch/arm/kvm/mmu.c   | 110 
 ---
  arch/arm64/include/asm/kvm_arm.h |   3 +-
  arch/arm64/include/asm/kvm_asm.h |   3 +-
  arch/arm64/include/asm/kvm_mmu.h |  21 ++--
  arch/arm64/kvm/sys_regs.c|  99 ++-
  arch/arm64/kvm/sys_regs.h|   2 +
  17 files changed, 341 insertions(+), 68 deletions(-)

 --
 1.8.3.4

 --
 To unsubscribe from this list: send the line unsubscribe kvm in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V12 3/5] kvm : Fold pv_unhalt flag into GET_MP_STATE ioctl to aid migration

2013-08-06 Thread Eric Northup
On Tue, Aug 6, 2013 at 11:23 AM, Raghavendra K T
raghavendra...@linux.vnet.ibm.com wrote:
 kvm : Fold pv_unhalt flag into GET_MP_STATE ioctl to aid migration

 From: Raghavendra K T raghavendra...@linux.vnet.ibm.com

 During migration, any vcpu that got kicked but did not become runnable
 (still in halted state) should be runnable after migration.

If this is about migration correctness, could it get folded into the
previous patch 2/5, so that there's not a broken commit which could
hurt bisection?


 Signed-off-by: Raghavendra K T raghavendra...@linux.vnet.ibm.com
 Acked-by: Gleb Natapov g...@redhat.com
 Acked-by: Ingo Molnar mi...@kernel.org
 ---
  arch/x86/kvm/x86.c |7 ++-
  1 file changed, 6 insertions(+), 1 deletion(-)

 diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
 index dae4575..1e73dab 100644
 --- a/arch/x86/kvm/x86.c
 +++ b/arch/x86/kvm/x86.c
 @@ -6284,7 +6284,12 @@ int kvm_arch_vcpu_ioctl_get_mpstate(struct kvm_vcpu 
 *vcpu,
 struct kvm_mp_state *mp_state)
  {
 kvm_apic_accept_events(vcpu);
 -   mp_state-mp_state = vcpu-arch.mp_state;
 +   if (vcpu-arch.mp_state == KVM_MP_STATE_HALTED 
 +   vcpu-arch.pv.pv_unhalted)
 +   mp_state-mp_state = KVM_MP_STATE_RUNNABLE;
 +   else
 +   mp_state-mp_state = vcpu-arch.mp_state;
 +
 return 0;
  }


 ___
 Virtualization mailing list
 virtualizat...@lists.linux-foundation.org
 https://lists.linuxfoundation.org/mailman/listinfo/virtualization
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 1/2] Hyper-H reference counter

2013-05-13 Thread Eric Northup
On Mon, May 13, 2013 at 4:45 AM, Vadim Rozenfeld vroze...@redhat.com wrote:
 Signed-off: Peter Lieven p...@dlh.net
 Signed-off: Gleb Natapov g...@redhat.com
 Signed-off: Vadim Rozenfeld vroze...@redhat.com

 The following patch allows to activate Hyper-V
 reference time counter
 ---
  arch/x86/include/asm/kvm_host.h|  2 ++
  arch/x86/include/uapi/asm/hyperv.h |  3 +++
  arch/x86/kvm/x86.c | 25 -
  3 files changed, 29 insertions(+), 1 deletion(-)

 diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
 index 3741c65..f0fee35 100644
 --- a/arch/x86/include/asm/kvm_host.h
 +++ b/arch/x86/include/asm/kvm_host.h
 @@ -575,6 +575,8 @@ struct kvm_arch {
 /* fields used by HYPER-V emulation */
 u64 hv_guest_os_id;
 u64 hv_hypercall;
 +   u64 hv_ref_count;
 +   u64 hv_tsc_page;

 #ifdef CONFIG_KVM_MMU_AUDIT
 int audit_point;
 diff --git a/arch/x86/include/uapi/asm/hyperv.h 
 b/arch/x86/include/uapi/asm/hyperv.h
 index b80420b..9711819 100644
 --- a/arch/x86/include/uapi/asm/hyperv.h
 +++ b/arch/x86/include/uapi/asm/hyperv.h
 @@ -136,6 +136,9 @@
  /* MSR used to read the per-partition time reference counter */
  #define HV_X64_MSR_TIME_REF_COUNT  0x4020

 +/* A partition's reference time stamp counter (TSC) page */
 +#define HV_X64_MSR_REFERENCE_TSC   0x4021
 +
  /* Define the virtual APIC registers */
  #define HV_X64_MSR_EOI 0x4070
  #define HV_X64_MSR_ICR 0x4071
 diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
 index 094b5d9..1a4036d 100644
 --- a/arch/x86/kvm/x86.c
 +++ b/arch/x86/kvm/x86.c
 @@ -843,7 +843,7 @@ EXPORT_SYMBOL_GPL(kvm_rdpmc);
  static u32 msrs_to_save[] = {
 MSR_KVM_SYSTEM_TIME, MSR_KVM_WALL_CLOCK,
 MSR_KVM_SYSTEM_TIME_NEW, MSR_KVM_WALL_CLOCK_NEW,
 -   HV_X64_MSR_GUEST_OS_ID, HV_X64_MSR_HYPERCALL,
 +   HV_X64_MSR_GUEST_OS_ID, 
 HV_X64_MSR_HYPERCALL,HV_X64_MSR_TIME_REF_COUNT,
 HV_X64_MSR_APIC_ASSIST_PAGE, MSR_KVM_ASYNC_PF_EN, MSR_KVM_STEAL_TIME,
 MSR_KVM_PV_EOI_EN,
 MSR_IA32_SYSENTER_CS, MSR_IA32_SYSENTER_ESP, MSR_IA32_SYSENTER_EIP,
 @@ -1764,6 +1764,8 @@ static bool kvm_hv_msr_partition_wide(u32 msr)
 switch (msr) {
 case HV_X64_MSR_GUEST_OS_ID:
 case HV_X64_MSR_HYPERCALL:
 +   case HV_X64_MSR_REFERENCE_TSC:
 +   case HV_X64_MSR_TIME_REF_COUNT:
 r = true;
 break;
 }
 @@ -1803,6 +1805,21 @@ static int set_msr_hyperv_pw(struct kvm_vcpu *vcpu, 
 u32 msr, u64 data)
 if (__copy_to_user((void __user *)addr, instructions, 4))
 return 1;
 kvm-arch.hv_hypercall = data;
 +   kvm-arch.hv_ref_count = get_kernel_ns();
 +   break;
 +   }
 +   case HV_X64_MSR_REFERENCE_TSC: {
 +   u64 gfn;
 +   unsigned long addr;
 +   u32 tsc_ref;
 +   gfn = data  HV_X64_MSR_HYPERCALL_PAGE_ADDRESS_SHIFT;
 +   addr = gfn_to_hva(kvm, gfn);
 +   if (kvm_is_error_hva(addr))
 +   return 1;
 +   tsc_ref = 0;
 +   if(__copy_to_user((void __user *)addr, tsc_ref, 
 sizeof(tsc_ref)))

Does this do the right thing when we're migrating?  How does usermode
learn that the guest page has been dirtied?

 +   return 1;
 +   kvm-arch.hv_tsc_page = data;
 break;
 }
 default:
 @@ -2229,6 +2246,12 @@ static int get_msr_hyperv_pw(struct kvm_vcpu *vcpu, 
 u32 msr, u64 *pdata)
 case HV_X64_MSR_HYPERCALL:
 data = kvm-arch.hv_hypercall;
 break;
 +   case HV_X64_MSR_TIME_REF_COUNT:
 +   data = div_u64(get_kernel_ns() - kvm-arch.hv_ref_count,100);
 +   break;
 +   case HV_X64_MSR_REFERENCE_TSC:
 +   data = kvm-arch.hv_tsc_page;
 +   break;
 default:
 vcpu_unimpl(vcpu, Hyper-V unhandled rdmsr: 0x%x\n, msr);
 return 1;
 --
 1.8.1.2

 --
 To unsubscribe from this list: send the line unsubscribe kvm in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] kvm tools: virtio-net mergable rx buffers

2013-04-23 Thread Eric Northup
Do you care about guests with drivers that don't negotiate
VIRTIO_NET_F_MRG_RXBUF?

On Mon, Apr 22, 2013 at 5:32 PM, Sasha Levin sasha.le...@oracle.com wrote:

 Support mergable rx buffers for virtio-net. This helps reduce the amount
 of memory the guest kernel has to allocate per rx vq.

 Signed-off-by: Sasha Levin sasha.le...@oracle.com
 ---
  tools/kvm/include/kvm/uip.h  |  4 ++--
  tools/kvm/include/kvm/util.h |  3 +++
  tools/kvm/net/uip/core.c | 54 
 +---
  tools/kvm/net/uip/tcp.c  |  2 +-
  tools/kvm/net/uip/udp.c  |  2 +-
  tools/kvm/util/util.c| 15 
  tools/kvm/virtio/net.c   | 37 ++
  7 files changed, 55 insertions(+), 62 deletions(-)

 diff --git a/tools/kvm/include/kvm/uip.h b/tools/kvm/include/kvm/uip.h
 index ac248d2..fa82f10 100644
 --- a/tools/kvm/include/kvm/uip.h
 +++ b/tools/kvm/include/kvm/uip.h
 @@ -252,7 +252,7 @@ struct uip_tcp_socket {
  };

  struct uip_tx_arg {
 -   struct virtio_net_hdr *vnet;
 +   struct virtio_net_hdr_mrg_rxbuf *vnet;
 struct uip_info *info;
 struct uip_eth *eth;
 int vnet_len;
 @@ -332,7 +332,7 @@ static inline u16 uip_eth_hdrlen(struct uip_eth *eth)
  }

  int uip_tx(struct iovec *iov, u16 out, struct uip_info *info);
 -int uip_rx(struct iovec *iov, u16 in, struct uip_info *info);
 +int uip_rx(unsigned char *buffer, u32 length, struct uip_info *info);
  int uip_init(struct uip_info *info);

  int uip_tx_do_ipv4_udp_dhcp(struct uip_tx_arg *arg);
 diff --git a/tools/kvm/include/kvm/util.h b/tools/kvm/include/kvm/util.h
 index 0df9f0d..6f8ac83 100644
 --- a/tools/kvm/include/kvm/util.h
 +++ b/tools/kvm/include/kvm/util.h
 @@ -22,6 +22,7 @@
  #include sys/param.h
  #include sys/types.h
  #include linux/types.h
 +#include sys/uio.h

  #ifdef __GNUC__
  #define NORETURN __attribute__((__noreturn__))
 @@ -94,4 +95,6 @@ struct kvm;
  void *mmap_hugetlbfs(struct kvm *kvm, const char *htlbfs_path, u64 size);
  void *mmap_anon_or_hugetlbfs(struct kvm *kvm, const char *hugetlbfs_path, 
 u64 size);

 +int memcpy_toiovecend(const struct iovec *iov, int iovlen, unsigned char 
 *kdata, size_t len);
 +
  #endif /* KVM__UTIL_H */
 diff --git a/tools/kvm/net/uip/core.c b/tools/kvm/net/uip/core.c
 index 4e5bb82..d9e9993 100644
 --- a/tools/kvm/net/uip/core.c
 +++ b/tools/kvm/net/uip/core.c
 @@ -7,7 +7,7 @@

  int uip_tx(struct iovec *iov, u16 out, struct uip_info *info)
  {
 -   struct virtio_net_hdr *vnet;
 +   struct virtio_net_hdr_mrg_rxbuf *vnet;
 struct uip_tx_arg arg;
 int eth_len, vnet_len;
 struct uip_eth *eth;
 @@ -74,63 +74,21 @@ int uip_tx(struct iovec *iov, u16 out, struct uip_info 
 *info)
 return vnet_len + eth_len;
  }

 -int uip_rx(struct iovec *iov, u16 in, struct uip_info *info)
 +int uip_rx(unsigned char *buffer, u32 length, struct uip_info *info)
  {
 -   struct virtio_net_hdr *vnet;
 -   struct uip_eth *eth;
 struct uip_buf *buf;
 -   int vnet_len;
 -   int eth_len;
 -   char *p;
 int len;
 -   int cnt;
 -   int i;

 /*
  * Sleep until there is a buffer for guest
  */
 buf = uip_buf_get_used(info);

 -   /*
 -* Fill device to guest buffer, vnet hdr fisrt
 -*/
 -   vnet_len = iov[0].iov_len;
 -   vnet = iov[0].iov_base;
 -   if (buf-vnet_len  vnet_len) {
 -   len = -1;
 -   goto out;
 -   }
 -   memcpy(vnet, buf-vnet, buf-vnet_len);
 -
 -   /*
 -* Then, the real eth data
 -* Note: Be sure buf-eth_len is not bigger than the buffer len that 
 guest provides
 -*/
 -   cnt = buf-eth_len;
 -   p = buf-eth;
 -   for (i = 1; i  in; i++) {
 -   eth_len = iov[i].iov_len;
 -   eth = iov[i].iov_base;
 -   if (cnt  eth_len) {
 -   memcpy(eth, p, eth_len);
 -   cnt -= eth_len;
 -   p += eth_len;
 -   } else {
 -   memcpy(eth, p, cnt);
 -   cnt -= cnt;
 -   break;
 -   }
 -   }
 -
 -   if (cnt) {
 -   pr_warning(uip_rx error);
 -   len = -1;
 -   goto out;
 -   }
 +   memcpy(buffer, buf-vnet, buf-vnet_len);
 +   memcpy(buffer + buf-vnet_len, buf-eth, buf-eth_len);

 len = buf-vnet_len + buf-eth_len;

 -out:
 uip_buf_set_free(info, buf);
 return len;
  }
 @@ -172,8 +130,8 @@ int uip_init(struct uip_info *info)
 }

 list_for_each_entry(buf, buf_head, list) {
 -   buf-vnet   = malloc(sizeof(struct virtio_net_hdr));
 -   buf-vnet_len   = sizeof(struct virtio_net_hdr);
 +   buf-vnet   = malloc(sizeof(struct 
 virtio_net_hdr_mrg_rxbuf));
 +   buf-vnet_len   = sizeof(struct virtio_net_hdr_mrg_rxbuf);

Re: [PATCHv2] KVM: x86: Fix memory leak in vmx.c

2013-04-17 Thread Eric Northup
On Wed, Apr 17, 2013 at 10:54 AM, Andrew Honig aho...@google.com wrote:

 If userspace creates and destroys multiple VMs within the same process
 we leak 20k of memory in the userspace process context per VM.  This
 patch frees the memory in kvm_arch_destroy_vm.  If the process exits
 without closing the VM file descriptor or the file descriptor has been
 shared with another process then we don't need to free the memory.

Technically, I think there's still a (temporary) leak in the case
where the last close happened from the wrong process: f_op-release()
gets called from a context where it won't whack the kvm memory
regions.  However, that's a perverse case not expected in practice --
it will get cleaned up when the original process exits and has it's mm
cleaned up.  Since the one affected (the original open()ing process of
/dev/kvm) is the one the one affected and also the one who misbehaved
(shared its file descriptor), I don't know that it's worth trying to
nail that case down as long as the host kernel isn't compromised (it
won't be).  Perhaps comment it though, at least in the changelog
entry?


 Signed-off-by: Andrew Honig aho...@google.com
 ---
  arch/x86/kvm/x86.c |   17 +
  1 file changed, 17 insertions(+)

 diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
 index e172132..e93e16b 100644
 --- a/arch/x86/kvm/x86.c
 +++ b/arch/x86/kvm/x86.c
 @@ -6811,6 +6811,23 @@ void kvm_arch_sync_events(struct kvm *kvm)

  void kvm_arch_destroy_vm(struct kvm *kvm)
  {
 +   if (current-mm == kvm-mm) {
 +   /*
 +* Free memory regions allocated on behalf of userspace,
 +* unless the the memory map has changed due to process exit
 +* or fd copying.
 +*/
 +   struct kvm_userspace_memory_region mem;
 +   memset(mem, 0, sizeof(mem));
 +   mem.slot = APIC_ACCESS_PAGE_PRIVATE_MEMSLOT;
 +   kvm_set_memory_region(kvm, mem, 0);
 +
 +   mem.slot = IDENTITY_PAGETABLE_PRIVATE_MEMSLOT;
 +   kvm_set_memory_region(kvm, mem, 0);
 +
 +   mem.slot = TSS_PRIVATE_MEMSLOT;
 +   kvm_set_memory_region(kvm, mem, 0);
 +   }
 kvm_iommu_unmap_guest(kvm);
 kfree(kvm-arch.vpic);
 kfree(kvm-arch.vioapic);
 --
 1.7.10.4

 --
 To unsubscribe from this list: send the line unsubscribe kvm in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Is this really a CVE? - buffer overflow in handling of MSR_KVM_SYSTEM_TIME (CVE-2013-1796)

2013-04-03 Thread Eric Northup
On Tue, Apr 2, 2013 at 11:05 PM, Florian Beck beckfloria...@gmail.com wrote:
 The CVE-2013-1796
 (https://git.kernel.org/cgit/virt/kvm/kvm.git/commit/?id=c300aa64ddf57d9c5d9c898a64b36877345dd4a9)
 reports a possibility of host memory corruption.
 I see that this could lead into corruption of guest kernel memory,
 but how could be the wrong aligned address reported by guest corrupt
 host kernel memory?

If the region crosses a page boundary.



 Regards, Florian

 --
 This was the posted fix for CVE-2013-1796:
 --
 diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
 index f7c850b..2ade60c 100644
 --- a/arch/x86/kvm/x86.c
 +++ b/arch/x86/kvm/x86.c
 @@ -1959,6 +1959,11 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu,
 struct msr_data *msr_info)
 /* ...but clean it before doing the actual write */
 vcpu-arch.time_offset = data  ~(PAGE_MASK | 1);
 + /* Check that the address is 32-byte aligned. */
 + if (vcpu-arch.time_offset 
 + (sizeof(struct pvclock_vcpu_time_info) - 1))
 + break;
 +
 vcpu-arch.time_page =
 gfn_to_page(vcpu-kvm, data  PAGE_SHIFT);
 --
 To unsubscribe from this list: send the line unsubscribe kvm in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Best way to busy-wait for a virtio queue?

2013-04-01 Thread Eric Northup
On Fri, Mar 29, 2013 at 4:12 PM, H. Peter Anvin h...@zytor.com wrote:

 Is there any preferred way to busy-wait on a virtio event?  As in: the
 guest doesn't have anything useful to do until something is plopped down
 on the virtio queue, but would like to proceed as quickly as possible
 after that.  Passing through an interrupt handler seems like unnecessary
 overhead.

How much information do you have about the host?  It is possible that
leaving the vCPU running is displacing execution from whatever host
thread(s) would be involved in making progress towards the event you
want delivered - in that case, the interrupt overhead might be
balanced out by lower latency of the event delivery.

 Right now I have a poll loop looking like (pseudocode):

 outw(0, trigger);
 while (readl(ring-output pointer) != final output pointer)
 cpu_relax();/* x86 PAUSE instruction */

 ... but I have no idea how much sense that makes.

The cleanest expression of the desired semantic I can think of would
be MONITOR/MWAIT, except that KVM doesn't allow those instructions in
the guest.  For the case of a 100% non-overcommitted host (including
host i/o processing), there's no reason not to allow the guest to run
those instructions.

Lacking that, I think the above busy-loop w/PAUSE in it will end up
causing a pause-loop exit - so it has largely the same effect but also
works on current hosts.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 5/5] KVM: MMU: fast invalid all mmio sptes

2013-03-18 Thread Eric Northup
On Fri, Mar 15, 2013 at 8:29 AM, Xiao Guangrong
xiaoguangr...@linux.vnet.ibm.com wrote:
 This patch tries to introduce a very simple and scale way to invalid all
 mmio sptes - it need not walk any shadow pages and hold mmu-lock

 KVM maintains a global mmio invalid generation-number which is stored in
 kvm-arch.mmio_invalid_gen and every mmio spte stores the current global
 generation-number into his available bits when it is created

 When KVM need zap all mmio sptes, it just simply increase the global
 generation-number. When guests do mmio access, KVM intercepts a MMIO #PF
 then it walks the shadow page table and get the mmio spte. If the
 generation-number on the spte does not equal the global generation-number,
 it will go to the normal #PF handler to update the mmio spte

 Since 19 bits are used to store generation-number on mmio spte, the
 generation-number can be round after 33554432 times. It is large enough
 for nearly all most cases, but making the code be more strong, we zap all
 shadow pages when the number is round

 Signed-off-by: Xiao Guangrong xiaoguangr...@linux.vnet.ibm.com
 ---
  arch/x86/include/asm/kvm_host.h |2 +
  arch/x86/kvm/mmu.c  |   61 
 +--
  arch/x86/kvm/mmutrace.h |   17 +++
  arch/x86/kvm/paging_tmpl.h  |7 +++-
  arch/x86/kvm/vmx.c  |4 ++
  arch/x86/kvm/x86.c  |6 +--
  6 files changed, 82 insertions(+), 15 deletions(-)

 diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
 index ef7f4a5..572398e 100644
 --- a/arch/x86/include/asm/kvm_host.h
 +++ b/arch/x86/include/asm/kvm_host.h
 @@ -529,6 +529,7 @@ struct kvm_arch {
 unsigned int n_requested_mmu_pages;
 unsigned int n_max_mmu_pages;
 unsigned int indirect_shadow_pages;
 +   unsigned int mmio_invalid_gen;

Could this get initialized to something close to the wrap-around
value, so that the wrap-around case gets more real-world coverage?

 struct hlist_head mmu_page_hash[KVM_NUM_MMU_PAGES];
 /*
  * Hash table of struct kvm_mmu_page.
 @@ -765,6 +766,7 @@ void kvm_mmu_slot_remove_write_access(struct kvm *kvm, 
 int slot);
  void kvm_mmu_write_protect_pt_masked(struct kvm *kvm,
  struct kvm_memory_slot *slot,
  gfn_t gfn_offset, unsigned long mask);
 +void kvm_mmu_invalid_mmio_spte(struct kvm *kvm);
  void kvm_mmu_zap_all(struct kvm *kvm);
  unsigned int kvm_mmu_calculate_mmu_pages(struct kvm *kvm);
  void kvm_mmu_change_mmu_pages(struct kvm *kvm, unsigned int 
 kvm_nr_mmu_pages);
 diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
 index 13626f4..7093a92 100644
 --- a/arch/x86/kvm/mmu.c
 +++ b/arch/x86/kvm/mmu.c
 @@ -234,12 +234,13 @@ static unsigned int get_mmio_spte_generation(u64 spte)
  static void mark_mmio_spte(struct kvm *kvm, u64 *sptep, u64 gfn,
unsigned access)
  {
 -   u64 mask = generation_mmio_spte_mask(0);
 +   unsigned int gen = ACCESS_ONCE(kvm-arch.mmio_invalid_gen);
 +   u64 mask = generation_mmio_spte_mask(gen);

 access = ACC_WRITE_MASK | ACC_USER_MASK;
 mask |= shadow_mmio_mask | access | gfn  PAGE_SHIFT;

 -   trace_mark_mmio_spte(sptep, gfn, access, 0);
 +   trace_mark_mmio_spte(sptep, gfn, access, gen);
 mmu_spte_set(sptep, mask);
  }

 @@ -269,6 +270,34 @@ static bool set_mmio_spte(struct kvm *kvm, u64 *sptep, 
 gfn_t gfn,
 return false;
  }

 +static bool check_mmio_spte(struct kvm *kvm, u64 spte)
 +{
 +   return get_mmio_spte_generation(spte) ==
 +   ACCESS_ONCE(kvm-arch.mmio_invalid_gen);
 +}
 +
 +/*
 + * The caller should protect concurrent access on
 + * kvm-arch.mmio_invalid_gen. Currently, it is used by
 + * kvm_arch_commit_memory_region and protected by kvm-slots_lock.
 + */
 +void kvm_mmu_invalid_mmio_spte(struct kvm *kvm)
 +{
 +   /* Ensure update memslot has been completed. */
 +   smp_mb();
 +
 +trace_kvm_mmu_invalid_mmio_spte(kvm);
 +
 +   /*
 +* The very rare case: if the generation-number is round,
 +* zap all shadow pages.
 +*/
 +   if (unlikely(kvm-arch.mmio_invalid_gen++ == MAX_GEN)) {
 +   kvm-arch.mmio_invalid_gen = 0;
 +   return kvm_mmu_zap_all(kvm);
 +   }
 +}
 +
  static inline u64 rsvd_bits(int s, int e)
  {
 return ((1ULL  (e - s + 1)) - 1)  s;
 @@ -3183,9 +3212,12 @@ static u64 walk_shadow_page_get_mmio_spte(struct 
 kvm_vcpu *vcpu, u64 addr)
  }

  /*
 - * If it is a real mmio page fault, return 1 and emulat the instruction
 - * directly, return 0 to let CPU fault again on the address, -1 is
 - * returned if bug is detected.
 + * Return value:
 + * 2: invalid spte is detected then let the real page fault path
 + *update the mmio spte.
 + * 1: it is a real mmio page fault, emulate the instruction directly.
 + * 0: let CPU fault again on the 

[PATCH] virtio_scsi: fix memory leak on full queue condition.

2012-11-08 Thread Eric Northup
virtscsi_queuecommand was leaking memory when the virtio queue was full.

Tested: Guest operates correctly even with very small queue sizes, validated
we're not leaking kmalloc-192 sized allocations anymore.

Signed-off-by: Eric Northup digitale...@google.com
---
 drivers/scsi/virtio_scsi.c |2 ++
 1 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/drivers/scsi/virtio_scsi.c b/drivers/scsi/virtio_scsi.c
index 595af1a..dd8dc27 100644
--- a/drivers/scsi/virtio_scsi.c
+++ b/drivers/scsi/virtio_scsi.c
@@ -469,6 +469,8 @@ static int virtscsi_queuecommand(struct Scsi_Host *sh, 
struct scsi_cmnd *sc)
  sizeof cmd-req.cmd, sizeof cmd-resp.cmd,
  GFP_ATOMIC) = 0)
ret = 0;
+   else
+   mempool_free(cmd, virtscsi_cmd_pool);
 
 out:
return ret;
-- 
1.7.7.3

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/8] use jump labels to streamline common APIC configuration

2012-08-05 Thread Eric Northup
On Sun, Aug 5, 2012 at 5:58 AM, Gleb Natapov g...@redhat.com wrote:
 APIC code has a lot of checks for apic presence and apic HW/SW enable
 state.  Most common configuration is when each vcpu has in kernel apic
 and it is fully enabled. This path series uses jump labels to turn checks
 to nops in the common case.

What is the target workload and how does the performance compare?  As
a naive question, how different is it than just using gcc branch
hints?

[...]
-- 
Typing one-handed, please don't mistake brevity for rudeness.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 0/5] Export offsets of VMCS fields as note information for kdump

2012-05-22 Thread Eric Northup
On Mon, May 21, 2012 at 8:53 PM, Yanfei Zhang
zhangyan...@cn.fujitsu.com wrote:
 于 2012年05月22日 02:58, Eric Northup 写道:
[...]
 So you can have the VMCS offset dumping be a manually-loaded module.
 Build a database mapping from (CPUID, microcode revision) - (VMCSINFO).
 There's no need for anything beyond the (CPUID, microcode revision) to
 be put in the kdump, since your offline processing of a kdump can then
 look up the rest.
[...]

 We have considered this way, but there are two issues:
 1) vmx resource is unique for a single cpu, and it's risky to grab it forcibly
 on the environment where kvm module is used, in particular on customer's 
 environment.
 To do this safely, kvm support is needed.

It's not risky: you just have to make sure that no one else is going
to use the VMCS on your CPU while you're running.  You can disable
preemption and then save the old VMCS pointer from the CPU (see the
VMPTRST instructions).  Load your temporary VMCS pointer, discover
the fields, then restore the original VMCS pointer.  Then re-enable
preemption and you're done.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 0/5] Export offsets of VMCS fields as note information for kdump

2012-05-21 Thread Eric Northup
On Wed, May 16, 2012 at 12:50 AM, zhangyanfei
zhangyan...@cn.fujitsu.com wrote:

 This patch set exports offsets of VMCS fields as note information for
 kdump. We call it VMCSINFO. The purpose of VMCSINFO is to retrieve
 runtime state of guest machine image, such as registers, in host
 machine's crash dump as VMCS format. The problem is that VMCS internal
 is hidden by Intel in its specification. So, we slove this problem
 by reverse engineering implemented in this patch set. The VMCSINFO
 is exported via sysfs to kexec-tools just like VMCOREINFO.

Perhaps I'm wrong, but this solution seems much, much more dynamic
than it needs to be.

The VMCS offsets aren't going to change between different boots on the
same CPU, unless perhaps the microcode has been updated.

So you can have the VMCS offset dumping be a manually-loaded module.
Build a database mapping from (CPUID, microcode revision) - (VMCSINFO).
There's no need for anything beyond the (CPUID, microcode revision) to
be put in the kdump, since your offline processing of a kdump can then
look up the rest.

It means you don't have to interact with the vmx module at all, and
no extra modules or code have to be loaded on the millions of Linux
machines that won't need the functionality.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] kvm: don't call mmu_shrinker w/o used_mmu_pages

2012-04-22 Thread Eric Northup
On Sun, Apr 22, 2012 at 2:16 AM, Avi Kivity a...@redhat.com wrote:
 On 04/21/2012 05:15 AM, Mike Waychison wrote:
[...]
 There is no mmu_list_lock.  Do you mean kvm_lock or kvm-mmu_lock?

 If the former, then we could easily fix this by dropping kvm_lock while
 the work is being done.  If the latter, then it's more difficult.

 (kvm_lock being contended implies that mmu_shrink is called concurrently?)

On a 32-core system experiencing memory pressure, mmu_shrink was often
being called concurrently (before we turned it off).

With just one, or a small number of VMs on a host, when the
mmu_shrinker contents on the kvm_lock, that's just a proxy for the
contention on kvm-mmu_lock.  It is the one that gets reported,
though, since it gets acquired first.

The contention on mmu_lock would indeed be difficult to remove.  Our
case was perhaps unusual, because of the use of memory containers.  So
some cgroups were under memory pressure (thus calling the shrinker)
but the various VCPU threads (whose guest page tables were being
evicted by the shrinker) could immediately turn around and
successfully re-allocate them.  That made the kvm-mmu_lock really
hot.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Linux Crash Caused By KVM?

2012-04-11 Thread Eric Northup
On Wed, Apr 11, 2012 at 7:45 AM, Avi Kivity a...@redhat.com wrote:
 On 04/11/2012 05:11 AM, Peijie Yu wrote:
      For this problem, i found that panic is caused by
 BUG_ON(in_nmi()) which means NMI happened during another NMI Context;
 But i check the Intel Technical Manual and found While an NMI
 interrupt handler is executing, the processor disables additional
 calls to the NMI handler until the next IRET instruction is executed.
 So, how this happen?


 The NMI path for kvm is different; the processor exits from the guest
 with NMIs blocked, then executes kvm code until it issues int $2 in
 vmx_complete_interrupts(). If an IRET is executed in this path, then
 NMIs will be unblocked and nested NMIs may occur.

 One way this can happen is if we access the vmap area and incur a fault,
 between the VMEXIT and invoking the NMI handler. Or perhaps the NMI
 handler itself generates a fault. Or we have a debug exception in that path.

 Is this reproducible?

As an FYI, there have been BIOSes whose SMI handlers ran IRETs.  So
the NMI blocking can go away surprisingly.

See 29.8 NMI handling while in SMM in the Intel SDM vol 3.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] KVM: Introduce direct MSI message injection for in-kernel irqchips

2012-03-28 Thread Eric Northup
On Wed, Mar 28, 2012 at 10:47 AM, Jan Kiszka jan.kis...@siemens.com wrote:
[...]
 +4.61 KVM_SET_MSI
 +
 +Capability: KVM_CAP_SET_MSI
 +Architectures: x86
 +Type: vm ioctl
 +Parameters: struct kvm_msi (in)
 +Returns: 0 on success, -1 on error

Is this the actual behavior?  It looked to me like the successful
return value ended up getting set by __apic_accept_irq(), which claims
to Return 1 if successfully added and 0 if discarded.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/2 v3] kvm: notify host when guest panicked

2012-03-14 Thread Eric Northup
On Wed, Mar 14, 2012 at 6:25 AM, Gleb Natapov g...@redhat.com wrote:
 On Wed, Mar 14, 2012 at 03:16:05PM +0200, Avi Kivity wrote:
 On 03/14/2012 03:14 PM, Gleb Natapov wrote:
  On Wed, Mar 14, 2012 at 03:07:46PM +0200, Avi Kivity wrote:
   On 03/14/2012 01:11 PM, Wen Congyang wrote:

 I don't think we want to use the driver.  Instead, have a small 
 piece of
 code that resets the device and pushes out a string (the panic 
 message?)
 without any interrupts etc.

 It's still going to be less reliable than a hypercall, I agree.
   
Do you still want to use complicated and less reliable way?
  
   Are you willing to try it out and see how complicated it really is?
  
   While it's more complicated, it's also more flexible.  You can
   communicate the panic message, whether the guest is attempting a kdump
   and its own recovery or whether it wants the host to do it, etc., you
   can communicate less severe failures like oopses.
  
  hypercall can take arguments to achieve the same.

 It has to be designed in advance; and every time we notice something's
 missing we have to update the host kernel.


 We and in the designed stage now. Not to late to design something flexible
 :) Panic hypercall can take GPA of a buffer where host puts panic info
 as a parameter.  This buffer can be read by QEMU and passed to management.

If a host kernel change is in the works, I think it might be cleanest
to have the host kernel export a new kind of VCPU exit for
unhandled-by-KVM hypercalls.  Then usermode can respond to the
hypercall as appropriate.  This would permit adding or changing future
hypercalls without host kernel changes.

Guest panic is almost the definition of not-a-fast-path, and so
what's the reason to handle it in the host kernel.

Punting the functionality to user-space isn't a magic bullet for
getting a good interface designed, but in my opinion it is a better
place to be doing this.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] Next gen kvm api

2012-02-03 Thread Eric Northup
On Thu, Feb 2, 2012 at 8:09 AM, Avi Kivity a...@redhat.com wrote:
[...]

 Moving to syscalls avoids these problems, but introduces new ones:

 - adding new syscalls is generally frowned upon, and kvm will need several
 - syscalls into modules are harder and rarer than into core kernel code
 - will need to add a vcpu pointer to task_struct, and a kvm pointer to
 mm_struct
- Lost a good place to put access control (permissions on /dev/kvm)
for which user-mode processes can use KVM.

How would the ability to use sys_kvm_* be regulated?
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC] KVM MMU: improve large munmap efficiency

2012-01-26 Thread Eric Northup
Flush the shadow MMU instead of iterating over each host VA when doing
a large invalidate range callback.

The previous code is O(N) in the number of virtual pages being
invalidated, while holding both the MMU spinlock and the mmap_sem.
Large unmaps can cause significant delay, during which the process is
unkillable.  Worse, all page allocation could be delayed if there's
enough memory pressure that mmu_shrink gets called.

Signed-off-by: Eric Northup digitale...@google.com

---

We have seen delays of over 30 seconds doing a large (128GB) unmap.

It'd be nicer to check if the amount of work to be done by the entire
flush is less than the work to be done iterating over each HVA page,
but that information isn't currently available to the arch-
independent part of KVM.

Better ideas would be most welcome ;-)


Tested by attaching a debugger to a running qemu w/kvm and running
call munmap(0, 1UL  46).

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 7287bf5..9fe303a 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -61,6 +61,8 @@
 #define CREATE_TRACE_POINTS
 #include trace/events/kvm.h

+#define MMU_NOTIFIER_FLUSH_THRESHOLD_PAGES (1024u*1024u*1024u)
+
 MODULE_AUTHOR(Qumranet);
 MODULE_LICENSE(GPL);

@@ -332,8 +334,12 @@ static void
kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
 * count is also read inside the mmu_lock critical section.
 */
kvm-mmu_notifier_count++;
-   for (; start  end; start += PAGE_SIZE)
-   need_tlb_flush |= kvm_unmap_hva(kvm, start);
+   if (end - start  MMU_NOTIFIER_FLUSH_THRESHOLD_PAGES)
+   for (; start  end; start += PAGE_SIZE)
+   need_tlb_flush |= kvm_unmap_hva(kvm, start);
+   else
+   kvm_arch_flush_shadow(kvm);
+
need_tlb_flush |= kvm-tlbs_dirty;
spin_unlock(kvm-mmu_lock);
srcu_read_unlock(kvm-srcu, idx);
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 06/13] x86/ticketlock: add slowpath logic

2011-09-02 Thread Eric Northup
On Thu, Sep 1, 2011 at 5:54 PM, Jeremy Fitzhardinge jer...@goop.org wrote:
 From: Jeremy Fitzhardinge jeremy.fitzhardi...@citrix.com

 Maintain a flag in both LSBs of the ticket lock which indicates whether
 anyone is in the lock slowpath and may need kicking when the current
 holder unlocks.  The flags are set when the first locker enters
 the slowpath, and cleared when unlocking to an empty queue.

Are there actually two flags maintained?  I only see the one in the
ticket tail getting set/cleared/tested.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] kvm tools: adds a PCI device that exports a host shared segment as a PCI BAR in the guest

2011-08-25 Thread Eric Northup
Just FYI, one issue that I found with exposing host memory regions as
a PCI BAR (including via a very old version of the ivshmem driver...
haven't tried a newer one) is that x86's pci_mmap_page_range doesn't
want to set up a write-back cacheable mapping of a BAR.

It may not matter for your requirements, but the uncached access
reduced guest-host bandwidth via the shared memory driver by a lot.


If you need the physical address to be fixed, you might be better off
by reserving a memory region in the e820 map rather than a PCI BAR,
since BARs can move around.


On Thu, Aug 25, 2011 at 8:08 AM, David Evensky
even...@dancer.ca.sandia.gov wrote:

 Adding in the rest of what ivshmem does shouldn't affect our use, *I
 think*.  I hadn't intended this to do everything that ivshmem does,
 but I can see how that would be useful. It would be cool if it could
 grow into that.

 Our requirements for the driver in kvm tool are that another program
 on the host can create a shared segment (anonymous, non-file backed)
 with a specified handle, size, and contents. That this segment is
 available to the guest at boot time at a specified address and that no
 driver will change the contents of the memory except under direct user
 action. Also, when the guest goes away the shared memory segment
 shouldn't be affected (e.g. contents changed). Finally, we cannot
 change the lightweight nature of kvm tool.

 This is the feature of ivshmem that I need to check today. I did some
 testing a month ago, but it wasn't detailed enough to check this out.

 \dae




 On Thu, Aug 25, 2011 at 02:25:48PM +0300, Sasha Levin wrote:
  On Thu, 2011-08-25 at 11:59 +0100, Stefan Hajnoczi wrote:
   On Thu, Aug 25, 2011 at 11:37 AM, Pekka Enberg penb...@kernel.org wrote:
Hi Stefan,
   
On Thu, Aug 25, 2011 at 1:31 PM, Stefan Hajnoczi stefa...@gmail.com 
wrote:
It's obviously not competing. One thing you might want to consider is
making the guest interface compatible with ivshmem. Is there any 
reason
we shouldn't do that? I don't consider that a requirement, just nice 
to
have.
   
The point of implementing the same interface as ivshmem is that users
don't need to rejig guests or applications in order to switch between
hypervisors.  A different interface also prevents same-to-same
benchmarks.
   
There is little benefit to creating another virtual device interface
when a perfectly good one already exists.  The question should be: how
is this shmem device different and better than ivshmem?  If there is
no justification then implement the ivshmem interface.
   
So which interface are we actually taking about? Userspace/kernel in the
guest or hypervisor/guest kernel?
  
   The hardware interface.  Same PCI BAR layout and semantics.
  
Either way, while it would be nice to share the interface but it's not a
*requirement* for tools/kvm unless ivshmem is specified in the virtio
spec or the driver is in mainline Linux. We don't intend to require 
people
to implement non-standard and non-Linux QEMU interfaces. OTOH,
ivshmem would make the PCI ID problem go away.
  
   Introducing yet another non-standard and non-Linux interface doesn't
   help though.  If there is no significant improvement over ivshmem then
   it makes sense to let ivshmem gain critical mass and more users
   instead of fragmenting the space.
 
  I support doing it ivshmem-compatible, though it doesn't have to be a
  requirement right now (that is, use this patch as a base and build it
  towards ivshmem - which shouldn't be an issue since this patch provides
  the PCI+SHM parts which are required by ivshmem anyway).
 
  ivshmem is a good, documented, stable interface backed by a lot of
  research and testing behind it. Looking at the spec it's obvious that
  Cam had KVM in mind when designing it and thats exactly what we want to
  have in the KVM tool.
 
  David, did you have any plans to extend it to become ivshmem-compatible?
  If not, would turning it into such break any code that depends on it
  horribly?
 
  --
 
  Sasha.
 
 --
 To unsubscribe from this list: send the line unsubscribe kvm in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC v5 86/86] 440fx: fix PAM, PCI holes

2011-07-25 Thread Eric Northup
On Wed, Jul 20, 2011 at 9:50 AM, Avi Kivity a...@redhat.com wrote:
[...]
 @@ -130,7 +137,13 @@ static void pc_init1(MemoryRegion *system_memory,

     if (pci_enabled) {
         pci_bus = i440fx_init(i440fx_state, piix3_devfn, isa_irq,
 -                              system_memory, system_io, ram_size);
 +                              system_memory, system_io, ram_size,
 +                              0xe000, 0x1fe0,
 +                              0x1 + above_4g_mem_size,
 +                              (sizeof(target_phys_addr_t) == 32

sizeof(target_phys_addr_t) == 8 ?

 +                               ? 0
 +                               : ((uint64_t)1  63)),
 +                              pci_memory, ram_memory);
     } else {
         pci_bus = NULL;
         i440fx_state = NULL;
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] kvm: log directly from the guest to the host kvm buffer

2011-05-12 Thread Eric Northup
On Thu, May 12, 2011 at 8:42 AM, Avi Kivity a...@redhat.com wrote:

 On 05/12/2011 06:39 PM, Dhaval Giani wrote:

 
   I think that one hypercall per trace is too expensive.  Tracing is meant 
  to
   be lightweight!  I think the guest can log to a buffer, which is flushed 
  on
   overflow or when a vmexit occurs.  That gives us automatic serialization
   between a vcpu and the cpu it runs on, but not between a vcpu and a
   different host cpu.
 

 hmm. So, basically, log all of these events, and then send them to the
 host either on an exit, or when your buffer fills up. There is one
 problem with approach though. One of the reasons I wanted this
 approach was beacuse i wanted to co-relate the guest and the host
 times. (which is why I kept is synchronous). I lose out that
 information with what you say. However I see your point about the
 overhead. I will think about this a bit more.

 You might use kvmclock to get a zero-exit (but not zero-cost) time which can 
 be correlated.

 Another option is to use xadd on a shared memory area to have a global 
 counter incremented.  However that can be slow on large machines, and is hard 
 to do securely with multiple guests.

If the guest puts guest TSC into the buffer with each event, KVM can
convert guest-host time when it drains the buffers on the next
vmexit.  That's enough information to do an offline correlation of
guest and host events.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC] KVM MMU: fix hashing for TDP and non-paging modes

2010-04-26 Thread Eric Northup
On Mon, Apr 26, 2010 at 2:46 PM, Marcelo Tosatti mtosa...@redhat.com wrote:
 Doh, and your patch does not. But it does not apply to kvm.git -next
 branch, can you regenerate please?

--

For TDP mode, avoid creating multiple page table roots for the single
guest-to-host physical address map by fixing the inputs used for the
shadow page table hash in mmu_alloc_roots().

Signed-off-by: Eric Northup digitale...@google.com
---

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index ddfa865..9696d65 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -2059,10 +2059,12 @@ static int mmu_alloc_roots(struct kvm_vcpu *vcpu)
hpa_t root = vcpu-arch.mmu.root_hpa;

ASSERT(!VALID_PAGE(root));
-   if (tdp_enabled)
-   direct = 1;
if (mmu_check_root(vcpu, root_gfn))
return 1;
+   if (tdp_enabled) {
+   direct = 1;
+   root_gfn = 0;
+   }
sp = kvm_mmu_get_page(vcpu, root_gfn, 0,
  PT64_ROOT_LEVEL, direct,
  ACC_ALL, NULL);
@@ -2072,8 +2074,6 @@ static int mmu_alloc_roots(struct kvm_vcpu *vcpu)
return 0;
}
direct = !is_paging(vcpu);
-   if (tdp_enabled)
-   direct = 1;
for (i = 0; i  4; ++i) {
hpa_t root = vcpu-arch.mmu.pae_root[i];

@@ -2089,6 +2089,10 @@ static int mmu_alloc_roots(struct kvm_vcpu *vcpu)
root_gfn = 0;
if (mmu_check_root(vcpu, root_gfn))
return 1;
+   if (tdp_enabled) {
+   direct = 1;
+   root_gfn = i  30;
+   }
sp = kvm_mmu_get_page(vcpu, root_gfn, i  30,
  PT32_ROOT_LEVEL, direct,
  ACC_ALL, NULL);
--
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH RFC] KVM MMU: fix hashing for TDP and non-paging modes

2010-04-22 Thread Eric Northup
I've been reading the x86's mmu.c recently and had been wondering
about something.  Avi's recent mmu documentation (thanks!) seems to
have confirmed my understanding of how the shadow paging is supposed
to be working.  In TDP mode, when mmu_alloc_roots() calls
kvm_mmu_get_page(), why does it pass (vcpu-arch.cr3  PAGE_SHIFT) or
(vcpu-arch.mmu.pae_root[i]) as gfn?

It seems to me that in TDP mode, gfn should be either zero for the
root page table, or 0/1GB/2GB/3GB (for PAE page tables).

The existing behavior can lead to multiple, semantically-identical TDP
roots being created by mmu_alloc_roots, depending on the VCPU's CR3 at
the time that mmu_alloc_roots was called.  But the nested page tables
should be* independent of the VCPU state. That wastes some memory and
causes extra page faults while populating the extra copies of the page
tables.

*assuming that we aren't modeling per-VCPU state that might change the
physical address map as seen by that VCPU, such as setting the APIC
base to an address overlapping RAM.

All feedback would be welcome, since I'm new to this system!  A
strawman patch follows.

thanks,
-Eric

--

For TDP mode, avoid creating multiple page table roots for the single
guest-to-host physical address map by fixing the inputs used for the
shadow page table hash in mmu_alloc_roots().

Signed-off-by: Eric Northup digitale...@google.com
---
 arch/x86/kvm/mmu.c |   12 
 1 files changed, 8 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index ddfa865..9696d65 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -2059,10 +2059,12 @@ static int mmu_alloc_roots(struct kvm_vcpu *vcpu)
hpa_t root = vcpu-arch.mmu.root_hpa;

ASSERT(!VALID_PAGE(root));
-   if (tdp_enabled)
-   direct = 1;
if (mmu_check_root(vcpu, root_gfn))
return 1;
+   if (tdp_enabled) {
+   direct = 1;
+   root_gfn = 0;
+   }
sp = kvm_mmu_get_page(vcpu, root_gfn, 0,
  PT64_ROOT_LEVEL, direct,
  ACC_ALL, NULL);
@@ -2072,8 +2074,6 @@ static int mmu_alloc_roots(struct kvm_vcpu *vcpu)
return 0;
}
direct = !is_paging(vcpu);
-   if (tdp_enabled)
-   direct = 1;
for (i = 0; i  4; ++i) {
hpa_t root = vcpu-arch.mmu.pae_root[i];

@@ -2089,6 +2089,10 @@ static int mmu_alloc_roots(struct kvm_vcpu *vcpu)
root_gfn = 0;
if (mmu_check_root(vcpu, root_gfn))
return 1;
+   if (tdp_enabled) {
+   direct = 1;
+   root_gfn = i  30;
+   }
sp = kvm_mmu_get_page(vcpu, root_gfn, i  30,
  PT32_ROOT_LEVEL, direct,
  ACC_ALL, NULL);
--
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html