Re: [PATCH v4] KVM: nVMX: Fully support of nested VMX preemption timer

2013-09-15 Thread Arthur Chunqi Li
On Sat, Sep 14, 2013 at 3:44 PM, Jan Kiszka jan.kis...@web.de wrote:
 On 2013-09-13 19:15, Paolo Bonzini wrote:
 Il 06/09/2013 04:04, Arthur Chunqi Li ha scritto:
 +preempt_val_l1 = delta_tsc_l1  preempt_scale;
 +if (preempt_val_l2 = preempt_val_l1)
 +preempt_val_l2 = 0;
 +else
 +preempt_val_l2 -= preempt_val_l1;
 +vmcs_write32(VMX_PREEMPTION_TIMER_VALUE, preempt_val_l2);

 Did you test that a value of 0 triggers an immediate exit, rather than
 counting down by 2^32?  Perhaps it's safer to limit the value to 1
 instead of 0.

 To my experience, 0 triggers immediate exists when the preemption timer
 is enabled.
Yes, L2 VM will exit immediately when the value is 0 with my patch.

Arthur

 Jan


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] KVM: VMX: set blocked by NMI flag if EPT violation happens during IRET from NMI

2013-09-15 Thread Gleb Natapov
Set blocked by NMI flag if EPT violation happens during IRET from NMI
otherwise NMI can be called recursively causing stack corruption.

Signed-off-by: Gleb Natapov g...@redhat.com
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 1f1da43..fa1984c 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -5339,6 +5339,15 @@ static int handle_ept_violation(struct kvm_vcpu *vcpu)
return 0;
}
 
+   /*
+* EPT violation happened while executing iret from NMI,
+* blocked by NMI bit has to be set before next VM entry.
+* There are errata that may cause this bit to not be set:
+* AAK134, BY25. 
+*/
+   if (exit_qualification  INTR_INFO_UNBLOCK_NMI)
+   vmcs_set_bits(GUEST_INTERRUPTIBILITY_INFO, 
GUEST_INTR_STATE_NMI);
+
gpa = vmcs_read64(GUEST_PHYSICAL_ADDRESS);
trace_kvm_page_fault(gpa, exit_qualification);
 
--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH kvm-unit-tests] Test fault during IRET from NMI.

2013-09-15 Thread Gleb Natapov
This test checks that NMI window opens only after IRET from NMI is
executed without a fault.

Signed-off-by: Gleb Natapov g...@redhat.com
diff --git a/lib/x86/processor.h b/lib/x86/processor.h
index e46d8d0..de1dc47 100644
--- a/lib/x86/processor.h
+++ b/lib/x86/processor.h
@@ -62,6 +62,13 @@ static inline u16 read_gs(void)
 return val;
 }
 
+static inline unsigned long read_rflags(void)
+{
+   unsigned long f;
+   asm (pushf; pop %0\n\t : =rm(f));
+   return f;
+}
+
 static inline void write_ds(unsigned val)
 {
 asm (mov %0, %%ds : : rm(val) : memory);
diff --git a/x86/eventinj.c b/x86/eventinj.c
index 7bed7c5..c171e30 100644
--- a/x86/eventinj.c
+++ b/x86/eventinj.c
@@ -125,6 +125,42 @@ static void nmi_isr(struct ex_regs *r)
printf(After nested NMI to itself\n);
 }
 
+unsigned long after_iret_addr;
+
+static void nested_nmi_iret_isr(struct ex_regs *r)
+{
+   printf(Nested NMI isr running rip=%x\n, r-rip);
+
+   if (r-rip == after_iret_addr)
+   test_count++;
+}
+static void nmi_iret_isr(struct ex_regs *r)
+{
+   unsigned long *s = alloc_page();
+   test_count++;
+   printf(NMI isr running %p stack %p\n, after_iret, s);
+   handle_exception(2, nested_nmi_iret_isr);
+   printf(Try send nested NMI to itself\n);
+   apic_self_nmi();
+   printf(After nested NMI to itself\n);
+   s[4] = read_ss();
+   s[3] = 0; /* rsp */
+   s[2] = read_rflags();
+   s[1] = read_cs();
+   s[0] = after_iret_addr = (unsigned long)after_iret;
+   asm (mov %%rsp, %0\n\t
+mov %1, %%rsp\n\t
+outl %2, $0xe4\n\t /* flush stack page */
+#ifdef __x86_64__
+iretq\n\t
+#else
+iretl\n\t
+#endif
+: =m(s[3]) : rm(s[0]), a((unsigned int)virt_to_phys(s)) : 
memory);
+after_iret:
+   printf(After iret\n);
+}
+
 static void tirq0(isr_regs_t *r)
 {
printf(irq0 running\n);
@@ -300,6 +336,20 @@ int main()
irq_disable();
report(NMI, test_count == 2);
 
+   /* generate NMI that will fault on IRET */
+   printf(Before NMI IRET test\n);
+   test_count = 0;
+   handle_exception(2, nmi_iret_isr);
+   printf(Try send NMI to itself\n);
+   apic_self_nmi();
+   /* this is needed on VMX without NMI window notificatoin.
+  Interrupt windows is used instead, so let pending NMI
+  to be injected */
+   irq_enable();
+   asm volatile (nop);
+   irq_disable();
+   printf(After NMI to itself\n);
+   report(NMI, test_count == 2);
 #ifndef __x86_64__
stack_phys = (ulong)virt_to_phys(alloc_page());
stack_va = alloc_vpage();
--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] [RFC] x86: kvm: remove KVM_SOFT_MAX_VCPUS

2013-09-15 Thread Gleb Natapov
On Sat, Sep 14, 2013 at 02:16:51PM +0200, Andrew Jones wrote:
 This patch removes KVM_SOFT_MAX_VCPUS and uses num_online_cpus() for
 KVM_CAP_NR_VCPUS instead, as ARM does. While the API doc simply says
 KVM_CAP_NR_VCPUS should return the recommended maximum number of vcpus,
 it has been returning KVM_SOFT_MAX_VCPUS, which was defined as the
 maximum tested number of vcpus. As that concept could be
 distro-specific, this patch uses the other recommended maximum, the
 number of physical cpus, as we never recommend configuring a guest that
 has more vcpus than the host has pcpus. Of course a guest can always
 still be configured with up to KVM_CAP_MAX_VCPUS though anyway.
 
 I've put RFC on this patch because I'm not sure if there are any gotchas
 lurking with this change. The change now means hosts no longer return
 the same number for KVM_CAP_NR_VCPUS, and that number is likely going to
 generally be quite a bit less than what KVM_SOFT_MAX_VCPUS was (160). I
 can't think of anything other than generating more warnings[1] from qemu
 with guests that configure more vcpus than pcpus though.
 
Another gotcha is that on a host with more then 160 cpus recommended
value will grow which is not a good idea without appropriate testing.

 [1] Actually, until 972fc544b6034a in uq/master is merged there won't be
 any warnings either.
 
 Signed-off-by: Andrew Jones drjo...@redhat.com
 ---
  arch/x86/include/asm/kvm_host.h | 1 -
  arch/x86/kvm/x86.c  | 2 +-
  2 files changed, 1 insertion(+), 2 deletions(-)
 
 diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
 index c76ff74a98f2e..9236c63315a9b 100644
 --- a/arch/x86/include/asm/kvm_host.h
 +++ b/arch/x86/include/asm/kvm_host.h
 @@ -32,7 +32,6 @@
  #include asm/asm.h
  
  #define KVM_MAX_VCPUS 255
 -#define KVM_SOFT_MAX_VCPUS 160
  #define KVM_USER_MEM_SLOTS 125
  /* memory slots that are not exposed to userspace */
  #define KVM_PRIVATE_MEM_SLOTS 3
 diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
 index e5ca72a5cdb6d..d9d3e2ed68ee9 100644
 --- a/arch/x86/kvm/x86.c
 +++ b/arch/x86/kvm/x86.c
 @@ -2604,7 +2604,7 @@ int kvm_dev_ioctl_check_extension(long ext)
   r = !kvm_x86_ops-cpu_has_accelerated_tpr();
   break;
   case KVM_CAP_NR_VCPUS:
 - r = KVM_SOFT_MAX_VCPUS;
 + r = min(num_online_cpus(), KVM_MAX_VCPUS);
s/KVM_MAX_VCPUS/KVM_SOFT_MAX_VCPUS/.  Also what about hotplug cpus?

   break;
   case KVM_CAP_MAX_VCPUS:
   r = KVM_MAX_VCPUS;
 -- 
 1.8.1.4

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] x86: kvm: introduce CONFIG_KVM_MAX_VCPUS

2013-09-15 Thread Gleb Natapov
On Sat, Sep 14, 2013 at 02:18:49PM +0200, Andrew Jones wrote:
 Take CONFIG_KVM_MAX_VCPUS from arm32, but set the default to 255.
 
 Signed-off-by: Andrew Jones drjo...@redhat.com
 ---
  arch/x86/include/asm/kvm_host.h |  5 +++--
  arch/x86/kvm/Kconfig| 10 ++
  2 files changed, 13 insertions(+), 2 deletions(-)
 
 diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
 index c76ff74a98f2e..e7e9b523a8f7e 100644
 --- a/arch/x86/include/asm/kvm_host.h
 +++ b/arch/x86/include/asm/kvm_host.h
 @@ -31,8 +31,9 @@
  #include asm/msr-index.h
  #include asm/asm.h
  
 -#define KVM_MAX_VCPUS 255
 -#define KVM_SOFT_MAX_VCPUS 160
 +#define KVM_MAX_VCPUS CONFIG_KVM_MAX_VCPUS
 +#define KVM_SOFT_MAX_VCPUS min(160, KVM_MAX_VCPUS)
 +
  #define KVM_USER_MEM_SLOTS 125
  /* memory slots that are not exposed to userspace */
  #define KVM_PRIVATE_MEM_SLOTS 3
 diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
 index a47a3e54b964b..e9532c33527ee 100644
 --- a/arch/x86/kvm/Kconfig
 +++ b/arch/x86/kvm/Kconfig
 @@ -52,6 +52,16 @@ config KVM
  
 If unsure, say N.
  
 +config KVM_MAX_VCPUS
 + int Number maximum supported virtual CPUs per VM
 + depends on KVM
 + default 255
 + help
 +   Static number of max supported virtual CPUs per VM.
 +
 +   Set to a lower number to save some resources. Set to a higher
 +   number to test scalability.
 +
Maximum this can save is around 2K per VM. This is pretty insignificant
considering overall memory footprint even smallest VM has.

  config KVM_INTEL
   tristate KVM for Intel processors support
   depends on KVM
 -- 
 1.8.1.4

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 19/23] KVM: PPC: Book3S: Select PR vs HV separately for each guest

2013-09-15 Thread Aneesh Kumar K.V
Alexander Graf ag...@suse.de writes:

 Am 14.09.2013 um 13:33 schrieb Aneesh Kumar K.V 
 aneesh.ku...@linux.vnet.ibm.com:

 Benjamin Herrenschmidt b...@kernel.crashing.org writes:
 
 On Fri, 2013-09-13 at 10:17 +1000, Paul Mackerras wrote:
 
 Aneesh and I are currently investigating an alternative approach,
 which is much more like the x86 way of doing things.  We are looking
 at splitting the code into three modules: a kvm_pr.ko module with the
 PR-specific bits, a kvm_hv.ko module with the HV-specific bits, and a
 core kvm.ko module with the generic bits (basically book3s.c,
 powerpc.c, stuff from virt/kvm/, plus the stuff that both PR and HV
 use).  Basically the core module would have a pointer to a struct
 full of function pointers for the various ops that book3s_pr.c and
 book3s_hv.c both provide.  You would only be able to have one of
 kvm_pr and kvm_hv loaded at any one time.  If they were built in, you
 could have them both built in but only one could register its function
 pointer struct with the core.  Obviously the kvm_hv module would only
 load and register its struct on a machine that had hypervisor mode
 available.  If they were both built in I would think we would give HV
 the first chance to register itself, and let PR register if we can't
 do HV.
 
 How does that sound?
 
 As long as we can force-load the PR one on a machine that normally runs
 HV for the sake of testing ...
 
 This is what I currently have
 
 [root@llmp24l02 kvm]# insmod ./kvm-hv.ko 
 [root@llmp24l02 kvm]# insmod ./kvm-pr.ko 
 insmod: ERROR: could not insert module ./kvm-pr.ko: File exists

 The reason this model makes sense for x86 is that you never have SVM and VMX 
 in the cpu at the same time. Either it is an AMD chip or an Intel chip.

 PR and HV however are not mutually exclusive in hardware. What you really 
 want is

 1) distro can force HV/PR
 2) admin can force HV/PR
 3) user can force HV/PR
 4) by default things just work

 1 can be done through kernel config options.
 2 can be done through modules that get loaded or not
 3 can be done through a vm ioctl
 4 only works if you allow hv and pr to be available at the same time

 I can assume who you talked to about this to make these design decisions, but 
 it definitely was not me.

I didn't had much discussion around the design with anybody yet. What
you saw above was me changing/moving code around madly to get
something working in a day. I was hoping to get something that I can post as RFC
early and let others to comment. Good to get the feedback early.

-aneesh

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH kvm-unit-tests] Test fault during IRET from NMI.

2013-09-15 Thread Paolo Bonzini
Il 15/09/2013 10:17, Gleb Natapov ha scritto:
 This test checks that NMI window opens only after IRET from NMI is
 executed without a fault.
 
 Signed-off-by: Gleb Natapov g...@redhat.com

Reviewed-by: Paolo Bonzini pbonz...@redhat.com

... with a couple English nits (see inline).

 diff --git a/lib/x86/processor.h b/lib/x86/processor.h
 index e46d8d0..de1dc47 100644
 --- a/lib/x86/processor.h
 +++ b/lib/x86/processor.h
 @@ -62,6 +62,13 @@ static inline u16 read_gs(void)
  return val;
  }
  
 +static inline unsigned long read_rflags(void)
 +{
 + unsigned long f;
 + asm (pushf; pop %0\n\t : =rm(f));
 + return f;
 +}
 +
  static inline void write_ds(unsigned val)
  {
  asm (mov %0, %%ds : : rm(val) : memory);
 diff --git a/x86/eventinj.c b/x86/eventinj.c
 index 7bed7c5..c171e30 100644
 --- a/x86/eventinj.c
 +++ b/x86/eventinj.c
 @@ -125,6 +125,42 @@ static void nmi_isr(struct ex_regs *r)
   printf(After nested NMI to itself\n);
  }
  
 +unsigned long after_iret_addr;
 +
 +static void nested_nmi_iret_isr(struct ex_regs *r)
 +{
 + printf(Nested NMI isr running rip=%x\n, r-rip);
 +
 + if (r-rip == after_iret_addr)
 + test_count++;
 +}
 +static void nmi_iret_isr(struct ex_regs *r)
 +{
 + unsigned long *s = alloc_page();
 + test_count++;
 + printf(NMI isr running %p stack %p\n, after_iret, s);
 + handle_exception(2, nested_nmi_iret_isr);
 + printf(Try send nested NMI to itself\n);

s/Try send/Sending/

 + apic_self_nmi();
 + printf(After nested NMI to itself\n);
 + s[4] = read_ss();
 + s[3] = 0; /* rsp */
 + s[2] = read_rflags();
 + s[1] = read_cs();
 + s[0] = after_iret_addr = (unsigned long)after_iret;
 + asm (mov %%rsp, %0\n\t
 +  mov %1, %%rsp\n\t
 +  outl %2, $0xe4\n\t /* flush stack page */
 +#ifdef __x86_64__
 +  iretq\n\t
 +#else
 +  iretl\n\t
 +#endif
 +  : =m(s[3]) : rm(s[0]), a((unsigned int)virt_to_phys(s)) : 
 memory);
 +after_iret:
 + printf(After iret\n);
 +}
 +
  static void tirq0(isr_regs_t *r)
  {
   printf(irq0 running\n);
 @@ -300,6 +336,20 @@ int main()
   irq_disable();
   report(NMI, test_count == 2);
  
 + /* generate NMI that will fault on IRET */
 + printf(Before NMI IRET test\n);
 + test_count = 0;
 + handle_exception(2, nmi_iret_isr);
 + printf(Try send NMI to itself\n);

s/Try send/Sending/

 + apic_self_nmi();
 + /* this is needed on VMX without NMI window notificatoin.

s/notifiatoin/notification/

 +Interrupt windows is used instead, so let pending NMI
 +to be injected */
 + irq_enable();
 + asm volatile (nop);
 + irq_disable();
 + printf(After NMI to itself\n);
 + report(NMI, test_count == 2);
  #ifndef __x86_64__
   stack_phys = (ulong)virt_to_phys(alloc_page());
   stack_va = alloc_vpage();
 --
   Gleb.
 --
 To unsubscribe from this list: send the line unsubscribe kvm in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH kvm-unit-tests] Test fault during IRET from NMI.

2013-09-15 Thread Gleb Natapov
On Sun, Sep 15, 2013 at 11:54:15AM +0200, Paolo Bonzini wrote:
 Il 15/09/2013 10:17, Gleb Natapov ha scritto:
  This test checks that NMI window opens only after IRET from NMI is
  executed without a fault.
  
  Signed-off-by: Gleb Natapov g...@redhat.com
 
 Reviewed-by: Paolo Bonzini pbonz...@redhat.com
 
 ... with a couple English nits (see inline).
 
Yeah, those are copy/paste from other places. The file is full of it :)

  diff --git a/lib/x86/processor.h b/lib/x86/processor.h
  index e46d8d0..de1dc47 100644
  --- a/lib/x86/processor.h
  +++ b/lib/x86/processor.h
  @@ -62,6 +62,13 @@ static inline u16 read_gs(void)
   return val;
   }
   
  +static inline unsigned long read_rflags(void)
  +{
  +   unsigned long f;
  +   asm (pushf; pop %0\n\t : =rm(f));
  +   return f;
  +}
  +
   static inline void write_ds(unsigned val)
   {
   asm (mov %0, %%ds : : rm(val) : memory);
  diff --git a/x86/eventinj.c b/x86/eventinj.c
  index 7bed7c5..c171e30 100644
  --- a/x86/eventinj.c
  +++ b/x86/eventinj.c
  @@ -125,6 +125,42 @@ static void nmi_isr(struct ex_regs *r)
  printf(After nested NMI to itself\n);
   }
   
  +unsigned long after_iret_addr;
  +
  +static void nested_nmi_iret_isr(struct ex_regs *r)
  +{
  +   printf(Nested NMI isr running rip=%x\n, r-rip);
  +
  +   if (r-rip == after_iret_addr)
  +   test_count++;
  +}
  +static void nmi_iret_isr(struct ex_regs *r)
  +{
  +   unsigned long *s = alloc_page();
  +   test_count++;
  +   printf(NMI isr running %p stack %p\n, after_iret, s);
  +   handle_exception(2, nested_nmi_iret_isr);
  +   printf(Try send nested NMI to itself\n);
 
 s/Try send/Sending/
 
  +   apic_self_nmi();
  +   printf(After nested NMI to itself\n);
  +   s[4] = read_ss();
  +   s[3] = 0; /* rsp */
  +   s[2] = read_rflags();
  +   s[1] = read_cs();
  +   s[0] = after_iret_addr = (unsigned long)after_iret;
  +   asm (mov %%rsp, %0\n\t
  +mov %1, %%rsp\n\t
  +outl %2, $0xe4\n\t /* flush stack page */
  +#ifdef __x86_64__
  +iretq\n\t
  +#else
  +iretl\n\t
  +#endif
  +: =m(s[3]) : rm(s[0]), a((unsigned int)virt_to_phys(s)) : 
  memory);
  +after_iret:
  +   printf(After iret\n);
  +}
  +
   static void tirq0(isr_regs_t *r)
   {
  printf(irq0 running\n);
  @@ -300,6 +336,20 @@ int main()
  irq_disable();
  report(NMI, test_count == 2);
   
  +   /* generate NMI that will fault on IRET */
  +   printf(Before NMI IRET test\n);
  +   test_count = 0;
  +   handle_exception(2, nmi_iret_isr);
  +   printf(Try send NMI to itself\n);
 
 s/Try send/Sending/
 
  +   apic_self_nmi();
  +   /* this is needed on VMX without NMI window notificatoin.
 
 s/notifiatoin/notification/
 
  +  Interrupt windows is used instead, so let pending NMI
  +  to be injected */
  +   irq_enable();
  +   asm volatile (nop);
  +   irq_disable();
  +   printf(After NMI to itself\n);
  +   report(NMI, test_count == 2);
   #ifndef __x86_64__
  stack_phys = (ulong)virt_to_phys(alloc_page());
  stack_va = alloc_vpage();
  --
  Gleb.
  --
  To unsubscribe from this list: send the line unsubscribe kvm in
  the body of a message to majord...@vger.kernel.org
  More majordomo info at  http://vger.kernel.org/majordomo-info.html
  

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/3] arm32: kvm: rename CONFIG_KVM_ARM_MAX_VCPUS

2013-09-15 Thread Gleb Natapov
On Sat, Sep 14, 2013 at 02:10:55PM +0200, Andrew Jones wrote:
 Drop the _ARM_ part of the name. We can then introduce a config option
 like this to aarch64 and other arches using the same name - allowing
 grep to show them all. Also update the help text to describe the option
 more completely.
 
 Signed-off-by: Andrew Jones drjo...@redhat.com
 ---
  arch/arm/include/asm/kvm_host.h | 4 ++--
  arch/arm/kvm/Kconfig| 8 
  2 files changed, 6 insertions(+), 6 deletions(-)
 
 diff --git a/arch/arm/include/asm/kvm_host.h b/arch/arm/include/asm/kvm_host.h
 index 7d22517d80711..c614d3eb176c6 100644
 --- a/arch/arm/include/asm/kvm_host.h
 +++ b/arch/arm/include/asm/kvm_host.h
 @@ -25,8 +25,8 @@
  #include asm/fpstate.h
  #include kvm/arm_arch_timer.h
  
 -#if defined(CONFIG_KVM_ARM_MAX_VCPUS)
 -#define KVM_MAX_VCPUS CONFIG_KVM_ARM_MAX_VCPUS
 +#if defined(CONFIG_KVM_MAX_VCPUS)
 +#define KVM_MAX_VCPUS CONFIG_KVM_MAX_VCPUS
  #else
  #define KVM_MAX_VCPUS 0
  #endif
 diff --git a/arch/arm/kvm/Kconfig b/arch/arm/kvm/Kconfig
 index ebf5015508b52..de63bfccb3eb5 100644
 --- a/arch/arm/kvm/Kconfig
 +++ b/arch/arm/kvm/Kconfig
 @@ -40,16 +40,16 @@ config KVM_ARM_HOST
   ---help---
 Provides host support for ARM processors.
  
 -config KVM_ARM_MAX_VCPUS
 +config KVM_MAX_VCPUS
   int Number maximum supported virtual CPUs per VM
   depends on KVM_ARM_HOST
   default 4
   help
 Static number of max supported virtual CPUs per VM.
  
 -   If you choose a high number, the vcpu structures will be quite
 -   large, so only choose a reasonable number that you expect to
 -   actually use.
I do no see why on ARM vcpu structure size depends on KVM_ARM_MAX_VCPUS.
Can somebody point me to it.

 +   The default is set to the highest number of vcpus that
 +   current hardware supports. Set to a lower number to save
 +   some resources. Set to a higher number to test scalability.
  
  config KVM_ARM_VGIC
   bool KVM support for Virtual GIC
 -- 
 1.8.1.4

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 00/15] KVM: MMU: locklessly wirte-protect

2013-09-15 Thread Gleb Natapov
Marcelo do you feel your comments are addressed in patches 3 and 5 of
this series?

On Thu, Sep 05, 2013 at 06:29:03PM +0800, Xiao Guangrong wrote:
 Changelog v2:
 
 - the changes from Gleb's review:
   1) fix calculating the number of spte in the pte_list_add()
   2) set iter-desc to NULL if meet a nulls desc to cleanup the code of
  rmap_get_next()
   3) fix hlist corruption due to accessing sp-hlish out of mmu-lock
   4) use rcu functions to access the rcu protected pointer
   5) spte will be missed in lockless walker if the spte is moved in a desc
  (remove a spte from the rmap using only one desc). Fix it by bottom-up
  walking the desc
 
 - the changes from Paolo's review
   1) make the order and memory barriers between update spte / add spte into
  rmap and dirty-log more clear
   
 - the changes from Marcelo's review:
   1) let fast page fault only fix the spts on the last level (level = 1)
   2) improve some changelogs and comments
 
 - the changes from Takuya's review:
   move the patch flush tlb if the spte can be locklessly modified forward
   to make it's more easily merged
 
 Thank all of you very much for your time and patience on this patchset!
   
 Since we use rcu_assign_pointer() to update the points in desc even if dirty
 log is disabled, i have measured the performance:
 Host: Intel(R) Xeon(R) CPU   X5690  @ 3.47GHz * 12 + 36G memory
 
 - migrate-perf (benchmark the time of get-dirty-log)
   before: Run 10 times, Avg time:9009483 ns.
   after: Run 10 times, Avg time:4807343 ns.
 
 - kerbench
   Guest: 12 VCPUs + 8G memory
   before:
 EPT is enabled:
 # cat 09-05-origin-ept | grep real   
 real 85.58
 real 83.47
 real 82.95
 
 EPT is disabled:
 # cat 09-05-origin-shadow | grep real
 real 138.77
 real 138.99
 real 139.55
 
   after:
 EPT is enabled:
 # cat 09-05-lockless-ept | grep real
 real 83.40
 real 82.81
 real 83.39
 
 EPT is disabled:
 # cat 09-05-lockless-shadow | grep real
 real 138.91
 real 139.71
 real 138.94
 
 No performance regression!
 
 
 
 Background
 ==
 Currently, when mark memslot dirty logged or get dirty page, we need to
 write-protect large guest memory, it is the heavy work, especially, we need to
 hold mmu-lock which is also required by vcpu to fix its page table fault and
 mmu-notifier when host page is being changed. In the extreme cpu / memory used
 guest, it becomes a scalability issue.
 
 This patchset introduces a way to locklessly write-protect guest memory.
 
 Idea
 ==
 There are the challenges we meet and the ideas to resolve them.
 
 1) How to locklessly walk rmap?
 The first idea we got to prevent desc being freed when we are walking the
 rmap is using RCU. But when vcpu runs on shadow page mode or nested mmu mode,
 it updates the rmap really frequently.
 
 So we uses SLAB_DESTROY_BY_RCU to manage desc instead, it allows the object
 to be reused more quickly. We also store a nulls in the last desc
 (desc-more) which can help us to detect whether the desc is moved to anther
 rmap then we can re-walk the rmap if that happened. I learned this idea from
 nulls-list.
 
 Another issue is, when a spte is deleted from the desc, another spte in the
 last desc will be moved to this position to replace the deleted one. If the
 deleted one has been accessed and we do not access the replaced one, the
 replaced one is missed when we do lockless walk.
 To fix this case, we do not backward move the spte, instead, we forward move
 the entry: when a spte is deleted, we move the entry in the first desc to that
 position.
 
 2) How to locklessly access shadow page table?
 It is easy if the handler is in the vcpu context, in that case we can use
 walk_shadow_page_lockless_begin() and walk_shadow_page_lockless_end() that
 disable interrupt to stop shadow page be freed. But we are on the ioctl 
 context
 and the paths we are optimizing for have heavy workload, disabling interrupt 
 is
 not good for the system performance.
 
 We add a indicator into kvm struct (kvm-arch.rcu_free_shadow_page), then use
 call_rcu() to free the shadow page if that indicator is set. Set/Clear the
 indicator are protected by slot-lock, so it need not be atomic and does not
 hurt the performance and the scalability.
 
 3) How to locklessly write-protect guest memory?
 Currently, there are two behaviors when we write-protect guest memory, one is
 clearing the Writable bit on spte and the another one is dropping spte when it
 points to large page. The former is easy we only need to atomicly clear a bit
 but the latter is hard since we need to remove the spte from rmap. so we unify
 these two behaviors that only make the spte readonly. Making large spte
 readonly instead of nonpresent is also good for reducing jitter.
 
 And we need to pay more attention on the order of making spte writable, adding
 spte into rmap and setting the corresponding bit on dirty bitmap since
 kvm_vm_ioctl_get_dirty_log() write-protects the spte based on the dirty 
 bitmap,
 we 

Re: [PATCH v2 0/2] KVM: s390: add floating irq controller

2013-09-15 Thread Gleb Natapov
On Fri, Sep 06, 2013 at 03:30:38PM +0200, Christian Borntraeger wrote:
 On 06/09/13 14:19, Jens Freimann wrote: This series adds a kvm_device that 
 acts as a irq controller for floating
  interrupts.  As a first step it implements functionality to retrieve and 
  inject
  interrupts for the purpose of migration and for hardening the reset code by
  allowing user space to explicitly remove all pending floating interrupts.
  
  PFAULT patches will also use this device for enabling/disabling pfault, 
  therefore
  the pfault patch series will be reworked to use this device.
  
  * Patch 1/2 adds a new data structure to hold interrupt information. The 
  current
one (struct kvm_s390_interrupt) does not allow to inject every kind of 
  interrupt,
e.g. some data for program interrupts and machine check interruptions were
missing.
  
  * Patch 2/2 adds a kvm_device which supports getting/setting currently 
  pending
floating interrupts as well as deleting all currently pending interrupts
  
  
  Jens Freimann (2):
KVM: s390: add and extend interrupt information data structs
KVM: s390: add floating irq controller
  
   Documentation/virtual/kvm/devices/s390_flic.txt |  36 +++
   arch/s390/include/asm/kvm_host.h|  35 +--
   arch/s390/include/uapi/asm/kvm.h|   5 +
   arch/s390/kvm/interrupt.c   | 304 
  
   arch/s390/kvm/kvm-s390.c|   1 +
   include/linux/kvm_host.h|   1 +
   include/uapi/linux/kvm.h|  65 +
   virt/kvm/kvm_main.c |   5 +
   8 files changed, 368 insertions(+), 84 deletions(-)
   create mode 100644 Documentation/virtual/kvm/devices/s390_flic.txt
  
 
 
 Gleb, Paolo,
 
 since the qemu part relies on a kernel header file, it makes sense to not 
 only let the kernel
 part go via the kvm tree, but also the qemu part. I want Alex to Ack the 
 interface, and if he
 agrees then I am fine with applying the whole series.
 
Still waiting for Alex's ACK.

 If nothing else comes up, feel free to apply the small change request from 
 Peter yourself or
 ask Jens for a resend.
 
 --snip
 
 --- a/include/uapi/linux/kvm.h
 +++ b/include/uapi/linux/kvm.h
 @@ -908,7 +908,7 @@ struct kvm_device_attr {
  #define KVM_DEV_TYPE_FSL_MPIC_20   1
  #define KVM_DEV_TYPE_FSL_MPIC_42   2
  #define KVM_DEV_TYPE_XICS  3
 -#define KVM_DEV_TYPE_FLIC  4
 +#define KVM_DEV_TYPE_FLIC  5
  
  /*
   * ioctls for VM fds
 
 --snip

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 1/6] KVM: nVMX: Replace kvm_set_cr0 with vmx_set_cr0 in load_vmcs12_host_state

2013-09-15 Thread Gleb Natapov
On Tue, Sep 10, 2013 at 09:14:14PM +0800, Arthur Chunqi Li wrote:
 On Mon, Sep 2, 2013 at 4:21 PM, Gleb Natapov g...@redhat.com wrote:
  On Thu, Aug 08, 2013 at 04:26:28PM +0200, Jan Kiszka wrote:
  Likely a typo, but a fatal one as kvm_set_cr0 performs checks on the
  Not a typo :) That what Avi asked for do during initial nested VMX
  review: http://markmail.org/message/hhidqyhbo2mrgxxc
 
  But there is at least one transition check that kvm_set_cr0() does that
  should not be done during vmexit emulation, namely CS.L bit check, so I
  tend to agree that kvm_set_cr0() is not appropriate here, at lest not as
  it is. But can we skip other checks kvm_set_cr0() does? For instance
  what prevents us from loading CR0.PG = 1 EFER.LME = 1 and CR4.PAE = 0
  during nested vmexit?  What _should_ prevent it is vmentry check from
  26.2.4
 
  If the host address-space size VM-exit control is 1, the following
  must hold:
   - Bit 5 of the CR4 field (corresponding to CR4.PAE) is 1.
 Hi Jan and Gleb,
 Our nested VMX testing framework may not support such testing modes.
 Here we need to catch the failed result (ZF flag) close to vmresume,
 but vmresume/vmlaunch is well encapsulated in our framework. If we
 simply write a vmresume inline function, the VMX will act unexpectedly
 when it doesn't cause vmresume fail.
 
 Do you have any ideas about this?
 
I am not sure what you mean. The framework does capture failed vmentry
flags, but it handles the failure internally in vmx_run(). If you want
framework to be able to provide vmentry failure handler do what Paolo
suggests.

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Bug 60850] BUG: Bad page state in process libvirtd pfn:76000

2013-09-15 Thread bugzilla-daemon
https://bugzilla.kernel.org/show_bug.cgi?id=60850

Alexander Y. Tiurin alexande...@gmail.com changed:

   What|Removed |Added

Version|unspecified |2.5
Product|Virtualization  |Drivers
  Component|kvm |Network

-- 
You are receiving this mail because:
You are watching the assignee of the bug.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 19/23] KVM: PPC: Book3S: Select PR vs HV separately for each guest

2013-09-15 Thread Alexander Graf


Am 15.09.2013 um 04:16 schrieb Aneesh Kumar K.V 
aneesh.ku...@linux.vnet.ibm.com:

 Alexander Graf ag...@suse.de writes:
 
 Am 14.09.2013 um 13:33 schrieb Aneesh Kumar K.V 
 aneesh.ku...@linux.vnet.ibm.com:
 
 Benjamin Herrenschmidt b...@kernel.crashing.org writes:
 
 On Fri, 2013-09-13 at 10:17 +1000, Paul Mackerras wrote:
 
 Aneesh and I are currently investigating an alternative approach,
 which is much more like the x86 way of doing things.  We are looking
 at splitting the code into three modules: a kvm_pr.ko module with the
 PR-specific bits, a kvm_hv.ko module with the HV-specific bits, and a
 core kvm.ko module with the generic bits (basically book3s.c,
 powerpc.c, stuff from virt/kvm/, plus the stuff that both PR and HV
 use).  Basically the core module would have a pointer to a struct
 full of function pointers for the various ops that book3s_pr.c and
 book3s_hv.c both provide.  You would only be able to have one of
 kvm_pr and kvm_hv loaded at any one time.  If they were built in, you
 could have them both built in but only one could register its function
 pointer struct with the core.  Obviously the kvm_hv module would only
 load and register its struct on a machine that had hypervisor mode
 available.  If they were both built in I would think we would give HV
 the first chance to register itself, and let PR register if we can't
 do HV.
 
 How does that sound?
 
 As long as we can force-load the PR one on a machine that normally runs
 HV for the sake of testing ...
 
 This is what I currently have
 
 [root@llmp24l02 kvm]# insmod ./kvm-hv.ko 
 [root@llmp24l02 kvm]# insmod ./kvm-pr.ko 
 insmod: ERROR: could not insert module ./kvm-pr.ko: File exists
 
 The reason this model makes sense for x86 is that you never have SVM and VMX 
 in the cpu at the same time. Either it is an AMD chip or an Intel chip.
 
 PR and HV however are not mutually exclusive in hardware. What you really 
 want is
 
 1) distro can force HV/PR
 2) admin can force HV/PR
 3) user can force HV/PR
 4) by default things just work
 
 1 can be done through kernel config options.
 2 can be done through modules that get loaded or not
 3 can be done through a vm ioctl
 4 only works if you allow hv and pr to be available at the same time
 
 I can assume who you talked to about this to make these design decisions, 
 but it definitely was not me.
 
 I didn't had much discussion around the design with anybody yet. What
 you saw above was me changing/moving code around madly to get
 something working in a day. I was hoping to get something that I can post as 
 RFC
 early and let others to comment. Good to get the feedback early.

Heh ok :). I think we want to be flexible here unless complexity grows too much 
of a maintenance burden and/or slows things down.

Alex

 
 -aneesh
 
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4] KVM: nVMX: Fully support of nested VMX preemption timer

2013-09-15 Thread Gleb Natapov
On Fri, Sep 06, 2013 at 10:04:51AM +0800, Arthur Chunqi Li wrote:
 This patch contains the following two changes:
 1. Fix the bug in nested preemption timer support. If vmexit L2-L0
 with some reasons not emulated by L1, preemption timer value should
 be save in such exits.
 2. Add support of Save VMX-preemption timer value VM-Exit controls
 to nVMX.
 
 With this patch, nested VMX preemption timer features are fully
 supported.
 
 Signed-off-by: Arthur Chunqi Li yzt...@gmail.com
 ---
 ChangeLog to v3:
 Move nested_adjust_preemption_timer to the latest place just before vmenter.
 Some minor changes.
 
  arch/x86/include/uapi/asm/msr-index.h |1 +
  arch/x86/kvm/vmx.c|   49 
 +++--
  2 files changed, 48 insertions(+), 2 deletions(-)
 
 diff --git a/arch/x86/include/uapi/asm/msr-index.h 
 b/arch/x86/include/uapi/asm/msr-index.h
 index bb04650..b93e09a 100644
 --- a/arch/x86/include/uapi/asm/msr-index.h
 +++ b/arch/x86/include/uapi/asm/msr-index.h
 @@ -536,6 +536,7 @@
  
  /* MSR_IA32_VMX_MISC bits */
  #define MSR_IA32_VMX_MISC_VMWRITE_SHADOW_RO_FIELDS (1ULL  29)
 +#define MSR_IA32_VMX_MISC_PREEMPTION_TIMER_SCALE   0x1F
  /* AMD-V MSRs */
  
  #define MSR_VM_CR   0xc0010114
 diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
 index 1f1da43..f364d16 100644
 --- a/arch/x86/kvm/vmx.c
 +++ b/arch/x86/kvm/vmx.c
 @@ -374,6 +374,8 @@ struct nested_vmx {
*/
   struct page *apic_access_page;
   u64 msr_ia32_feature_control;
 + /* Set if vmexit is L2-L1 */
 + bool nested_vmx_exit;
Do not see why it is needed, see bellow.

  };
  
  #define POSTED_INTR_ON  0
 @@ -2204,7 +2206,17 @@ static __init void nested_vmx_setup_ctls_msrs(void)
  #ifdef CONFIG_X86_64
   VM_EXIT_HOST_ADDR_SPACE_SIZE |
  #endif
 - VM_EXIT_LOAD_IA32_PAT | VM_EXIT_SAVE_IA32_PAT;
 + VM_EXIT_LOAD_IA32_PAT | VM_EXIT_SAVE_IA32_PAT |
 + VM_EXIT_SAVE_VMX_PREEMPTION_TIMER;
 + if (!(nested_vmx_pinbased_ctls_high 
 + PIN_BASED_VMX_PREEMPTION_TIMER) ||
 + !(nested_vmx_exit_ctls_high 
 + VM_EXIT_SAVE_VMX_PREEMPTION_TIMER)) {
 + nested_vmx_exit_ctls_high =
 + (~VM_EXIT_SAVE_VMX_PREEMPTION_TIMER);
 + nested_vmx_pinbased_ctls_high =
 + (~PIN_BASED_VMX_PREEMPTION_TIMER);
 + }
   nested_vmx_exit_ctls_high |= (VM_EXIT_ALWAYSON_WITHOUT_TRUE_MSR |
 VM_EXIT_LOAD_IA32_EFER);
  
 @@ -6707,6 +6719,24 @@ static void vmx_get_exit_info(struct kvm_vcpu *vcpu, 
 u64 *info1, u64 *info2)
   *info2 = vmcs_read32(VM_EXIT_INTR_INFO);
  }
  
 +static void nested_adjust_preemption_timer(struct kvm_vcpu *vcpu)
 +{
 + u64 delta_tsc_l1;
 + u32 preempt_val_l1, preempt_val_l2, preempt_scale;
 +
 + preempt_scale = native_read_msr(MSR_IA32_VMX_MISC) 
 + MSR_IA32_VMX_MISC_PREEMPTION_TIMER_SCALE;
 + preempt_val_l2 = vmcs_read32(VMX_PREEMPTION_TIMER_VALUE);
 + delta_tsc_l1 = kvm_x86_ops-read_l1_tsc(vcpu,
 + native_read_tsc()) - vcpu-arch.last_guest_tsc;
 + preempt_val_l1 = delta_tsc_l1  preempt_scale;
 + if (preempt_val_l2 = preempt_val_l1)
 + preempt_val_l2 = 0;
 + else
 + preempt_val_l2 -= preempt_val_l1;
 + vmcs_write32(VMX_PREEMPTION_TIMER_VALUE, preempt_val_l2);
 +}
 +
  /*
   * The guest has exited.  See if we can fix it or if we need userspace
   * assistance.
 @@ -6736,9 +6766,11 @@ static int vmx_handle_exit(struct kvm_vcpu *vcpu)
   vmx-nested.nested_run_pending = 0;
  
   if (is_guest_mode(vcpu)  nested_vmx_exit_handled(vcpu)) {
 + vmx-nested.nested_vmx_exit = true;
   nested_vmx_vmexit(vcpu);
   return 1;
   }
 + vmx-nested.nested_vmx_exit = false;
  
   if (exit_reason  VMX_EXIT_REASONS_FAILED_VMENTRY) {
   vcpu-run-exit_reason = KVM_EXIT_FAIL_ENTRY;
 @@ -7132,6 +7164,8 @@ static void __noclone vmx_vcpu_run(struct kvm_vcpu 
 *vcpu)
   debugctlmsr = get_debugctlmsr();
  
   vmx-__launched = vmx-loaded_vmcs-launched;
 + if (is_guest_mode(vcpu)  !(vmx-nested.nested_vmx_exit))
How is_guest_mode() and nested_vmx_exi can be both true? The only place
nested_vmx_exit is set to true is just before call to
nested_vmx_vmexit(). The firs thing nested_vmx_vmexit() does is makes
is_guest_mode() false. To enter guest mode again at least one other
vmexit from L1 to L0 is needed at which point nested_vmx_exit will be
reset to false again.

If you want to avoid calling nested_adjust_preemption_timer() during 
vmlaunch/vmresume emulation (and it looks like this is what you are
trying to achieve here) you can check nested_run_pending.


 + nested_adjust_preemption_timer(vcpu);
   asm(
   /* Store host registers */
   push %% _ASM_DX ; push %% 

Re: [RFC PATCH 3/3] kvm: Add VFIO device for handling IOMMU cache coherency

2013-09-15 Thread Alexey Kardashevskiy
On 09/14/2013 02:25 AM, Alex Williamson wrote:
 On Fri, 2013-09-13 at 18:49 +1000, Alexey Kardashevskiy wrote:
 On 09/13/2013 07:23 AM, Alex Williamson wrote:
 So far we've succeeded at making KVM and VFIO mostly unaware of each
 other, but there's any important point where that breaks down.  Intel
 VT-d hardware may or may not support snoop control.  When snoop
 control is available, intel-iommu promotes No-Snoop transactions on
 PCIe to be cache coherent.  That allows KVM to handle things like the
 x86 WBINVD opcode as a nop.  When the hardware does not support this,
 KVM must implement a hardware visible WBINVD for the guest.

 We could simply let userspace tell KVM how to handle WBINVD, but it's
 privileged for a reason.  Allowing an arbitrary user to enable
 physical WBINVD gives them a more access to the hardware.  Previously,
 this has only been enabled for guests supporting legacy PCI device
 assignment.  In such cases it's necessary for proper guest execution.
 We therefore create a new KVM-VFIO virtual device.  The user can add
 and remove VFIO groups to this device via file descriptors.  KVM
 makes use of the VFIO external user interface to validate that the
 user has access to physical hardware and gets the coherency state of
 the IOMMU from VFIO.  This provides equivalent functionality to
 legacy KVM assignment, while keeping (nearly) all the bits isolated.

 The one intrusion is the resulting flag indicating the coherency
 state.  For this RFC it's placed on the x86 kvm_arch struct, however
 I know POWER has interest in using the VFIO external user interface,
 and I'm hoping we can share a common KVM-VFIO device.  Perhaps they
 care about No-Snoop handling as well or the code can be #ifdef'd.


 POWER does not support (at least boos3s - server, not sure about others)
 this cache-non-coherent stuff at all.
 
 Then it's easy for your IOMMU API interface to return always cache
 coherent or never cache coherent or whatever ;)
 
 Regarding reusing this device with external API for POWER - I posted a
 patch which introduces KVM device to link KVM with IOMMU but besides the
 list of groups registered in KVM, it also provides the way to find a group
 by LIOBN (logical bus number) which is used in DMA map/unmap hypercalls. So
 in my case kvm_vfio_group struct needs LIOBN and it would be nice to have
 there window_size too (for a quick boundary check). I am not sure we want
 to mix everything here.

 It is in [PATCH v10 12/13] KVM: PPC: Add support for IOMMU in-kernel
 handling if you are interested (kvmppc_spapr_tce_iommu_device).
 
 Yes, I stole the code to get the vfio symbols from your code.  The
 convergence I was hoping to achieve is that KVM doesn't really want to
 know about VFIO and vica versa.  We can therefore at least limit the
 intrusion by sharing a common device.  Obviously for you it will need
 some extra interfaces to associate an LIOBN to a group, but we keep both
 the kernel an userspace cleaner by avoiding duplication where we can.
 Is this really not extensible to your usage?

Well, I do not have a good taste for this kind of stuff so I cannot tell
for sure. I can reuse this device and hack it to do whatever I need but how?

There are 2 things I need from KVM device:
1. associate IOMMU group with LIOBN
2. get a pointer to an IOMMU group by LIOBN (used to get ppc's IOMMU table
pointer which is an IOMMU data of an IOMMU group so we could take a
shortcut here).

There are 3 possible interfaces to use:
A. set/get attributes
B. ioctl
C. additional API

What I posted a week ago uses A for 1 and C for 2. Now we move this to
virt/kvm/vfio.c.
A for 1 is more or less ok but how exactly? Yet another attribute? Platform
specific bus ID? In your patch attr-addr is not really an address (to be
accessed with get_user()) but just an fd.

For 2 - there are already A and B interfaces so we do not want C, right?
What kind of a good device has a backdoor? :) But A and B do not have
access control to prevent the user space from receiving a IOMMU group
pointer (which it would not be able to use anyway but still). Do we care
about this (I do not)? And using B (ioctls) within the kernel - I cannot
say I saw many usages of this.

I am pretty sure we will spend some time (not much) arguing about these
things and when we agree on something, then some of KVM maintainers will
step in and say that there is 1001st way of doing this and - goto start.
And after all, this still won't be a device as it does not emulate anything
present in the real hardware, this is just yet another interface to KVM and
some ways of using it won't be natural for somebody. /me sighs.



-- 
Alexey
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 20/23] KVM: PPC: Book3S PR: Better handling of host-side read-only pages

2013-09-15 Thread Paul Mackerras
On Sat, Sep 14, 2013 at 03:23:53PM -0500, Alexander Graf wrote:
 
 
 Am 14.09.2013 um 00:24 schrieb Paul Mackerras pau...@samba.org:
 
  Bootup (F19 guest, 3 runs):
  
  Without the patch: average 20.12 seconds, st. dev. 0.17 seconds
  With the patch: 20.47 seconds, st. dev. 0.19 seconds
  
  Delta: 0.35 seconds, or 1.7%.
  
  time for i in $(seq 1000); do /bin/echo $i /dev/null; done:
  
  Without the patch: average 7.27 seconds, st. dev. 0.23 seconds
  With the patch: average 7.55 seconds, st. dev. 0.39 seconds
  
  Delta: 0.28 seconds, or 3.8%.
  
  So there appears to be a small effect, of a few percent.
 
 So in the normal case it slows us down, but allows ksm to be effective. Do we 
 actually want this change then?

I was a bit puzzled why there was a measurable slowdown until I
remembered that this patch was intended to go along with the patch
powerpc: Implement __get_user_pages_fast(), which Ben took and which
is now upstream in Linus' tree (1f7bf028).  So, I applied that patch
on top of this Better handling of host-side read-only pages pages,
and did the same measurements.  The results were:

Bootup (F19 guest, 3 runs): average 20.05 seconds, st. dev. 0.53s

1000 /bin/echo (4 runs): average 7.27 seconds, st. dev. 0.032s

So with both patches applied there is no slowdown at all, and KSM
works properly.  I think we want this patch.

Paul.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4] KVM: nVMX: Fully support of nested VMX preemption timer

2013-09-15 Thread Arthur Chunqi Li
On Sat, Sep 14, 2013 at 1:15 AM, Paolo Bonzini pbonz...@redhat.com wrote:
 Il 06/09/2013 04:04, Arthur Chunqi Li ha scritto:
 This patch contains the following two changes:
 1. Fix the bug in nested preemption timer support. If vmexit L2-L0
 with some reasons not emulated by L1, preemption timer value should
 be save in such exits.
 2. Add support of Save VMX-preemption timer value VM-Exit controls
 to nVMX.

 With this patch, nested VMX preemption timer features are fully
 supported.

 Signed-off-by: Arthur Chunqi Li yzt...@gmail.com
 ---
 ChangeLog to v3:
 Move nested_adjust_preemption_timer to the latest place just before vmenter.
 Some minor changes.

  arch/x86/include/uapi/asm/msr-index.h |1 +
  arch/x86/kvm/vmx.c|   49 
 +++--
  2 files changed, 48 insertions(+), 2 deletions(-)

 diff --git a/arch/x86/include/uapi/asm/msr-index.h 
 b/arch/x86/include/uapi/asm/msr-index.h
 index bb04650..b93e09a 100644
 --- a/arch/x86/include/uapi/asm/msr-index.h
 +++ b/arch/x86/include/uapi/asm/msr-index.h
 @@ -536,6 +536,7 @@

  /* MSR_IA32_VMX_MISC bits */
  #define MSR_IA32_VMX_MISC_VMWRITE_SHADOW_RO_FIELDS (1ULL  29)
 +#define MSR_IA32_VMX_MISC_PREEMPTION_TIMER_SCALE   0x1F
  /* AMD-V MSRs */

  #define MSR_VM_CR   0xc0010114
 diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
 index 1f1da43..f364d16 100644
 --- a/arch/x86/kvm/vmx.c
 +++ b/arch/x86/kvm/vmx.c
 @@ -374,6 +374,8 @@ struct nested_vmx {
*/
   struct page *apic_access_page;
   u64 msr_ia32_feature_control;
 + /* Set if vmexit is L2-L1 */
 + bool nested_vmx_exit;
  };

  #define POSTED_INTR_ON  0
 @@ -2204,7 +2206,17 @@ static __init void nested_vmx_setup_ctls_msrs(void)
  #ifdef CONFIG_X86_64
   VM_EXIT_HOST_ADDR_SPACE_SIZE |
  #endif
 - VM_EXIT_LOAD_IA32_PAT | VM_EXIT_SAVE_IA32_PAT;
 + VM_EXIT_LOAD_IA32_PAT | VM_EXIT_SAVE_IA32_PAT |
 + VM_EXIT_SAVE_VMX_PREEMPTION_TIMER;
 + if (!(nested_vmx_pinbased_ctls_high 
 + PIN_BASED_VMX_PREEMPTION_TIMER) ||
 + !(nested_vmx_exit_ctls_high 
 + VM_EXIT_SAVE_VMX_PREEMPTION_TIMER)) {

 Align this under the other !.  Also, I prefer to have one long line
 for the whole !(...  ...) || (and likewise below), but I don't know
 if Gleb agrees

 + nested_vmx_exit_ctls_high =
 + (~VM_EXIT_SAVE_VMX_PREEMPTION_TIMER);

 Please remove parentheses around ~VM_EXIT_SAVE_VMX_PREEMPTION_TIMER, and
 likewise elsewhere in the patch.

 + nested_vmx_pinbased_ctls_high =
 + (~PIN_BASED_VMX_PREEMPTION_TIMER);
 + }
   nested_vmx_exit_ctls_high |= (VM_EXIT_ALWAYSON_WITHOUT_TRUE_MSR |
 VM_EXIT_LOAD_IA32_EFER);

 @@ -6707,6 +6719,24 @@ static void vmx_get_exit_info(struct kvm_vcpu *vcpu, 
 u64 *info1, u64 *info2)
   *info2 = vmcs_read32(VM_EXIT_INTR_INFO);
  }

 +static void nested_adjust_preemption_timer(struct kvm_vcpu *vcpu)
 +{
 + u64 delta_tsc_l1;
 + u32 preempt_val_l1, preempt_val_l2, preempt_scale;

 Should this exit immediately if the preemption timer pin-based control
 is disabled?
Hi Paolo,
How can I get pin-based control here from struct kvm_vcpu *vcpu?

Arthur

 + preempt_scale = native_read_msr(MSR_IA32_VMX_MISC) 
 + MSR_IA32_VMX_MISC_PREEMPTION_TIMER_SCALE;
 + preempt_val_l2 = vmcs_read32(VMX_PREEMPTION_TIMER_VALUE);
 + delta_tsc_l1 = kvm_x86_ops-read_l1_tsc(vcpu,
 + native_read_tsc()) - vcpu-arch.last_guest_tsc;

 Please format this like:

 delta_tsc_l1 =
 kvm_x86_ops-read_l1_tsc(vcpu, native_read_tsc())
 - vcpu-arch.last_guest_tsc;

 + preempt_val_l1 = delta_tsc_l1  preempt_scale;
 + if (preempt_val_l2 = preempt_val_l1)
 + preempt_val_l2 = 0;
 + else
 + preempt_val_l2 -= preempt_val_l1;
 + vmcs_write32(VMX_PREEMPTION_TIMER_VALUE, preempt_val_l2);

 Did you test that a value of 0 triggers an immediate exit, rather than
 counting down by 2^32?  Perhaps it's safer to limit the value to 1
 instead of 0.

 +}
 +
  /*
   * The guest has exited.  See if we can fix it or if we need userspace
   * assistance.
 @@ -6736,9 +6766,11 @@ static int vmx_handle_exit(struct kvm_vcpu *vcpu)
   vmx-nested.nested_run_pending = 0;

   if (is_guest_mode(vcpu)  nested_vmx_exit_handled(vcpu)) {
 + vmx-nested.nested_vmx_exit = true;

 I think this assignment should be in nested_vmx_vmexit, since it is
 called from other places as well.

   nested_vmx_vmexit(vcpu);
   return 1;
   }
 + vmx-nested.nested_vmx_exit = false;

   if (exit_reason  VMX_EXIT_REASONS_FAILED_VMENTRY) {
   vcpu-run-exit_reason = KVM_EXIT_FAIL_ENTRY;
 @@ -7132,6 +7164,8 @@ static void __noclone vmx_vcpu_run(struct kvm_vcpu 
 *vcpu)
 

Re: [PATCH v4] KVM: nVMX: Fully support of nested VMX preemption timer

2013-09-15 Thread Arthur Chunqi Li
On Sun, Sep 15, 2013 at 8:31 PM, Gleb Natapov g...@redhat.com wrote:
 On Fri, Sep 06, 2013 at 10:04:51AM +0800, Arthur Chunqi Li wrote:
 This patch contains the following two changes:
 1. Fix the bug in nested preemption timer support. If vmexit L2-L0
 with some reasons not emulated by L1, preemption timer value should
 be save in such exits.
 2. Add support of Save VMX-preemption timer value VM-Exit controls
 to nVMX.

 With this patch, nested VMX preemption timer features are fully
 supported.

 Signed-off-by: Arthur Chunqi Li yzt...@gmail.com
 ---
 ChangeLog to v3:
 Move nested_adjust_preemption_timer to the latest place just before vmenter.
 Some minor changes.

  arch/x86/include/uapi/asm/msr-index.h |1 +
  arch/x86/kvm/vmx.c|   49 
 +++--
  2 files changed, 48 insertions(+), 2 deletions(-)

 diff --git a/arch/x86/include/uapi/asm/msr-index.h 
 b/arch/x86/include/uapi/asm/msr-index.h
 index bb04650..b93e09a 100644
 --- a/arch/x86/include/uapi/asm/msr-index.h
 +++ b/arch/x86/include/uapi/asm/msr-index.h
 @@ -536,6 +536,7 @@

  /* MSR_IA32_VMX_MISC bits */
  #define MSR_IA32_VMX_MISC_VMWRITE_SHADOW_RO_FIELDS (1ULL  29)
 +#define MSR_IA32_VMX_MISC_PREEMPTION_TIMER_SCALE   0x1F
  /* AMD-V MSRs */

  #define MSR_VM_CR   0xc0010114
 diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
 index 1f1da43..f364d16 100644
 --- a/arch/x86/kvm/vmx.c
 +++ b/arch/x86/kvm/vmx.c
 @@ -374,6 +374,8 @@ struct nested_vmx {
*/
   struct page *apic_access_page;
   u64 msr_ia32_feature_control;
 + /* Set if vmexit is L2-L1 */
 + bool nested_vmx_exit;
 Do not see why it is needed, see bellow.

  };

  #define POSTED_INTR_ON  0
 @@ -2204,7 +2206,17 @@ static __init void nested_vmx_setup_ctls_msrs(void)
  #ifdef CONFIG_X86_64
   VM_EXIT_HOST_ADDR_SPACE_SIZE |
  #endif
 - VM_EXIT_LOAD_IA32_PAT | VM_EXIT_SAVE_IA32_PAT;
 + VM_EXIT_LOAD_IA32_PAT | VM_EXIT_SAVE_IA32_PAT |
 + VM_EXIT_SAVE_VMX_PREEMPTION_TIMER;
 + if (!(nested_vmx_pinbased_ctls_high 
 + PIN_BASED_VMX_PREEMPTION_TIMER) ||
 + !(nested_vmx_exit_ctls_high 
 + VM_EXIT_SAVE_VMX_PREEMPTION_TIMER)) {
 + nested_vmx_exit_ctls_high =
 + (~VM_EXIT_SAVE_VMX_PREEMPTION_TIMER);
 + nested_vmx_pinbased_ctls_high =
 + (~PIN_BASED_VMX_PREEMPTION_TIMER);
 + }
   nested_vmx_exit_ctls_high |= (VM_EXIT_ALWAYSON_WITHOUT_TRUE_MSR |
 VM_EXIT_LOAD_IA32_EFER);

 @@ -6707,6 +6719,24 @@ static void vmx_get_exit_info(struct kvm_vcpu *vcpu, 
 u64 *info1, u64 *info2)
   *info2 = vmcs_read32(VM_EXIT_INTR_INFO);
  }

 +static void nested_adjust_preemption_timer(struct kvm_vcpu *vcpu)
 +{
 + u64 delta_tsc_l1;
 + u32 preempt_val_l1, preempt_val_l2, preempt_scale;
 +
 + preempt_scale = native_read_msr(MSR_IA32_VMX_MISC) 
 + MSR_IA32_VMX_MISC_PREEMPTION_TIMER_SCALE;
 + preempt_val_l2 = vmcs_read32(VMX_PREEMPTION_TIMER_VALUE);
 + delta_tsc_l1 = kvm_x86_ops-read_l1_tsc(vcpu,
 + native_read_tsc()) - vcpu-arch.last_guest_tsc;
 + preempt_val_l1 = delta_tsc_l1  preempt_scale;
 + if (preempt_val_l2 = preempt_val_l1)
 + preempt_val_l2 = 0;
 + else
 + preempt_val_l2 -= preempt_val_l1;
 + vmcs_write32(VMX_PREEMPTION_TIMER_VALUE, preempt_val_l2);
 +}
 +
  /*
   * The guest has exited.  See if we can fix it or if we need userspace
   * assistance.
 @@ -6736,9 +6766,11 @@ static int vmx_handle_exit(struct kvm_vcpu *vcpu)
   vmx-nested.nested_run_pending = 0;

   if (is_guest_mode(vcpu)  nested_vmx_exit_handled(vcpu)) {
 + vmx-nested.nested_vmx_exit = true;
   nested_vmx_vmexit(vcpu);
   return 1;
   }
 + vmx-nested.nested_vmx_exit = false;

   if (exit_reason  VMX_EXIT_REASONS_FAILED_VMENTRY) {
   vcpu-run-exit_reason = KVM_EXIT_FAIL_ENTRY;
 @@ -7132,6 +7164,8 @@ static void __noclone vmx_vcpu_run(struct kvm_vcpu 
 *vcpu)
   debugctlmsr = get_debugctlmsr();

   vmx-__launched = vmx-loaded_vmcs-launched;
 + if (is_guest_mode(vcpu)  !(vmx-nested.nested_vmx_exit))
 How is_guest_mode() and nested_vmx_exi can be both true? The only place
 nested_vmx_exit is set to true is just before call to
 nested_vmx_vmexit(). The firs thing nested_vmx_vmexit() does is makes
 is_guest_mode() false. To enter guest mode again at least one other
 vmexit from L1 to L0 is needed at which point nested_vmx_exit will be
 reset to false again.

 If you want to avoid calling nested_adjust_preemption_timer() during
 vmlaunch/vmresume emulation (and it looks like this is what you are
 trying to achieve here) you can check nested_run_pending.
Besides vmlaunch/vmresume emulation, every exit from L2-L1 should not
call 

Re: [PATCH 19/23] KVM: PPC: Book3S: Select PR vs HV separately for each guest

2013-09-15 Thread Aneesh Kumar K.V
Alexander Graf ag...@suse.de writes:

 Am 14.09.2013 um 13:33 schrieb Aneesh Kumar K.V 
 aneesh.ku...@linux.vnet.ibm.com:

 Benjamin Herrenschmidt b...@kernel.crashing.org writes:
 
 On Fri, 2013-09-13 at 10:17 +1000, Paul Mackerras wrote:
 
 Aneesh and I are currently investigating an alternative approach,
 which is much more like the x86 way of doing things.  We are looking
 at splitting the code into three modules: a kvm_pr.ko module with the
 PR-specific bits, a kvm_hv.ko module with the HV-specific bits, and a
 core kvm.ko module with the generic bits (basically book3s.c,
 powerpc.c, stuff from virt/kvm/, plus the stuff that both PR and HV
 use).  Basically the core module would have a pointer to a struct
 full of function pointers for the various ops that book3s_pr.c and
 book3s_hv.c both provide.  You would only be able to have one of
 kvm_pr and kvm_hv loaded at any one time.  If they were built in, you
 could have them both built in but only one could register its function
 pointer struct with the core.  Obviously the kvm_hv module would only
 load and register its struct on a machine that had hypervisor mode
 available.  If they were both built in I would think we would give HV
 the first chance to register itself, and let PR register if we can't
 do HV.
 
 How does that sound?
 
 As long as we can force-load the PR one on a machine that normally runs
 HV for the sake of testing ...
 
 This is what I currently have
 
 [root@llmp24l02 kvm]# insmod ./kvm-hv.ko 
 [root@llmp24l02 kvm]# insmod ./kvm-pr.ko 
 insmod: ERROR: could not insert module ./kvm-pr.ko: File exists

 The reason this model makes sense for x86 is that you never have SVM and VMX 
 in the cpu at the same time. Either it is an AMD chip or an Intel chip.

 PR and HV however are not mutually exclusive in hardware. What you really 
 want is

 1) distro can force HV/PR
 2) admin can force HV/PR
 3) user can force HV/PR
 4) by default things just work

 1 can be done through kernel config options.
 2 can be done through modules that get loaded or not
 3 can be done through a vm ioctl
 4 only works if you allow hv and pr to be available at the same time

 I can assume who you talked to about this to make these design decisions, but 
 it definitely was not me.

I didn't had much discussion around the design with anybody yet. What
you saw above was me changing/moving code around madly to get
something working in a day. I was hoping to get something that I can post as RFC
early and let others to comment. Good to get the feedback early.

-aneesh

--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 19/23] KVM: PPC: Book3S: Select PR vs HV separately for each guest

2013-09-15 Thread Alexander Graf


Am 15.09.2013 um 04:16 schrieb Aneesh Kumar K.V 
aneesh.ku...@linux.vnet.ibm.com:

 Alexander Graf ag...@suse.de writes:
 
 Am 14.09.2013 um 13:33 schrieb Aneesh Kumar K.V 
 aneesh.ku...@linux.vnet.ibm.com:
 
 Benjamin Herrenschmidt b...@kernel.crashing.org writes:
 
 On Fri, 2013-09-13 at 10:17 +1000, Paul Mackerras wrote:
 
 Aneesh and I are currently investigating an alternative approach,
 which is much more like the x86 way of doing things.  We are looking
 at splitting the code into three modules: a kvm_pr.ko module with the
 PR-specific bits, a kvm_hv.ko module with the HV-specific bits, and a
 core kvm.ko module with the generic bits (basically book3s.c,
 powerpc.c, stuff from virt/kvm/, plus the stuff that both PR and HV
 use).  Basically the core module would have a pointer to a struct
 full of function pointers for the various ops that book3s_pr.c and
 book3s_hv.c both provide.  You would only be able to have one of
 kvm_pr and kvm_hv loaded at any one time.  If they were built in, you
 could have them both built in but only one could register its function
 pointer struct with the core.  Obviously the kvm_hv module would only
 load and register its struct on a machine that had hypervisor mode
 available.  If they were both built in I would think we would give HV
 the first chance to register itself, and let PR register if we can't
 do HV.
 
 How does that sound?
 
 As long as we can force-load the PR one on a machine that normally runs
 HV for the sake of testing ...
 
 This is what I currently have
 
 [root@llmp24l02 kvm]# insmod ./kvm-hv.ko 
 [root@llmp24l02 kvm]# insmod ./kvm-pr.ko 
 insmod: ERROR: could not insert module ./kvm-pr.ko: File exists
 
 The reason this model makes sense for x86 is that you never have SVM and VMX 
 in the cpu at the same time. Either it is an AMD chip or an Intel chip.
 
 PR and HV however are not mutually exclusive in hardware. What you really 
 want is
 
 1) distro can force HV/PR
 2) admin can force HV/PR
 3) user can force HV/PR
 4) by default things just work
 
 1 can be done through kernel config options.
 2 can be done through modules that get loaded or not
 3 can be done through a vm ioctl
 4 only works if you allow hv and pr to be available at the same time
 
 I can assume who you talked to about this to make these design decisions, 
 but it definitely was not me.
 
 I didn't had much discussion around the design with anybody yet. What
 you saw above was me changing/moving code around madly to get
 something working in a day. I was hoping to get something that I can post as 
 RFC
 early and let others to comment. Good to get the feedback early.

Heh ok :). I think we want to be flexible here unless complexity grows too much 
of a maintenance burden and/or slows things down.

Alex

 
 -aneesh
 
--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 20/23] KVM: PPC: Book3S PR: Better handling of host-side read-only pages

2013-09-15 Thread Paul Mackerras
On Sat, Sep 14, 2013 at 03:23:53PM -0500, Alexander Graf wrote:
 
 
 Am 14.09.2013 um 00:24 schrieb Paul Mackerras pau...@samba.org:
 
  Bootup (F19 guest, 3 runs):
  
  Without the patch: average 20.12 seconds, st. dev. 0.17 seconds
  With the patch: 20.47 seconds, st. dev. 0.19 seconds
  
  Delta: 0.35 seconds, or 1.7%.
  
  time for i in $(seq 1000); do /bin/echo $i /dev/null; done:
  
  Without the patch: average 7.27 seconds, st. dev. 0.23 seconds
  With the patch: average 7.55 seconds, st. dev. 0.39 seconds
  
  Delta: 0.28 seconds, or 3.8%.
  
  So there appears to be a small effect, of a few percent.
 
 So in the normal case it slows us down, but allows ksm to be effective. Do we 
 actually want this change then?

I was a bit puzzled why there was a measurable slowdown until I
remembered that this patch was intended to go along with the patch
powerpc: Implement __get_user_pages_fast(), which Ben took and which
is now upstream in Linus' tree (1f7bf028).  So, I applied that patch
on top of this Better handling of host-side read-only pages pages,
and did the same measurements.  The results were:

Bootup (F19 guest, 3 runs): average 20.05 seconds, st. dev. 0.53s

1000 /bin/echo (4 runs): average 7.27 seconds, st. dev. 0.032s

So with both patches applied there is no slowdown at all, and KSM
works properly.  I think we want this patch.

Paul.
--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html