[PATCH] KVM: PPC: Implement extension to report number of memslots

2015-10-15 Thread Nikunj A Dadhania
QEMU assumes 32 memslots if this extension is not implemented. Although,
current value of KVM_USER_MEM_SLOTS is 32, once KVM_USER_MEM_SLOTS
changes QEMU would take a wrong value.

Signed-off-by: Nikunj A Dadhania 
---
 arch/powerpc/kvm/powerpc.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
index 2e51289..6fd2405 100644
--- a/arch/powerpc/kvm/powerpc.c
+++ b/arch/powerpc/kvm/powerpc.c
@@ -559,6 +559,9 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
else
r = num_online_cpus();
break;
+   case KVM_CAP_NR_MEMSLOTS:
+   r = KVM_USER_MEM_SLOTS;
+   break;
case KVM_CAP_MAX_VCPUS:
r = KVM_MAX_VCPUS;
break;
-- 
2.4.3

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC 1/2] kvm: Handle undercommitted guest case in PLE handler

2012-10-11 Thread Nikunj A Dadhania
On Wed, 10 Oct 2012 09:24:55 -0500, Andrew Theurer 
 wrote:
> 
> Below is again 8 x 20-way VMs, but this time I tried out Nikunj's gang
> scheduling patches.  While I am not recommending gang scheduling, I
> think it's a good data point.  The performance is 3.88x the PLE result.
> 
> https://docs.google.com/open?id=0B6tfUNlZ-14wWXdscWcwNTVEY3M

That looks pretty good and serves the purpose. And the result says it all.

> Note that the task switching intervals of 4ms are quite obvious again,
> and this time all vCPUs from same VM run at the same time.  It
> represents the best possible outcome.
> 
> 
> Anyway, I thought the bitmaps might help better visualize what's going
> on.
> 
> -Andrew
> 

Regards
Nikunj

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 0/8] KVM paravirt remote flush tlb

2012-09-04 Thread Nikunj A Dadhania
On Tue, 04 Sep 2012 10:51:06 +0300, Avi Kivity  wrote:
> On 09/04/2012 04:30 AM, Nikunj A Dadhania wrote:
> > On Mon, 03 Sep 2012 17:33:46 +0300, Avi Kivity  wrote:
> >> On 08/21/2012 02:25 PM, Nikunj A. Dadhania wrote:
> >> > 
> >> > kernbench(lower is better)
> >> > ==
> >> >  base  pvflushv4  %improvement
> >> > 1VM48.5800   46.8513   3.55846
> >> > 2VM   108.1823  104.6410   3.27346
> >> > 3VM   183.2733  163.3547  10.86825
> >> > 
> >> > ebizzy(higher is better)
> >> > 
> >> >  base pvflushv4  %improvement
> >> > 1VM 2414.5000 2089.8750 -13.44481
> >> > 2VM 2167.6250 2371.7500  9.41699
> >> > 3VM 1600. 2102.5556 31.40060
> >> > 
> >> 
> >> The regression is worrying.  We're improving the contended case at the
> >> cost of the non-contended case, this is usually the wrong thing to do.
> >> Do we have any clear idea of the cause of the regression?
> >> 
> > Previous perf numbers suggest that in 1VM scenario flush_tlb_others_ipi
> > is around 2%, while for contented case its around 10%. That is what is
> > helping contended case.
> 
> But what is causing the regression for the uncontended case?
> 
Haven't been able to nail that, any clue on how to profile would help.

Regards
Nikunj

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 0/8] KVM paravirt remote flush tlb

2012-09-03 Thread Nikunj A Dadhania
On Mon, 03 Sep 2012 17:33:46 +0300, Avi Kivity  wrote:
> On 08/21/2012 02:25 PM, Nikunj A. Dadhania wrote:
> > 
> > kernbench(lower is better)
> > ==
> >  base  pvflushv4  %improvement
> > 1VM48.5800   46.8513   3.55846
> > 2VM   108.1823  104.6410   3.27346
> > 3VM   183.2733  163.3547  10.86825
> > 
> > ebizzy(higher is better)
> > 
> >  base pvflushv4  %improvement
> > 1VM 2414.5000 2089.8750 -13.44481
> > 2VM 2167.6250 2371.7500  9.41699
> > 3VM 1600. 2102.5556 31.40060
> > 
> 
> The regression is worrying.  We're improving the contended case at the
> cost of the non-contended case, this is usually the wrong thing to do.
> Do we have any clear idea of the cause of the regression?
> 
Previous perf numbers suggest that in 1VM scenario flush_tlb_others_ipi
is around 2%, while for contented case its around 10%. That is what is
helping contended case.

Regards,
Nikunj

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 3/8] KVM Guest: Add VCPU running/pre-empted state for guest

2012-08-26 Thread Nikunj A Dadhania
On Fri, 24 Aug 2012 12:02:27 -0300, Marcelo Tosatti  wrote:
> On Fri, Aug 24, 2012 at 11:09:39AM +0530, Nikunj A Dadhania wrote:
> > On Thu, 23 Aug 2012 06:36:43 -0300, Marcelo Tosatti  
> > wrote:
> > > On Tue, Aug 21, 2012 at 04:56:35PM +0530, Nikunj A. Dadhania wrote:
[...]

> > > > @@ -469,6 +500,11 @@ void __init kvm_guest_init(void)
> > > > if (kvm_para_has_feature(KVM_FEATURE_PV_EOI))
> > > > apic_set_eoi_write(kvm_guest_apic_eoi_write);
> > > >  
> > > > +#ifdef CONFIG_PARAVIRT_TLB_FLUSH
> > > > +   if (kvm_para_has_feature(KVM_FEATURE_VCPU_STATE))
> > > > +   has_vcpu_state = 1;
> > > > +#endif
> > > 
> > > Why only this hunk guarded by CONFIG_PARAVIRT_TLB_FLUSH and not
> > > the rest of the code?
> > > 
> > The guest should have been compiled with CONFIG_PARAVIRT_TLB_FLUSH, as
> > the config also brings in HAVE_RCU_TABLE_FREE code into picture. We
> > should not enable this code without HAVE_RCU_TABLE_FREE.
> > 
> > Did not want to spray this across all the code, as the compiler will
> > take care of throwing out the kvm_tlb_flush_others
> > 
> > > Is there a switch to enable/disable this feature on the kernel
> > > command line? 
> > >
> > No, havent added it. 
> > 
> > > Grep for early_param in kvm.c.
> > > 
> > Let me know if that is required.
> 
> Yes, please add it. Its useful.
> 
Done, will send it in my next version.

Nikunj

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 3/8] KVM Guest: Add VCPU running/pre-empted state for guest

2012-08-23 Thread Nikunj A Dadhania
On Thu, 23 Aug 2012 06:36:43 -0300, Marcelo Tosatti  wrote:
> On Tue, Aug 21, 2012 at 04:56:35PM +0530, Nikunj A. Dadhania wrote:
> >  
> > +void kvm_disable_vcpu_state(void)
> > +{
> > +   if (!has_vcpu_state)
> > +   return;
> > +
> > +   wrmsr(MSR_KVM_VCPU_STATE, 0, 0);
> 
> wrmsrl (to be consistent).
>
Sure, will change
 
> > +}
> > +
> >  #ifdef CONFIG_SMP
> >  static void __init kvm_smp_prepare_boot_cpu(void)
> >  {
> > @@ -410,6 +440,7 @@ static void __cpuinit kvm_guest_cpu_online(void *dummy)
> >  
> >  static void kvm_guest_cpu_offline(void *dummy)
> >  {
> > +   kvm_disable_vcpu_state();
> 
> Should disable MSR at kvm_pv_guest_cpu_reboot.
> 
Sure, can you explain the difference for my understanding?

> > kvm_disable_steal_time();
> > if (kvm_para_has_feature(KVM_FEATURE_PV_EOI))
> > wrmsrl(MSR_KVM_PV_EOI_EN, 0);
> > @@ -469,6 +500,11 @@ void __init kvm_guest_init(void)
> > if (kvm_para_has_feature(KVM_FEATURE_PV_EOI))
> > apic_set_eoi_write(kvm_guest_apic_eoi_write);
> >  
> > +#ifdef CONFIG_PARAVIRT_TLB_FLUSH
> > +   if (kvm_para_has_feature(KVM_FEATURE_VCPU_STATE))
> > +   has_vcpu_state = 1;
> > +#endif
> 
> Why only this hunk guarded by CONFIG_PARAVIRT_TLB_FLUSH and not
> the rest of the code?
> 
The guest should have been compiled with CONFIG_PARAVIRT_TLB_FLUSH, as
the config also brings in HAVE_RCU_TABLE_FREE code into picture. We
should not enable this code without HAVE_RCU_TABLE_FREE.

Did not want to spray this across all the code, as the compiler will
take care of throwing out the kvm_tlb_flush_others

> Is there a switch to enable/disable this feature on the kernel
> command line? 
>
No, havent added it. 

> Grep for early_param in kvm.c.
> 
Let me know if that is required.

Regards
Nikunj

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 4/8] KVM-HV: Add VCPU running/pre-empted state for guest

2012-08-23 Thread Nikunj A Dadhania
On Thu, 23 Aug 2012 08:46:22 -0300, Marcelo Tosatti  wrote:
> On Tue, Aug 21, 2012 at 04:56:43PM +0530, Nikunj A. Dadhania wrote:
> > From: Nikunj A. Dadhania 
> > 
> > Hypervisor code to indicate guest running/pre-empteded status through
> > msr. The page is now pinned during MSR write time and use
> > kmap_atomic/kunmap_atomic to access the shared area vcpu_state area.
> > 
> > Suggested-by: Marcelo Tosatti 
> > Signed-off-by: Nikunj A. Dadhania 
> > ---

[...]

> > +
> > +static void kvm_set_vcpu_state(struct kvm_vcpu *vcpu)
> > +{
> > +   struct kvm_vcpu_state *vs;
> > +   char *kaddr;
> > +
> > +   if (!((vcpu->arch.v_state.msr_val & KVM_MSR_ENABLED) &&
> > +   vcpu->arch.v_state.vs_page))
> > +   return;
> 
> It was agreed it was necessary to have valid vs_page only if MSR was
> enabled? Or was that a misunderstanding?
>
There is a case where MSR is enabled but vs_page is NULL, this is
gaurding that case. The check is now:

if (!(msr_enabled && vs_page))
   return;

I had proposed that here:
http://www.spinics.net/lists/kvm/msg77147.html

Regards
Nikunj

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v4 8/8] KVM-doc: Add paravirt tlb flush document

2012-08-21 Thread Nikunj A. Dadhania

Signed-off-by: Nikunj A. Dadhania 
---
 Documentation/virtual/kvm/msr.txt|4 ++
 Documentation/virtual/kvm/paravirt-tlb-flush.txt |   53 ++
 2 files changed, 57 insertions(+), 0 deletions(-)
 create mode 100644 Documentation/virtual/kvm/paravirt-tlb-flush.txt

diff --git a/Documentation/virtual/kvm/msr.txt 
b/Documentation/virtual/kvm/msr.txt
index 7304710..92a6af6 100644
--- a/Documentation/virtual/kvm/msr.txt
+++ b/Documentation/virtual/kvm/msr.txt
@@ -256,3 +256,7 @@ MSR_KVM_EOI_EN: 0x4b564d04
guest must both read the least significant bit in the memory area and
clear it using a single CPU instruction, such as test and clear, or
compare and exchange.
+
+MSR_KVM_VCPU_STATE: 0x4b564d05
+
+Refer: Documentation/virtual/kvm/paravirt-tlb-flush.txt
diff --git a/Documentation/virtual/kvm/paravirt-tlb-flush.txt 
b/Documentation/virtual/kvm/paravirt-tlb-flush.txt
new file mode 100644
index 000..0eaabd7
--- /dev/null
+++ b/Documentation/virtual/kvm/paravirt-tlb-flush.txt
@@ -0,0 +1,53 @@
+KVM - Paravirt TLB Flush
+Nikunj A Dadhania , IBM, 2012
+
+
+Remote flushing api's does a busy wait which is fine in bare-metal
+scenario. But with-in the guest, the vcpus might have been pre-empted
+or blocked. In this scenario, the initator vcpu would end up
+busy-waiting for a long amount of time.
+
+This would require to have information of guest running/not-running
+within the guest to take a decision. The following MSR introduces vcpu
+running state information.
+
+Using this MSR we have implemented para-virt flush tlbs making sure
+that it does not wait for vcpus that are not-running. And TLB flushing
+for them is deferred, which is done on guest enter.
+
+MSR_KVM_VCPU_STATE: 0x4b564d04
+
+   data: 64-byte alignment physical address of a memory area which must be
+   in guest RAM, plus an enable bit in bit 0. This memory is expected to
+   hold a copy of the following structure:
+
+   struct kvm_steal_time {
+   __u64 state;
+   __u32 pad[14];
+   }
+
+   whose data will be filled in by the hypervisor/guest. Only one
+   write, or registration, is needed for each VCPU.  The interval
+   between updates of this structure is arbitrary and
+   implementation-dependent.  The hypervisor may update this
+   structure at any time it sees fit until anything with bit0 ==
+   0 is written to it. Guest is required to make sure this
+   structure is initialized to zero.
+
+   This would enable a VCPU to know running status of sibling
+   VCPUs. The information can further be used to determine if an
+   IPI needs to be send to the non-running VCPU and wait for them
+   unnecessarily. For e.g. flush_tlb_others_ipi.
+
+   Fields have the following meanings:
+
+   state: has bit  following fields:
+
+   Bit 0 - vcpu running state. Hypervisor would set vcpu
+   running/not running. Value 1 meaning the vcpu
+   is running and value 0 means vcpu is
+   pre-empted out.
+
+   Bit 1 - hypervisor should flush tlb is set during
+   guest enter/exit
+

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v4 7/8] Enable HAVE_RCU_TABLE_FREE for kvm when PARAVIRT_TLB_FLUSH is enabled

2012-08-21 Thread Nikunj A. Dadhania

Signed-off-by: Nikunj A. Dadhania 
---
 arch/x86/Kconfig |   11 +++
 1 files changed, 11 insertions(+), 0 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index c70684f..354160d 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -612,6 +612,17 @@ config PARAVIRT_SPINLOCKS
 
  If you are unsure how to answer this question, answer N.
 
+config PARAVIRT_TLB_FLUSH
+   bool "Paravirtualization layer for TLB Flush"
+   depends on PARAVIRT && SMP && EXPERIMENTAL
+   select HAVE_RCU_TABLE_FREE
+   ---help---
+ Paravirtualized Flush TLB replace the native implementation
+ with something virtualization-friendly (for example, set a
+ flag for sleeping vcpu and do not wait for it).
+
+ If you are unsure how to answer this question, answer N.
+
 config PARAVIRT_CLOCK
bool
 

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v4 6/8] KVM-HV: Add flush_on_enter before guest enter

2012-08-21 Thread Nikunj A. Dadhania
PV-Flush guest would indicate to flush on enter, flush the TLB before
entering and exiting the guest.

Signed-off-by: Nikunj A. Dadhania 
---
 arch/x86/kvm/x86.c |   28 
 1 files changed, 12 insertions(+), 16 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 43f2c19..07fdb0f 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1557,20 +1557,9 @@ static void record_steal_time(struct kvm_vcpu *vcpu)
&vcpu->arch.st.steal, sizeof(struct kvm_steal_time));
 }
 
-static void kvm_set_atomic(u64 *addr, u64 old, u64 new)
-{
-   int loop = 100;
-   while (1) {
-   if (cmpxchg(addr, old, new) == old)
-   break;
-   loop--;
-   if (!loop) {
-   pr_info("atomic cur: %lx old: %lx new: %lx\n",
-   *addr, old, new);
-   break;
-   }
-   }
-}
+#define VS_NOT_IN_GUEST  (0)
+#define VS_IN_GUEST  (1 << KVM_VCPU_STATE_IN_GUEST_MODE)
+#define VS_SHOULD_FLUSH  (1 << KVM_VCPU_STATE_SHOULD_FLUSH)
 
 static void kvm_set_vcpu_state(struct kvm_vcpu *vcpu)
 {
@@ -1584,7 +1573,13 @@ static void kvm_set_vcpu_state(struct kvm_vcpu *vcpu)
kaddr = kmap_atomic(vcpu->arch.v_state.vs_page);
kaddr += vcpu->arch.v_state.vs_offset;
vs = kaddr;
-   kvm_set_atomic(&vs->state, 0, 1 << KVM_VCPU_STATE_IN_GUEST_MODE);
+   if (xchg(&vs->state, VS_IN_GUEST) == VS_SHOULD_FLUSH) {
+   /* 
+* Do TLB_FLUSH before entering the guest, its passed
+* the stage of request checking 
+*/
+   kvm_x86_ops->tlb_flush(vcpu);
+   }
kunmap_atomic(kaddr);
 }
 
@@ -1600,7 +1595,8 @@ static void kvm_clear_vcpu_state(struct kvm_vcpu *vcpu)
kaddr = kmap_atomic(vcpu->arch.v_state.vs_page);
kaddr += vcpu->arch.v_state.vs_offset;
vs = kaddr;
-   kvm_set_atomic(&vs->state, 1 << KVM_VCPU_STATE_IN_GUEST_MODE, 0);
+   if (xchg(&vs->state, VS_NOT_IN_GUEST) == VS_SHOULD_FLUSH)
+   kvm_make_request(KVM_REQ_TLB_FLUSH, vcpu);
kunmap_atomic(kaddr);
 }
 

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v4 5/8] KVM Guest: Add paravirt kvm_flush_tlb_others

2012-08-21 Thread Nikunj A. Dadhania
From: Nikunj A. Dadhania 

flush_tlb_others_ipi depends on lot of statics in tlb.c.  Replicated
the flush_tlb_others_ipi as kvm_flush_tlb_others to further adapt to
paravirtualization.

Use the vcpu state information inside the kvm_flush_tlb_others to
avoid sending ipi to pre-empted vcpus.

* Do not send ipi's to offline vcpus and set flush_on_enter flag
* For online vcpus: Wait for them to clear the flag

The approach was discussed here: https://lkml.org/lkml/2012/2/20/157

v3:
* use only one state variable for vcpu-running/flush_on_enter
* use cmpxchg to update the state
* adapt to Alex Shi's TLB flush optimization

v2:
* use ACCESS_ONCE so the value is not register cached
* Separate HV and Guest code

Suggested-by: Peter Zijlstra 
Signed-off-by: Nikunj A. Dadhania 

--
Pseudo Algo:

   Hypervisor
   ==
   guest_exit()
   if (!(xchg(state, NOT_IN_GUEST) == SHOULD_FLUSH))
   tlb_flush(vcpu);

   guest_enter()
   if (!(xchg(state, IN_GUEST) == SHOULD_FLUSH))
   tlb_flush(vcpu);

Guest
=
flushcpumask = cpumask;
for_each_cpu(i, flushmask) {
state = vs->state;
if(!test_bit(IN_GUEST_MODE, state)) {
if (cmpxchg(&vs->state, state,
state | (1 << SHOULD_FLUSH)) == SUCCESS)
   cpumask_clear_cpu(flushmask,i)
}
}

smp_call_function_many(f->flushmask, flush_tlb_func)
---
 arch/x86/include/asm/tlbflush.h |   11 +++
 arch/x86/kernel/kvm.c   |4 +++-
 arch/x86/mm/tlb.c   |   36 
 3 files changed, 50 insertions(+), 1 deletions(-)

diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index 74a4433..0a343a1 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -119,6 +119,13 @@ static inline void native_flush_tlb_others(const struct 
cpumask *cpumask,
 {
 }
 
+static inline void kvm_flush_tlb_others(const struct cpumask *cpumask,
+   struct mm_struct *mm,
+   unsigned long start,
+   unsigned long end)
+{
+}
+
 static inline void reset_lazy_tlbstate(void)
 {
 }
@@ -153,6 +160,10 @@ void native_flush_tlb_others(const struct cpumask *cpumask,
struct mm_struct *mm,
unsigned long start, unsigned long end);
 
+void kvm_flush_tlb_others(const struct cpumask *cpumask,
+   struct mm_struct *mm, unsigned long start,
+   unsigned long end);
+
 #define TLBSTATE_OK1
 #define TLBSTATE_LAZY  2
 
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index 37e6599..b538a31 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -501,8 +501,10 @@ void __init kvm_guest_init(void)
apic_set_eoi_write(kvm_guest_apic_eoi_write);
 
 #ifdef CONFIG_PARAVIRT_TLB_FLUSH
-   if (kvm_para_has_feature(KVM_FEATURE_VCPU_STATE))
+   if (kvm_para_has_feature(KVM_FEATURE_VCPU_STATE)) {
has_vcpu_state = 1;
+   pv_mmu_ops.flush_tlb_others = kvm_flush_tlb_others;
+   }
 #endif
 
 #ifdef CONFIG_SMP
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 613cd83..645df99 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -6,6 +6,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -119,6 +120,41 @@ static void flush_tlb_func(void *info)
 
 }
 
+#ifdef CONFIG_KVM_GUEST
+
+DECLARE_PER_CPU(struct kvm_vcpu_state, vcpu_state) __aligned(64);
+
+void kvm_flush_tlb_others(const struct cpumask *cpumask,
+   struct mm_struct *mm, unsigned long start,
+   unsigned long end)
+{
+   struct flush_tlb_info info;
+   struct kvm_vcpu_state *v_state;
+   u64 state;
+   int cpu;
+   cpumask_t flushmask;
+
+   cpumask_copy(&flushmask, cpumask);
+   info.flush_mm = mm;
+   info.flush_start = start;
+   info.flush_end = end;
+   /*
+* We have to call flush only on online vCPUs. And
+* queue flush_on_enter for pre-empted vCPUs
+*/
+   for_each_cpu(cpu, to_cpumask(&flushmask)) {
+   v_state = &per_cpu(vcpu_state, cpu);
+   state = v_state->state;
+   if (!test_bit(KVM_VCPU_STATE_IN_GUEST_MODE, &state)) {
+   if (cmpxchg(&v_state->state, state, state | 1 << 
KVM_VCPU_STATE_SHOULD_FLUSH))
+   cpumask_clear_cpu(cpu, to_cpumask(&flushmask));
+   }
+   }
+
+   smp_call_function_many(&flushmask, flush_tlb_func, &info, 1);
+}
+#endif /* CONFIG_KVM_GUEST */
+
 void native_flush_tlb_others(const struct cpumask *cpumask,
 struct mm_struct *mm, unsigned long start,
 unsigned lo

[PATCH v4 4/8] KVM-HV: Add VCPU running/pre-empted state for guest

2012-08-21 Thread Nikunj A. Dadhania
From: Nikunj A. Dadhania 

Hypervisor code to indicate guest running/pre-empteded status through
msr. The page is now pinned during MSR write time and use
kmap_atomic/kunmap_atomic to access the shared area vcpu_state area.

Suggested-by: Marcelo Tosatti 
Signed-off-by: Nikunj A. Dadhania 
---
 arch/x86/include/asm/kvm_host.h |7 +++
 arch/x86/kvm/cpuid.c|1 
 arch/x86/kvm/x86.c  |   88 ++-
 3 files changed, 94 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 09155d6..441348f 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -429,6 +429,13 @@ struct kvm_vcpu_arch {
struct kvm_steal_time steal;
} st;
 
+   /* indicates vcpu is running or preempted */
+   struct {
+   u64 msr_val;
+   struct page *vs_page;
+   unsigned int vs_offset;
+   } v_state;
+
u64 last_guest_tsc;
u64 last_kernel_ns;
u64 last_host_tsc;
diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
index 0595f13..37ab364 100644
--- a/arch/x86/kvm/cpuid.c
+++ b/arch/x86/kvm/cpuid.c
@@ -411,6 +411,7 @@ static int do_cpuid_ent(struct kvm_cpuid_entry2 *entry, u32 
function,
 (1 << KVM_FEATURE_CLOCKSOURCE2) |
 (1 << KVM_FEATURE_ASYNC_PF) |
 (1 << KVM_FEATURE_PV_EOI) |
+(1 << KVM_FEATURE_VCPU_STATE) |
 (1 << KVM_FEATURE_CLOCKSOURCE_STABLE_BIT);
 
if (sched_info_on())
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 59b5950..43f2c19 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -806,13 +806,13 @@ EXPORT_SYMBOL_GPL(kvm_rdpmc);
  * kvm-specific. Those are put in the beginning of the list.
  */
 
-#define KVM_SAVE_MSRS_BEGIN9
+#define KVM_SAVE_MSRS_BEGIN10
 static u32 msrs_to_save[] = {
MSR_KVM_SYSTEM_TIME, MSR_KVM_WALL_CLOCK,
MSR_KVM_SYSTEM_TIME_NEW, MSR_KVM_WALL_CLOCK_NEW,
HV_X64_MSR_GUEST_OS_ID, HV_X64_MSR_HYPERCALL,
HV_X64_MSR_APIC_ASSIST_PAGE, MSR_KVM_ASYNC_PF_EN, MSR_KVM_STEAL_TIME,
-   MSR_KVM_PV_EOI_EN,
+   MSR_KVM_VCPU_STATE, MSR_KVM_PV_EOI_EN,
MSR_IA32_SYSENTER_CS, MSR_IA32_SYSENTER_ESP, MSR_IA32_SYSENTER_EIP,
MSR_STAR,
 #ifdef CONFIG_X86_64
@@ -1557,6 +1557,63 @@ static void record_steal_time(struct kvm_vcpu *vcpu)
&vcpu->arch.st.steal, sizeof(struct kvm_steal_time));
 }
 
+static void kvm_set_atomic(u64 *addr, u64 old, u64 new)
+{
+   int loop = 100;
+   while (1) {
+   if (cmpxchg(addr, old, new) == old)
+   break;
+   loop--;
+   if (!loop) {
+   pr_info("atomic cur: %lx old: %lx new: %lx\n",
+   *addr, old, new);
+   break;
+   }
+   }
+}
+
+static void kvm_set_vcpu_state(struct kvm_vcpu *vcpu)
+{
+   struct kvm_vcpu_state *vs;
+   char *kaddr;
+
+   if (!((vcpu->arch.v_state.msr_val & KVM_MSR_ENABLED) &&
+   vcpu->arch.v_state.vs_page))
+   return;
+
+   kaddr = kmap_atomic(vcpu->arch.v_state.vs_page);
+   kaddr += vcpu->arch.v_state.vs_offset;
+   vs = kaddr;
+   kvm_set_atomic(&vs->state, 0, 1 << KVM_VCPU_STATE_IN_GUEST_MODE);
+   kunmap_atomic(kaddr);
+}
+
+static void kvm_clear_vcpu_state(struct kvm_vcpu *vcpu)
+{
+   struct kvm_vcpu_state *vs;
+   char *kaddr;
+
+   if (!((vcpu->arch.v_state.msr_val & KVM_MSR_ENABLED) &&
+   vcpu->arch.v_state.vs_page))
+   return;
+
+   kaddr = kmap_atomic(vcpu->arch.v_state.vs_page);
+   kaddr += vcpu->arch.v_state.vs_offset;
+   vs = kaddr;
+   kvm_set_atomic(&vs->state, 1 << KVM_VCPU_STATE_IN_GUEST_MODE, 0);
+   kunmap_atomic(kaddr);
+}
+
+static void kvm_vcpu_state_reset(struct kvm_vcpu *vcpu)
+{
+   vcpu->arch.v_state.msr_val = 0;
+   vcpu->arch.v_state.vs_offset = 0;
+   if (vcpu->arch.v_state.vs_page) {
+   kvm_release_page_dirty(vcpu->arch.v_state.vs_page);
+   vcpu->arch.v_state.vs_page = NULL;
+   }
+}
+
 int kvm_set_msr_common(struct kvm_vcpu *vcpu, u32 msr, u64 data)
 {
bool pr = false;
@@ -1676,6 +1733,24 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, u32 msr, 
u64 data)
return 1;
break;
 
+   case MSR_KVM_VCPU_STATE:
+   kvm_vcpu_state_reset(vcpu);
+
+   if (!(data & KVM_MSR_ENABLED))
+   break;
+
+   vcpu->arch.v_state.vs_page = gfn_to_page(vcpu->kvm, data >> 
PAGE_SHIFT);
+
+   i

[PATCH v4 3/8] KVM Guest: Add VCPU running/pre-empted state for guest

2012-08-21 Thread Nikunj A. Dadhania
From: Nikunj A. Dadhania 

The patch adds guest code for msr between guest and hypervisor. The
msr will export the vcpu running/pre-empted information to the guest
from host. This will enable guest to intelligently send ipi to running
vcpus and set flag for pre-empted vcpus. This will prevent waiting for
vcpus that are not running.

Suggested-by: Peter Zijlstra 
Signed-off-by: Nikunj A. Dadhania 
---
 arch/x86/include/asm/kvm_para.h |   13 +
 arch/x86/kernel/kvm.c   |   36 
 2 files changed, 49 insertions(+), 0 deletions(-)

diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
index 2f7712e..5dfb975 100644
--- a/arch/x86/include/asm/kvm_para.h
+++ b/arch/x86/include/asm/kvm_para.h
@@ -23,6 +23,7 @@
 #define KVM_FEATURE_ASYNC_PF   4
 #define KVM_FEATURE_STEAL_TIME 5
 #define KVM_FEATURE_PV_EOI 6
+#define KVM_FEATURE_VCPU_STATE  7
 
 /* The last 8 bits are used to indicate how to interpret the flags field
  * in pvclock structure. If no bits are set, all flags are ignored.
@@ -39,6 +40,7 @@
 #define MSR_KVM_ASYNC_PF_EN 0x4b564d02
 #define MSR_KVM_STEAL_TIME  0x4b564d03
 #define MSR_KVM_PV_EOI_EN  0x4b564d04
+#define MSR_KVM_VCPU_STATE  0x4b564d05
 
 struct kvm_steal_time {
__u64 steal;
@@ -51,6 +53,17 @@ struct kvm_steal_time {
 #define KVM_STEAL_VALID_BITS ((-1ULL << (KVM_STEAL_ALIGNMENT_BITS + 1)))
 #define KVM_STEAL_RESERVED_MASK (((1 << KVM_STEAL_ALIGNMENT_BITS) - 1 ) << 1)
 
+struct kvm_vcpu_state {
+   __u64 state;
+   __u32 pad[14];
+};
+/* bits in vcpu_state->state */
+#define KVM_VCPU_STATE_IN_GUEST_MODE 0
+#define KVM_VCPU_STATE_SHOULD_FLUSH  1
+
+#define KVM_VCPU_STATE_ALIGN_BITS 5
+#define KVM_VCPU_STATE_VALID_BITS ((-1ULL << (KVM_VCPU_STATE_ALIGN_BITS + 1)))
+
 #define KVM_MAX_MMU_OP_BATCH   32
 
 #define KVM_ASYNC_PF_ENABLED   (1 << 0)
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index c1d61ee..37e6599 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -66,6 +66,9 @@ static DEFINE_PER_CPU(struct kvm_vcpu_pv_apf_data, 
apf_reason) __aligned(64);
 static DEFINE_PER_CPU(struct kvm_steal_time, steal_time) __aligned(64);
 static int has_steal_clock = 0;
 
+DEFINE_PER_CPU(struct kvm_vcpu_state, vcpu_state) __aligned(64);
+static int has_vcpu_state;
+
 /*
  * No need for any "IO delay" on KVM
  */
@@ -302,6 +305,22 @@ static void kvm_guest_apic_eoi_write(u32 reg, u32 val)
apic_write(APIC_EOI, APIC_EOI_ACK);
 }
 
+static void kvm_register_vcpu_state(void)
+{
+   int cpu = smp_processor_id();
+   struct kvm_vcpu_state *v_state;
+
+   if (!has_vcpu_state)
+   return;
+
+   v_state = &per_cpu(vcpu_state, cpu);
+   memset(v_state, 0, sizeof(*v_state));
+
+   wrmsrl(MSR_KVM_VCPU_STATE, (__pa(v_state) | KVM_MSR_ENABLED));
+   printk(KERN_INFO "kvm-vcpustate: cpu %d, msr %lx\n",
+   cpu, __pa(v_state));
+}
+
 void __cpuinit kvm_guest_cpu_init(void)
 {
if (!kvm_para_available())
@@ -330,6 +349,9 @@ void __cpuinit kvm_guest_cpu_init(void)
 
if (has_steal_clock)
kvm_register_steal_time();
+
+   if (has_vcpu_state)
+   kvm_register_vcpu_state();
 }
 
 static void kvm_pv_disable_apf(void)
@@ -393,6 +415,14 @@ void kvm_disable_steal_time(void)
wrmsr(MSR_KVM_STEAL_TIME, 0, 0);
 }
 
+void kvm_disable_vcpu_state(void)
+{
+   if (!has_vcpu_state)
+   return;
+
+   wrmsr(MSR_KVM_VCPU_STATE, 0, 0);
+}
+
 #ifdef CONFIG_SMP
 static void __init kvm_smp_prepare_boot_cpu(void)
 {
@@ -410,6 +440,7 @@ static void __cpuinit kvm_guest_cpu_online(void *dummy)
 
 static void kvm_guest_cpu_offline(void *dummy)
 {
+   kvm_disable_vcpu_state();
kvm_disable_steal_time();
if (kvm_para_has_feature(KVM_FEATURE_PV_EOI))
wrmsrl(MSR_KVM_PV_EOI_EN, 0);
@@ -469,6 +500,11 @@ void __init kvm_guest_init(void)
if (kvm_para_has_feature(KVM_FEATURE_PV_EOI))
apic_set_eoi_write(kvm_guest_apic_eoi_write);
 
+#ifdef CONFIG_PARAVIRT_TLB_FLUSH
+   if (kvm_para_has_feature(KVM_FEATURE_VCPU_STATE))
+   has_vcpu_state = 1;
+#endif
+
 #ifdef CONFIG_SMP
smp_ops.smp_prepare_boot_cpu = kvm_smp_prepare_boot_cpu;
register_cpu_notifier(&kvm_cpu_notifier);

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v4 2/8] mm: Add missing TLB invalidate to RCU page-table freeing

2012-08-21 Thread Nikunj A. Dadhania
From: Peter Zijlstra 

For normal systems we need a TLB invalidate before freeing the
page-tables, the generic RCU based page-table freeing code lacked
this.

This is because this code originally came from ppc where the hardware
never walks the linux page-tables and thus this invalidate is not
required.

Others, notably s390 which ran into this problem in cd94154cc6a
("[S390] fix tlb flushing for page table pages"), do very much need
this TLB invalidation.

Therefore add it, with a Kconfig option to disable it so as to not
unduly slow down PPC and SPARC64 which neither of them need it.

Signed-off-by: Peter Zijlstra 
Link: http://lkml.kernel.org/n/tip-z32nke0csqopykthsk1zj...@git.kernel.org

Not for inclusion - is part of PeterZ's "Unify TLB gather implementations"
http://mid.gmane.org/20120627211540.459910...@chello.nl

[Fix to check *batch is not NULL]
Signed-off-by: Nikunj A. Dadhania 
---
 arch/Kconfig |3 +++
 arch/powerpc/Kconfig |1 +
 arch/sparc/Kconfig   |1 +
 mm/memory.c  |   43 +--
 4 files changed, 42 insertions(+), 6 deletions(-)

diff --git a/arch/Kconfig b/arch/Kconfig
index 8c3d957..fec1c9b 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -231,6 +231,9 @@ config HAVE_ARCH_MUTEX_CPU_RELAX
 config HAVE_RCU_TABLE_FREE
bool
 
+config STRICT_TLB_FILL
+   bool
+
 config ARCH_HAVE_NMI_SAFE_CMPXCHG
bool
 
diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index 9a5d3cd..fb70260 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -127,6 +127,7 @@ config PPC
select GENERIC_IRQ_SHOW_LEVEL
select IRQ_FORCED_THREADING
select HAVE_RCU_TABLE_FREE if SMP
+   select STRICT_TLB_FILL
select HAVE_SYSCALL_TRACEPOINTS
select HAVE_BPF_JIT if PPC64
select HAVE_ARCH_JUMP_LABEL
diff --git a/arch/sparc/Kconfig b/arch/sparc/Kconfig
index e74ff13..126e500 100644
--- a/arch/sparc/Kconfig
+++ b/arch/sparc/Kconfig
@@ -52,6 +52,7 @@ config SPARC64
select HAVE_KRETPROBES
select HAVE_KPROBES
select HAVE_RCU_TABLE_FREE if SMP
+   select STRICT_TLB_FILL
select HAVE_MEMBLOCK
select HAVE_MEMBLOCK_NODE_MAP
select HAVE_SYSCALL_WRAPPERS
diff --git a/mm/memory.c b/mm/memory.c
index 91f6945..2ef9ce1 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -332,12 +332,47 @@ static void tlb_remove_table_rcu(struct rcu_head *head)
free_page((unsigned long)batch);
 }
 
+#ifdef CONFIG_STRICT_TLB_FILL
+/*
+ * Some archictures (sparc64, ppc) cannot refill TLBs after the they've removed
+ * the PTE entries from their hash-table. Their hardware never looks at the
+ * linux page-table structures, so they don't need a hardware TLB invalidate
+ * when tearing down the page-table structure itself.
+ */
+static inline void tlb_table_flush_mmu(struct mmu_gather *tlb) { }
+
+/*
+ * When there's less than two users of this mm there cannot be
+ * a concurrent page-table walk.
+ */
+static inline bool tlb_table_fast(struct mmu_gather *tlb)
+{
+   return atomic_read(&tlb->mm->mm_users) < 2;
+}
+#else
+static inline void tlb_table_flush_mmu(struct mmu_gather *tlb)
+{
+   tlb_flush_mmu(tlb);
+}
+
+/*
+ * Even if there's only a single user, speculative TLB loads can
+ * wreck stuff.
+ */
+static inline bool tlb_table_fast(struct mmu_gather *tlb)
+{
+   return false;
+}
+#endif /* CONFIG_STRICT_TLB_FILL */
+
 void tlb_table_flush(struct mmu_gather *tlb)
 {
struct mmu_table_batch **batch = &tlb->batch;
 
if (*batch) {
-   call_rcu_sched(&(*batch)->rcu, tlb_remove_table_rcu);
+   tlb_table_flush_mmu(tlb);
+   if (*batch)
+   call_rcu_sched(&(*batch)->rcu, tlb_remove_table_rcu);
*batch = NULL;
}
 }
@@ -348,11 +383,7 @@ void tlb_remove_table(struct mmu_gather *tlb, void *table)
 
tlb->need_flush = 1;
 
-   /*
-* When there's less then two users of this mm there cannot be a
-* concurrent page-table walk.
-*/
-   if (atomic_read(&tlb->mm->mm_users) < 2) {
+   if (tlb_table_fast(tlb)) {
__tlb_remove_table(table);
return;
}

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v4 1/8] mm, x86: Add HAVE_RCU_TABLE_FREE support

2012-08-21 Thread Nikunj A. Dadhania
From: Peter Zijlstra 

Implements optional HAVE_RCU_TABLE_FREE support for x86.

This is useful for things like Xen and KVM where paravirt tlb flush
means the software page table walkers like GUP-fast cannot rely on
IRQs disabling like regular x86 can.

Not for inclusion - is part of PeterZ's "Unify TLB gather implementations"
http://mid.gmane.org/20120627211540.459910...@chello.nl

Cc: Nikunj A Dadhania 
Cc: Jeremy Fitzhardinge 
Cc: Avi Kivity 
Signed-off-by: Peter Zijlstra 
Link: http://lkml.kernel.org/n/tip-r106wg6t7crxxhva55jna...@git.kernel.org
---
 arch/x86/include/asm/tlb.h |1 +
 arch/x86/mm/pgtable.c  |6 +++---
 include/asm-generic/tlb.h  |9 +
 3 files changed, 13 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/tlb.h b/arch/x86/include/asm/tlb.h
index 4fef207..f5489f0 100644
--- a/arch/x86/include/asm/tlb.h
+++ b/arch/x86/include/asm/tlb.h
@@ -1,6 +1,7 @@
 #ifndef _ASM_X86_TLB_H
 #define _ASM_X86_TLB_H
 
+#define __tlb_remove_table(table) free_page_and_swap_cache(table)
 #define tlb_start_vma(tlb, vma) do { } while (0)
 #define tlb_end_vma(tlb, vma) do { } while (0)
 #define __tlb_remove_tlb_entry(tlb, ptep, address) do { } while (0)
diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
index 8573b83..34fa168 100644
--- a/arch/x86/mm/pgtable.c
+++ b/arch/x86/mm/pgtable.c
@@ -51,21 +51,21 @@ void ___pte_free_tlb(struct mmu_gather *tlb, struct page 
*pte)
 {
pgtable_page_dtor(pte);
paravirt_release_pte(page_to_pfn(pte));
-   tlb_remove_page(tlb, pte);
+   tlb_remove_table(tlb, pte);
 }
 
 #if PAGETABLE_LEVELS > 2
 void ___pmd_free_tlb(struct mmu_gather *tlb, pmd_t *pmd)
 {
paravirt_release_pmd(__pa(pmd) >> PAGE_SHIFT);
-   tlb_remove_page(tlb, virt_to_page(pmd));
+   tlb_remove_table(tlb, virt_to_page(pmd));
 }
 
 #if PAGETABLE_LEVELS > 3
 void ___pud_free_tlb(struct mmu_gather *tlb, pud_t *pud)
 {
paravirt_release_pud(__pa(pud) >> PAGE_SHIFT);
-   tlb_remove_page(tlb, virt_to_page(pud));
+   tlb_remove_table(tlb, virt_to_page(pud));
 }
 #endif /* PAGETABLE_LEVELS > 3 */
 #endif /* PAGETABLE_LEVELS > 2 */
diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
index ed6642a..d382b22 100644
--- a/include/asm-generic/tlb.h
+++ b/include/asm-generic/tlb.h
@@ -19,6 +19,8 @@
 #include 
 #include 
 
+static inline void tlb_remove_page(struct mmu_gather *tlb, struct page *page);
+
 #ifdef CONFIG_HAVE_RCU_TABLE_FREE
 /*
  * Semi RCU freeing of the page directories.
@@ -60,6 +62,13 @@ struct mmu_table_batch {
 extern void tlb_table_flush(struct mmu_gather *tlb);
 extern void tlb_remove_table(struct mmu_gather *tlb, void *table);
 
+#else
+
+static inline void tlb_remove_table(struct mmu_gather *tlb, void *table)
+{
+   tlb_remove_page(tlb, table);
+}
+
 #endif
 
 /*

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v4 0/8] KVM paravirt remote flush tlb

2012-08-21 Thread Nikunj A. Dadhania
Remote flushing api's does a busy wait which is fine in bare-metal
scenario. But with-in the guest, the vcpus might have been pre-empted
or blocked. In this scenario, the initator vcpu would end up
busy-waiting for a long amount of time.

This was discovered in our gang scheduling test and other way to solve
this is by para-virtualizing the flush_tlb_others_ipi(now shows up as
smp_call_function_many after Alex Shi's TLB optimization)

This patch set implements para-virt flush tlbs making sure that it
does not wait for vcpus that are sleeping. And all the sleeping vcpus
flush the tlb on guest enter. Idea was discussed here:
https://lkml.org/lkml/2012/2/20/157

This also brings one more dependency for lock-less page walk that is
performed by get_user_pages_fast(gup_fast). gup_fast disables the
interrupt and assumes that the pages will not be freed during that
period. And this was fine as the flush_tlb_others_ipi would wait for
all the IPI to be processed and return back. With the new approach of
not waiting for the sleeping vcpus, this assumption is not valid
anymore. So now HAVE_RCU_TABLE_FREE is used to free the pages. This
will make sure that all the cpus would atleast process smp_callback
before the pages are freed.

Changelog from v3:
• Add helper for cleaning up vcpu_state information (Marcelo)
• Fix code for checking vs_page and leaking page refs (Marcelo)

Changelog from v2:
• Rebase to 3.5 based linus(commit - f7da9cd) kernel.
• Port PV-Flush to new TLB-Optimization code by Alex Shi
• Use pinned pages to avoid overhead during guest enter/exit (Marcelo)
• Remove kick, as this is not improving much
• Use bit fields in the state(flush_on_enter and vcpu_running) flag to
  avoid smp barriers (Marcelo)

Changelog from v1:
• Race fixes reported by Vatsa
• Address gup_fast dependency using PeterZ's rcu table free patch
• Fix rcu_table_free for hw pagetable walkers

Here are the results from PLE hardware. Here is the setup details:
• 32 CPUs (HT disabled)
• 64-bit VM
   • 32vcpus
   • 8GB RAM

base =  3.6-rc1 + ple handler optimization patch
pvflushv4 =  3.6-rc1 + ple handler optimization patch + pvflushv4 patch

kernbench(lower is better)
==
 base  pvflushv4  %improvement
1VM48.5800   46.8513   3.55846
2VM   108.1823  104.6410   3.27346
3VM   183.2733  163.3547  10.86825

ebizzy(higher is better)

 base pvflushv4  %improvement
1VM 2414.5000 2089.8750 -13.44481
2VM 2167.6250 2371.7500  9.41699
3VM 1600. 2102.5556 31.40060

Thanks Raghu for running the tests.

[1] http://article.gmane.org/gmane.linux.kernel/1329752

---

Nikunj A. Dadhania (6):
  KVM Guest: Add VCPU running/pre-empted state for guest
  KVM-HV: Add VCPU running/pre-empted state for guest
  KVM Guest: Add paravirt kvm_flush_tlb_others
  KVM-HV: Add flush_on_enter before guest enter
  Enable HAVE_RCU_TABLE_FREE for kvm when PARAVIRT_TLB_FLUSH is enabled
  KVM-doc: Add paravirt tlb flush document

Peter Zijlstra (2):
  mm, x86: Add HAVE_RCU_TABLE_FREE support
  mm: Add missing TLB invalidate to RCU page-table freeing


 Documentation/virtual/kvm/msr.txt|4 +
 Documentation/virtual/kvm/paravirt-tlb-flush.txt |   53 ++
 arch/Kconfig |3 +
 arch/powerpc/Kconfig |1 
 arch/sparc/Kconfig   |1 
 arch/x86/Kconfig |   11 +++
 arch/x86/include/asm/kvm_host.h  |7 ++
 arch/x86/include/asm/kvm_para.h  |   13 +++
 arch/x86/include/asm/tlb.h   |1 
 arch/x86/include/asm/tlbflush.h  |   11 +++
 arch/x86/kernel/kvm.c|   38 ++
 arch/x86/kvm/cpuid.c |1 
 arch/x86/kvm/x86.c   |   84 +-
 arch/x86/mm/pgtable.c|6 +-
 arch/x86/mm/tlb.c|   36 +
 include/asm-generic/tlb.h|9 ++
 mm/memory.c  |   43 ++-
 17 files changed, 311 insertions(+), 11 deletions(-)
 create mode 100644 Documentation/virtual/kvm/paravirt-tlb-flush.txt

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 4/8] KVM-HV: Add VCPU running/pre-empted state for guest

2012-08-04 Thread Nikunj A Dadhania
On Fri, 3 Aug 2012 14:31:22 -0300, Marcelo Tosatti  wrote:
> On Fri, Aug 03, 2012 at 11:25:44AM +0530, Nikunj A Dadhania wrote:
> > On Thu, 2 Aug 2012 16:56:28 -0300, Marcelo Tosatti  
> > wrote:
> > > >  
> > > > +   case MSR_KVM_VCPU_STATE:
> > > > +   vcpu->arch.v_state.vs_page = gfn_to_page(vcpu->kvm, 
> > > > data >> PAGE_SHIFT);
> > > > +   vcpu->arch.v_state.vs_offset = data & ~(PAGE_MASK | 
> > > > KVM_MSR_ENABLED);
> > > 
> > > Assign vs_offset after success.
> > > 
> > > > +
> > > > +   if (is_error_page(vcpu->arch.v_state.vs_page)) {
> > > > +   kvm_release_page_clean(vcpu->arch.time_page);
> > > > +   vcpu->arch.v_state.vs_page = NULL;
> > > > +   pr_info("KVM: VCPU_STATE - Unable to pin the 
> > > > page\n");
> > > 
> > > Missing break or return;
> > > 
> > > > +   }
> > > > +   vcpu->arch.v_state.msr_val = data;
> > > > +   break;
> > > > +
> > > > case MSR_IA32_MCG_CTL:
> > > 
> > > Please verify this code carefully again.
> > > 
> > > Also leaking the page reference.
> > > 
> > > > vcpu->arch.apf.msr_val = 0;
> > > > vcpu->arch.st.msr_val = 0;
> > > > +   vcpu->arch.v_state.msr_val = 0;
> > > 
> > > Add a newline and comment (or even better a new helper).
> > > >  
> > > > kvmclock_reset(vcpu);
> > > 
> > 
> > How about something like the below. I have tried to look at time_page
> > for reference:
> > 
> > 
> > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > index 580abcf..c82cc12 100644
> > --- a/arch/x86/kvm/x86.c
> > +++ b/arch/x86/kvm/x86.c
> > @@ -1604,6 +1604,16 @@ static void kvm_clear_vcpu_state(struct kvm_vcpu 
> > *vcpu)
> > kunmap_atomic(kaddr);
> >  }
> >  
> > +static void kvm_vcpu_state_reset(struct kvm_vcpu *vcpu)
> > +{
> > +   vcpu->arch.v_state.msr_val = 0;
> > +   vcpu->arch.v_state.vs_offset = 0;
> > +   if (vcpu->arch.v_state.vs_page) {
> > +   kvm_release_page_dirty(vcpu->arch.v_state.vs_page);
> > +   vcpu->arch.v_state.vs_page = NULL;
> > +   }
> > +}
> > +
> >  int kvm_set_msr_common(struct kvm_vcpu *vcpu, u32 msr, u64 data)
> >  {
> > bool pr = false;
> > @@ -1724,14 +1734,17 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, u32 
> > msr, u64 data)
> > break;
> >  
> > case MSR_KVM_VCPU_STATE:
> > +   kvm_vcpu_state_reset(vcpu);
> > +
> > vcpu->arch.v_state.vs_page = gfn_to_page(vcpu->kvm, data >> 
> > PAGE_SHIFT);
> > -   vcpu->arch.v_state.vs_offset = data & ~(PAGE_MASK | 
> > KVM_MSR_ENABLED);
> 
> Should also fail if its not enabled (KVM_MSR_ENABLED bit).
> 
> What is the point of having non-NULL vs_page pointer if KVM_MSR_ENABLED
> bit is not set?
> 
Yes, will do that.

> The rest is fine, thanks.
> 

Thanks
Nikunj

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 4/8] KVM-HV: Add VCPU running/pre-empted state for guest

2012-08-02 Thread Nikunj A Dadhania
On Thu, 2 Aug 2012 16:56:28 -0300, Marcelo Tosatti  wrote:
> >  
> > +   case MSR_KVM_VCPU_STATE:
> > +   vcpu->arch.v_state.vs_page = gfn_to_page(vcpu->kvm, data >> 
> > PAGE_SHIFT);
> > +   vcpu->arch.v_state.vs_offset = data & ~(PAGE_MASK | 
> > KVM_MSR_ENABLED);
> 
> Assign vs_offset after success.
> 
> > +
> > +   if (is_error_page(vcpu->arch.v_state.vs_page)) {
> > +   kvm_release_page_clean(vcpu->arch.time_page);
> > +   vcpu->arch.v_state.vs_page = NULL;
> > +   pr_info("KVM: VCPU_STATE - Unable to pin the page\n");
> 
> Missing break or return;
> 
> > +   }
> > +   vcpu->arch.v_state.msr_val = data;
> > +   break;
> > +
> > case MSR_IA32_MCG_CTL:
> 
> Please verify this code carefully again.
> 
> Also leaking the page reference.
> 
> > vcpu->arch.apf.msr_val = 0;
> > vcpu->arch.st.msr_val = 0;
> > +   vcpu->arch.v_state.msr_val = 0;
> 
> Add a newline and comment (or even better a new helper).
> >  
> > kvmclock_reset(vcpu);
> 

How about something like the below. I have tried to look at time_page
for reference:


diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 580abcf..c82cc12 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1604,6 +1604,16 @@ static void kvm_clear_vcpu_state(struct kvm_vcpu *vcpu)
kunmap_atomic(kaddr);
 }
 
+static void kvm_vcpu_state_reset(struct kvm_vcpu *vcpu)
+{
+   vcpu->arch.v_state.msr_val = 0;
+   vcpu->arch.v_state.vs_offset = 0;
+   if (vcpu->arch.v_state.vs_page) {
+   kvm_release_page_dirty(vcpu->arch.v_state.vs_page);
+   vcpu->arch.v_state.vs_page = NULL;
+   }
+}
+
 int kvm_set_msr_common(struct kvm_vcpu *vcpu, u32 msr, u64 data)
 {
bool pr = false;
@@ -1724,14 +1734,17 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, u32 msr, 
u64 data)
break;
 
case MSR_KVM_VCPU_STATE:
+   kvm_vcpu_state_reset(vcpu);
+
vcpu->arch.v_state.vs_page = gfn_to_page(vcpu->kvm, data >> 
PAGE_SHIFT);
-   vcpu->arch.v_state.vs_offset = data & ~(PAGE_MASK | 
KVM_MSR_ENABLED);
 
if (is_error_page(vcpu->arch.v_state.vs_page)) {
-   kvm_release_page_clean(vcpu->arch.time_page);
+   kvm_release_page_clean(vcpu->arch.v_state.vs_page);
vcpu->arch.v_state.vs_page = NULL;
pr_info("KVM: VCPU_STATE - Unable to pin the page\n");
+   break;
}
+   vcpu->arch.v_state.vs_offset = data & ~(PAGE_MASK | 
KVM_MSR_ENABLED);
vcpu->arch.v_state.msr_val = data;
break;
 
@@ -6053,6 +6066,7 @@ void kvm_put_guest_fpu(struct kvm_vcpu *vcpu)
 void kvm_arch_vcpu_free(struct kvm_vcpu *vcpu)
 {
kvmclock_reset(vcpu);
+   kvm_vcpu_state_reset(vcpu);
 
free_cpumask_var(vcpu->arch.wbinvd_dirty_mask);
fx_free(vcpu);
@@ -6109,9 +6123,9 @@ int kvm_arch_vcpu_reset(struct kvm_vcpu *vcpu)
kvm_make_request(KVM_REQ_EVENT, vcpu);
vcpu->arch.apf.msr_val = 0;
vcpu->arch.st.msr_val = 0;
-   vcpu->arch.v_state.msr_val = 0;
 
kvmclock_reset(vcpu);
+   kvm_vcpu_state_reset(vcpu);
 
kvm_clear_async_pf_completion_queue(vcpu);
kvm_async_pf_hash_reset(vcpu);

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 6/8] KVM-HV: Add flush_on_enter before guest enter

2012-08-02 Thread Nikunj A Dadhania
On Thu, 2 Aug 2012 17:16:41 -0300, Marcelo Tosatti  wrote:
> On Thu, Aug 02, 2012 at 05:14:02PM -0300, Marcelo Tosatti wrote:
> > On Tue, Jul 31, 2012 at 04:19:02PM +0530, Nikunj A. Dadhania wrote:
> > > From: Nikunj A. Dadhania 
> > >  
> > >  static void kvm_set_vcpu_state(struct kvm_vcpu *vcpu)
> > >  {
> > > @@ -1584,7 +1573,8 @@ static void kvm_set_vcpu_state(struct kvm_vcpu 
> > > *vcpu)
> > >   kaddr = kmap_atomic(vcpu->arch.v_state.vs_page);
> > >   kaddr += vcpu->arch.v_state.vs_offset;
> > >   vs = kaddr;
> > > - kvm_set_atomic(&vs->state, 0, 1 << KVM_VCPU_STATE_IN_GUEST_MODE);
> > > + if (xchg(&vs->state, VS_IN_GUEST) == VS_SHOULD_FLUSH)
> > > + kvm_x86_ops->tlb_flush(vcpu);
> > >   kunmap_atomic(kaddr);
> > >  }
> > >  
> > > @@ -1600,7 +1590,8 @@ static void kvm_clear_vcpu_state(struct kvm_vcpu 
> > > *vcpu)
> > >   kaddr = kmap_atomic(vcpu->arch.v_state.vs_page);
> > >   kaddr += vcpu->arch.v_state.vs_offset;
> > >   vs = kaddr;
> > > - kvm_set_atomic(&vs->state, 1 << KVM_VCPU_STATE_IN_GUEST_MODE, 0);
> > > + if (xchg(&vs->state, VS_NOT_IN_GUEST) == VS_SHOULD_FLUSH)
> > > + kvm_x86_ops->tlb_flush(vcpu);
> > >   kunmap_atomic(kaddr);
> > >  }
> > 
> > Nevermind the early comment (the other comments on that message are
> > valid).
I assume the above is related to kvm_set_atomic comment in the [3/8]

> 
> Ah, so the pseucode mentions flush-on-exit because we can be clearing 
> the flag on xchg. Setting KVM_REQ_TLB_FLUSH instead should be enough
> (please verify).
> 
Yes, that will work while exiting. 

In the vm_enter case, we need to do a kvm_x86_ops->tlb_flush(vcpu), as
we have already passed the phase of checking the request.

Nikunj

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 4/8] KVM-HV: Add VCPU running/pre-empted state for guest

2012-08-02 Thread Nikunj A Dadhania
On Thu, 2 Aug 2012 16:56:28 -0300, Marcelo Tosatti  wrote:
> On Tue, Jul 31, 2012 at 04:18:41PM +0530, Nikunj A. Dadhania wrote:
> > From: Nikunj A. Dadhania 
> > 
> > Hypervisor code to indicate guest running/pre-empteded status through
> > msr. The page is now pinned during MSR write time and use
> > kmap_atomic/kunmap_atomic to access the shared area vcpu_state area.
> > 
> > Suggested-by: Marcelo Tosatti 
> > Signed-off-by: Nikunj A. Dadhania 
> > ---
> >  arch/x86/include/asm/kvm_host.h |7 
> >  arch/x86/kvm/cpuid.c|1 +
> >  arch/x86/kvm/x86.c  |   71 
> > ++-
> >  3 files changed, 77 insertions(+), 2 deletions(-)

[...]

> > +static void kvm_set_atomic(u64 *addr, u64 old, u64 new)
> > +{
> > +   int loop = 100;
> > +   while (1) {
> > +   if (cmpxchg(addr, old, new) == old)
> > +   break;
> > +   loop--;
> > +   if (!loop) {
> > +   pr_info("atomic cur: %lx old: %lx new: %lx\n",
> > +   *addr, old, new);
> > +   break;
> > +   }
> > +   }
> > +}
> 
> A generic "kvm_set_atomic" would need that loop, but in the particular
> TLB flush case we know that the only information being transmitted is 
> a TLB flush.
> 
yes, so the next patch gets rid of this in a neater way.

> So this idea should work:
> 
> old = *addr;
> if (cmpxchg(addr, old, IN_GUEST_MODE) == FAILURE) 
>   kvm_x86_ops->tlb_flush()
> atomic_set(addr, IN_GUEST_MODE);
> } else if {
> if (old & TLB_SHOULD_FLUSH)
>   kvm_x86_ops->tlb_flush()
> }
> 
> (the actual pseucode above is pretty ugly and 
> mus be improved but it should be enough to transmit
> the idea).
> 
> Of course as long as you make sure the atomic_set does not
> overwrite information.
> 
> 
> > +   char *kaddr;
> > +
> > +   if (!(vcpu->arch.v_state.msr_val & KVM_MSR_ENABLED) ||
> > +   !vcpu->arch.v_state.vs_page)
> > +   return;
>
> If its not enabled vs_page should be NULL?
> 
Yes, it should be:

if (!(enabled && vs_page))
   return;

> > +
> > +   kaddr = kmap_atomic(vcpu->arch.v_state.vs_page);
> > +   kaddr += vcpu->arch.v_state.vs_offset;
> > +   vs = kaddr;
> > +   kvm_set_atomic(&vs->state, 0, 1 << KVM_VCPU_STATE_IN_GUEST_MODE);
> > +   kunmap_atomic(kaddr);
> > +}
> > +
> > +static void kvm_clear_vcpu_state(struct kvm_vcpu *vcpu)
> > +{
> > +   struct kvm_vcpu_state *vs;
> > +   char *kaddr;
> > +
> > +   if (!(vcpu->arch.v_state.msr_val & KVM_MSR_ENABLED) ||
> > +   !vcpu->arch.v_state.vs_page)
> > +   return;
> 
> Like above.
> 
> > +   kaddr = kmap_atomic(vcpu->arch.v_state.vs_page);
> > +   kaddr += vcpu->arch.v_state.vs_offset;
> > +   vs = kaddr;
> > +   kvm_set_atomic(&vs->state, 1 << KVM_VCPU_STATE_IN_GUEST_MODE, 0);
> > +   kunmap_atomic(kaddr);
> > +}
> > +
> >  int kvm_set_msr_common(struct kvm_vcpu *vcpu, u32 msr, u64 data)
> >  {
> > bool pr = false;
> > @@ -1676,6 +1723,18 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, u32 
> > msr, u64 data)
> > return 1;
> > break;
> >  
> > +   case MSR_KVM_VCPU_STATE:
> > +   vcpu->arch.v_state.vs_page = gfn_to_page(vcpu->kvm, data >> 
> > PAGE_SHIFT);
> > +   vcpu->arch.v_state.vs_offset = data & ~(PAGE_MASK | 
> > KVM_MSR_ENABLED);
> 
> Assign vs_offset after success.
>
Will do that.
 
> > +
> > +   if (is_error_page(vcpu->arch.v_state.vs_page)) {
> > +   kvm_release_page_clean(vcpu->arch.time_page);
C&P error :(
kvm_release_page_clean(vcpu->arch.v_state.vs_page);

> > +   vcpu->arch.v_state.vs_page = NULL;
> > +   pr_info("KVM: VCPU_STATE - Unable to pin the page\n");
> 
> Missing break or return;
> 
Right

> > +   }
> > +   vcpu->arch.v_state.msr_val = data;
> > +   break;
> > +
> > case MSR_IA32_MCG_CTL:
> 
> Please verify this code carefully again.
> 
> Also leaking the page reference.
> 
> > vcpu->arch.apf.msr_val = 0;
> > vcpu->arch.st.msr_val = 0;
> > +   vcpu->arch.v_state.msr_val = 0;
> 
> Add a newline and comment (or even better a new helper).
>
Will do.

Thanks for the detailed review.

Nikunj

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 6/7] kvm,x86: RCU based table free

2012-08-01 Thread Nikunj A Dadhania
Hi Stefano,

On Wed, 1 Aug 2012 12:23:37 +0100, Stefano Stabellini 
 wrote:
> On Tue, 5 Jun 2012, Stefano Stabellini wrote:
> > On Tue, 5 Jun 2012, Peter Zijlstra wrote:
> > > On Tue, 2012-06-05 at 18:34 +0530, Nikunj A Dadhania wrote:
> > > > PeterZ, is 7/7 alright to be picked?
> > > 
> > > Yeah, I guess it is.. I haven't had time to rework my tlb series yet
> > > though. But these two patches together should make it work for x86.
> > > 
> > 
> > Good. Do you think they are OK for 3.5-rc2? Or is it better to wait for
> > 3.6?
> > 
> 
> Hello Nikunj,
> what happened to this patch series?
> In particular I am interested in the following two patches:
> 
> kvm,x86: RCU based table free
> Flush page-table pages before freeing them
> 
> do you still intend to carry on with the development? Is there anything
> missing that is preventing them from going upstream?
>
I have posted a v3 on the kvm-list:
http://www.spinics.net/lists/kvm/msg76955.html

I am carrying the above two patches(with one fix) in my series as well
for completeness. 

I have picked up the patches from PeterZ's "Unify TLB gather
implementations -v3"
http://article.gmane.org/gmane.linux.kernel.mm/81278

Regards
Nikunj

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 8/8] KVM-doc: Add paravirt tlb flush document

2012-07-31 Thread Nikunj A. Dadhania

Signed-off-by: Nikunj A. Dadhania 
---
 Documentation/virtual/kvm/msr.txt|4 ++
 Documentation/virtual/kvm/paravirt-tlb-flush.txt |   53 ++
 2 files changed, 57 insertions(+), 0 deletions(-)
 create mode 100644 Documentation/virtual/kvm/paravirt-tlb-flush.txt

diff --git a/Documentation/virtual/kvm/msr.txt 
b/Documentation/virtual/kvm/msr.txt
index 7304710..92a6af6 100644
--- a/Documentation/virtual/kvm/msr.txt
+++ b/Documentation/virtual/kvm/msr.txt
@@ -256,3 +256,7 @@ MSR_KVM_EOI_EN: 0x4b564d04
guest must both read the least significant bit in the memory area and
clear it using a single CPU instruction, such as test and clear, or
compare and exchange.
+
+MSR_KVM_VCPU_STATE: 0x4b564d05
+
+Refer: Documentation/virtual/kvm/paravirt-tlb-flush.txt
diff --git a/Documentation/virtual/kvm/paravirt-tlb-flush.txt 
b/Documentation/virtual/kvm/paravirt-tlb-flush.txt
new file mode 100644
index 000..0eaabd7
--- /dev/null
+++ b/Documentation/virtual/kvm/paravirt-tlb-flush.txt
@@ -0,0 +1,53 @@
+KVM - Paravirt TLB Flush
+Nikunj A Dadhania , IBM, 2012
+
+
+Remote flushing api's does a busy wait which is fine in bare-metal
+scenario. But with-in the guest, the vcpus might have been pre-empted
+or blocked. In this scenario, the initator vcpu would end up
+busy-waiting for a long amount of time.
+
+This would require to have information of guest running/not-running
+within the guest to take a decision. The following MSR introduces vcpu
+running state information.
+
+Using this MSR we have implemented para-virt flush tlbs making sure
+that it does not wait for vcpus that are not-running. And TLB flushing
+for them is deferred, which is done on guest enter.
+
+MSR_KVM_VCPU_STATE: 0x4b564d04
+
+   data: 64-byte alignment physical address of a memory area which must be
+   in guest RAM, plus an enable bit in bit 0. This memory is expected to
+   hold a copy of the following structure:
+
+   struct kvm_steal_time {
+   __u64 state;
+   __u32 pad[14];
+   }
+
+   whose data will be filled in by the hypervisor/guest. Only one
+   write, or registration, is needed for each VCPU.  The interval
+   between updates of this structure is arbitrary and
+   implementation-dependent.  The hypervisor may update this
+   structure at any time it sees fit until anything with bit0 ==
+   0 is written to it. Guest is required to make sure this
+   structure is initialized to zero.
+
+   This would enable a VCPU to know running status of sibling
+   VCPUs. The information can further be used to determine if an
+   IPI needs to be send to the non-running VCPU and wait for them
+   unnecessarily. For e.g. flush_tlb_others_ipi.
+
+   Fields have the following meanings:
+
+   state: has bit  following fields:
+
+   Bit 0 - vcpu running state. Hypervisor would set vcpu
+   running/not running. Value 1 meaning the vcpu
+   is running and value 0 means vcpu is
+   pre-empted out.
+
+   Bit 1 - hypervisor should flush tlb is set during
+   guest enter/exit
+

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 7/8] Enable HAVE_RCU_TABLE_FREE for kvm when PARAVIRT_TLB_FLUSH is enabled

2012-07-31 Thread Nikunj A. Dadhania

Signed-off-by: Nikunj A. Dadhania 
---
 arch/x86/Kconfig |   11 +++
 1 files changed, 11 insertions(+), 0 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index c70684f..354160d 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -612,6 +612,17 @@ config PARAVIRT_SPINLOCKS
 
  If you are unsure how to answer this question, answer N.
 
+config PARAVIRT_TLB_FLUSH
+   bool "Paravirtualization layer for TLB Flush"
+   depends on PARAVIRT && SMP && EXPERIMENTAL
+   select HAVE_RCU_TABLE_FREE
+   ---help---
+ Paravirtualized Flush TLB replace the native implementation
+ with something virtualization-friendly (for example, set a
+ flag for sleeping vcpu and do not wait for it).
+
+ If you are unsure how to answer this question, answer N.
+
 config PARAVIRT_CLOCK
bool
 

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 4/8] KVM-HV: Add VCPU running/pre-empted state for guest

2012-07-31 Thread Nikunj A. Dadhania
From: Nikunj A. Dadhania 

Hypervisor code to indicate guest running/pre-empteded status through
msr. The page is now pinned during MSR write time and use
kmap_atomic/kunmap_atomic to access the shared area vcpu_state area.

Suggested-by: Marcelo Tosatti 
Signed-off-by: Nikunj A. Dadhania 
---
 arch/x86/include/asm/kvm_host.h |7 
 arch/x86/kvm/cpuid.c|1 +
 arch/x86/kvm/x86.c  |   71 ++-
 3 files changed, 77 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 09155d6..441348f 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -429,6 +429,13 @@ struct kvm_vcpu_arch {
struct kvm_steal_time steal;
} st;
 
+   /* indicates vcpu is running or preempted */
+   struct {
+   u64 msr_val;
+   struct page *vs_page;
+   unsigned int vs_offset;
+   } v_state;
+
u64 last_guest_tsc;
u64 last_kernel_ns;
u64 last_host_tsc;
diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
index 0595f13..37ab364 100644
--- a/arch/x86/kvm/cpuid.c
+++ b/arch/x86/kvm/cpuid.c
@@ -411,6 +411,7 @@ static int do_cpuid_ent(struct kvm_cpuid_entry2 *entry, u32 
function,
 (1 << KVM_FEATURE_CLOCKSOURCE2) |
 (1 << KVM_FEATURE_ASYNC_PF) |
 (1 << KVM_FEATURE_PV_EOI) |
+(1 << KVM_FEATURE_VCPU_STATE) |
 (1 << KVM_FEATURE_CLOCKSOURCE_STABLE_BIT);
 
if (sched_info_on())
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 59b5950..580abcf 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -806,13 +806,13 @@ EXPORT_SYMBOL_GPL(kvm_rdpmc);
  * kvm-specific. Those are put in the beginning of the list.
  */
 
-#define KVM_SAVE_MSRS_BEGIN9
+#define KVM_SAVE_MSRS_BEGIN10
 static u32 msrs_to_save[] = {
MSR_KVM_SYSTEM_TIME, MSR_KVM_WALL_CLOCK,
MSR_KVM_SYSTEM_TIME_NEW, MSR_KVM_WALL_CLOCK_NEW,
HV_X64_MSR_GUEST_OS_ID, HV_X64_MSR_HYPERCALL,
HV_X64_MSR_APIC_ASSIST_PAGE, MSR_KVM_ASYNC_PF_EN, MSR_KVM_STEAL_TIME,
-   MSR_KVM_PV_EOI_EN,
+   MSR_KVM_VCPU_STATE, MSR_KVM_PV_EOI_EN,
MSR_IA32_SYSENTER_CS, MSR_IA32_SYSENTER_ESP, MSR_IA32_SYSENTER_EIP,
MSR_STAR,
 #ifdef CONFIG_X86_64
@@ -1557,6 +1557,53 @@ static void record_steal_time(struct kvm_vcpu *vcpu)
&vcpu->arch.st.steal, sizeof(struct kvm_steal_time));
 }
 
+static void kvm_set_atomic(u64 *addr, u64 old, u64 new)
+{
+   int loop = 100;
+   while (1) {
+   if (cmpxchg(addr, old, new) == old)
+   break;
+   loop--;
+   if (!loop) {
+   pr_info("atomic cur: %lx old: %lx new: %lx\n",
+   *addr, old, new);
+   break;
+   }
+   }
+}
+
+static void kvm_set_vcpu_state(struct kvm_vcpu *vcpu)
+{
+   struct kvm_vcpu_state *vs;
+   char *kaddr;
+
+   if (!(vcpu->arch.v_state.msr_val & KVM_MSR_ENABLED) ||
+   !vcpu->arch.v_state.vs_page)
+   return;
+
+   kaddr = kmap_atomic(vcpu->arch.v_state.vs_page);
+   kaddr += vcpu->arch.v_state.vs_offset;
+   vs = kaddr;
+   kvm_set_atomic(&vs->state, 0, 1 << KVM_VCPU_STATE_IN_GUEST_MODE);
+   kunmap_atomic(kaddr);
+}
+
+static void kvm_clear_vcpu_state(struct kvm_vcpu *vcpu)
+{
+   struct kvm_vcpu_state *vs;
+   char *kaddr;
+
+   if (!(vcpu->arch.v_state.msr_val & KVM_MSR_ENABLED) ||
+   !vcpu->arch.v_state.vs_page)
+   return;
+
+   kaddr = kmap_atomic(vcpu->arch.v_state.vs_page);
+   kaddr += vcpu->arch.v_state.vs_offset;
+   vs = kaddr;
+   kvm_set_atomic(&vs->state, 1 << KVM_VCPU_STATE_IN_GUEST_MODE, 0);
+   kunmap_atomic(kaddr);
+}
+
 int kvm_set_msr_common(struct kvm_vcpu *vcpu, u32 msr, u64 data)
 {
bool pr = false;
@@ -1676,6 +1723,18 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, u32 msr, 
u64 data)
return 1;
break;
 
+   case MSR_KVM_VCPU_STATE:
+   vcpu->arch.v_state.vs_page = gfn_to_page(vcpu->kvm, data >> 
PAGE_SHIFT);
+   vcpu->arch.v_state.vs_offset = data & ~(PAGE_MASK | 
KVM_MSR_ENABLED);
+
+   if (is_error_page(vcpu->arch.v_state.vs_page)) {
+   kvm_release_page_clean(vcpu->arch.time_page);
+   vcpu->arch.v_state.vs_page = NULL;
+   pr_info("KVM: VCPU_STATE - Unable to pin the page\n");
+   }
+   vcpu->arch.v_state.msr_val = data;
+   break;
+
case MSR_IA32_MC

[PATCH v3 6/8] KVM-HV: Add flush_on_enter before guest enter

2012-07-31 Thread Nikunj A. Dadhania
From: Nikunj A. Dadhania 

PV-Flush guest would indicate to flush on enter, flush the TLB before
entering and exiting the guest.

Signed-off-by: Nikunj A. Dadhania 
---
 arch/x86/kvm/x86.c |   23 +++
 1 files changed, 7 insertions(+), 16 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 580abcf..a67e971 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1557,20 +1557,9 @@ static void record_steal_time(struct kvm_vcpu *vcpu)
&vcpu->arch.st.steal, sizeof(struct kvm_steal_time));
 }
 
-static void kvm_set_atomic(u64 *addr, u64 old, u64 new)
-{
-   int loop = 100;
-   while (1) {
-   if (cmpxchg(addr, old, new) == old)
-   break;
-   loop--;
-   if (!loop) {
-   pr_info("atomic cur: %lx old: %lx new: %lx\n",
-   *addr, old, new);
-   break;
-   }
-   }
-}
+#define VS_NOT_IN_GUEST  (0)
+#define VS_IN_GUEST  (1 << KVM_VCPU_STATE_IN_GUEST_MODE)
+#define VS_SHOULD_FLUSH  (1 << KVM_VCPU_STATE_SHOULD_FLUSH)
 
 static void kvm_set_vcpu_state(struct kvm_vcpu *vcpu)
 {
@@ -1584,7 +1573,8 @@ static void kvm_set_vcpu_state(struct kvm_vcpu *vcpu)
kaddr = kmap_atomic(vcpu->arch.v_state.vs_page);
kaddr += vcpu->arch.v_state.vs_offset;
vs = kaddr;
-   kvm_set_atomic(&vs->state, 0, 1 << KVM_VCPU_STATE_IN_GUEST_MODE);
+   if (xchg(&vs->state, VS_IN_GUEST) == VS_SHOULD_FLUSH)
+   kvm_x86_ops->tlb_flush(vcpu);
kunmap_atomic(kaddr);
 }
 
@@ -1600,7 +1590,8 @@ static void kvm_clear_vcpu_state(struct kvm_vcpu *vcpu)
kaddr = kmap_atomic(vcpu->arch.v_state.vs_page);
kaddr += vcpu->arch.v_state.vs_offset;
vs = kaddr;
-   kvm_set_atomic(&vs->state, 1 << KVM_VCPU_STATE_IN_GUEST_MODE, 0);
+   if (xchg(&vs->state, VS_NOT_IN_GUEST) == VS_SHOULD_FLUSH)
+   kvm_x86_ops->tlb_flush(vcpu);
kunmap_atomic(kaddr);
 }
 

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 5/8] KVM Guest: Add paravirt kvm_flush_tlb_others

2012-07-31 Thread Nikunj A. Dadhania
From: Nikunj A. Dadhania 

flush_tlb_others_ipi depends on lot of statics in tlb.c.  Replicated
the flush_tlb_others_ipi as kvm_flush_tlb_others to further adapt to
paravirtualization.

Use the vcpu state information inside the kvm_flush_tlb_others to
avoid sending ipi to pre-empted vcpus.

* Do not send ipi's to offline vcpus and set flush_on_enter flag
* For online vcpus: Wait for them to clear the flag

The approach was discussed here: https://lkml.org/lkml/2012/2/20/157

v3:
* use only one state variable for vcpu-running/flush_on_enter
* use cmpxchg to update the state
* adapt to Alex Shi's TLB flush optimization

v2:
* use ACCESS_ONCE so the value is not register cached
* Separate HV and Guest code

Suggested-by: Peter Zijlstra 
Signed-off-by: Nikunj A. Dadhania 

--
Pseudo Algo:

   Hypervisor
   ==
   guest_exit()
   if (!(xchg(state, NOT_IN_GUEST) == SHOULD_FLUSH))
   tlb_flush(vcpu);

   guest_enter()
   if (!(xchg(state, IN_GUEST) == SHOULD_FLUSH))
   tlb_flush(vcpu);

Guest
=
flushcpumask = cpumask;
for_each_cpu(i, flushmask) {
state = vs->state;
if(!test_bit(IN_GUEST_MODE, state)) {
if (cmpxchg(&vs->state, state,
state | (1 << SHOULD_FLUSH)) == SUCCESS)
   cpumask_clear_cpu(flushmask,i)
}
}
if(!empty(flushmask)
smp_call_function_many(f->flushmask, flush_tlb_func)

Summary:
Author:
---
 arch/x86/include/asm/tlbflush.h |   11 +++
 arch/x86/kernel/kvm.c   |4 +++-
 arch/x86/mm/tlb.c   |   37 +
 3 files changed, 51 insertions(+), 1 deletions(-)

diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index 74a4433..0a343a1 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -119,6 +119,13 @@ static inline void native_flush_tlb_others(const struct 
cpumask *cpumask,
 {
 }
 
+static inline void kvm_flush_tlb_others(const struct cpumask *cpumask,
+   struct mm_struct *mm,
+   unsigned long start,
+   unsigned long end)
+{
+}
+
 static inline void reset_lazy_tlbstate(void)
 {
 }
@@ -153,6 +160,10 @@ void native_flush_tlb_others(const struct cpumask *cpumask,
struct mm_struct *mm,
unsigned long start, unsigned long end);
 
+void kvm_flush_tlb_others(const struct cpumask *cpumask,
+   struct mm_struct *mm, unsigned long start,
+   unsigned long end);
+
 #define TLBSTATE_OK1
 #define TLBSTATE_LAZY  2
 
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index 37e6599..b538a31 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -501,8 +501,10 @@ void __init kvm_guest_init(void)
apic_set_eoi_write(kvm_guest_apic_eoi_write);
 
 #ifdef CONFIG_PARAVIRT_TLB_FLUSH
-   if (kvm_para_has_feature(KVM_FEATURE_VCPU_STATE))
+   if (kvm_para_has_feature(KVM_FEATURE_VCPU_STATE)) {
has_vcpu_state = 1;
+   pv_mmu_ops.flush_tlb_others = kvm_flush_tlb_others;
+   }
 #endif
 
 #ifdef CONFIG_SMP
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 613cd83..2399013 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -6,6 +6,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -119,6 +120,42 @@ static void flush_tlb_func(void *info)
 
 }
 
+#ifdef CONFIG_KVM_GUEST
+
+DECLARE_PER_CPU(struct kvm_vcpu_state, vcpu_state) __aligned(64);
+
+void kvm_flush_tlb_others(const struct cpumask *cpumask,
+   struct mm_struct *mm, unsigned long start,
+   unsigned long end)
+{
+   struct flush_tlb_info info;
+   struct kvm_vcpu_state *v_state;
+   u64 state;
+   int cpu;
+   cpumask_t flushmask;
+
+   cpumask_copy(&flushmask, cpumask);
+   info.flush_mm = mm;
+   info.flush_start = start;
+   info.flush_end = end;
+   /*
+* We have to call flush only on online vCPUs. And
+* queue flush_on_enter for pre-empted vCPUs
+*/
+   for_each_cpu(cpu, to_cpumask(&flushmask)) {
+   v_state = &per_cpu(vcpu_state, cpu);
+   state = v_state->state;
+   if (!test_bit(KVM_VCPU_STATE_IN_GUEST_MODE, &state)) {
+   if (cmpxchg(&v_state->state, state, state | 1 << 
KVM_VCPU_STATE_SHOULD_FLUSH))
+   cpumask_clear_cpu(cpu, to_cpumask(&flushmask));
+   }
+   }
+
+   if (!cpumask_empty(to_cpumask(&flushmask)))
+   smp_call_function_many(&flushmask, flush_tlb_func, &info, 1);
+}
+#endif /* CONFIG_KVM_GUEST */
+
 void native_flush_tlb_others(const struct cpumask *cpumas

[PATCH v3 2/8] mm: Add missing TLB invalidate to RCU page-table freeing

2012-07-31 Thread Nikunj A. Dadhania
From: Peter Zijlstra 

For normal systems we need a TLB invalidate before freeing the
page-tables, the generic RCU based page-table freeing code lacked
this.

This is because this code originally came from ppc where the hardware
never walks the linux page-tables and thus this invalidate is not
required.

Others, notably s390 which ran into this problem in cd94154cc6a
("[S390] fix tlb flushing for page table pages"), do very much need
this TLB invalidation.

Therefore add it, with a Kconfig option to disable it so as to not
unduly slow down PPC and SPARC64 which neither of them need it.

Signed-off-by: Peter Zijlstra 
Link: http://lkml.kernel.org/n/tip-z32nke0csqopykthsk1zj...@git.kernel.org

[Fix to check *batch is not NULL]
Signed-off-by: Nikunj A. Dadhania 
---
 arch/Kconfig |3 +++
 arch/powerpc/Kconfig |1 +
 arch/sparc/Kconfig   |1 +
 mm/memory.c  |   43 +--
 4 files changed, 42 insertions(+), 6 deletions(-)

diff --git a/arch/Kconfig b/arch/Kconfig
index 8c3d957..fec1c9b 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -231,6 +231,9 @@ config HAVE_ARCH_MUTEX_CPU_RELAX
 config HAVE_RCU_TABLE_FREE
bool
 
+config STRICT_TLB_FILL
+   bool
+
 config ARCH_HAVE_NMI_SAFE_CMPXCHG
bool
 
diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index 9a5d3cd..fb70260 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -127,6 +127,7 @@ config PPC
select GENERIC_IRQ_SHOW_LEVEL
select IRQ_FORCED_THREADING
select HAVE_RCU_TABLE_FREE if SMP
+   select STRICT_TLB_FILL
select HAVE_SYSCALL_TRACEPOINTS
select HAVE_BPF_JIT if PPC64
select HAVE_ARCH_JUMP_LABEL
diff --git a/arch/sparc/Kconfig b/arch/sparc/Kconfig
index e74ff13..126e500 100644
--- a/arch/sparc/Kconfig
+++ b/arch/sparc/Kconfig
@@ -52,6 +52,7 @@ config SPARC64
select HAVE_KRETPROBES
select HAVE_KPROBES
select HAVE_RCU_TABLE_FREE if SMP
+   select STRICT_TLB_FILL
select HAVE_MEMBLOCK
select HAVE_MEMBLOCK_NODE_MAP
select HAVE_SYSCALL_WRAPPERS
diff --git a/mm/memory.c b/mm/memory.c
index 91f6945..2ef9ce1 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -332,12 +332,47 @@ static void tlb_remove_table_rcu(struct rcu_head *head)
free_page((unsigned long)batch);
 }
 
+#ifdef CONFIG_STRICT_TLB_FILL
+/*
+ * Some archictures (sparc64, ppc) cannot refill TLBs after the they've removed
+ * the PTE entries from their hash-table. Their hardware never looks at the
+ * linux page-table structures, so they don't need a hardware TLB invalidate
+ * when tearing down the page-table structure itself.
+ */
+static inline void tlb_table_flush_mmu(struct mmu_gather *tlb) { }
+
+/*
+ * When there's less than two users of this mm there cannot be
+ * a concurrent page-table walk.
+ */
+static inline bool tlb_table_fast(struct mmu_gather *tlb)
+{
+   return atomic_read(&tlb->mm->mm_users) < 2;
+}
+#else
+static inline void tlb_table_flush_mmu(struct mmu_gather *tlb)
+{
+   tlb_flush_mmu(tlb);
+}
+
+/*
+ * Even if there's only a single user, speculative TLB loads can
+ * wreck stuff.
+ */
+static inline bool tlb_table_fast(struct mmu_gather *tlb)
+{
+   return false;
+}
+#endif /* CONFIG_STRICT_TLB_FILL */
+
 void tlb_table_flush(struct mmu_gather *tlb)
 {
struct mmu_table_batch **batch = &tlb->batch;
 
if (*batch) {
-   call_rcu_sched(&(*batch)->rcu, tlb_remove_table_rcu);
+   tlb_table_flush_mmu(tlb);
+   if (*batch)
+   call_rcu_sched(&(*batch)->rcu, tlb_remove_table_rcu);
*batch = NULL;
}
 }
@@ -348,11 +383,7 @@ void tlb_remove_table(struct mmu_gather *tlb, void *table)
 
tlb->need_flush = 1;
 
-   /*
-* When there's less then two users of this mm there cannot be a
-* concurrent page-table walk.
-*/
-   if (atomic_read(&tlb->mm->mm_users) < 2) {
+   if (tlb_table_fast(tlb)) {
__tlb_remove_table(table);
return;
}

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 3/8] KVM Guest: Add VCPU running/pre-empted state for guest

2012-07-31 Thread Nikunj A. Dadhania
From: Nikunj A. Dadhania 

The patch adds guest code for msr between guest and hypervisor. The
msr will export the vcpu running/pre-empted information to the guest
from host. This will enable guest to intelligently send ipi to running
vcpus and set flag for pre-empted vcpus. This will prevent waiting for
vcpus that are not running.

Suggested-by: Peter Zijlstra 
Signed-off-by: Nikunj A. Dadhania 
---
 arch/x86/include/asm/kvm_para.h |   13 +
 arch/x86/kernel/kvm.c   |   36 
 2 files changed, 49 insertions(+), 0 deletions(-)

diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
index 2f7712e..5dfb975 100644
--- a/arch/x86/include/asm/kvm_para.h
+++ b/arch/x86/include/asm/kvm_para.h
@@ -23,6 +23,7 @@
 #define KVM_FEATURE_ASYNC_PF   4
 #define KVM_FEATURE_STEAL_TIME 5
 #define KVM_FEATURE_PV_EOI 6
+#define KVM_FEATURE_VCPU_STATE  7
 
 /* The last 8 bits are used to indicate how to interpret the flags field
  * in pvclock structure. If no bits are set, all flags are ignored.
@@ -39,6 +40,7 @@
 #define MSR_KVM_ASYNC_PF_EN 0x4b564d02
 #define MSR_KVM_STEAL_TIME  0x4b564d03
 #define MSR_KVM_PV_EOI_EN  0x4b564d04
+#define MSR_KVM_VCPU_STATE  0x4b564d05
 
 struct kvm_steal_time {
__u64 steal;
@@ -51,6 +53,17 @@ struct kvm_steal_time {
 #define KVM_STEAL_VALID_BITS ((-1ULL << (KVM_STEAL_ALIGNMENT_BITS + 1)))
 #define KVM_STEAL_RESERVED_MASK (((1 << KVM_STEAL_ALIGNMENT_BITS) - 1 ) << 1)
 
+struct kvm_vcpu_state {
+   __u64 state;
+   __u32 pad[14];
+};
+/* bits in vcpu_state->state */
+#define KVM_VCPU_STATE_IN_GUEST_MODE 0
+#define KVM_VCPU_STATE_SHOULD_FLUSH  1
+
+#define KVM_VCPU_STATE_ALIGN_BITS 5
+#define KVM_VCPU_STATE_VALID_BITS ((-1ULL << (KVM_VCPU_STATE_ALIGN_BITS + 1)))
+
 #define KVM_MAX_MMU_OP_BATCH   32
 
 #define KVM_ASYNC_PF_ENABLED   (1 << 0)
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index c1d61ee..37e6599 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -66,6 +66,9 @@ static DEFINE_PER_CPU(struct kvm_vcpu_pv_apf_data, 
apf_reason) __aligned(64);
 static DEFINE_PER_CPU(struct kvm_steal_time, steal_time) __aligned(64);
 static int has_steal_clock = 0;
 
+DEFINE_PER_CPU(struct kvm_vcpu_state, vcpu_state) __aligned(64);
+static int has_vcpu_state;
+
 /*
  * No need for any "IO delay" on KVM
  */
@@ -302,6 +305,22 @@ static void kvm_guest_apic_eoi_write(u32 reg, u32 val)
apic_write(APIC_EOI, APIC_EOI_ACK);
 }
 
+static void kvm_register_vcpu_state(void)
+{
+   int cpu = smp_processor_id();
+   struct kvm_vcpu_state *v_state;
+
+   if (!has_vcpu_state)
+   return;
+
+   v_state = &per_cpu(vcpu_state, cpu);
+   memset(v_state, 0, sizeof(*v_state));
+
+   wrmsrl(MSR_KVM_VCPU_STATE, (__pa(v_state) | KVM_MSR_ENABLED));
+   printk(KERN_INFO "kvm-vcpustate: cpu %d, msr %lx\n",
+   cpu, __pa(v_state));
+}
+
 void __cpuinit kvm_guest_cpu_init(void)
 {
if (!kvm_para_available())
@@ -330,6 +349,9 @@ void __cpuinit kvm_guest_cpu_init(void)
 
if (has_steal_clock)
kvm_register_steal_time();
+
+   if (has_vcpu_state)
+   kvm_register_vcpu_state();
 }
 
 static void kvm_pv_disable_apf(void)
@@ -393,6 +415,14 @@ void kvm_disable_steal_time(void)
wrmsr(MSR_KVM_STEAL_TIME, 0, 0);
 }
 
+void kvm_disable_vcpu_state(void)
+{
+   if (!has_vcpu_state)
+   return;
+
+   wrmsr(MSR_KVM_VCPU_STATE, 0, 0);
+}
+
 #ifdef CONFIG_SMP
 static void __init kvm_smp_prepare_boot_cpu(void)
 {
@@ -410,6 +440,7 @@ static void __cpuinit kvm_guest_cpu_online(void *dummy)
 
 static void kvm_guest_cpu_offline(void *dummy)
 {
+   kvm_disable_vcpu_state();
kvm_disable_steal_time();
if (kvm_para_has_feature(KVM_FEATURE_PV_EOI))
wrmsrl(MSR_KVM_PV_EOI_EN, 0);
@@ -469,6 +500,11 @@ void __init kvm_guest_init(void)
if (kvm_para_has_feature(KVM_FEATURE_PV_EOI))
apic_set_eoi_write(kvm_guest_apic_eoi_write);
 
+#ifdef CONFIG_PARAVIRT_TLB_FLUSH
+   if (kvm_para_has_feature(KVM_FEATURE_VCPU_STATE))
+   has_vcpu_state = 1;
+#endif
+
 #ifdef CONFIG_SMP
smp_ops.smp_prepare_boot_cpu = kvm_smp_prepare_boot_cpu;
register_cpu_notifier(&kvm_cpu_notifier);

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 1/8] mm, x86: Add HAVE_RCU_TABLE_FREE support

2012-07-31 Thread Nikunj A. Dadhania
From: Peter Zijlstra 

Implements optional HAVE_RCU_TABLE_FREE support for x86.

This is useful for things like Xen and KVM where paravirt tlb flush
means the software page table walkers like GUP-fast cannot rely on
IRQs disabling like regular x86 can.

Not for inclusion - is part of PeterZ's "Unify TLB gather implementations"
http://mid.gmane.org/20120627211540.459910...@chello.nl

Cc: Nikunj A Dadhania 
Cc: Jeremy Fitzhardinge 
Cc: Avi Kivity 
Signed-off-by: Peter Zijlstra 
Link: http://lkml.kernel.org/n/tip-r106wg6t7crxxhva55jna...@git.kernel.org
---
 arch/x86/include/asm/tlb.h |1 +
 arch/x86/mm/pgtable.c  |6 +++---
 include/asm-generic/tlb.h  |9 +
 3 files changed, 13 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/tlb.h b/arch/x86/include/asm/tlb.h
index 4fef207..f5489f0 100644
--- a/arch/x86/include/asm/tlb.h
+++ b/arch/x86/include/asm/tlb.h
@@ -1,6 +1,7 @@
 #ifndef _ASM_X86_TLB_H
 #define _ASM_X86_TLB_H
 
+#define __tlb_remove_table(table) free_page_and_swap_cache(table)
 #define tlb_start_vma(tlb, vma) do { } while (0)
 #define tlb_end_vma(tlb, vma) do { } while (0)
 #define __tlb_remove_tlb_entry(tlb, ptep, address) do { } while (0)
diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
index 8573b83..34fa168 100644
--- a/arch/x86/mm/pgtable.c
+++ b/arch/x86/mm/pgtable.c
@@ -51,21 +51,21 @@ void ___pte_free_tlb(struct mmu_gather *tlb, struct page 
*pte)
 {
pgtable_page_dtor(pte);
paravirt_release_pte(page_to_pfn(pte));
-   tlb_remove_page(tlb, pte);
+   tlb_remove_table(tlb, pte);
 }
 
 #if PAGETABLE_LEVELS > 2
 void ___pmd_free_tlb(struct mmu_gather *tlb, pmd_t *pmd)
 {
paravirt_release_pmd(__pa(pmd) >> PAGE_SHIFT);
-   tlb_remove_page(tlb, virt_to_page(pmd));
+   tlb_remove_table(tlb, virt_to_page(pmd));
 }
 
 #if PAGETABLE_LEVELS > 3
 void ___pud_free_tlb(struct mmu_gather *tlb, pud_t *pud)
 {
paravirt_release_pud(__pa(pud) >> PAGE_SHIFT);
-   tlb_remove_page(tlb, virt_to_page(pud));
+   tlb_remove_table(tlb, virt_to_page(pud));
 }
 #endif /* PAGETABLE_LEVELS > 3 */
 #endif /* PAGETABLE_LEVELS > 2 */
diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
index ed6642a..d382b22 100644
--- a/include/asm-generic/tlb.h
+++ b/include/asm-generic/tlb.h
@@ -19,6 +19,8 @@
 #include 
 #include 
 
+static inline void tlb_remove_page(struct mmu_gather *tlb, struct page *page);
+
 #ifdef CONFIG_HAVE_RCU_TABLE_FREE
 /*
  * Semi RCU freeing of the page directories.
@@ -60,6 +62,13 @@ struct mmu_table_batch {
 extern void tlb_table_flush(struct mmu_gather *tlb);
 extern void tlb_remove_table(struct mmu_gather *tlb, void *table);
 
+#else
+
+static inline void tlb_remove_table(struct mmu_gather *tlb, void *table)
+{
+   tlb_remove_page(tlb, table);
+}
+
 #endif
 
 /*

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 0/8] KVM paravirt remote flush tlb

2012-07-31 Thread Nikunj A. Dadhania

Remote flushing api's does a busy wait which is fine in bare-metal
scenario. But with-in the guest, the vcpus might have been pre-empted
or blocked. In this scenario, the initator vcpu would end up
busy-waiting for a long amount of time.

This was discovered in our gang scheduling test and other way to solve
this is by para-virtualizing the flush_tlb_others_ipi(now shows up as
smp_call_function_many after Alex Shi's TLB optimization)

This patch set implements para-virt flush tlbs making sure that it
does not wait for vcpus that are sleeping. And all the sleeping vcpus
flush the tlb on guest enter. Idea was discussed here:
https://lkml.org/lkml/2012/2/20/157

This also brings one more dependency for lock-less page walk that is
performed by get_user_pages_fast(gup_fast). gup_fast disables the
interrupt and assumes that the pages will not be freed during that
period. And this was fine as the flush_tlb_others_ipi would wait for
all the IPI to be processed and return back. With the new approach of
not waiting for the sleeping vcpus, this assumption is not valid
anymore. So now HAVE_RCU_TABLE_FREE is used to free the pages. This
will make sure that all the cpus would atleast process smp_callback
before the pages are freed.

Changelog from v2:
• Rebase to 3.5 based linus(commit - f7da9cd) kernel.
• Port PV-Flush to new TLB-Optimization code by Alex Shi
• Use pinned pages to avoid overhead during guest enter/exit (Marcelo)
• Remove kick, as this is not improving much
• Use bit fields in the state(flush_on_enter and vcpu_running) flag to
  avoid smp barriers (Marcelo)
• Add documentation for Paravirt TLB Flush (Marcelo)

Changelog from v1:
• Race fixes reported by Vatsa
• Address gup_fast dependency using PeterZ's rcu table free patch
• Fix rcu_table_free for hw pagetable walkers

Here are the results from PLE hardware. Here is the setup details:
• 32 CPUs (HT disabled)
• 64-bit VM
   • 32vcpus
   • 8GB RAM

Base: f7da9cd (based on 3.5 kernel, includes rik's changes and alex
  shi's changes)
ple-opt: Raghu's PLE improvements [1](in kvm:auto-next now)
pv3flsh: ple-opt + paravirt flush v3

Lower is better
kbench - 1VM

AvgStddev
base  16.714089 1.2471967
pleopt12.5274110.15261886
pv3flsh   12.96 0.5041832

kbench - 2VM

AvgStddev
base  28.565933 3.0167804
pleopt  22.7613 1.9046476
pv3flsh23.034083 2.2192968

Higher is better
ebizzy - 1VM

AvgStddev
base  1091 21.674358
pleopt2239 45.188494
pv3flsh   2170.7 44.592102

ebizzy - 2VM

AvgStddev
base 1824.7 63.708299
pleopt   2383.2 107.46779
pv3flsh  2328.2 69.359172

Observations:
-
Looking at the results above, ple-opt[1] patches have addressed the
remote-flush-tlb issue that we were trying to address using the
paravirt-tlb-flush approach. 

[1] http://article.gmane.org/gmane.linux.kernel/1329752

---

Nikunj A. Dadhania (6):
  KVM Guest: Add VCPU running/pre-empted state for guest
  KVM-HV: Add VCPU running/pre-empted state for guest
  KVM Guest: Add paravirt kvm_flush_tlb_others
  KVM-HV: Add flush_on_enter before guest enter
  Enable HAVE_RCU_TABLE_FREE for kvm when PARAVIRT_TLB_FLUSH is enabled
  KVM-doc: Add paravirt tlb flush document

Peter Zijlstra (2):
  mm, x86: Add HAVE_RCU_TABLE_FREE support
  mm: Add missing TLB invalidate to RCU page-table freeing


 Documentation/virtual/kvm/msr.txt|4 +
 Documentation/virtual/kvm/paravirt-tlb-flush.txt |   53 +++
 arch/Kconfig |3 +
 arch/powerpc/Kconfig |1 
 arch/sparc/Kconfig   |1 
 arch/x86/Kconfig |   11 
 arch/x86/include/asm/kvm_host.h  |7 ++
 arch/x86/include/asm/kvm_para.h  |   13 +
 arch/x86/include/asm/tlb.h   |1 
 arch/x86/include/asm/tlbflush.h  |   11 
 arch/x86/kernel/kvm.c|   38 +
 arch/x86/kvm/cpuid.c |1 
 arch/x86/kvm/x86.c   |   62 +-
 arch/x86/mm/pgtable.c|6 +-
 arch/x86/mm/tlb.c|   37 +
 include/asm-generic/tlb.h|9 +++
 mm/memory.c  |   43 +--
 17 files changed, 290 insertions(+), 11 deletions(-)
 create mode 100644 Documentation/virtual/kvm/paravirt-tlb-flush.txt

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord..

Re: [PATCH RFC 1/2] kvm vcpu: Note down pause loop exit

2012-07-12 Thread Nikunj A Dadhania
On Wed, 11 Jul 2012 16:22:29 +0530, Raghavendra K T 
 wrote:
> On 07/11/2012 02:23 PM, Avi Kivity wrote:
> >
> > This adds some tiny overhead to vcpu entry.  You could remove it by
> > using the vcpu->requests mechanism to clear the flag, since
> > vcpu->requests is already checked on every entry.
> 
> So IIUC,  let's have request bit for indicating PLE,
> 
> pause_interception() /handle_pause()
> {
>   make_request(PLE_REQUEST)
>   vcpu_on_spin()
> 
> }
> 
> check_eligibility()
>   {
>   !test_request(PLE_REQUEST) || ( test_request(PLE_REQUEST)  && 
> dy_eligible())
> .
> .
> }
> 
> vcpu_run()
> {
> 
> check_request(PLE_REQUEST)
>
I know check_request will clear PLE_REQUEST, but you just need a
clear_request here, right?


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC 0/2] kvm: Improving directed yield in PLE handler

2012-07-12 Thread Nikunj A Dadhania
On Wed, 11 Jul 2012 14:04:03 +0300, Avi Kivity  wrote:
> 
> > So this would probably improve guests that uses cpu_relax, for example
> > stop_machine_run. I have no measurements, though.
> 
> smp_call_function() too (though that can be converted to directed yield
> too).  It seems worthwhile.
> 
With 

https://lkml.org/lkml/2012/6/26/266 in tip:x86/mm

which now uses smp_call_function_many in native_flush_tlb_others. It
will help that too.

Nikunj

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 3/7] KVM: Add paravirt kvm_flush_tlb_others

2012-07-06 Thread Nikunj A Dadhania
On Tue, 3 Jul 2012 04:55:35 -0300, Marcelo Tosatti  wrote:
> On Mon, Jun 04, 2012 at 10:37:24AM +0530, Nikunj A. Dadhania wrote:
> > flush_tlb_others_ipi depends on lot of statics in tlb.c.  Replicated
> > the flush_tlb_others_ipi as kvm_flush_tlb_others to further adapt to
> > paravirtualization.
> > 
> > Use the vcpu state information inside the kvm_flush_tlb_others to
> > avoid sending ipi to pre-empted vcpus.
> > 
> > * Do not send ipi's to offline vcpus and set flush_on_enter flag
> > * For online vcpus: Wait for them to clear the flag
> > 
> > The approach was discussed here: https://lkml.org/lkml/2012/2/20/157
> > 
> > Suggested-by: Peter Zijlstra 
> > Signed-off-by: Nikunj A. Dadhania 
> > 
> > --
> > Pseudo Algo:
> > 
> >Write()
> >==
> > 
> >guest_exit()
> >flush_on_enter[i]=0;
> >running[i] = 0;
> > 
> >guest_enter()
> >running[i] = 1;
> >smp_mb();
> >if(flush_on_enter[i]) {
> >tlb_flush()
> >flush_on_enter[i]=0;
> >}
> > 
> > 
> >Read()
> >==
> > 
> >GUESTKVM-HV
> > 
> >f->flushcpumask = cpumask - me;
> > 
> > again:
> >for_each_cpu(i, f->flushmask) {
> > 
> >if (!running[i]) {
> >case 1:
> > 
> >running[n]=1
> > 
> >(cpuN does not see
> >flush_on_enter set,
> >guest later finds it
> >running and sends ipi,
> >we are fine here, need
> >to clear the flag on
> >guest_exit)
> > 
> >   flush_on_enter[i] = 1;
> >case2:
> > 
> >running[n]=1
> >(cpuN - will see flush
> >on enter and an IPI as
> >well - addressed in patch-4)
> > 
> >   if (!running[i])
> >  cpu_clear(f->flushmask);  All is well, vm_enter
> >will do the fixup
> >}
> >case 3:
> >running[n] = 0;
> > 
> >(cpuN went to sleep,
> >we saw it as awake,
> >ipi sent, but wait
> >will break without
> >zero_mask and goto
> >again will take care)
> > 
> >}
> >send_ipi(f->flushmask)
> > 
> >wait_a_while_for_zero_mask();
> > 
> >if (!zero_mask)
> >goto again;
> 
> Can you please measure increased vmentry/vmexit overhead? x86/vmexit.c 
> of git://git.kernel.org/pub/scm/virt/kvm/kvm-unit-tests.git should 
> help.
> 

Please find below the results (debug patch attached for enabling
registration of kvm_vcu_state)

I have taken results for 1 and 4 vcpus. Used the following command for
starting the tests:

/usr/libexec/qemu-kvm -smp $i -device testdev,chardev=testlog -chardev
file,id=testlog,path=vmexit.out -serial stdio -kernel ./x86/vmexit.flat

Machine : IBM xSeries with Intel(R) Xeon(R) X7560 2.27GHz CPU 
  with 32 core, 32 online cpus and 4*64GB RAM.

x  base - unpatched host kernel 
+  wo_vs - patched host kernel, vcpu_state not registered
*  w_vs.txt - patched host kernel and vcpu_state registered

1 vcpu results:
---
cpuid
=
   NAvg   Stddev
x 10 2135.1  17.8975
+ 10   2188  18.3666
* 10 2448.9  43.9910

vmcall
==
   NAvg   Stddev
x 10 2025.5  38.1641
+ 10 2047.5  24.8205
* 10 2306.2  40.3066

mov_from_cr8

Re: [PATCH v2 5/7] KVM: Introduce PV kick in flush tlb

2012-07-04 Thread Nikunj A Dadhania
On Wed, 4 Jul 2012 23:37:46 -0300, Marcelo Tosatti  wrote:
> On Tue, Jul 03, 2012 at 01:55:02PM +0530, Nikunj A Dadhania wrote:
> > On Tue, 3 Jul 2012 05:07:13 -0300, Marcelo Tosatti  
> > wrote:
> > > On Mon, Jun 04, 2012 at 10:38:17AM +0530, Nikunj A. Dadhania wrote:
> > > > In place of looping continuously introduce a halt if we do not succeed
> > > > after some time.
> > > > 
> > > > For vcpus that were running an IPI is sent.  In case, it went to sleep
> > > > between this, we will be doing flush_on_enter(harmless). But as a
> > > > flush IPI was already sent, that will be processed in ipi handler,
> > > > this might result into something undesireable, i.e. It might clear the
> > > > flush_mask of a new request.
> > > > 
> > > > So after sending an IPI and waiting for a while, do a halt and wait
> > > > for a kick from the last vcpu.
> > > > 
> > > > Signed-off-by: Srivatsa Vaddagiri 
> > > > Signed-off-by: Nikunj A. Dadhania 
> > > 
> > > Again, was it determined that this is necessary from data of 
> > > benchmarking on the in-guest-mode/out-guest-mode patch?
> > > 
> > No, this is more of a fix wrt algo.
> 
> Please have numbers for the improvement relative to the previous
> patch.
> 
I would consider this more of a correctness fix, rather than an
improvement. In this scenario, suppose vcpu1 was pre-empted out before
the delivery of the IPI. After the loop count, we find that vcpu1 did
not respond and is found pre-empted. We set the flush_on_enter flag for
the vcpu1 and proceed. During vcpu1's guest_enter we would do a
flush_on_enter. Now the vcpu1 will also receive an ipi interrupt in guest
mode, in which it will try to clear the flush_mask and acknowledge to
the interrupt. This processing of ipi would not be correct. So with this
patch, we execute a halt and wait for vcpu1 to clear the flush_mask
through the ipi interrupt.

> It introduces a dependency, these (pvtlbflush and pvspinlocks) are
> separate features. It is useful to switch them on/off individually.
> 
Agree, we can also get the pv kick feature separated which can be useful
to both of the approaches, so they can become independent.

Although, tests suggests that for best results both these features
should be enabled.

Nikunj

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 3/7] KVM: Add paravirt kvm_flush_tlb_others

2012-07-04 Thread Nikunj A Dadhania
On Wed, 4 Jul 2012 23:09:10 -0300, Marcelo Tosatti  wrote:
> On Tue, Jul 03, 2012 at 01:49:49PM +0530, Nikunj A Dadhania wrote:
> > On Tue, 3 Jul 2012 04:55:35 -0300, Marcelo Tosatti  
> > wrote:
> > > > 
> > > >if (!zero_mask)
> > > >goto again;
> > > 
> > > Can you please measure increased vmentry/vmexit overhead? x86/vmexit.c 
> > > of git://git.kernel.org/pub/scm/virt/kvm/kvm-unit-tests.git should 
> > > help.
> > >
> > Sure will get back with the result.
> > 
> > > > +   /* 
> > > > +* Guest might have seen us offline and would have set
> > > > +* flush_on_enter. 
> > > > +*/
> > > > +   kvm_read_guest_cached(vcpu->kvm, ghc, vs, 2*sizeof(__u32));
> > > > +   if (vs->flush_on_enter) 
> > > > +   kvm_x86_ops->tlb_flush(vcpu);
> > > 
> > > 
> > > So flush_tlb_page which was an invlpg now flushes the entire TLB. Did
> > > you take that into account?
> > > 
> > When the vcpu is sleeping/pre-empted out, multiple request for flush_tlb
> > could have happened. And now when we are here, it is cleaning up all the
> > TLB.
> 
> Yes, cases where there are sufficient exits transforming one TLB entry
> invalidation into full TLB invalidation should go unnoticed.
> 
> > One other approach would be to queue the addresses, that brings us with
> > the question: how many request to queue? This would require us adding
> > more syncronization between guest and host for updating the area where
> > these addresses is shared.
> 
> Sounds unnecessarily complicated.
> 
Yes, I did give this a try earlier, but did not see much improvement
with the amount of complexity that it was bringing in.

Regards
Nikunj

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 3/7] KVM: Add paravirt kvm_flush_tlb_others

2012-07-03 Thread Nikunj A Dadhania
On Tue, 3 Jul 2012 05:11:35 -0300, Marcelo Tosatti  wrote:
> On Mon, Jun 04, 2012 at 10:37:24AM +0530, Nikunj A. Dadhania wrote:

> >  arch/x86/include/asm/kvm_para.h |3 +-
> >  arch/x86/include/asm/tlbflush.h |9 ++
> >  arch/x86/kernel/kvm.c   |1 +
> >  arch/x86/kvm/x86.c  |   14 -
> >  arch/x86/mm/tlb.c   |   61 
> > +++
> >  5 files changed, 86 insertions(+), 2 deletions(-)
> > 

[...]

> > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > index 264f172..4714a7b 100644
> > --- a/arch/x86/kvm/x86.c
> > +++ b/arch/x86/kvm/x86.c
> 
> Please split guest/host (arch/x86/kernel/kvm.c etc VS arch/x86/kvm/)
> patches.
> 
Ok

> Please document guest/host interface
> (Documentation/virtual/kvm/paravirt-tlb-flush.txt, add a pointer to it
> from msr.txt).
> 
Sure.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 5/7] KVM: Introduce PV kick in flush tlb

2012-07-03 Thread Nikunj A Dadhania
On Tue, 3 Jul 2012 05:07:13 -0300, Marcelo Tosatti  wrote:
> On Mon, Jun 04, 2012 at 10:38:17AM +0530, Nikunj A. Dadhania wrote:
> > In place of looping continuously introduce a halt if we do not succeed
> > after some time.
> > 
> > For vcpus that were running an IPI is sent.  In case, it went to sleep
> > between this, we will be doing flush_on_enter(harmless). But as a
> > flush IPI was already sent, that will be processed in ipi handler,
> > this might result into something undesireable, i.e. It might clear the
> > flush_mask of a new request.
> > 
> > So after sending an IPI and waiting for a while, do a halt and wait
> > for a kick from the last vcpu.
> > 
> > Signed-off-by: Srivatsa Vaddagiri 
> > Signed-off-by: Nikunj A. Dadhania 
> 
> Again, was it determined that this is necessary from data of 
> benchmarking on the in-guest-mode/out-guest-mode patch?
> 
No, this is more of a fix wrt algo.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 3/7] KVM: Add paravirt kvm_flush_tlb_others

2012-07-03 Thread Nikunj A Dadhania
On Tue, 3 Jul 2012 04:55:35 -0300, Marcelo Tosatti  wrote:
> > 
> >if (!zero_mask)
> >goto again;
> 
> Can you please measure increased vmentry/vmexit overhead? x86/vmexit.c 
> of git://git.kernel.org/pub/scm/virt/kvm/kvm-unit-tests.git should 
> help.
>
Sure will get back with the result.

> > +   /* 
> > +* Guest might have seen us offline and would have set
> > +* flush_on_enter. 
> > +*/
> > +   kvm_read_guest_cached(vcpu->kvm, ghc, vs, 2*sizeof(__u32));
> > +   if (vs->flush_on_enter) 
> > +   kvm_x86_ops->tlb_flush(vcpu);
> 
> 
> So flush_tlb_page which was an invlpg now flushes the entire TLB. Did
> you take that into account?
> 
When the vcpu is sleeping/pre-empted out, multiple request for flush_tlb
could have happened. And now when we are here, it is cleaning up all the
TLB.

One other approach would be to queue the addresses, that brings us with
the question: how many request to queue? This would require us adding
more syncronization between guest and host for updating the area where
these addresses is shared.

> > +again:
> > +   for_each_cpu(cpu, to_cpumask(f->flush_cpumask)) {
> > +   v_state = &per_cpu(vcpu_state, cpu);
> > +
> > +   if (!v_state->state) {
> 
> Should use ACCESS_ONCE to make sure the value is not register cached.
> \
> > +   v_state->flush_on_enter = 1;
> > +   smp_mb();
> > +   if (!v_state->state)
> 
> And here.
> 
Sure will add this check for both in my next version.

> > +   cpumask_clear_cpu(cpu, 
> > to_cpumask(f->flush_cpumask));
> > +   }
> > +   }
> > +
> > +   if (cpumask_empty(to_cpumask(f->flush_cpumask)))
> > +   goto out;
> > +
> > +   apic->send_IPI_mask(to_cpumask(f->flush_cpumask),
> > +   INVALIDATE_TLB_VECTOR_START + sender);
> > +
> > +   loop = 1000;
> > +   while (!cpumask_empty(to_cpumask(f->flush_cpumask)) && --loop)
> > +   cpu_relax();
> > +
> > +   if (!cpumask_empty(to_cpumask(f->flush_cpumask)))
> > +   goto again;
> 
> Is this necessary in addition to the in-guest-mode/out-guest-mode
> detection? If so, why?
> 
The "case 3" where we initially saw the vcpu was running, and a flush
ipi is send to the vcpu. During this time the vcpu might be pre-empted,
so we come out of the loop=1000 with !empty flushmask. We then re-verify
the flushmask against the current running vcpu and make sure that the
vcpu that was pre-empted is un-marked and we can proceed out of the
kvm_flush_tlb_others_ipi without waiting for sleeping/pre-empted vcpus.

Regards
Nikunj

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 3/7] KVM: Add paravirt kvm_flush_tlb_others

2012-06-18 Thread Nikunj A Dadhania
Hi Marcelo,

Thanks for the review.

On Tue, 12 Jun 2012 20:02:18 -0300, Marcelo Tosatti  wrote:
> On Mon, Jun 04, 2012 at 10:37:24AM +0530, Nikunj A. Dadhania wrote:
> > flush_tlb_others_ipi depends on lot of statics in tlb.c.  Replicated
> > the flush_tlb_others_ipi as kvm_flush_tlb_others to further adapt to
> > paravirtualization.
> > 
> > Use the vcpu state information inside the kvm_flush_tlb_others to
> > avoid sending ipi to pre-empted vcpus.
> > 
> > * Do not send ipi's to offline vcpus and set flush_on_enter flag
> > * For online vcpus: Wait for them to clear the flag
> > 
> > The approach was discussed here: https://lkml.org/lkml/2012/2/20/157
> > 
> > Suggested-by: Peter Zijlstra 
> > Signed-off-by: Nikunj A. Dadhania 
> 
> Why not reintroduce the hypercall to flush TLBs? 
Sure, I will get a version with this.

> No waiting, no entry/exit trickery.
> 
Are you also suggesting to get rid of vcpu-running information?
We would atleast need this to raise a flushTLB hypercall on the sleeping
vcpu.

> This is half-way-there paravirt with all the downsides. 
Some more details on what are the downsides would help us reach to a
better solution.

> Even though the guest running information might be useful in other
> cases.
> 
Yes, that was one of the things on the mind.

> > Pseudo Algo:
> > 
> >Write()
> >==
> > 
> >guest_exit()
> >flush_on_enter[i]=0;
> >running[i] = 0;
> > 
> >guest_enter()
> >running[i] = 1;
> >smp_mb();
> >if(flush_on_enter[i]) {
> >tlb_flush()
> >flush_on_enter[i]=0;
> >}
> > 
> > 
> >Read()
> >==
> > 
> >GUESTKVM-HV
> > 
> >f->flushcpumask = cpumask - me;
> > 
> > again:
> >for_each_cpu(i, f->flushmask) {
> > 
> >if (!running[i]) {
> >case 1:
> > 
> >running[n]=1
> > 
> >(cpuN does not see
> >flush_on_enter set,
> >guest later finds it
> >running and sends ipi,
> >we are fine here, need
> >to clear the flag on
> >guest_exit)
> > 
> >   flush_on_enter[i] = 1;
> >case2:
> > 
> >running[n]=1
> >(cpuN - will see flush
> >on enter and an IPI as
> >well - addressed in patch-4)
> > 
> >   if (!running[i])
> >  cpu_clear(f->flushmask);  All is well, vm_enter
> >will do the fixup
> >}
> >case 3:
> >running[n] = 0;
> > 
> >(cpuN went to sleep,
> >we saw it as awake,
> >ipi sent, but wait
> >will break without
> >zero_mask and goto
> >again will take care)
> > 
> >}
> >send_ipi(f->flushmask)
> > 
> >wait_a_while_for_zero_mask();
> > 
> >if (!zero_mask)
> >goto again;
> > ---
> >  arch/x86/include/asm/kvm_para.h |3 +-
> >  arch/x86/include/asm/tlbflush.h |9 ++
> >  arch/x86/kernel/kvm.c   |1 +
> >  arch/x86/kvm/x86.c  |   14 -
> >  arch/x86/mm/tlb.c   |   61 
> > +++
> >  5 files changed, 86 insertions(+), 2 deletions(-)
> > 
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 1/7] KVM Guest: Add VCPU running/pre-empted state for guest

2012-06-18 Thread Nikunj A Dadhania
On Tue, 12 Jun 2012 19:43:10 -0300, Marcelo Tosatti  wrote:
> On Mon, Jun 04, 2012 at 10:36:05AM +0530, Nikunj A. Dadhania wrote:
> > The patch adds guest code for msr between guest and hypervisor. The
> > msr will export the vcpu running/pre-empted information to the guest
> > from host. This will enable guest to intelligently send ipi to running
> > vcpus and set flag for pre-empted vcpus. This will prevent waiting for
> > vcpus that are not running.
> > 
> > Suggested-by: Peter Zijlstra 
> > Signed-off-by: Nikunj A. Dadhania 

[...]

> > @@ -433,6 +464,8 @@ void __init kvm_guest_init(void)
> > pv_time_ops.steal_clock = kvm_steal_clock;
> > }
> >  
> > +   has_vcpu_state = 1;
> > +
> 
> Should be checking for a feature bit, see kvm_para_has_feature() 
> examples above in the function.
>
Sure, will take of this in my next version. 


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 6/7] kvm,x86: RCU based table free

2012-06-05 Thread Nikunj A Dadhania
On Tue, 05 Jun 2012 15:08:08 +0200, Peter Zijlstra  wrote:
> On Tue, 2012-06-05 at 18:34 +0530, Nikunj A Dadhania wrote:
> > PeterZ, is 7/7 alright to be picked?
> 
> Yeah, I guess it is.. I haven't had time to rework my tlb series yet
> though. But these two patches together should make it work for x86.
>
I haven't added your SOB yet to this, though I have your name in
mentioned "From". Should I add your SOB to this, I have added a
minor fix for !CONFIG_HAVE_RCU_TABLE_FREE case?

Regards
Nikunj

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 6/7] kvm,x86: RCU based table free

2012-06-05 Thread Nikunj A Dadhania
On Tue, 5 Jun 2012 12:58:32 +0100, Stefano Stabellini 
 wrote:
> On Tue, 5 Jun 2012, Nikunj A Dadhania wrote:
> > On Tue, 5 Jun 2012 11:48:02 +0100, Stefano Stabellini 
> >  wrote:
> > > 
> > > I am also interested in introducing HAVE_RCU_TABLE_FREE on x86 for Xen.
> > > Maybe we can pull our efforts together :-)
> > > 
> > > Giving a look at this patch, it doesn't look like it is introducing
> > > CONFIG_HAVE_RCU_TABLE_FREE anywhere under arch/x86.
> > > How is the user supposed to set it?
> > >
> > I am doing that in the next patch only for KVM-ParavirtTLB flush, as
> > there is a bug in this implementation that patch [7/7] fixes.
> > 
> > Refer following thread for details:
> > http://mid.gmane.org/1337254086.4281.26.camel@twins
> > http://mid.gmane.org/1337273959.4281.62.camel@twins
> 
> Thanks, somehow I missed the 7/7 patch.
> 
> From the Xen POV, your patch is fine because we'll just select
> PARAVIRT_TLB_FLUSH on CONFIG_XEN (see appended patch for completeness).
> 
Selecting ARCH_HW_WALKS_PAGE_TABLE in place of PARAVIRT_TLB_FLUSH should
suffice.

> The main difference between the two approaches is that a kernel with
> PARAVIRT_TLB_FLUSH and/or CONFIG_XEN enabled is going to have
> HAVE_RCU_TABLE_FREE even when running on native.
> 
> Are you proposing this series for 3.5?
> If not (because it depends on ticketlocks and KVM Paravirt Spinlock
> patches), 
>
3.6 I suppose as the merge window is already closed and we are having
some discussions on PLE results.

> could you extract patch 6/7 and 7/7 and send them out
> separately?
>
> I am saying this because Xen needs the HAVE_RCU_TABLE_FREE fix even if
> pv ticketlock are not accepted. This is an outstanding bug for us
> unfortunately.
> 
PeterZ has a patch in his tlb-unify:

mm, x86: Add HAVE_RCU_TABLE_FREE support

Implements optional HAVE_RCU_TABLE_FREE support for x86.

    This is useful for things like Xen and KVM where paravirt tlb flush
means the software page table walkers like GUP-fast cannot rely on
IRQs disabling like regular x86 can.

Cc: Nikunj A Dadhania 
Cc: Jeremy Fitzhardinge 
Cc: Avi Kivity 
Signed-off-by: Peter Zijlstra 

http://git.kernel.org/?p=linux/kernel/git/peterz/mmu.git;a=commit;h=8a7e6fa5be9d2645c3394892c870113e6e5d9309

PeterZ, is 7/7 alright to be picked?

Regards
Nikunj


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 6/7] kvm,x86: RCU based table free

2012-06-05 Thread Nikunj A Dadhania
On Tue, 5 Jun 2012 11:48:02 +0100, Stefano Stabellini 
 wrote:
> 
> I am also interested in introducing HAVE_RCU_TABLE_FREE on x86 for Xen.
> Maybe we can pull our efforts together :-)
> 
> Giving a look at this patch, it doesn't look like it is introducing
> CONFIG_HAVE_RCU_TABLE_FREE anywhere under arch/x86.
> How is the user supposed to set it?
>
I am doing that in the next patch only for KVM-ParavirtTLB flush, as
there is a bug in this implementation that patch [7/7] fixes.

Refer following thread for details:
http://mid.gmane.org/1337254086.4281.26.camel@twins
http://mid.gmane.org/1337273959.4281.62.camel@twins

Regards
Nikunj

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 1/7] KVM Guest: Add VCPU running/pre-empted state for guest

2012-06-03 Thread Nikunj A. Dadhania
The patch adds guest code for msr between guest and hypervisor. The
msr will export the vcpu running/pre-empted information to the guest
from host. This will enable guest to intelligently send ipi to running
vcpus and set flag for pre-empted vcpus. This will prevent waiting for
vcpus that are not running.

Suggested-by: Peter Zijlstra 
Signed-off-by: Nikunj A. Dadhania 
---
 arch/x86/include/asm/kvm_para.h |   10 ++
 arch/x86/kernel/kvm.c   |   33 +
 2 files changed, 43 insertions(+), 0 deletions(-)

diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
index 77266d3..f57b5cc 100644
--- a/arch/x86/include/asm/kvm_para.h
+++ b/arch/x86/include/asm/kvm_para.h
@@ -24,6 +24,7 @@
 #define KVM_FEATURE_ASYNC_PF   4
 #define KVM_FEATURE_STEAL_TIME 5
 #define KVM_FEATURE_PV_UNHALT  6
+#define KVM_FEATURE_VCPU_STATE  7
 
 /* The last 8 bits are used to indicate how to interpret the flags field
  * in pvclock structure. If no bits are set, all flags are ignored.
@@ -39,6 +40,7 @@
 #define MSR_KVM_SYSTEM_TIME_NEW 0x4b564d01
 #define MSR_KVM_ASYNC_PF_EN 0x4b564d02
 #define MSR_KVM_STEAL_TIME  0x4b564d03
+#define MSR_KVM_VCPU_STATE  0x4b564d04
 
 struct kvm_steal_time {
__u64 steal;
@@ -51,6 +53,14 @@ struct kvm_steal_time {
 #define KVM_STEAL_VALID_BITS ((-1ULL << (KVM_STEAL_ALIGNMENT_BITS + 1)))
 #define KVM_STEAL_RESERVED_MASK (((1 << KVM_STEAL_ALIGNMENT_BITS) - 1 ) << 1)
 
+struct kvm_vcpu_state {
+   __u32 state;
+   __u32 pad[15];
+};
+
+#define KVM_VCPU_STATE_ALIGN_BITS 5
+#define KVM_VCPU_STATE_VALID_BITS ((-1ULL << (KVM_VCPU_STATE_ALIGN_BITS + 1)))
+
 #define KVM_MAX_MMU_OP_BATCH   32
 
 #define KVM_ASYNC_PF_ENABLED   (1 << 0)
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index 98f0378..bb686a6 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -64,6 +64,9 @@ static DEFINE_PER_CPU(struct kvm_vcpu_pv_apf_data, 
apf_reason) __aligned(64);
 static DEFINE_PER_CPU(struct kvm_steal_time, steal_time) __aligned(64);
 static int has_steal_clock = 0;
 
+DEFINE_PER_CPU(struct kvm_vcpu_state, vcpu_state) __aligned(64);
+static int has_vcpu_state;
+
 /*
  * No need for any "IO delay" on KVM
  */
@@ -291,6 +294,22 @@ static void kvm_register_steal_time(void)
cpu, __pa(st));
 }
 
+static void kvm_register_vcpu_state(void)
+{
+   int cpu = smp_processor_id();
+   struct kvm_vcpu_state *v_state;
+
+   if (!has_vcpu_state)
+   return;
+
+   v_state = &per_cpu(vcpu_state, cpu);
+   memset(v_state, 0, sizeof(*v_state));
+
+   wrmsrl(MSR_KVM_VCPU_STATE, (__pa(v_state) | KVM_MSR_ENABLED));
+   printk(KERN_INFO "kvm-vcpustate: cpu %d, msr %lu\n",
+   cpu, __pa(v_state));
+}
+
 void __cpuinit kvm_guest_cpu_init(void)
 {
if (!kvm_para_available())
@@ -310,6 +329,9 @@ void __cpuinit kvm_guest_cpu_init(void)
 
if (has_steal_clock)
kvm_register_steal_time();
+
+   if (has_vcpu_state)
+   kvm_register_vcpu_state();
 }
 
 static void kvm_pv_disable_apf(void *unused)
@@ -361,6 +383,14 @@ void kvm_disable_steal_time(void)
wrmsr(MSR_KVM_STEAL_TIME, 0, 0);
 }
 
+void kvm_disable_vcpu_state(void)
+{
+   if (!has_vcpu_state)
+   return;
+
+   wrmsr(MSR_KVM_VCPU_STATE, 0, 0);
+}
+
 #ifdef CONFIG_SMP
 static void __init kvm_smp_prepare_boot_cpu(void)
 {
@@ -379,6 +409,7 @@ static void __cpuinit kvm_guest_cpu_online(void *dummy)
 
 static void kvm_guest_cpu_offline(void *dummy)
 {
+   kvm_disable_vcpu_state();
kvm_disable_steal_time();
kvm_pv_disable_apf(NULL);
apf_task_wake_all();
@@ -433,6 +464,8 @@ void __init kvm_guest_init(void)
pv_time_ops.steal_clock = kvm_steal_clock;
}
 
+   has_vcpu_state = 1;
+
 #ifdef CONFIG_SMP
smp_ops.smp_prepare_boot_cpu = kvm_smp_prepare_boot_cpu;
register_cpu_notifier(&kvm_cpu_notifier);

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 6/7] kvm,x86: RCU based table free

2012-06-03 Thread Nikunj A. Dadhania
From: Peter Zijlstra 

get_user_pages_fast() depends on the IPI to hold off page table
teardown while they are locklessly walked with interrupts disabled.
If a vcpu were to be preempted while in this critical section, another
vcpu tearing down page tables would go ahead and destroy them.  when
the preempted vcpu resumes it then touches the freed pages.

Using HAVE_RCU_TABLE_FREE:

By using call_rcu_sched() to free the page-tables you'd need to
receive and process at least one tick on the woken up cpu after the
freeing, but since the in-progress gup_fast() will have IRQs disabled
this will be delayed.

http://article.gmane.org/gmane.linux.kernel/1290539

Tested-by: Nikunj A. Dadhania 
Signed-off-by: Nikunj A. Dadhania 
---
 arch/powerpc/include/asm/pgalloc.h  |1 +
 arch/s390/mm/pgtable.c  |1 +
 arch/sparc/include/asm/pgalloc_64.h |1 +
 arch/x86/mm/pgtable.c   |6 +++---
 include/asm-generic/tlb.h   |9 +
 mm/memory.c |7 +++
 6 files changed, 22 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/include/asm/pgalloc.h 
b/arch/powerpc/include/asm/pgalloc.h
index bf301ac..c33ae79 100644
--- a/arch/powerpc/include/asm/pgalloc.h
+++ b/arch/powerpc/include/asm/pgalloc.h
@@ -49,6 +49,7 @@ static inline void __tlb_remove_table(void *_table)
 
pgtable_free(table, shift);
 }
+#define __tlb_remove_table __tlb_remove_table
 #else /* CONFIG_SMP */
 static inline void pgtable_free_tlb(struct mmu_gather *tlb, void *table, 
unsigned shift)
 {
diff --git a/arch/s390/mm/pgtable.c b/arch/s390/mm/pgtable.c
index 6e765bf..7029ed7 100644
--- a/arch/s390/mm/pgtable.c
+++ b/arch/s390/mm/pgtable.c
@@ -730,6 +730,7 @@ void __tlb_remove_table(void *_table)
else
free_pages((unsigned long) table, ALLOC_ORDER);
 }
+#define __tlb_remove_table __tlb_remove_table
 
 static void tlb_remove_table_smp_sync(void *arg)
 {
diff --git a/arch/sparc/include/asm/pgalloc_64.h 
b/arch/sparc/include/asm/pgalloc_64.h
index 40b2d7a..d10913a 100644
--- a/arch/sparc/include/asm/pgalloc_64.h
+++ b/arch/sparc/include/asm/pgalloc_64.h
@@ -106,6 +106,7 @@ static inline void __tlb_remove_table(void *_table)
is_page = true;
pgtable_free(table, is_page);
 }
+#define __tlb_remove_table __tlb_remove_table
 #else /* CONFIG_SMP */
 static inline void pgtable_free_tlb(struct mmu_gather *tlb, void *table, bool 
is_page)
 {
diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
index 8573b83..34fa168 100644
--- a/arch/x86/mm/pgtable.c
+++ b/arch/x86/mm/pgtable.c
@@ -51,21 +51,21 @@ void ___pte_free_tlb(struct mmu_gather *tlb, struct page 
*pte)
 {
pgtable_page_dtor(pte);
paravirt_release_pte(page_to_pfn(pte));
-   tlb_remove_page(tlb, pte);
+   tlb_remove_table(tlb, pte);
 }
 
 #if PAGETABLE_LEVELS > 2
 void ___pmd_free_tlb(struct mmu_gather *tlb, pmd_t *pmd)
 {
paravirt_release_pmd(__pa(pmd) >> PAGE_SHIFT);
-   tlb_remove_page(tlb, virt_to_page(pmd));
+   tlb_remove_table(tlb, virt_to_page(pmd));
 }
 
 #if PAGETABLE_LEVELS > 3
 void ___pud_free_tlb(struct mmu_gather *tlb, pud_t *pud)
 {
paravirt_release_pud(__pa(pud) >> PAGE_SHIFT);
-   tlb_remove_page(tlb, virt_to_page(pud));
+   tlb_remove_table(tlb, virt_to_page(pud));
 }
 #endif /* PAGETABLE_LEVELS > 3 */
 #endif /* PAGETABLE_LEVELS > 2 */
diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
index f96a5b5..9ac30f7 100644
--- a/include/asm-generic/tlb.h
+++ b/include/asm-generic/tlb.h
@@ -19,6 +19,8 @@
 #include 
 #include 
 
+static inline void tlb_remove_page(struct mmu_gather *tlb, struct page *page);
+
 #ifdef CONFIG_HAVE_RCU_TABLE_FREE
 /*
  * Semi RCU freeing of the page directories.
@@ -60,6 +62,13 @@ struct mmu_table_batch {
 extern void tlb_table_flush(struct mmu_gather *tlb);
 extern void tlb_remove_table(struct mmu_gather *tlb, void *table);
 
+#else
+
+static inline void tlb_remove_table(struct mmu_gather *tlb, void *page)
+{
+   tlb_remove_page(tlb, page);
+}
+
 #endif
 
 /*
diff --git a/mm/memory.c b/mm/memory.c
index 6105f47..c12685d 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -297,6 +297,13 @@ int __tlb_remove_page(struct mmu_gather *tlb, struct page 
*page)
  * See the comment near struct mmu_table_batch.
  */
 
+#ifndef __tlb_remove_table
+static void __tlb_remove_table(void *table)
+{
+   free_page_and_swap_cache(table);
+}
+#endif
+
 static void tlb_remove_table_smp_sync(void *arg)
 {
/* Simply deliver the interrupt */

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 7/7] Flush page-table pages before freeing them

2012-06-03 Thread Nikunj A. Dadhania
From: Nikunj A. Dadhania 

Certain architecture(viz. x86, arm, s390) have hardware page-table
walkers(#PF). So during the RCU page-table teardown process make sure
we do a tlb flush of page-table pages on all relevant CPUs to
synchronize against hardware walkers, and then free the pages.

Moreover, the (mm_users < 2) condition does not hold good for the above
architectures, as the hardware engine is one of the user.

Suggested-by: Peter Zijlstra 
Signed-off-by: Nikunj A. Dadhania 
---
 arch/Kconfig |3 +++
 arch/x86/Kconfig |   12 
 mm/memory.c  |   24 ++--
 3 files changed, 37 insertions(+), 2 deletions(-)

diff --git a/arch/Kconfig b/arch/Kconfig
index 684eb5a..abc3739 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -196,6 +196,9 @@ config HAVE_ARCH_MUTEX_CPU_RELAX
 config HAVE_RCU_TABLE_FREE
bool
 
+config ARCH_HW_WALKS_PAGE_TABLE
+   bool
+
 config ARCH_HAVE_NMI_SAFE_CMPXCHG
bool
 
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index a9ec0da..b0a9f11 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -617,6 +617,18 @@ config PARAVIRT_SPINLOCKS
 
  If you are unsure how to answer this question, answer N.
 
+config PARAVIRT_TLB_FLUSH
+   bool "Paravirtualization layer for TLB Flush"
+   depends on PARAVIRT && SMP && EXPERIMENTAL
+   select HAVE_RCU_TABLE_FREE
+   select ARCH_HW_WALKS_PAGE_TABLE
+   ---help---
+ Paravirtualized Flush TLB replace the native implementation
+ with something virtualization-friendly (for example, set a
+ flag for sleeping vcpu and do not wait for it).
+
+ If you are unsure how to answer this question, answer N.
+
 config PARAVIRT_CLOCK
bool
 
diff --git a/mm/memory.c b/mm/memory.c
index c12685d..acfadb8 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -335,11 +335,27 @@ static void tlb_remove_table_rcu(struct rcu_head *head)
free_page((unsigned long)batch);
 }
 
+#ifdef CONFIG_ARCH_HW_WALKS_PAGE_TABLE
+/*
+ * Some architectures(x86, arm, s390) HW walks the page tables when
+ * the page-table tear down might be happening. So make sure that
+ * before freeing the page-table pages, flush their tlbs
+ */
+static inline void tlb_table_flush_mmu(struct mmu_gather *tlb)
+{
+   tlb_flush_mmu(tlb);
+}
+
+#else
+#define tlb_table_flush_mmu(tlb) do {} while (0)
+#endif
+
 void tlb_table_flush(struct mmu_gather *tlb)
 {
struct mmu_table_batch **batch = &tlb->batch;
 
if (*batch) {
+   tlb_table_flush_mmu(tlb);
call_rcu_sched(&(*batch)->rcu, tlb_remove_table_rcu);
*batch = NULL;
}
@@ -351,18 +367,22 @@ void tlb_remove_table(struct mmu_gather *tlb, void *table)
 
tlb->need_flush = 1;
 
+#ifndef CONFIG_ARCH_HW_WALKS_PAGE_TABLE
/*
-* When there's less then two users of this mm there cannot be a
-* concurrent page-table walk.
+* When there's less then two users of this mm there cannot be
+* a concurrent page-table walk for architectures that do not
+* have hardware page-table walkers.
 */
if (atomic_read(&tlb->mm->mm_users) < 2) {
__tlb_remove_table(table);
return;
}
+#endif
 
if (*batch == NULL) {
*batch = (struct mmu_table_batch *)__get_free_page(GFP_NOWAIT | 
__GFP_NOWARN);
if (*batch == NULL) {
+   tlb_table_flush_mmu(tlb);
tlb_remove_table_one(table);
return;
}

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 5/7] KVM: Introduce PV kick in flush tlb

2012-06-03 Thread Nikunj A. Dadhania
In place of looping continuously introduce a halt if we do not succeed
after some time.

For vcpus that were running an IPI is sent.  In case, it went to sleep
between this, we will be doing flush_on_enter(harmless). But as a
flush IPI was already sent, that will be processed in ipi handler,
this might result into something undesireable, i.e. It might clear the
flush_mask of a new request.

So after sending an IPI and waiting for a while, do a halt and wait
for a kick from the last vcpu.

Signed-off-by: Srivatsa Vaddagiri 
Signed-off-by: Nikunj A. Dadhania 
---
 arch/x86/mm/tlb.c |   25 +
 1 files changed, 17 insertions(+), 8 deletions(-)

diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index f5dacdd..2c686bf 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -43,6 +43,8 @@ union smp_flush_state {
struct {
struct mm_struct *flush_mm;
unsigned long flush_va;
+   int sender_cpu;
+   unsigned int need_kick;
raw_spinlock_t tlbstate_lock;
DECLARE_BITMAP(flush_cpumask, NR_CPUS);
};
@@ -167,6 +169,11 @@ out:
smp_mb__before_clear_bit();
cpumask_clear_cpu(cpu, to_cpumask(f->flush_cpumask));
smp_mb__after_clear_bit();
+   if (f->need_kick && cpumask_empty(to_cpumask(f->flush_cpumask))) {
+   f->need_kick = 0;
+   smp_wmb();
+   kvm_kick_cpu(f->sender_cpu);
+   }
inc_irq_stat(irq_tlb_count);
 }
 
@@ -222,15 +229,17 @@ void kvm_flush_tlb_others(const struct cpumask *cpumask,
if (nr_cpu_ids > NUM_INVALIDATE_TLB_VECTORS)
raw_spin_lock(&f->tlbstate_lock);
 
+   cpu = smp_processor_id();
f->flush_mm = mm;
f->flush_va = va;
-   if (cpumask_andnot(to_cpumask(f->flush_cpumask), cpumask, 
cpumask_of(smp_processor_id( {
+   f->sender_cpu = cpu;
+   f->need_kick = 0;
+   if (cpumask_andnot(to_cpumask(f->flush_cpumask), cpumask, 
cpumask_of(cpu))) {
/*
 * We have to send the IPI only to online vCPUs
 * affected. And queue flush_on_enter for pre-empted
 * vCPUs
 */
-again:
for_each_cpu(cpu, to_cpumask(f->flush_cpumask)) {
v_state = &per_cpu(vcpu_state, cpu);
 
@@ -242,9 +251,6 @@ again:
}
}
 
-   if (cpumask_empty(to_cpumask(f->flush_cpumask)))
-   goto out;
-
apic->send_IPI_mask(to_cpumask(f->flush_cpumask),
INVALIDATE_TLB_VECTOR_START + sender);
 
@@ -252,10 +258,13 @@ again:
while (!cpumask_empty(to_cpumask(f->flush_cpumask)) && --loop)
cpu_relax();
 
-   if (!cpumask_empty(to_cpumask(f->flush_cpumask)))
-   goto again;
+   if (!loop) {
+   f->need_kick = 1;
+   smp_mb();
+   while (!cpumask_empty(to_cpumask(f->flush_cpumask)))
+   halt();
+   }
}
-out:
f->flush_mm = NULL;
f->flush_va = 0;
if (nr_cpu_ids > NUM_INVALIDATE_TLB_VECTORS)

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 4/7] KVM: export kvm_kick_vcpu for pv_flush

2012-06-03 Thread Nikunj A. Dadhania

Signed-off-by: Nikunj A. Dadhania 
---
 arch/x86/include/asm/kvm_para.h |4 
 arch/x86/kernel/kvm.c   |   18 +-
 2 files changed, 13 insertions(+), 9 deletions(-)

diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
index 684a285..651a305 100644
--- a/arch/x86/include/asm/kvm_para.h
+++ b/arch/x86/include/asm/kvm_para.h
@@ -206,6 +206,7 @@ void kvm_async_pf_task_wait(u32 token);
 void kvm_async_pf_task_wake(u32 token);
 u32 kvm_read_and_reset_pf_reason(void);
 extern void kvm_disable_steal_time(void);
+void kvm_kick_cpu(int cpu);
 
 #ifdef CONFIG_PARAVIRT_SPINLOCKS
 void __init kvm_spinlock_init(void);
@@ -229,6 +230,9 @@ static inline void kvm_disable_steal_time(void)
 {
return;
 }
+
+#define kvm_kick_cpu(T) do {} while (0)
+
 #endif
 
 #endif /* __KERNEL__ */
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index 66db54e..5943285 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -487,6 +487,15 @@ static __init int activate_jump_labels(void)
 }
 arch_initcall(activate_jump_labels);
 
+/* Kick a cpu */
+void kvm_kick_cpu(int cpu)
+{
+   int apicid;
+
+   apicid = per_cpu(x86_cpu_to_apicid, cpu);
+   kvm_hypercall1(KVM_HC_KICK_CPU, apicid);
+}
+
 #ifdef CONFIG_PARAVIRT_SPINLOCKS
 
 enum kvm_contention_stat {
@@ -695,15 +704,6 @@ out:
 }
 PV_CALLEE_SAVE_REGS_THUNK(kvm_lock_spinning);
 
-/* Kick a cpu by its apicid*/
-static inline void kvm_kick_cpu(int cpu)
-{
-   int apicid;
-
-   apicid = per_cpu(x86_cpu_to_apicid, cpu);
-   kvm_hypercall1(KVM_HC_KICK_CPU, apicid);
-}
-
 /* Kick vcpu waiting on @lock->head to reach value @ticket */
 static void kvm_unlock_kick(struct arch_spinlock *lock, __ticket_t ticket)
 {

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 3/7] KVM: Add paravirt kvm_flush_tlb_others

2012-06-03 Thread Nikunj A. Dadhania
flush_tlb_others_ipi depends on lot of statics in tlb.c.  Replicated
the flush_tlb_others_ipi as kvm_flush_tlb_others to further adapt to
paravirtualization.

Use the vcpu state information inside the kvm_flush_tlb_others to
avoid sending ipi to pre-empted vcpus.

* Do not send ipi's to offline vcpus and set flush_on_enter flag
* For online vcpus: Wait for them to clear the flag

The approach was discussed here: https://lkml.org/lkml/2012/2/20/157

Suggested-by: Peter Zijlstra 
Signed-off-by: Nikunj A. Dadhania 

--
Pseudo Algo:

   Write()
   ==

   guest_exit()
   flush_on_enter[i]=0;
   running[i] = 0;

   guest_enter()
   running[i] = 1;
   smp_mb();
   if(flush_on_enter[i]) {
   tlb_flush()
   flush_on_enter[i]=0;
   }


   Read()
   ==

   GUESTKVM-HV

   f->flushcpumask = cpumask - me;

again:
   for_each_cpu(i, f->flushmask) {

   if (!running[i]) {
   case 1:

   running[n]=1

   (cpuN does not see
   flush_on_enter set,
   guest later finds it
   running and sends ipi,
   we are fine here, need
   to clear the flag on
   guest_exit)

  flush_on_enter[i] = 1;
   case2:

   running[n]=1
   (cpuN - will see flush
   on enter and an IPI as
   well - addressed in patch-4)

  if (!running[i])
 cpu_clear(f->flushmask);  All is well, vm_enter
   will do the fixup
   }
   case 3:
   running[n] = 0;

   (cpuN went to sleep,
   we saw it as awake,
   ipi sent, but wait
   will break without
   zero_mask and goto
   again will take care)

   }
   send_ipi(f->flushmask)

   wait_a_while_for_zero_mask();

   if (!zero_mask)
   goto again;
---
 arch/x86/include/asm/kvm_para.h |3 +-
 arch/x86/include/asm/tlbflush.h |9 ++
 arch/x86/kernel/kvm.c   |1 +
 arch/x86/kvm/x86.c  |   14 -
 arch/x86/mm/tlb.c   |   61 +++
 5 files changed, 86 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
index f57b5cc..684a285 100644
--- a/arch/x86/include/asm/kvm_para.h
+++ b/arch/x86/include/asm/kvm_para.h
@@ -55,7 +55,8 @@ struct kvm_steal_time {
 
 struct kvm_vcpu_state {
__u32 state;
-   __u32 pad[15];
+   __u32 flush_on_enter;
+   __u32 pad[14];
 };
 
 #define KVM_VCPU_STATE_ALIGN_BITS 5
diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index c0e108e..29470bd 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -119,6 +119,12 @@ static inline void native_flush_tlb_others(const struct 
cpumask *cpumask,
 {
 }
 
+static inline void kvm_flush_tlb_others(const struct cpumask *cpumask,
+   struct mm_struct *mm,
+   unsigned long va)
+{
+}
+
 static inline void reset_lazy_tlbstate(void)
 {
 }
@@ -145,6 +151,9 @@ static inline void flush_tlb_range(struct vm_area_struct 
*vma,
 void native_flush_tlb_others(const struct cpumask *cpumask,
 struct mm_struct *mm, unsigned long va);
 
+void kvm_flush_tlb_others(const struct cpumask *cpumask,
+ struct mm_struct *mm, unsigned long va);
+
 #define TLBSTATE_OK1
 #define TLBSTATE_LAZY  2
 
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index bb686a6..66db54e 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -465,6 +465,7 @@ void __init kvm_guest_init(void)
}
 
has_vcpu_state = 1;
+   pv_mmu_ops.flush_tlb_others = kvm_flush_tlb_others;
 
 #ifdef CONFIG_SMP
smp_ops.smp

[PATCH v2 2/7] KVM-HV: Add VCPU running/pre-empted state for guest

2012-06-03 Thread Nikunj A. Dadhania
Hypervisor code to indicate guest running/pre-empteded status through
msr.

Suggested-by: Peter Zijlstra 
Signed-off-by: Nikunj A. Dadhania 
---
 arch/x86/include/asm/kvm_host.h |7 ++
 arch/x86/kvm/cpuid.c|1 +
 arch/x86/kvm/x86.c  |   45 ++-
 3 files changed, 52 insertions(+), 1 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index dad475b..12fe3c7 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -418,6 +418,13 @@ struct kvm_vcpu_arch {
struct kvm_steal_time steal;
} st;
 
+   /* indicates vcpu is running or preempted */
+   struct {
+   u64 msr_val;
+   struct gfn_to_hva_cache data;
+   struct kvm_vcpu_state vs;
+   } v_state;
+
u64 last_guest_tsc;
u64 last_kernel_ns;
u64 last_host_tsc;
diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
index 7c93806..0588984 100644
--- a/arch/x86/kvm/cpuid.c
+++ b/arch/x86/kvm/cpuid.c
@@ -409,6 +409,7 @@ static int do_cpuid_ent(struct kvm_cpuid_entry2 *entry, u32 
function,
 (1 << KVM_FEATURE_CLOCKSOURCE2) |
 (1 << KVM_FEATURE_ASYNC_PF) |
 (1 << KVM_FEATURE_CLOCKSOURCE_STABLE_BIT) |
+(1 << KVM_FEATURE_VCPU_STATE) |
 (1 << KVM_FEATURE_PV_UNHALT);
 
if (sched_info_on())
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 8e5f57b..264f172 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -789,12 +789,13 @@ EXPORT_SYMBOL_GPL(kvm_rdpmc);
  * kvm-specific. Those are put in the beginning of the list.
  */
 
-#define KVM_SAVE_MSRS_BEGIN9
+#define KVM_SAVE_MSRS_BEGIN10
 static u32 msrs_to_save[] = {
MSR_KVM_SYSTEM_TIME, MSR_KVM_WALL_CLOCK,
MSR_KVM_SYSTEM_TIME_NEW, MSR_KVM_WALL_CLOCK_NEW,
HV_X64_MSR_GUEST_OS_ID, HV_X64_MSR_HYPERCALL,
HV_X64_MSR_APIC_ASSIST_PAGE, MSR_KVM_ASYNC_PF_EN, MSR_KVM_STEAL_TIME,
+   MSR_KVM_VCPU_STATE,
MSR_IA32_SYSENTER_CS, MSR_IA32_SYSENTER_ESP, MSR_IA32_SYSENTER_EIP,
MSR_STAR,
 #ifdef CONFIG_X86_64
@@ -1539,6 +1540,32 @@ static void record_steal_time(struct kvm_vcpu *vcpu)
&vcpu->arch.st.steal, sizeof(struct kvm_steal_time));
 }
 
+static void kvm_set_vcpu_state(struct kvm_vcpu *vcpu)
+{
+   struct kvm_vcpu_state *vs = &vcpu->arch.v_state.vs;
+   struct gfn_to_hva_cache *ghc = &vcpu->arch.v_state.data;
+
+   if (!(vcpu->arch.v_state.msr_val & KVM_MSR_ENABLED))
+   return;
+
+   vs->state = 1;
+   kvm_write_guest_cached(vcpu->kvm, ghc, vs, 2*sizeof(__u32));
+   smp_wmb();
+}
+
+static void kvm_clear_vcpu_state(struct kvm_vcpu *vcpu)
+{
+   struct kvm_vcpu_state *vs = &vcpu->arch.v_state.vs;
+   struct gfn_to_hva_cache *ghc = &vcpu->arch.v_state.data;
+
+   if (!(vcpu->arch.v_state.msr_val & KVM_MSR_ENABLED))
+   return;
+
+   vs->state = 0;
+   kvm_write_guest_cached(vcpu->kvm, ghc, vs, 2*sizeof(__u32));
+   smp_wmb();
+}
+
 int kvm_set_msr_common(struct kvm_vcpu *vcpu, u32 msr, u64 data)
 {
bool pr = false;
@@ -1654,6 +1681,14 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, u32 msr, 
u64 data)
 
break;
 
+   case MSR_KVM_VCPU_STATE:
+   if (kvm_gfn_to_hva_cache_init(vcpu->kvm, 
&vcpu->arch.v_state.data,
+ data & KVM_VCPU_STATE_VALID_BITS))
+   return 1;
+
+   vcpu->arch.v_state.msr_val = data;
+   break;
+
case MSR_IA32_MCG_CTL:
case MSR_IA32_MCG_STATUS:
case MSR_IA32_MC0_CTL ... MSR_IA32_MC0_CTL + 4 * KVM_MAX_MCE_BANKS - 1:
@@ -1974,6 +2009,9 @@ int kvm_get_msr_common(struct kvm_vcpu *vcpu, u32 msr, 
u64 *pdata)
case MSR_KVM_STEAL_TIME:
data = vcpu->arch.st.msr_val;
break;
+   case MSR_KVM_VCPU_STATE:
+   data = vcpu->arch.v_state.msr_val;
+   break;
case MSR_IA32_P5_MC_ADDR:
case MSR_IA32_P5_MC_TYPE:
case MSR_IA32_MCG_CAP:
@@ -5324,6 +5362,8 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
kvm_load_guest_fpu(vcpu);
kvm_load_guest_xcr0(vcpu);
 
+   kvm_set_vcpu_state(vcpu);
+
vcpu->mode = IN_GUEST_MODE;
 
/* We should set ->mode before check ->requests,
@@ -5340,6 +5380,7 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
local_irq_enable();
preempt_enable();
kvm_x86_ops->cancel_injection(vcpu);
+   kvm_clear_vcpu_state(vcpu);
r = 1;
goto out;
}
@@ -5374,6 +54

[PATCH v2 0/7] KVM paravirt remote flush tlb

2012-06-03 Thread Nikunj A. Dadhania

Remote flushing api's does a busy wait which is fine in bare-metal
scenario. But with-in the guest, the vcpus might have been pre-empted
or blocked. In this scenario, the initator vcpu would end up
busy-waiting for a long amount of time.

This was discovered in our gang scheduling test and other way to solve
this is by para-virtualizing the flush_tlb_others_ipi.

This patch set implements para-virt flush tlbs making sure that it
does not wait for vcpus that are sleeping. And all the sleeping vcpus
flush the tlb on guest enter. Idea was discussed here:
https://lkml.org/lkml/2012/2/20/157

This also brings one more dependency for lock-less page walk that is
performed by get_user_pages_fast(gup_fast). gup_fast disables the
interrupt and assumes that the pages will not be freed during that
period. And this was fine as the flush_tlb_others_ipi would wait for
the all the IPI to be processed and return back. With the new approach
of not waiting for the sleeping vcpus, this assumption is not valid
anymore. So now HAVE_RCU_TABLE_FREE is used to free the pages. This
will make sure that all the cpus would atleast process smp_callback
before the pages are freed.

The patchset depends on ticketlocks[1] and KVM Paravirt Spinlock
patches[2]

Changelog from v1:
• Race fixes reported by Vatsa
• Address gup_fast dependency using PeterZ's rcu table free patch
• Fix rcu_table_free for hw pagetable walkers
• Increased SPIN_THRESHOLD 8k - to address the baseline numbers
  regression in ebizzy(non-ple). Raghu is working on tuning the
  threshold value along with the ple_window and ple_gap.

Here are the results from PLE hardware. Here is the setup details:
• 8 CPUs (HT disabled)
• 64-bit VM
   • 8vcpus
   • 1GB RAM

Numbers are % improvement/degradation wrt base kernel 3.4.0-rc4
(commit: af3a3ab2) 

Note: SPINLOCK_THRESHOLD is set to 8192

gang - Base kernel + gang scheduling patches
pvspin - Base kernel + ticketlocks patches + paravirt spinlock patches
pvflush - Base kernel + paravirt tlb flush patches
pvall - pvspin + paravirt tlb flush patches
pvallnople - pvall and PLE is disabled(ple_gap = 0) 

+-+---+---+---+---+---+
| |   gang|  pvspin   |  pvflush  |   pvall   | pvallnople|
+-+---+---+---+---+---+
|  ebizzy-1vm |2  |2  |3  |  -11  | 4 |
|  ebizzy-2vm |  156  |   15  |  -58  |  343  |   110 | 
|  ebizzy-4vm |  238  |   14  |  -42  |   17  |47 |
+-+---+---+---+---+---+
| specjbb-1vm |3  |5  |3  |3  | 2 |
| specjbb-2vm |  -10  |3  |2  |2  | 3 |
| specjbb-4vm |1  |4  |3  |4  | 4 |
+-+---+---+---+---+---+
|  hbench-1vm |  -14  |  -58  |   -1  |2  | 7 |
|  hbench-2vm |  -35  |   -5  |7  |   11  |12 |
|  hbench-4vm |   19  |8  |   -1  |   14  |35 |
+-+---+---+---+---+---+
|  dbench-1vm |   -1  |  -17  |  -25  |   -7  |   -18 |
|  dbench-2vm |3  |   -4  |1  |5  | 3 |
|  dbench-4vm |8  |6  |   22  |6  |-6 |
+-+---+---+---+---+---+
|  kbench-1vm | -100  |8  |4  |5  | 7 |
|  kbench-2vm |7  |9  |0  |   -2  |-2 |
|  kbench-4vm |   12  |   -1  |0  |   -6  |   -15 |
+-+---+---+---+---+---+
| sysbnch-1vm |4  |1  |3  |4  | 5 |
| sysbnch-2vm |   73  |   15  |   29  |   34  |49 |
| sysbnch-4vm |   22  |2  |9  |   17  |31 |
+-+---+---+---+---+---+

Observations from the above table:
* pvall does well in most of the benchmarks.
* pvall does no do quite well for kernbench 2vm(-2%) and 4vm(-6%)

Other experiment that Vatsa suggested was to disable PLE. As the
paravirt patches provide similar functionality. So in those
experiments we did see notable improvements in hackbench and
sysbench. Kernbench degraded further, PLE does help kernbench. This
will be addressed by Raghu's directed yield approach.

Comments/suggestions welcome.

Regards
Nikunj

---

Nikunj A. Dadhania (6):
  KVM Guest: Add VCPU running/pre-empted state for guest
  KVM-HV: Add VCPU running/pre-empted state for guest
  KVM: Add paravirt kvm_flush_tlb_others
  KVM: export kvm_kick_vcpu for pv_flush
  KVM: Introduce PV kick in flush tlb
  Flush page-table pages

Re: [PATCH RFC V8 0/17] Paravirtualized ticket spinlocks

2012-05-13 Thread Nikunj A Dadhania
On Mon, 14 May 2012 00:15:30 +0530, Raghavendra K T 
 wrote:
> On 05/07/2012 08:22 PM, Avi Kivity wrote:
> 
> I could not come with pv-flush results (also Nikunj had clarified that
> the result was on NOn PLE
> 
Did you see any issues on PLE?

Regards,
Nikunj

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM: Softlockups in guests while running kernbench

2012-05-11 Thread Nikunj A Dadhania
On Thu, 10 May 2012 20:16:26 +0530, Nikunj A Dadhania 
 wrote:
> On Thu, 10 May 2012 15:39:04 +0530, Nikunj A Dadhania 
>  wrote:
> 
> I had a discussion with Avi on IRC, he suggested running a trace on the
> host for the sched: tracepoints
> 
> So when I see the console of the guest, i can see the soft-lockup
> messages. When I start tracing, soft-lockups disappear. 
> 
> Kernbench is started from the host; ps output:
> 13359 pts/1S+ 0:00 perf stat -a -- ssh root@192.168.123.11 cd 
> /root/linux-src; kernbench -f -M -H -o 16 > /root/kbench.log
> 13362 pts/1S+ 0:00 ssh root@192.168.123.11 cd /root/linux-src; 
> kernbench -f -M -H -o 16 > /root/kbench.log
> 13373 pts/1S+ 0:00 perf stat -a -- ssh root@192.168.123.12 cd 
> /root/linux-src; kernbench -f -M -H -o 16 > /root/kbench.log
> 13376 pts/1S+ 0:00 ssh root@192.168.123.12 cd /root/linux-src; 
> kernbench -f -M -H -o 16 > /root/kbench.log
> 13387 pts/1S+ 0:00 perf stat -a -- ssh root@192.168.123.13 cd 
> /root/linux-src; kernbench -f -M -H -o 16 > /root/kbench.log
> 13390 pts/1S+ 0:00 ssh root@192.168.123.13 cd /root/linux-src; 
> kernbench -f -M -H -o 16 > /root/kbench.log
> 13401 pts/1S+ 0:00 perf stat -a -- ssh root@192.168.123.14 cd 
> /root/linux-src; kernbench -f -M -H -o 16 > /root/kbench.log
> 13402 pts/1S+ 0:00 ssh root@192.168.123.14 cd /root/linux-src; 
> kernbench -f -M -H -o 16 > /root/kbench.log
> 
> I have tried following commands trace-cmd:
> 1) trace-cmd record -e kvm -e sched -b 10  -p 13362
> 2) trace-cmd record -e sched -b 10  -p 13362
> 3) trace-cmd record -e kvm -b 10  -p 13362
> 
> I have tried trace-cmd on qemu pid as well, I see similar behaviour in
> that case.
> 
> Peterz/Ingo, Any clue on debugging this would help.
>
With CONFIG_DEBUG_SPINLOCK=y in guest config, I stop seeing the
soft-lockup

Nikunj

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM: Softlockups in guests while running kernbench

2012-05-10 Thread Nikunj A Dadhania
On Thu, 10 May 2012 15:39:04 +0530, Nikunj A Dadhania 
 wrote:
> On Thu, 10 May 2012 12:22:00 +0300, Avi Kivity  wrote:
> > On 05/10/2012 11:15 AM, Nikunj A Dadhania wrote:
> > > I am running a 3.4.0-rc4 based kernel(commit: af3a3ab2), guest config
> > > attached.
> > >
> > > During my tests, I saw few softlockups inside the guests while running
> > > kernbench inside the guest. I can reproduce this repeatedly.
> > >
> > > Test Setup:
> > >
> > > - Create 4 VMs (8vcpu, 1GB RAM)
> > > - Run kernbench inside each guests (kernbench -f -M -H -o 16) in parallel
> > >
> > 
> > How overcommitted are you?  
> 8 Physical CPU and 32 vCPUs
> 
> > What does the host hardware look like?  
> Machine : IBM xSeries with Intel(R) Xeon(R) x5570 2.93GHz CPU with 8 core , 
> 64GB RAM
> Threading is disabled.
> 
> > Is there any load on the host?
> No
> 
> > 
> > The traces themselves always point a couple of instructions after
> > interrupts are enabled.  Can you post a few more?
> > 

I had a discussion with Avi on IRC, he suggested running a trace on the
host for the sched: tracepoints

So when I see the console of the guest, i can see the soft-lockup
messages. When I start tracing, soft-lockups disappear. 

Kernbench is started from the host; ps output:
13359 pts/1S+ 0:00 perf stat -a -- ssh root@192.168.123.11 cd 
/root/linux-src; kernbench -f -M -H -o 16 > /root/kbench.log
13362 pts/1S+ 0:00 ssh root@192.168.123.11 cd /root/linux-src; 
kernbench -f -M -H -o 16 > /root/kbench.log
13373 pts/1S+ 0:00 perf stat -a -- ssh root@192.168.123.12 cd 
/root/linux-src; kernbench -f -M -H -o 16 > /root/kbench.log
13376 pts/1S+ 0:00 ssh root@192.168.123.12 cd /root/linux-src; 
kernbench -f -M -H -o 16 > /root/kbench.log
13387 pts/1S+ 0:00 perf stat -a -- ssh root@192.168.123.13 cd 
/root/linux-src; kernbench -f -M -H -o 16 > /root/kbench.log
13390 pts/1S+ 0:00 ssh root@192.168.123.13 cd /root/linux-src; 
kernbench -f -M -H -o 16 > /root/kbench.log
13401 pts/1S+ 0:00 perf stat -a -- ssh root@192.168.123.14 cd 
/root/linux-src; kernbench -f -M -H -o 16 > /root/kbench.log
13402 pts/1S+ 0:00 ssh root@192.168.123.14 cd /root/linux-src; 
kernbench -f -M -H -o 16 > /root/kbench.log

I have tried following commands trace-cmd:
1) trace-cmd record -e kvm -e sched -b 10  -p 13362
2) trace-cmd record -e sched -b 10  -p 13362
3) trace-cmd record -e kvm -b 10  -p 13362

I have tried trace-cmd on qemu pid as well, I see similar behaviour in
that case.

Peterz/Ingo, Any clue on debugging this would help.

Regards
Nikunj

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC V8 0/17] Paravirtualized ticket spinlocks

2012-05-08 Thread Nikunj A Dadhania
On Mon, 7 May 2012 22:42:30 +0200 (CEST), Thomas Gleixner  
wrote:
> On Mon, 7 May 2012, Ingo Molnar wrote:
> > * Avi Kivity  wrote:
> > 
> > > > PS: Nikunj had experimented that pv-flush tlb + 
> > > > paravirt-spinlock is a win on PLE where only one of them 
> > > > alone could not prove the benefit.
> > > 
Do not have PLE numbers yet for pvflush and pvspinlock. 

I have seen on Non-PLE having pvflush and pvspinlock patches -
kernbench, ebizzy, specjbb, hackbench and dbench all of them improved. 

I am chasing a race currently on pv-flush path, it is causing
file-system corruption. I will post these number along with my v2 post.

> > > I'd like to see those numbers, then.
> > > 
> > > Ingo, please hold on the kvm-specific patches, meanwhile.
> > 
> > I'll hold off on the whole thing - frankly, we don't want this 
> > kind of Xen-only complexity. If KVM can make use of PLE then Xen 
> > ought to be able to do it as well.
> > 
> > If both Xen and KVM makes good use of it then that's a different 
> > matter.
> 
> Aside of that, it's kinda strange that a dude named "Nikunj" is
> referenced in the argument chain, but I can't find him on the CC list.
> 
/me waves my hand

Regards
Nikunj

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v1 3/5] KVM: Add paravirt kvm_flush_tlb_others

2012-05-06 Thread Nikunj A Dadhania
On Fri, 4 May 2012 17:14:49 +0530, Srivatsa Vaddagiri 
 wrote:
> * Nikunj A. Dadhania  [2012-04-27 21:54:37]:
> 
> > @@ -1549,6 +1549,11 @@ static void kvm_set_vcpu_state(struct kvm_vcpu *vcpu)
> > return;
> > 
> > vs->state = 1;
> > +   if (vs->flush_on_enter) {
> > +   kvm_mmu_flush_tlb(vcpu);
> > +   vs->flush_on_enter = 0;
> > +   }
> > +
> > kvm_write_guest_cached(vcpu->kvm, ghc, vs, 2*sizeof(__u32));
> 
> Reading flush_on_enter before writing ->state (=1) is racy afaics (and
> may cause vcpu to miss a TLB flush request).
> 
Yes I see this with sysbench, here is what I have now, currently I have
tested it with sysbench(50 runs). Will fold this in my v2.

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 60546e9..b2ee9fd 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1548,9 +1548,20 @@ static void kvm_set_vcpu_state(struct kvm_vcpu *vcpu)
if (!(vcpu->arch.v_state.msr_val & KVM_MSR_ENABLED))
return;
 
+   /* 
+* Let the guest know that we are online, make sure we do not
+* overwrite flush_on_enter, just write the vs->state.
+*/
vs->state = 1;
-   kvm_write_guest_cached(vcpu->kvm, ghc, vs, 2*sizeof(__u32));
+   kvm_write_guest_cached(vcpu->kvm, ghc, vs, 1*sizeof(__u32));
smp_wmb();
+   /* 
+* Guest might have seen us offline and would have set
+* flush_on_enter. 
+*/
+   kvm_read_guest_cached(vcpu->kvm, ghc, vs, 2*sizeof(__u32));
+   if (vs->flush_on_enter) 
+   kvm_x86_ops->tlb_flush(vcpu);
 }
 
 static void kvm_clear_vcpu_state(struct kvm_vcpu *vcpu)


Nikunj

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v1 3/5] KVM: Add paravirt kvm_flush_tlb_others

2012-05-03 Thread Nikunj A Dadhania
On Wed, 02 May 2012 12:20:40 +0200, Peter Zijlstra  wrote:
[...] 
> diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
> index f96a5b5..8ca33e9 100644
> --- a/include/asm-generic/tlb.h
> +++ b/include/asm-generic/tlb.h
> @@ -19,6 +19,8 @@
>  #include 
>  #include 
>  
> +static inline void tlb_remove_page(struct mmu_gather *tlb, struct page 
> *page);
> +
>  #ifdef CONFIG_HAVE_RCU_TABLE_FREE
>  /*
>   * Semi RCU freeing of the page directories.
> @@ -60,6 +62,13 @@ struct mmu_table_batch {
>  extern void tlb_table_flush(struct mmu_gather *tlb);
>  extern void tlb_remove_table(struct mmu_gather *tlb, void *table);
>  
> +#else
> +
> +static inline void tlb_remove_table(struct mmu_gather *tlb, void *table)
> +{
> + tlb_remove_page(tlb, page);
>
tlb_remove_page(tlb, table);

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v1 3/5] KVM: Add paravirt kvm_flush_tlb_others

2012-05-02 Thread Nikunj A Dadhania
On Wed, 02 May 2012 12:20:40 +0200, Peter Zijlstra  wrote:
> On Wed, 2012-05-02 at 14:21 +0530, Nikunj A Dadhania wrote:
> > [root@krm1 linux]# grep HAVE_RCU_TABLE .config
> > CONFIG_HAVE_RCU_TABLE_FREE=y
> > [root@krm1 linux]# make -j32  -s
> > mm/memory.c: In function ‘tlb_remove_table_one’:
> > mm/memory.c:315: error: implicit declaration of function 
> > ‘__tlb_remove_table’
> > 
> > I suppose we need to have __tlb_remove_table. Trying to understand what
> > needs to be done there. 
> 
> Argh, I really should get back to unifying all mmu-gather
> implementations :/
> 
> I think something like the below ought to sort it.
> 
Thanks a lot.

> Completely untested though..
> 

Tested-by: Nikunj A Dadhania 

Here is the comparison with the other version. 

Gangpv_spin_flushpv_spin_flush_rcu  
1VM 1.01   0.490.49 
2VMs7.07   4.044.06 
4VMs9.07   5.275.19 
8VMs9.99   7.657.80 

Will test other use cases as well and report back.

Regards
Nikunj

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v1 3/5] KVM: Add paravirt kvm_flush_tlb_others

2012-05-02 Thread Nikunj A Dadhania
On Tue, 01 May 2012 11:39:36 +0200, Peter Zijlstra  wrote:
> On Sun, 2012-04-29 at 15:23 +0300, Avi Kivity wrote:
> > On 04/27/2012 07:24 PM, Nikunj A. Dadhania wrote:
> > > flush_tlb_others_ipi depends on lot of statics in tlb.c.  Replicated
> > > the flush_tlb_others_ipi as kvm_flush_tlb_others to further adapt to
> > > paravirtualization.
> > >
> > > Use the vcpu state information inside the kvm_flush_tlb_others to
> > > avoid sending ipi to pre-empted vcpus.
> > >
> > > * Do not send ipi's to offline vcpus and set flush_on_enter flag
> > 
> > get_user_pages_fast() depends on the IPI to hold off page table teardown
> > while they are locklessly walked with interrupts disabled.  If a vcpu
> > were to be preempted while in this critical section, another vcpu
> > tearing down page tables would go ahead and destroy them.  when the
> > preempted vcpu resumes it then touches the freed pages.
> > 
> > We could try to teach kvm and get_user_pages_fast() about this, but this
> > is intrusive.  Another option is to replace the cpu_relax() loop with
> > something that sleeps and is then woken up by the TLB IPI handler if needed.
> 
> I think something like
> 
>   select HAVE_RCU_TABLE_FREE if PARAVIRT
> 
> or somesuch is just about all it takes.
>
[root@krm1 linux]# grep HAVE_RCU_TABLE .config
CONFIG_HAVE_RCU_TABLE_FREE=y
[root@krm1 linux]# make -j32  -s
mm/memory.c: In function ‘tlb_remove_table_one’:
mm/memory.c:315: error: implicit declaration of function ‘__tlb_remove_table’

I suppose we need to have __tlb_remove_table. Trying to understand what
needs to be done there.

Regards
Nikunj

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v1 3/5] KVM: Add paravirt kvm_flush_tlb_others

2012-04-30 Thread Nikunj A Dadhania
On Sun, 29 Apr 2012 15:23:16 +0300, Avi Kivity  wrote:
> On 04/27/2012 07:24 PM, Nikunj A. Dadhania wrote:
> > flush_tlb_others_ipi depends on lot of statics in tlb.c.  Replicated
> > the flush_tlb_others_ipi as kvm_flush_tlb_others to further adapt to
> > paravirtualization.
> >
> > Use the vcpu state information inside the kvm_flush_tlb_others to
> > avoid sending ipi to pre-empted vcpus.
> >
> > * Do not send ipi's to offline vcpus and set flush_on_enter flag
> 
> get_user_pages_fast() depends on the IPI to hold off page table teardown
> while they are locklessly walked with interrupts disabled.  If a vcpu
> were to be preempted while in this critical section, another vcpu
> tearing down page tables would go ahead and destroy them.  when the
> preempted vcpu resumes it then touches the freed pages.
> 
> We could try to teach kvm and get_user_pages_fast() about this, but this
> is intrusive. 


> Another option is to replace the cpu_relax() loop with
> something that sleeps and is then woken up by the TLB IPI handler if needed.
> 
That was the initial implementation that I did. Where we were not
looking for sleeping vcpus, just send IPIs to affected vcpus and execute
halt. One of the vcpu that sees the flushmask as empty would generate
the kick to the halted vcpu. I can respin my patches with just this as
well.

Regards
Nikunj

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v1 1/5] KVM Guest: Add VCPU running/pre-empted state for guest

2012-04-30 Thread Nikunj A Dadhania
On Tue, 01 May 2012 06:33:59 +0530, Raghavendra K T 
 wrote:
> On 04/27/2012 09:53 PM, Nikunj A. Dadhania wrote:
> > The patch adds guest code for msr between guest and hypervisor. The
> > msr will export the vcpu running/pre-empted information to the guest
> > from host. This will enable guest to intelligently send ipi to running
> > vcpus and set flag for pre-empted vcpus. This will prevent waiting for
> > vcpus that are not running.
> >
> > Suggested-by: Peter Zijlstra
> > Signed-off-by: Nikunj A. Dadhania
> > ---
> >   arch/x86/include/asm/kvm_para.h |   10 ++
> >   arch/x86/kernel/kvm.c   |   33 +
> >   2 files changed, 43 insertions(+), 0 deletions(-)
> >
> > diff --git a/arch/x86/include/asm/kvm_para.h 
> > b/arch/x86/include/asm/kvm_para.h
> > index 77266d3..f57b5cc 100644
> > --- a/arch/x86/include/asm/kvm_para.h
> > +++ b/arch/x86/include/asm/kvm_para.h
> > @@ -24,6 +24,7 @@
> >   #define KVM_FEATURE_ASYNC_PF  4
> >   #define KVM_FEATURE_STEAL_TIME5
> >   #define KVM_FEATURE_PV_UNHALT 6
> > +#define KVM_FEATURE_VCPU_STATE  7
> 
> I think you intended to use KVM_FEATURE_VCPU_STATE to address
> guest/host compatibility issue so that host/guest does not break
> when one of them run older kernel?
>
Yes, thats correct. 

Regards
Nikunj

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC PATCH v1 5/5] KVM: Introduce PV kick in flush tlb

2012-04-27 Thread Nikunj A. Dadhania
In place of looping continuously introduce a halt if we do not succeed
after some time.

For vcpus that were running an IPI is sent.  In case, it went to sleep
between this, we will be doing flush_on_enter(harmless). But as a
flush IPI was already sent, that will be processed in ipi handler,
this might result into something undesireable, i.e. It might clear the
flush_mask of a new request.

So after sending an IPI and waiting for a while, do a halt and wait
for a kick from the last vcpu.

Signed-off-by: Srivatsa Vaddagiri 
Signed-off-by: Nikunj A. Dadhania 
---
 arch/x86/mm/tlb.c |   27 +++
 1 files changed, 19 insertions(+), 8 deletions(-)

diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 91ae34e..2a20e59 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -43,6 +43,8 @@ union smp_flush_state {
struct {
struct mm_struct *flush_mm;
unsigned long flush_va;
+   int sender_cpu;
+   unsigned int need_kick;
raw_spinlock_t tlbstate_lock;
DECLARE_BITMAP(flush_cpumask, NR_CPUS);
};
@@ -71,6 +73,8 @@ void leave_mm(int cpu)
 EXPORT_SYMBOL_GPL(leave_mm);
 
 DECLARE_PER_CPU(struct kvm_vcpu_state, vcpu_state) __aligned(64);
+extern void kvm_kick_cpu(int cpu);
+
 /*
  *
  * The flush IPI assumes that a thread switch happens in this order:
@@ -168,6 +172,11 @@ out:
smp_mb__before_clear_bit();
cpumask_clear_cpu(cpu, to_cpumask(f->flush_cpumask));
smp_mb__after_clear_bit();
+   if (f->need_kick && cpumask_empty(to_cpumask(f->flush_cpumask))) {
+   f->need_kick = 0;
+   smp_wmb();
+   kvm_kick_cpu(f->sender_cpu);
+   }
inc_irq_stat(irq_tlb_count);
 }
 
@@ -219,15 +228,17 @@ void kvm_flush_tlb_others(const struct cpumask *cpumask,
if (nr_cpu_ids > NUM_INVALIDATE_TLB_VECTORS)
raw_spin_lock(&f->tlbstate_lock);
 
+   cpu = smp_processor_id();
f->flush_mm = mm;
f->flush_va = va;
-   if (cpumask_andnot(to_cpumask(f->flush_cpumask), cpumask, 
cpumask_of(smp_processor_id( {
+   f->sender_cpu = cpu;
+   f->need_kick = 0;
+   if (cpumask_andnot(to_cpumask(f->flush_cpumask), cpumask, 
cpumask_of(cpu))) {
/*
 * We have to send the IPI only to online vCPUs
 * affected. And queue flush_on_enter for pre-empted
 * vCPUs
 */
-again:
for_each_cpu(cpu, to_cpumask(f->flush_cpumask)) {
v_state = &per_cpu(vcpu_state, cpu);
 
@@ -239,9 +250,6 @@ again:
}
}
 
-   if (cpumask_empty(to_cpumask(f->flush_cpumask)))
-   goto out;
-
apic->send_IPI_mask(to_cpumask(f->flush_cpumask),
INVALIDATE_TLB_VECTOR_START + sender);
 
@@ -249,10 +257,13 @@ again:
while (!cpumask_empty(to_cpumask(f->flush_cpumask)) && --loop)
cpu_relax();
 
-   if (!cpumask_empty(to_cpumask(f->flush_cpumask)))
-   goto again;
+   if (!loop) {
+   f->need_kick = 1;
+   smp_mb();
+   while (!cpumask_empty(to_cpumask(f->flush_cpumask)))
+   halt();
+   }
}
-out:
f->flush_mm = NULL;
f->flush_va = 0;
if (nr_cpu_ids > NUM_INVALIDATE_TLB_VECTORS)

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC PATCH v1 4/5] KVM: get kvm_kick_vcpu out for pv_flush

2012-04-27 Thread Nikunj A. Dadhania
Get kvm_kick_cpu out of CONFIG_PARAVIRT_SPINLOCK define

Signed-off-by: Nikunj A. Dadhania 
---
 arch/x86/kernel/kvm.c |   18 +-
 1 files changed, 9 insertions(+), 9 deletions(-)

diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index 66db54e..5943285 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -487,6 +487,15 @@ static __init int activate_jump_labels(void)
 }
 arch_initcall(activate_jump_labels);
 
+/* Kick a cpu */
+void kvm_kick_cpu(int cpu)
+{
+   int apicid;
+
+   apicid = per_cpu(x86_cpu_to_apicid, cpu);
+   kvm_hypercall1(KVM_HC_KICK_CPU, apicid);
+}
+
 #ifdef CONFIG_PARAVIRT_SPINLOCKS
 
 enum kvm_contention_stat {
@@ -695,15 +704,6 @@ out:
 }
 PV_CALLEE_SAVE_REGS_THUNK(kvm_lock_spinning);
 
-/* Kick a cpu by its apicid*/
-static inline void kvm_kick_cpu(int cpu)
-{
-   int apicid;
-
-   apicid = per_cpu(x86_cpu_to_apicid, cpu);
-   kvm_hypercall1(KVM_HC_KICK_CPU, apicid);
-}
-
 /* Kick vcpu waiting on @lock->head to reach value @ticket */
 static void kvm_unlock_kick(struct arch_spinlock *lock, __ticket_t ticket)
 {

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC PATCH v1 3/5] KVM: Add paravirt kvm_flush_tlb_others

2012-04-27 Thread Nikunj A. Dadhania
flush_tlb_others_ipi depends on lot of statics in tlb.c.  Replicated
the flush_tlb_others_ipi as kvm_flush_tlb_others to further adapt to
paravirtualization.

Use the vcpu state information inside the kvm_flush_tlb_others to
avoid sending ipi to pre-empted vcpus.

* Do not send ipi's to offline vcpus and set flush_on_enter flag
* For online vcpus: Wait for them to clear the flag

The approach was discussed here: https://lkml.org/lkml/2012/2/20/157

Suggested-by: Peter Zijlstra 
Signed-off-by: Nikunj A. Dadhania 

--
Pseudo Algo:

   Write()
   ==

   guest_exit()
   flush_on_enter[i]=0;
   running[i] = 0;

   guest_enter()
   running[i] = 1;
   if(flush_on_enter[i]) {
   tlb_flush()
   flush_on_enter[i]=0;
   }


   Read()
   ==

   GUESTKVM-HV

   f->flushcpumask = cpumask - me;

again:
   for_each_cpu(i, f->flushmask) {

   if (!running[i]) {
   case 1:

   running[n]=1

   (cpuN does not see
   flush_on_enter set,
   guest later finds it
   running and sends ipi,
   we are fine here, need
   to clear the flag on
   guest_exit)

  flush_on_enter[i] = 1;
   case2:

   running[n]=1
   (cpuN - will see flush
   on enter and an IPI as
   well - addressed in patch-4)

  if (!running[i])
 cpu_clear(f->flushmask);  All is well, vm_enter
   will do the fixup
   }
   case 3:
   running[n] = 0;

   (cpuN went to sleep,
   we saw it as awake,
   ipi sent, but wait
   will break without
   zero_mask and goto
   again will take care)

   }
   send_ipi(f->flushmask)

   wait_a_while_for_zero_mask();

   if (!zero_mask)
   goto again;
---
 arch/x86/include/asm/kvm_para.h |3 +-
 arch/x86/include/asm/tlbflush.h |9 ++
 arch/x86/kernel/kvm.c   |1 +
 arch/x86/kvm/x86.c  |6 
 arch/x86/mm/tlb.c   |   57 +++
 5 files changed, 75 insertions(+), 1 deletions(-)

diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
index f57b5cc..684a285 100644
--- a/arch/x86/include/asm/kvm_para.h
+++ b/arch/x86/include/asm/kvm_para.h
@@ -55,7 +55,8 @@ struct kvm_steal_time {
 
 struct kvm_vcpu_state {
__u32 state;
-   __u32 pad[15];
+   __u32 flush_on_enter;
+   __u32 pad[14];
 };
 
 #define KVM_VCPU_STATE_ALIGN_BITS 5
diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index c0e108e..29470bd 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -119,6 +119,12 @@ static inline void native_flush_tlb_others(const struct 
cpumask *cpumask,
 {
 }
 
+static inline void kvm_flush_tlb_others(const struct cpumask *cpumask,
+   struct mm_struct *mm,
+   unsigned long va)
+{
+}
+
 static inline void reset_lazy_tlbstate(void)
 {
 }
@@ -145,6 +151,9 @@ static inline void flush_tlb_range(struct vm_area_struct 
*vma,
 void native_flush_tlb_others(const struct cpumask *cpumask,
 struct mm_struct *mm, unsigned long va);
 
+void kvm_flush_tlb_others(const struct cpumask *cpumask,
+ struct mm_struct *mm, unsigned long va);
+
 #define TLBSTATE_OK1
 #define TLBSTATE_LAZY  2
 
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index bb686a6..66db54e 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -465,6 +465,7 @@ void __init kvm_guest_init(void)
}
 
has_vcpu_state = 1;
+   pv_mmu_ops.flush_tlb_others = kvm_flush_tlb_others;
 
 #ifdef CONFIG_SMP
smp_ops.smp_prepare_boot_cpu = kvm_smp_prepare_boot_cpu;
diff --g

[RFC PATCH v1 2/5] KVM-HV: Add VCPU running/pre-empted state for guest

2012-04-27 Thread Nikunj A. Dadhania
Hypervisor code to indicate guest running/pre-empteded status through
msr.

Suggested-by: Peter Zijlstra 
Signed-off-by: Nikunj A. Dadhania 
---
 arch/x86/include/asm/kvm_host.h |7 ++
 arch/x86/kvm/cpuid.c|1 +
 arch/x86/kvm/x86.c  |   44 ++-
 3 files changed, 51 insertions(+), 1 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index dad475b..12fe3c7 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -418,6 +418,13 @@ struct kvm_vcpu_arch {
struct kvm_steal_time steal;
} st;
 
+   /* indicates vcpu is running or preempted */
+   struct {
+   u64 msr_val;
+   struct gfn_to_hva_cache data;
+   struct kvm_vcpu_state vs;
+   } v_state;
+
u64 last_guest_tsc;
u64 last_kernel_ns;
u64 last_host_tsc;
diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
index 7c93806..0588984 100644
--- a/arch/x86/kvm/cpuid.c
+++ b/arch/x86/kvm/cpuid.c
@@ -409,6 +409,7 @@ static int do_cpuid_ent(struct kvm_cpuid_entry2 *entry, u32 
function,
 (1 << KVM_FEATURE_CLOCKSOURCE2) |
 (1 << KVM_FEATURE_ASYNC_PF) |
 (1 << KVM_FEATURE_CLOCKSOURCE_STABLE_BIT) |
+(1 << KVM_FEATURE_VCPU_STATE) |
 (1 << KVM_FEATURE_PV_UNHALT);
 
if (sched_info_on())
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 8e5f57b..60546e9 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -789,12 +789,13 @@ EXPORT_SYMBOL_GPL(kvm_rdpmc);
  * kvm-specific. Those are put in the beginning of the list.
  */
 
-#define KVM_SAVE_MSRS_BEGIN9
+#define KVM_SAVE_MSRS_BEGIN10
 static u32 msrs_to_save[] = {
MSR_KVM_SYSTEM_TIME, MSR_KVM_WALL_CLOCK,
MSR_KVM_SYSTEM_TIME_NEW, MSR_KVM_WALL_CLOCK_NEW,
HV_X64_MSR_GUEST_OS_ID, HV_X64_MSR_HYPERCALL,
HV_X64_MSR_APIC_ASSIST_PAGE, MSR_KVM_ASYNC_PF_EN, MSR_KVM_STEAL_TIME,
+   MSR_KVM_VCPU_STATE,
MSR_IA32_SYSENTER_CS, MSR_IA32_SYSENTER_ESP, MSR_IA32_SYSENTER_EIP,
MSR_STAR,
 #ifdef CONFIG_X86_64
@@ -1539,6 +1540,32 @@ static void record_steal_time(struct kvm_vcpu *vcpu)
&vcpu->arch.st.steal, sizeof(struct kvm_steal_time));
 }
 
+static void kvm_set_vcpu_state(struct kvm_vcpu *vcpu)
+{
+   struct kvm_vcpu_state *vs = &vcpu->arch.v_state.vs;
+   struct gfn_to_hva_cache *ghc = &vcpu->arch.v_state.data;
+
+   if (!(vcpu->arch.v_state.msr_val & KVM_MSR_ENABLED))
+   return;
+
+   vs->state = 1;
+   kvm_write_guest_cached(vcpu->kvm, ghc, vs, 2*sizeof(__u32));
+   smp_wmb();
+}
+
+static void kvm_clear_vcpu_state(struct kvm_vcpu *vcpu)
+{
+   struct kvm_vcpu_state *vs = &vcpu->arch.v_state.vs;
+   struct gfn_to_hva_cache *ghc = &vcpu->arch.v_state.data;
+
+   if (!(vcpu->arch.v_state.msr_val & KVM_MSR_ENABLED))
+   return;
+
+   vs->state = 0;
+   kvm_write_guest_cached(vcpu->kvm, ghc, vs, 2*sizeof(__u32));
+   smp_wmb();
+}
+
 int kvm_set_msr_common(struct kvm_vcpu *vcpu, u32 msr, u64 data)
 {
bool pr = false;
@@ -1654,6 +1681,14 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, u32 msr, 
u64 data)
 
break;
 
+   case MSR_KVM_VCPU_STATE:
+   if (kvm_gfn_to_hva_cache_init(vcpu->kvm, 
&vcpu->arch.v_state.data,
+ data & KVM_VCPU_STATE_VALID_BITS))
+   return 1;
+
+   vcpu->arch.v_state.msr_val = data;
+   break;
+
case MSR_IA32_MCG_CTL:
case MSR_IA32_MCG_STATUS:
case MSR_IA32_MC0_CTL ... MSR_IA32_MC0_CTL + 4 * KVM_MAX_MCE_BANKS - 1:
@@ -1974,6 +2009,9 @@ int kvm_get_msr_common(struct kvm_vcpu *vcpu, u32 msr, 
u64 *pdata)
case MSR_KVM_STEAL_TIME:
data = vcpu->arch.st.msr_val;
break;
+   case MSR_KVM_VCPU_STATE:
+   data = vcpu->arch.v_state.msr_val;
+   break;
case MSR_IA32_P5_MC_ADDR:
case MSR_IA32_P5_MC_TYPE:
case MSR_IA32_MCG_CAP:
@@ -5324,6 +5362,8 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
kvm_load_guest_fpu(vcpu);
kvm_load_guest_xcr0(vcpu);
 
+   kvm_set_vcpu_state(vcpu);
+
vcpu->mode = IN_GUEST_MODE;
 
/* We should set ->mode before check ->requests,
@@ -5374,6 +5414,7 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
 
vcpu->arch.last_guest_tsc = kvm_x86_ops->read_l1_tsc(vcpu);
 
+   kvm_clear_vcpu_state(vcpu);
vcpu->mode = OUTSIDE_GUEST_MODE;
smp_wmb();
local_irq_enable();
@@ -6029,6 +6070,7 @@ int kvm_arch_vcpu_reset(struct 

[RFC PATCH v1 1/5] KVM Guest: Add VCPU running/pre-empted state for guest

2012-04-27 Thread Nikunj A. Dadhania
The patch adds guest code for msr between guest and hypervisor. The
msr will export the vcpu running/pre-empted information to the guest
from host. This will enable guest to intelligently send ipi to running
vcpus and set flag for pre-empted vcpus. This will prevent waiting for
vcpus that are not running.

Suggested-by: Peter Zijlstra 
Signed-off-by: Nikunj A. Dadhania 
---
 arch/x86/include/asm/kvm_para.h |   10 ++
 arch/x86/kernel/kvm.c   |   33 +
 2 files changed, 43 insertions(+), 0 deletions(-)

diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
index 77266d3..f57b5cc 100644
--- a/arch/x86/include/asm/kvm_para.h
+++ b/arch/x86/include/asm/kvm_para.h
@@ -24,6 +24,7 @@
 #define KVM_FEATURE_ASYNC_PF   4
 #define KVM_FEATURE_STEAL_TIME 5
 #define KVM_FEATURE_PV_UNHALT  6
+#define KVM_FEATURE_VCPU_STATE  7
 
 /* The last 8 bits are used to indicate how to interpret the flags field
  * in pvclock structure. If no bits are set, all flags are ignored.
@@ -39,6 +40,7 @@
 #define MSR_KVM_SYSTEM_TIME_NEW 0x4b564d01
 #define MSR_KVM_ASYNC_PF_EN 0x4b564d02
 #define MSR_KVM_STEAL_TIME  0x4b564d03
+#define MSR_KVM_VCPU_STATE  0x4b564d04
 
 struct kvm_steal_time {
__u64 steal;
@@ -51,6 +53,14 @@ struct kvm_steal_time {
 #define KVM_STEAL_VALID_BITS ((-1ULL << (KVM_STEAL_ALIGNMENT_BITS + 1)))
 #define KVM_STEAL_RESERVED_MASK (((1 << KVM_STEAL_ALIGNMENT_BITS) - 1 ) << 1)
 
+struct kvm_vcpu_state {
+   __u32 state;
+   __u32 pad[15];
+};
+
+#define KVM_VCPU_STATE_ALIGN_BITS 5
+#define KVM_VCPU_STATE_VALID_BITS ((-1ULL << (KVM_VCPU_STATE_ALIGN_BITS + 1)))
+
 #define KVM_MAX_MMU_OP_BATCH   32
 
 #define KVM_ASYNC_PF_ENABLED   (1 << 0)
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index 98f0378..bb686a6 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -64,6 +64,9 @@ static DEFINE_PER_CPU(struct kvm_vcpu_pv_apf_data, 
apf_reason) __aligned(64);
 static DEFINE_PER_CPU(struct kvm_steal_time, steal_time) __aligned(64);
 static int has_steal_clock = 0;
 
+DEFINE_PER_CPU(struct kvm_vcpu_state, vcpu_state) __aligned(64);
+static int has_vcpu_state;
+
 /*
  * No need for any "IO delay" on KVM
  */
@@ -291,6 +294,22 @@ static void kvm_register_steal_time(void)
cpu, __pa(st));
 }
 
+static void kvm_register_vcpu_state(void)
+{
+   int cpu = smp_processor_id();
+   struct kvm_vcpu_state *v_state;
+
+   if (!has_vcpu_state)
+   return;
+
+   v_state = &per_cpu(vcpu_state, cpu);
+   memset(v_state, 0, sizeof(*v_state));
+
+   wrmsrl(MSR_KVM_VCPU_STATE, (__pa(v_state) | KVM_MSR_ENABLED));
+   printk(KERN_INFO "kvm-vcpustate: cpu %d, msr %lu\n",
+   cpu, __pa(v_state));
+}
+
 void __cpuinit kvm_guest_cpu_init(void)
 {
if (!kvm_para_available())
@@ -310,6 +329,9 @@ void __cpuinit kvm_guest_cpu_init(void)
 
if (has_steal_clock)
kvm_register_steal_time();
+
+   if (has_vcpu_state)
+   kvm_register_vcpu_state();
 }
 
 static void kvm_pv_disable_apf(void *unused)
@@ -361,6 +383,14 @@ void kvm_disable_steal_time(void)
wrmsr(MSR_KVM_STEAL_TIME, 0, 0);
 }
 
+void kvm_disable_vcpu_state(void)
+{
+   if (!has_vcpu_state)
+   return;
+
+   wrmsr(MSR_KVM_VCPU_STATE, 0, 0);
+}
+
 #ifdef CONFIG_SMP
 static void __init kvm_smp_prepare_boot_cpu(void)
 {
@@ -379,6 +409,7 @@ static void __cpuinit kvm_guest_cpu_online(void *dummy)
 
 static void kvm_guest_cpu_offline(void *dummy)
 {
+   kvm_disable_vcpu_state();
kvm_disable_steal_time();
kvm_pv_disable_apf(NULL);
apf_task_wake_all();
@@ -433,6 +464,8 @@ void __init kvm_guest_init(void)
pv_time_ops.steal_clock = kvm_steal_clock;
}
 
+   has_vcpu_state = 1;
+
 #ifdef CONFIG_SMP
smp_ops.smp_prepare_boot_cpu = kvm_smp_prepare_boot_cpu;
register_cpu_notifier(&kvm_cpu_notifier);

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC PATCH v1 0/5] KVM paravirt remote flush tlb

2012-04-27 Thread Nikunj A. Dadhania
Remote flushing api's does a busy wait which is fine in bare-metal
scenario. But with-in the guest, the vcpus might have been pre-empted
or blocked. In this scenario, the initator vcpu would end up
busy-waiting for a long amount of time.

This was discovered in our gang scheduling test and other way to solve
this is by para-virtualizing the flush_tlb_others_ipi.

This patch set implements para-virt flush tlbs making sure that it
does not wait for vcpus that are sleeping. And all the sleeping vcpus
flush the tlb on guest enter. Idea was discussed here:
https://lkml.org/lkml/2012/2/20/157

This patch depends on ticketlocks[1] and KVM Paravirt Spinlock patches[2]
Based to 3.4.0-rc4 (commit: af3a3ab2)

Here are the results from non-PLE hardware. Running ebizzy workload
inside the VMs. The table shows the normalized ebizzy score wrt to the
baseline.

Machine:
8CPU Intel Xeon, HT disabled, 64 bit VM(8vcpu, 1G RAM)

Gangpv_spin pv_flushpv_spin_flush
1VM 1.01  0.301.01 0.49
2VMs7.07  0.530.91 4.04
4VMs9.07  0.590.31 5.27
8VMs9.99  1.580.48 7.65

Perf report from the guest VM:
Base:
41.25%   [k] flush_tlb_others_ipi
41.21%   [k] __bitmap_empty
 7.66%   [k] _raw_spin_unlock_irqrestore
 3.07%   [.] __memcpy_ssse3_back
 1.20%   [k] clear_page

gang:
22.92%   [.] __memcpy_ssse3_back
15.46%   [k] _raw_spin_unlock_irqrestore
 9.82%   [k] clear_page
 6.35%   [k] do_page_fault
 4.57%   [k] down_read_trylock
 3.36%   [k] __mem_cgroup_commit_charge
 3.26%   [k] __x2apic_send_IPI_mask
 3.23%   [k] up_read
 2.87%   [k] __bitmap_empty
 2.78%   [k] flush_tlb_others_ipi

pv_spin:
34.82%   [k] __bitmap_empty
34.75%   [k] flush_tlb_others_ipi
25.10%   [k] _raw_spin_unlock_irqrestore
 1.52%   [.] __memcpy_ssse3_back

pv_flush:
37.34%   [k] _raw_spin_unlock_irqrestore
18.26%   [k] native_halt
11.58%   [.] __memcpy_ssse3_back
 4.83%   [k] clear_page
 3.68%   [k] do_page_fault

pv_spin_flush:
71.13%   [k] _raw_spin_unlock_irqrestore
 8.89%   [.] __memcpy_ssse3_back
 4.68%   [k] native_halt
 3.92%   [k] clear_page
 2.31%   [k] do_page_fault

So looking at the perf output for pv_flush and pv_spin_flush, in both
the cases all the flush_tlb_others_ipi is no more contending for the
cpu and relinquishing the cpu for progress. 

Comments?

Regards
Nikunj

1. https://lkml.org/lkml/2012/4/19/335
2. https://lkml.org/lkml/2012/4/23/123

---

Nikunj A. Dadhania (5):
  KVM Guest: Add VCPU running/pre-empted state for guest
  KVM-HV: Add VCPU running/pre-empted state for guest
  KVM: Add paravirt kvm_flush_tlb_others
  KVM: export kvm_kick_vcpu for pv_flush
  KVM: Introduce PV kick in flush tlb


 arch/x86/include/asm/kvm_host.h |7 
 arch/x86/include/asm/kvm_para.h |   11 ++
 arch/x86/include/asm/tlbflush.h |9 +
 arch/x86/kernel/kvm.c   |   52 +-
 arch/x86/kvm/cpuid.c|1 +
 arch/x86/kvm/x86.c  |   50 -
 arch/x86/mm/tlb.c   |   68 +++
 7 files changed, 188 insertions(+), 10 deletions(-)

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html