Re: [PATCH 1/3] context_tracking: remove duplicate enabled check

2015-11-09 Thread Rik van Riel
On 10/27/2015 09:39 PM, Paolo Bonzini wrote:
> All calls to context_tracking_enter and context_tracking_exit
> are already checking context_tracking_is_enabled, except the
> context_tracking_user_enter and context_tracking_user_exit
> functions left in for the benefit of assembly calls.
> 
> Pull the check up to those functions, by making them simple
> wrappers around the user_enter and user_exit inline functions.
> 
> Cc: Andy Lutomirski <l...@kernel.org>
> Cc: Frederic Weisbecker <fweis...@gmail.com>
> Cc: Rik van Riel <r...@redhat.com>
> Cc: Paul McKenney <paul...@linux.vnet.ibm.com>
> Signed-off-by: Paolo Bonzini <pbonz...@redhat.com>

Reviewed-by: Rik van Riel <r...@redhat.com>
Tested-by: Rik van Riel <r...@redhat.com>

-- 
All rights reversed
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/3] context_tracking: avoid irq_save/irq_restore on guest entry and exit

2015-11-09 Thread Rik van Riel
On 10/27/2015 09:39 PM, Paolo Bonzini wrote:
> guest_enter and guest_exit must be called with interrupts disabled,
> since they take the vtime_seqlock with write_seq{lock,unlock}.
> Therefore, it is not necessary to check for exceptions, nor to
> save/restore the IRQ state, when context tracking functions are
> called by guest_enter and guest_exit.
> 
> Split the body of context_tracking_entry and context_tracking_exit
> out to __-prefixed functions, and use them from KVM.
> 
> Rik van Riel has measured this to speed up a tight vmentry/vmexit
> loop by about 2%.
> 
> Cc: Andy Lutomirski <l...@kernel.org>
> Cc: Frederic Weisbecker <fweis...@gmail.com>
> Cc: Rik van Riel <r...@redhat.com>
> Cc: Paul McKenney <paul...@linux.vnet.ibm.com>
> Signed-off-by: Paolo Bonzini <pbonz...@redhat.com>

Reviewed-by: Rik van Riel <r...@redhat.com>
Tested-by: Rik van Riel <r...@redhat.com>

-- 
All rights reversed
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] context_tracking: avoid extra checks on guest_enter and guest_exit

2015-04-29 Thread Rik van Riel
On 04/29/2015 05:21 AM, Paolo Bonzini wrote:
 guest_enter and guest_exit must be called with interrupts disabled,
 since they take the vtime_seqlock with write_seq{lock,unlock}.
 Therefore, it is not necessary to check for exceptions, nor to
 save/restore the IRQ state, when context tracking functions are
 called by guest_enter and guest_exit.
 
 Split the body of context_tracking_entry and context_tracking_exit
 out to __-prefixed functions, and use them from KVM.
 
 Rik van Riel has measured this to speed up a tight vmentry/vmexit
 loop by about 2%.
 
 Cc: Frederic Weisbecker fweis...@gmail.com
 Cc: Rik van Riel r...@redhat.com
 Cc: Paul McKenney paul...@linux.vnet.ibm.com
 Signed-off-by: Paolo Bonzini pbonz...@redhat.com

Reviewed-by: Rik van Riel r...@redhat.com
Tested-by: Rik van Riel r...@redhat.com

... now time to figure out the same thing for user_enter
and/or user_exit, lets see which of those we can call
with irqs already disabled :)

-- 
All rights reversed
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] context_tracking: remove duplicate enabled check

2015-04-28 Thread Rik van Riel
On 04/28/2015 07:36 AM, Paolo Bonzini wrote:
 All calls to context_tracking_enter and context_tracking_exit
 are already checking context_tracking_is_enabled, except the
 context_tracking_user_enter and context_tracking_user_exit
 functions left in for the benefit of assembly calls.
 
 Pull the check up to those functions, by making them simple
 wrappers around the user_enter and user_exit inline functions.

With this and your other (not yet posted) patch, run time
for the kvm-unit-tests vmexit.flat vmcall test with 10
million iterations drops from 17.8 to 17.5 seconds, which
is about a 1.7% speedup.


-- 
All rights reversed
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] context_tracking: remove duplicate enabled check

2015-04-28 Thread Rik van Riel
On 04/28/2015 07:36 AM, Paolo Bonzini wrote:
 All calls to context_tracking_enter and context_tracking_exit
 are already checking context_tracking_is_enabled, except the
 context_tracking_user_enter and context_tracking_user_exit
 functions left in for the benefit of assembly calls.
 
 Pull the check up to those functions, by making them simple
 wrappers around the user_enter and user_exit inline functions.
 
 Signed-off-by: Paolo Bonzini pbonz...@redhat.com

Reviewed-by: Rik van Riel r...@redhat.com

-- 
All rights reversed
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: x86: kvmclock: drop rdtsc_barrier()

2015-04-24 Thread Rik van Riel
On 04/24/2015 09:36 PM, Marcelo Tosatti wrote:
 
 Drop unnecessary rdtsc_barrier(), as has been determined empirically,
 see 057e6a8c660e95c3f4e7162e00e2fee1fc90c50d for details.
 
 Noticed by Andy Lutomirski.
 
 Improves clock_gettime() by approximately 15% on 
 Intel i7-3520M @ 2.90GHz.
 
 Signed-off-by: Marcelo Tosatti mtosa...@redhat.com

Acked-by: Rik van Riel r...@redhat.com

-- 
All rights reversed
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] kvm,x86: load guest FPU context more eagerly

2015-04-23 Thread Rik van Riel
Currently KVM will clear the FPU bits in CR0.TS in the VMCS, and trap to 
re-load them every time the guest accesses the FPU after a switch back into
the guest from the host.

This patch copies the x86 task switch semantics for FPU loading, with the
FPU loaded eagerly after first use if the system uses eager fpu mode,
or if the guest uses the FPU frequently.

In the latter case, after loading the FPU for 255 times, the fpu_counter
will roll over, and we will revert to loading the FPU on demand, until
it has been established that the guest is still actively using the FPU.

This mirrors the x86 task switch policy, which seems to work.

Signed-off-by: Rik van Riel r...@redhat.com
---
I still hope to put the larger FPU changes in at some point, but with
all the current changes to the FPU code I am somewhat uncomfortable
causing even more churn. After 4.1 I may send in the changes to defer
loading of user space FPU context to do_notify_resume() - unless people
want them sooner.

 arch/x86/kvm/x86.c   | 15 +--
 include/linux/kvm_host.h |  1 +
 2 files changed, 14 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index e1a81267f3f6..2cdb2472a633 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -7031,14 +7031,25 @@ void kvm_put_guest_fpu(struct kvm_vcpu *vcpu)
 {
kvm_put_guest_xcr0(vcpu);
 
-   if (!vcpu-guest_fpu_loaded)
+   if (!vcpu-guest_fpu_loaded) {
+   vcpu-fpu_counter = 0;
return;
+   }
 
vcpu-guest_fpu_loaded = 0;
fpu_save_init(vcpu-arch.guest_fpu);
__kernel_fpu_end();
++vcpu-stat.fpu_reload;
-   kvm_make_request(KVM_REQ_DEACTIVATE_FPU, vcpu);
+   /*
+* If using eager FPU mode, or if the guest is a frequent user
+* of the FPU, just leave the FPU active for next time.
+* Every 255 times fpu_counter rolls over to 0; a guest that uses
+* the FPU in bursts will revert to loading it on demand.
+*/
+   if (!use_eager_fpu()) {
+   if (++vcpu-fpu_counter  5)
+   kvm_make_request(KVM_REQ_DEACTIVATE_FPU, vcpu);
+   }
trace_kvm_fpu(0);
 }
 
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index ad45054309a0..f197ad3f6316 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -230,6 +230,7 @@ struct kvm_vcpu {
 
int fpu_active;
int guest_fpu_loaded, guest_xcr0_loaded;
+   unsigned char fpu_counter;
wait_queue_head_t wq;
struct pid *pid;
int sigset_active;

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [v6] kvm/fpu: Enable fully eager restore kvm FPU

2015-04-23 Thread Rik van Riel
On 04/23/2015 06:57 PM, Wanpeng Li wrote:
 Cc Rik, who is doing the similar work. :)

Hi Liang,

I posted this patch earlier, which should have the same effect as
your patch on more modern systems, while not loading the FPU context
for guests that barely use it on older systems:

https://lkml.org/lkml/2015/4/23/349

I have to admit the diffstat on your patch looks very nice, but
it might be good to know what impact it has on older systems...

 On Fri, Apr 24, 2015 at 05:13:03AM +0800, Liang Li wrote:
 Romove lazy FPU logic and use eager FPU entirely. Eager FPU does
 not have performance regression, and it can simplify the code.

 When compiling kernel on westmere, the performance of eager FPU
 is about 0.4% faster than lazy FPU.

 Signed-off-by: Liang Li liang.z...@intel.com
 Signed-off-by: Xudong Hao xudong@intel.com
 ---
 arch/x86/include/asm/kvm_host.h |  1 -
 arch/x86/kvm/svm.c  | 22 ++--
 arch/x86/kvm/vmx.c  | 74 
 +++--
 arch/x86/kvm/x86.c  |  8 +
 include/linux/kvm_host.h|  2 --
 5 files changed, 9 insertions(+), 98 deletions(-)

 diff --git a/arch/x86/include/asm/kvm_host.h 
 b/arch/x86/include/asm/kvm_host.h
 index dea2e7e..5d84cc9 100644
 --- a/arch/x86/include/asm/kvm_host.h
 +++ b/arch/x86/include/asm/kvm_host.h
 @@ -743,7 +743,6 @@ struct kvm_x86_ops {
  void (*cache_reg)(struct kvm_vcpu *vcpu, enum kvm_reg reg);
  unsigned long (*get_rflags)(struct kvm_vcpu *vcpu);
  void (*set_rflags)(struct kvm_vcpu *vcpu, unsigned long rflags);
 -void (*fpu_deactivate)(struct kvm_vcpu *vcpu);

  void (*tlb_flush)(struct kvm_vcpu *vcpu);

 diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
 index ce741b8..1b3b29b 100644
 --- a/arch/x86/kvm/svm.c
 +++ b/arch/x86/kvm/svm.c
 @@ -1087,7 +1087,6 @@ static void init_vmcb(struct vcpu_svm *svm)
  struct vmcb_control_area *control = svm-vmcb-control;
  struct vmcb_save_area *save = svm-vmcb-save;

 -svm-vcpu.fpu_active = 1;
  svm-vcpu.arch.hflags = 0;

  set_cr_intercept(svm, INTERCEPT_CR0_READ);
 @@ -1529,15 +1528,12 @@ static void update_cr0_intercept(struct vcpu_svm 
 *svm)
  ulong gcr0 = svm-vcpu.arch.cr0;
  u64 *hcr0 = svm-vmcb-save.cr0;

 -if (!svm-vcpu.fpu_active)
 -*hcr0 |= SVM_CR0_SELECTIVE_MASK;
 -else
 -*hcr0 = (*hcr0  ~SVM_CR0_SELECTIVE_MASK)
 -| (gcr0  SVM_CR0_SELECTIVE_MASK);
 +*hcr0 = (*hcr0  ~SVM_CR0_SELECTIVE_MASK)
 +| (gcr0  SVM_CR0_SELECTIVE_MASK);

  mark_dirty(svm-vmcb, VMCB_CR);

 -if (gcr0 == *hcr0  svm-vcpu.fpu_active) {
 +if (gcr0 == *hcr0) {
  clr_cr_intercept(svm, INTERCEPT_CR0_READ);
  clr_cr_intercept(svm, INTERCEPT_CR0_WRITE);
  } else {
 @@ -1568,8 +1564,6 @@ static void svm_set_cr0(struct kvm_vcpu *vcpu, 
 unsigned long cr0)
  if (!npt_enabled)
  cr0 |= X86_CR0_PG | X86_CR0_WP;

 -if (!vcpu-fpu_active)
 -cr0 |= X86_CR0_TS;
  /*
   * re-enable caching here because the QEMU bios
   * does not do it - this results in some delay at
 @@ -1795,7 +1789,6 @@ static void svm_fpu_activate(struct kvm_vcpu *vcpu)

  clr_exception_intercept(svm, NM_VECTOR);

 -svm-vcpu.fpu_active = 1;
  update_cr0_intercept(svm);
 }

 @@ -4139,14 +4132,6 @@ static bool svm_has_wbinvd_exit(void)
  return true;
 }

 -static void svm_fpu_deactivate(struct kvm_vcpu *vcpu)
 -{
 -struct vcpu_svm *svm = to_svm(vcpu);
 -
 -set_exception_intercept(svm, NM_VECTOR);
 -update_cr0_intercept(svm);
 -}
 -
 #define PRE_EX(exit)  { .exit_code = (exit), \
  .stage = X86_ICPT_PRE_EXCEPT, }
 #define POST_EX(exit) { .exit_code = (exit), \
 @@ -4381,7 +4366,6 @@ static struct kvm_x86_ops svm_x86_ops = {
  .cache_reg = svm_cache_reg,
  .get_rflags = svm_get_rflags,
  .set_rflags = svm_set_rflags,
 -.fpu_deactivate = svm_fpu_deactivate,

  .tlb_flush = svm_flush_tlb,

 diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
 index f5e8dce..811a666 100644
 --- a/arch/x86/kvm/vmx.c
 +++ b/arch/x86/kvm/vmx.c
 @@ -1567,7 +1567,7 @@ static void update_exception_bitmap(struct kvm_vcpu 
 *vcpu)
  u32 eb;

  eb = (1u  PF_VECTOR) | (1u  UD_VECTOR) | (1u  MC_VECTOR) |
 - (1u  NM_VECTOR) | (1u  DB_VECTOR);
 + (1u  DB_VECTOR);
  if ((vcpu-guest_debug 
   (KVM_GUESTDBG_ENABLE | KVM_GUESTDBG_USE_SW_BP)) ==
  (KVM_GUESTDBG_ENABLE | KVM_GUESTDBG_USE_SW_BP))
 @@ -1576,8 +1576,6 @@ static void update_exception_bitmap(struct kvm_vcpu 
 *vcpu)
  eb = ~0;
  if (enable_ept)
  eb = ~(1u  PF_VECTOR); /* bypass_guest_pf = 0 */
 -if (vcpu-fpu_active)
 -eb = ~(1u  NM_VECTOR);

  /* When we are running a nested L2 guest and L1 specified for it a
   * certain exception bitmap, we must trap the same exceptions and pass
 @@ -1961,9 +1959,6 @@ static void 

Re: [PATCH v15 16/16] unfair qspinlock: a queue based unfair lock

2015-04-09 Thread Rik van Riel
On 04/09/2015 10:13 AM, Peter Zijlstra wrote:
 On Thu, Apr 09, 2015 at 09:16:24AM -0400, Rik van Riel wrote:
 On 04/09/2015 03:01 AM, Peter Zijlstra wrote:
 On Wed, Apr 08, 2015 at 02:32:19PM -0400, Waiman Long wrote:
 For a virtual guest with the qspinlock patch, a simple unfair byte lock
 will be used if PV spinlock is not configured in or the hypervisor
 isn't either KVM or Xen. The byte lock works fine with small guest
 of just a few vCPUs. On a much larger guest, however, byte lock can
 have serious performance problem.

 Who cares?

 There are some people out there running guests with dozens
 of vCPUs. If the code exists to make those setups run better,
 is there a good reason not to use it?
 
 Well use paravirt, !paravirt stuff sucks performance wise anyhow.
 
 The question really is: is the added complexity worth the maintenance
 burden. And I'm just not convinced !paravirt virt is a performance
 critical target.

Fair enough.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v15 16/16] unfair qspinlock: a queue based unfair lock

2015-04-09 Thread Rik van Riel
On 04/09/2015 03:01 AM, Peter Zijlstra wrote:
 On Wed, Apr 08, 2015 at 02:32:19PM -0400, Waiman Long wrote:
 For a virtual guest with the qspinlock patch, a simple unfair byte lock
 will be used if PV spinlock is not configured in or the hypervisor
 isn't either KVM or Xen. The byte lock works fine with small guest
 of just a few vCPUs. On a much larger guest, however, byte lock can
 have serious performance problem.
 
 Who cares?

There are some people out there running guests with dozens
of vCPUs. If the code exists to make those setups run better,
is there a good reason not to use it?

Having said that, only KVM and Xen seem to support very
large guests, and PV spinlock is available there.

I believe both VMware and Hyperv have a 32 VCPU limit, anyway.

-- 
All rights reversed
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH -v5 6/5] context_tracking: fix exception_enter when already in IN_KERNEL

2015-02-12 Thread Rik van Riel
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 02/12/2015 12:00 PM, Frederic Weisbecker wrote:
 On Thu, Feb 12, 2015 at 10:47:10AM -0500, Rik van Riel wrote:
 -BEGIN PGP SIGNED MESSAGE- Hash: SHA1
 
 On 02/12/2015 10:42 AM, Frederic Weisbecker wrote:
 On Wed, Feb 11, 2015 at 02:43:19PM -0500, Rik van Riel wrote:
 If exception_enter happens when already in IN_KERNEL state,
 the code still calls context_tracking_exit, which ends up in
  rcu_eqs_exit_common, which explodes with a WARN_ON when it
 is called in a situation where dynticks are not enabled.
 
 Fortunately context_tracking_exit() already has a current_state
 == IN_KERNEL check so this shouldn't be a problem.
 
 No, it had a hard-coded current_state == IN_USER check, which
 is very close, but ...
 
 ... I replaced that with a state argument, and forgot to ensure
 that it never gets called with state == IN_KERNEL. This patch
 fixes that.
 
 Ah that's right! Well I'm going to merge this patch to 1/5 then to 
 avoid breaking bisection.

Thank you, Frederic!

- -- 
All rights reversed
-BEGIN PGP SIGNATURE-
Version: GnuPG v1

iQEcBAEBAgAGBQJU3OeDAAoJEM553pKExN6D7BsIAJ8CKC73jQ8T5Dqa/tlHV7Db
QFSJdpxP+7jCZwssehgpjpxCwtJ0UvGgle5OwX/POUhmagHxHmxVydOBz+xfYdBr
UuGkEl5TL+oyoMUr80Q4RTnJSZN08zi+THqiv33tyPUj6cNiycBZAuho3ELTRNOA
bRcHrMW+xd95uqoung7dSKrgA2jcym3+umNGnQb0gniraqcNLAmWs+jfAO8yZLJg
vk8bIKed6epQ3n6gcdYe0A28cLOuBvjEs5JNcEPxujY/349sjitKR2pLQ6HsfHLV
frlKsh7qQIRtoUJLO9ZBBDtGrmThwBwH8rw+GcVR8zviPNvV4IRrx47VBcHDWjc=
=mwFO
-END PGP SIGNATURE-
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH -v5 6/5] context_tracking: fix exception_enter when already in IN_KERNEL

2015-02-12 Thread Rik van Riel
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 02/12/2015 10:42 AM, Frederic Weisbecker wrote:
 On Wed, Feb 11, 2015 at 02:43:19PM -0500, Rik van Riel wrote:
 If exception_enter happens when already in IN_KERNEL state, the 
 code still calls context_tracking_exit, which ends up in 
 rcu_eqs_exit_common, which explodes with a WARN_ON when it is 
 called in a situation where dynticks are not enabled.
 
 Fortunately context_tracking_exit() already has a current_state ==
 IN_KERNEL check so this shouldn't be a problem.

No, it had a hard-coded current_state == IN_USER check,
which is very close, but ...

... I replaced that with a state argument, and forgot to
ensure that it never gets called with state == IN_KERNEL.
This patch fixes that.

 Meanwhile I'll still take the patch, it's better to handle that 
 from the caller.

Thanks.

- -- 
All rights reversed
-BEGIN PGP SIGNATURE-
Version: GnuPG v1

iQEcBAEBAgAGBQJU3Mr+AAoJEM553pKExN6DYNUH/2m9CtXhLdTHOEHRvxg41PCZ
/xafetUOS9cka0CNuiYpUuvfMSucoePW7YqUXqjYSIP25DsAleOh0qdep1Ob5bH+
2BqZNMwK3QDHf1+/V7nulnjVkeHtpXJm0HIZOjc06xeL+9T6ydB1vhQGIMLrGL9S
LvOstI3fseeIgglwYc2Gx7H7e99oOkxysvwMMvcMrW0cPSRAOdYxINQnfYW8A5kq
DTTXwWuJRZa4FLtP3wLpvocm5dMGDwTsDmuOk1PmXYlsTsO6H2BmCeio0euzStoJ
l+jR4x7Aq2KXES7gnMgpPw1iON3xKJ/RbXF8IC/doII8FYEV8Raxnf7hl47etBw=
=yIjW
-END PGP SIGNATURE-
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH -v5 6/5] context_tracking: fix exception_enter when already in IN_KERNEL

2015-02-11 Thread Rik van Riel
If exception_enter happens when already in IN_KERNEL state, the
code still calls context_tracking_exit, which ends up in
rcu_eqs_exit_common, which explodes with a WARN_ON when it is
called in a situation where dynticks are not enabled.

This can be avoided by having exception_enter only switch to
IN_KERNEL state if the current state is not already IN_KERNEL.

Signed-off-by: Rik van Riel r...@redhat.com
Reported-by: Luiz Capitulino lcapitul...@redhat.com
---
Frederic, you will want this bonus patch, too :)

Thanks to Luiz for finding this one. Whatever I was running did not
trigger this issue...

 include/linux/context_tracking.h | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/include/linux/context_tracking.h b/include/linux/context_tracking.h
index b65fd1420e53..9da230406e8c 100644
--- a/include/linux/context_tracking.h
+++ b/include/linux/context_tracking.h
@@ -37,7 +37,8 @@ static inline enum ctx_state exception_enter(void)
return 0;
 
prev_ctx = this_cpu_read(context_tracking.state);
-   context_tracking_exit(prev_ctx);
+   if (prev_ctx != IN_KERNEL)
+   context_tracking_exit(prev_ctx);
 
return prev_ctx;
 }

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/6] rcu,nohz: add context_tracking_user_enter/exit wrapper functions

2015-02-10 Thread Rik van Riel
On 02/10/2015 10:28 AM, Frederic Weisbecker wrote:
 On Tue, Feb 10, 2015 at 09:41:45AM -0500, r...@redhat.com wrote:
 From: Rik van Riel r...@redhat.com

 These wrapper functions allow architecture code (eg. ARM) to keep
 calling context_tracking_user_enter  context_tracking_user_exit
 the same way it always has, without error prone tricks like duplicate
 defines of argument values in assembly code.

 Signed-off-by: Rik van Riel r...@redhat.com
 
 This patch alone doesn't make much sense. 

Agreed, my goal was to keep things super easy to review,
to reduce the chance of introducing bugs.

 The changelog says it's about keeping
 context_tracking_user_*() functions as wrappers but fails to explain to what 
 they
 wrap, why and what are the new context_tracking_enter/exit functions for.
 
 Perhaps patches 1 and 2 should be merged together into something like:
 
   context_tracking: Generalize context tracking APIs to support user and 
 guest
 
 Do that because we'll also track guestetc  And keep the old user 
 context tracking APIs
 for now to avoid painful enum parameter support in ARM assembly 

Can do...

Paul, would you like me to resend the whole series, or just
a merged patch that replaces patches 1  2?

That is assuming Paul prefers having the patches merged into
one :)

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 6/6] kvm,rcu,nohz: use RCU extended quiescent state when running KVM guest

2015-02-10 Thread Rik van Riel
On 02/10/2015 02:59 PM, Andy Lutomirski wrote:
 On 02/10/2015 06:41 AM, r...@redhat.com wrote:
 From: Rik van Riel r...@redhat.com

 The host kernel is not doing anything while the CPU is executing
 a KVM guest VCPU, so it can be marked as being in an extended
 quiescent state, identical to that used when running user space
 code.

 The only exception to that rule is when the host handles an
 interrupt, which is already handled by the irq code, which
 calls rcu_irq_enter and rcu_irq_exit.

 The guest_enter and guest_exit functions already switch vtime
 accounting independent of context tracking. Leave those calls
 where they are, instead of moving them into the context tracking
 code.

 Signed-off-by: Rik van Riel r...@redhat.com
 ---
   include/linux/context_tracking.h   | 6 ++
   include/linux/context_tracking_state.h | 1 +
   include/linux/kvm_host.h   | 3 ++-
   3 files changed, 9 insertions(+), 1 deletion(-)

 diff --git a/include/linux/context_tracking.h
 b/include/linux/context_tracking.h
 index 954253283709..b65fd1420e53 100644
 --- a/include/linux/context_tracking.h
 +++ b/include/linux/context_tracking.h
 @@ -80,10 +80,16 @@ static inline void guest_enter(void)
   vtime_guest_enter(current);
   else
   current-flags |= PF_VCPU;
 +
 +if (context_tracking_is_enabled())
 +context_tracking_enter(IN_GUEST);
 
 Why the if statement?
 
 Also, have you checked how much this hurts guest lightweight entry/exit
 latency?  Context tracking is shockingly expensive for reasons I don't
 fully understand, but hopefully most of it is the vtime stuff.

Guest_enter and guest_exit already do the vtime stuff today.

This patch series adds the rcu stuff, and modifies
context_tracking_enter  context_tracking_exit to not
do the vtime stuff twice.

 (Context tracking is *so* expensive that I almost think we should set the
 performance taint flag if we enable it, assuming that flag ended up
 getting merged.  Also, we should make context tracking faster.)

I am all for making it faster :)

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/5] rcu,nohz: add state parameter to context_tracking_user_enter/exit

2015-02-09 Thread Rik van Riel
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 02/07/2015 03:06 PM, Paul E. McKenney wrote:
 On Sat, Feb 07, 2015 at 09:30:41AM +0100, Frederic Weisbecker
 wrote:
 On Fri, Feb 06, 2015 at 11:14:53PM -0800, Paul E. McKenney
 wrote:
 On Fri, Feb 06, 2015 at 10:34:21PM -0800, Paul E. McKenney
 wrote:
 On Fri, Feb 06, 2015 at 10:53:34PM -0500, Rik van Riel
 wrote:
 -BEGIN PGP SIGNED MESSAGE- Hash: SHA1
 
 On 02/06/2015 06:15 PM, Frederic Weisbecker wrote:
 
 Just a few things then:
 
 1) In this case rename context_tracking_user_enter/exit()
 to context_tracking_enter() and context_tracking_exit(),
 since it's not anymore about user only but about any
 generic context.
 
 2) We have the WARN_ON_ONCE(!current-mm); condition
 that is a debug check specific to userspace transitions
 because kernel threads aren't expected to resume to
 userspace. Can we also expect that we never switch
 to/from guest from a kernel thread? AFAICS this happens
 from an ioctl (thus user task) in x86 for kvm. But I only
 know this case.
 
 3) You might want to update a few comments that assume we
 only deal with userspace transitions.
 
 4) trace_user_enter/exit() should stay user-transitions
 specific.
 
 Paul, would you like me to send follow-up patches with the
 cleanups suggested by Frederic, or would you prefer me to
 send a new series with the cleanups integrated?
 
 I would prefer a new series, in order to prevent possible
 future confusion.
 
 Of course, if Frederic would rather push them himself, I am
 fine with that.  And in that case, you should ask him for his
 preferences, which just might differ from mine.  ;-)
 
 I prefer a new series too. Now whether you or me take the
 patches, I don't mind either way :-)
 
 Also I wonder how this feature is going to be enabled. Will it be
 enabled on full dynticks or should it be a seperate feature
 depending on full dynticks? Or even just CONFIG_RCU_USER_EQS?
 Because I'm still unclear about how and what this is used, if it
 involves full dynticks or only RCU extended quiescent states.
 
 Well, we certainly need it documented.  And validation
 considerations would push for keeping the number of possible
 combinations low, while paranoia about added feature would push for
 having it be separately enabled.  And if distros are going to
 enable this at build time, we either need -serious- validation or a
 way to disable at boot time.
 
 On the desired/required combinations of features, let's see...
 
 If I understand this completely, which I probably don't, we have
 the following considerations:
 
 o NO_HZ_FULL: Needed to get rid of the scheduling-clock interrupt 
 during guest execution, though I am not sure whether we really have
 that completely wired up with this patch set.  Regardless, Rik, for
 your use case, do you care about whether or not the guest gets
 interrupted by the host's scheduling-clock interrupts? (Based on
 discussion in this thread, my guess is yes.)
 
 o RCU_NOCB_CPUS: Implied by NO_HZ_FULL, but only on CPUs actually 
 enabled for NO_HZ_FULL operation, either by NO_HZ_FULL_ALL at build
 time or by nohz_full= at boot time.  Needed to avoid interrupting
 the guest with host RCU callback invocation. Rik, does your use
 case care about guests being interrupted by RCU callback
 invocation?  (Based on discussion in this thread, my guess is
 yes.)
 
 o RCU_USER_EQS: Implied by NO_HZ_FULL, and I would have to go look 
 to see what relation this has to nohz_full=.  Needed for RCU to be 
 able to recognize userspace-execution quiescent states on a given 
 CPU without disturbing that CPU.  Unless I am missing something 
 subtle, you have to have this for this patch series to make sense.
 
 If my guesses are correct, the best approach would be to have this 
 new mode of operation implied by NO_HZ_FULL.

I agree. It makes sense to have all three, and all three are enabled
in the configuration we use. I cannot think of a case where someone
would significantly benefit from just one or two of the above, except
maybe for debugging reasons.

Having NO_HZ_FULL enable all the above, either through a boot time
commandline option, or just by default, would make sense.

- -- 
All rights reversed
-BEGIN PGP SIGNATURE-
Version: GnuPG v1

iQEcBAEBAgAGBQJU2NVpAAoJEM553pKExN6DxxUH/RwpZI6dRYvIQbtY2y93ax5/
Lba4QbmZ6n6AnGXrtlpwEQMSMvLawKqT9ZFSwzKeSarX6Uu4aRCdi8td34ruu9rg
hfhv8hD1z15deYc0UPKUCbZrYrIi9uaG/FpioafDmPH+P4T2bFdvn7d/bKIoiaBM
T1QA+LNddRxOhtayrIEDH1BnPKgXw9V8f7/mGQPmRf+oRz+Hgn6DPpEm0kTbqn+L
RkhHNPemJ8bMaIwntAwzEklgnhkON9zOBe/XFof0lC+SdhtlAVkXPvX+cXiZMQZt
1rEqxK1+S9beeKVX65mLtxZg2omz46qz7SQRUGf3If2wHZXQtIRnvtlyCsDu/AI=
=gj2E
-END PGP SIGNATURE-
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH -v3 0/6] rcu,nohz,kvm: use RCU extended quiescent state when running KVM guest

2015-02-09 Thread Rik van Riel
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 02/09/2015 08:15 PM, Will Deacon wrote:
 Hi Rik,
 
 On Mon, Feb 09, 2015 at 04:04:38PM +, r...@redhat.com wrote:
 Apologies to Catalin and Will for not fixing up ARM. I am not 
 familiar with ARM assembly, and not sure how to pass a constant 
 argument to a function from assembly code on ARM :)
 
 It's a bit of a faff getting enum values into asm -- we actually
 have to duplicate the definitions using #defines to get at the
 constants. Perhaps it would be cleaner to leave
 context_tracking_user_{enter,exit} intact as C wrappers around
 context_tracking_{enter,exit} passing the appropriate constant?
 That way we don't actually need to change the arch code at all.

If Paul and Frederic have no objections, I would be happy to do that.

Paul, Frederic?

- -- 
All rights reversed
-BEGIN PGP SIGNATURE-
Version: GnuPG v1

iQEcBAEBAgAGBQJU2V1zAAoJEM553pKExN6D6g0IALbbBouimRFvTEvET5mspC7i
DDn9AZJ8lMfbtgeMQsNFsJqfXu0lM8lwczPMKoAz/op31p0pdPlykYIm9fmCdqik
rNFTIRRd8fPD3JJVjNu8lpmVsae9VQuaDX7/36OK5d36bIpPfZ1yj6dxOdL4p5oI
xYcnw1Re1rUmtykQo3i6V6379EPz8azSqxFRcP+Kf9kq3FLzHwzs+TwkOprGSWE4
B4wNVQMn+bLir/vQlS0+0NYcYXDyMllGQWm0Z8mBhUhjvHhP0rQ0UplHQk+zDOaN
slEOKAIEtfikAm9ugv25cylCJ6ko+DOrP1aZ994BQ7E9AxmBnzz0EN6i5NAIJ1s=
=msX3
-END PGP SIGNATURE-
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH -v3 0/6] rcu,nohz,kvm: use RCU extended quiescent state when running KVM guest

2015-02-09 Thread Rik van Riel
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 02/09/2015 10:01 PM, Paul E. McKenney wrote:
 On Tue, Feb 10, 2015 at 02:44:17AM +0100, Frederic Weisbecker
 wrote:
 On Mon, Feb 09, 2015 at 08:22:59PM -0500, Rik van Riel wrote:
 -BEGIN PGP SIGNED MESSAGE- Hash: SHA1
 
 On 02/09/2015 08:15 PM, Will Deacon wrote:
 Hi Rik,
 
 On Mon, Feb 09, 2015 at 04:04:38PM +, r...@redhat.com
 wrote:
 Apologies to Catalin and Will for not fixing up ARM. I am
 not familiar with ARM assembly, and not sure how to pass a
 constant argument to a function from assembly code on ARM
 :)
 
 It's a bit of a faff getting enum values into asm -- we
 actually have to duplicate the definitions using #defines to
 get at the constants. Perhaps it would be cleaner to leave 
 context_tracking_user_{enter,exit} intact as C wrappers
 around context_tracking_{enter,exit} passing the appropriate
 constant? That way we don't actually need to change the arch
 code at all.
 
 If Paul and Frederic have no objections, I would be happy to do
 that.
 
 Paul, Frederic?
 
 Sure, that's fine by me.
 
 And if it is fine by Frederic, it is fine by me!

I'll send a new series tomorrow that addresses Will's concern,
as well as Paulo's latest suggestions.

- -- 
All rights reversed
-BEGIN PGP SIGNATURE-
Version: GnuPG v1

iQEcBAEBAgAGBQJU2XTsAAoJEM553pKExN6D2ZcH/R7KIkJLpaWwxM2Iy0iEy+9r
CA2Tem1wXiDRMLzvDEKQHGlX8ea2o/VWNwjSMeRzDJbd3D9wsfPBMROXOXuji3Cc
FI2XBrC+fax43vWaIcKDb/GrQanffed2VZ1klL2twTShUOaSBPXFofZ1DISvrySP
//QOe7DbgYpqYJZ2KC+KohclvIkepAbo3AuiZKj97staw4IzP4kVJi1zAk0qbDDQ
6m7X4hvHnkfiSx0YLNRJ54j3i9/py5/rbUlnNP9LAxdQzOQQCFUSUKnn9zIMBZ03
e1W77VAQLI9+URXgeuSn7928PE+eqQQJwwdLzt0ckVojJL3i8J5SfoxTMr1sZh0=
=ft08
-END PGP SIGNATURE-
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/6] nohz,kvm: export context_tracking_user_enter/exit

2015-02-09 Thread Rik van Riel
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 02/09/2015 12:05 PM, Paolo Bonzini wrote:
 
 
 On 09/02/2015 17:04, r...@redhat.com wrote:
 From: Rik van Riel r...@redhat.com
 
 Export context_tracking_user_enter/exit so it can be used by
 KVM.
 
 Wrong function name in the commit message...

Oops, indeed.

Paul, could I convince you to fix up the changelog and reorder
patches 5  6, if you choose to apply these patches? :)

- -- 
All rights reversed
-BEGIN PGP SIGNATURE-
Version: GnuPG v1

iQEcBAEBAgAGBQJU2RJAAAoJEM553pKExN6DvmAIAJPx/Y66UTOn1009Ware7w7g
SyqsJvZBFGnJAYvbioQVDjJYA8hjnghffsj2AGiSc/WMkZTMFbhVoJuZKX9mOuE4
kJKkWZP74JGjYY5+uZqk1q+y83bqvvlUAdCGdWictWZJNTErFoXmBGaw4Qd4tiTe
8e4ltRHjGUpjwiBBO1H2vRjqkW4di35JXslUGEI5Os2cvGC+eYorz/7hEA3IX5hO
Oe07iGfgOa5SuLy3pIg8I5Y8S5Cy6ixGK5h5+qmvetTHz3KdUfx30Beyrz9Zbv2l
EF55sKIosMbGtlUBu2F7nhQL2aTiVf/pPuNokJH9tuv0eUbay/Tyosa9vTK7PiQ=
=rMEH
-END PGP SIGNATURE-
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] KVM: x86: extract guest running logic from __vcpu_run

2015-02-09 Thread Rik van Riel
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 02/06/2015 07:16 AM, Paolo Bonzini wrote:
 Rename the old __vcpu_run to vcpu_run, and extract its main logic 
 to a new function.
 
 The next patch will add a new early-exit condition to __vcpu_run, 
 avoid extra indentation.
 
 Signed-off-by: Paolo Bonzini pbonz...@redhat.com

Reviewed-by: Rik van Riel r...@redhat.com

- -- 
All rights reversed
-BEGIN PGP SIGNATURE-
Version: GnuPG v1

iQEcBAEBAgAGBQJU2UOmAAoJEM553pKExN6DLKoIAJhuuGhD46k4Xyai8JdtTGPr
osSYKVjIn9PYCYDjR9NQuUfji1i5JXluFMDHLREnKnTQlzvC9+pKIfyxEObgI3ni
8uYs5mealtFAv3uWL2JLdvfi4e58rUEmvI2c+CZBFBQJ6NkGh2QRF0VVa0Vowppv
/kvq/L3QKna4mfSngAA60p5g6ctKwNyXTXt1Pb+KbsYseJlnWLPSXwocDEBBEc35
vvHWdOwVtJQRbVwt/ZXNVPaePgCVB/eNLnxvCd1OujnIycoJhKNQKreEW2ik2UXY
CsgXrBiK3wlgHbf65Ogs9ivrn9wXQXxGbWf2WlnlQeTAbkSR9TqixkeqCYSmCI0=
=gy1C
-END PGP SIGNATURE-
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/5] rcu,nohz: add state parameter to context_tracking_user_enter/exit

2015-02-07 Thread Rik van Riel
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 02/07/2015 03:30 AM, Frederic Weisbecker wrote:

 I prefer a new series too. Now whether you or me take the patches,
 I don't mind either way :-)

I'll make it, no problem.

 Also I wonder how this feature is going to be enabled. Will it be
 enabled on full dynticks or should it be a seperate feature
 depending on full dynticks? Or even just CONFIG_RCU_USER_EQS?
 Because I'm still unclear about how and what this is used, if it
 involves full dynticks or only RCU extended quiescent states.

It involves full dynticks and CONFIG_RCU_USER_EQS.

- -- 
All rights reversed
-BEGIN PGP SIGNATURE-
Version: GnuPG v1

iQEcBAEBAgAGBQJU1fcsAAoJEM553pKExN6D7NEH/RZM/gqZ7CCACr4T3/Esd8GL
IeYmZui+GDyrzj63/xX7ZgU+aqPbkfbEJ3ueQkabjtzIHhkurBM19XZ8CwWb42S9
5kAi51MjLrNnLPdvYCcu2q15TKSygU+V5wvxVohxHC9fi+tE/1+FOrVATky68uO4
6izXTm8EXbDLRg0tB5Mq/sRqBXGHfDw19vVQqMkQ47vzIw4oNHpLBSTv7GXHhN7u
GH0QMzcDUUZ8IcyOSxLhRPOUX3XrV7C4U8ilP0ZJQ287sqtsMpQWtNZK6jmJN1tv
niCrHQAOH++MuuF3x2fulpO3fSTbgwW3bGeMKh2ITHk0ODG6iIh1htmg4EFA2Bg=
=AyIN
-END PGP SIGNATURE-
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/5] rcu,nohz: add state parameter to context_tracking_user_enter/exit

2015-02-06 Thread Rik van Riel
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 02/06/2015 01:23 PM, Frederic Weisbecker wrote:
 On Fri, Feb 06, 2015 at 01:20:21PM -0500, Rik van Riel wrote: On
 02/06/2015 12:22 PM, Frederic Weisbecker wrote:
 On Thu, Feb 05, 2015 at 03:23:48PM -0500, r...@redhat.com
 wrote:
 From: Rik van Riel r...@redhat.com
 
 Add the expected ctx_state as a parameter to 
 context_tracking_user_enter and
 context_tracking_user_exit, allowing the same functions to
 not just track kernel  user space switching, but also
 kernel  guest transitions.
 
 Signed-off-by: Rik van Riel r...@redhat.com
 
 You should consider using guest_enter() and guest_exit()
 instead. These are context tracking APIs too but specifically
 for guest.
 
 What do you mean instead?  KVM already uses those.
 
 I just wanted to avoid duplicating the code...
 
 I mean you can call rcu_user APIs directly from
 guest_enter/exit. You don't really need to call the
 context_tracking_user functions since guest_enter/guest_exit
 already handle the vtime accounting.

I would still have to modify exception_enter and exception_exit,
and with them context_tracking_user_enter and
context_tracking_user_exit.

We have to re-enable RCU when an exception happens.

I suspect exceptions in a guest just trigger VMEXIT, and we
figure later why the exception happened. However, if we were
to get an exception during the code where we transition into
or out of guest mode, we would still need exception_enter
and exception_exit...

- -- 
All rights reversed
-BEGIN PGP SIGNATURE-
Version: GnuPG v1

iQEcBAEBAgAGBQJU1Q1MAAoJEM553pKExN6DB90H/iVCnfrooAB15E5Qioa3Ty+X
hNMaIMX6zjYg++IFR5BhYLp9hp36o/98sv8RLTjZQix2q1ljivobmbABvx2MBNhx
NiPfU9DyBkhz3gwI4oTkggb383Wrcyt+RgvclI/96AbwkhrdzxmT1nnUc0kA98xC
6NTW2+imkYX31sY/2SFmYWnJMVZOjOIep3LCVh/hrWnQARd6mdyzzFr+v6Z/vyFe
8P2rbqlkN0nf1pGYz3VF6zqF8wVmOi1mx4mo0qy80Sax7jsZv9+gGfbF1HkHJnjg
FLsj/q/mcrH1GBK54a3s3P6ghpcFXfwibjhkGmrmA07XNHqLiNgKgmgPtArhU+s=
=9Ln1
-END PGP SIGNATURE-
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/5] rcu,nohz: add state parameter to context_tracking_user_enter/exit

2015-02-06 Thread Rik van Riel
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 02/06/2015 06:15 PM, Frederic Weisbecker wrote:

 Just a few things then:
 
 1) In this case rename context_tracking_user_enter/exit() to 
 context_tracking_enter() and context_tracking_exit(), since it's
 not anymore about user only but about any generic context.
 
 2) We have the WARN_ON_ONCE(!current-mm); condition that is a
 debug check specific to userspace transitions because kernel
 threads aren't expected to resume to userspace. Can we also expect
 that we never switch to/from guest from a kernel thread? AFAICS
 this happens from an ioctl (thus user task) in x86 for kvm. But I
 only know this case.
 
 3) You might want to update a few comments that assume we only deal
 with userspace transitions.
 
 4) trace_user_enter/exit() should stay user-transitions specific.

Paul, would you like me to send follow-up patches with the cleanups
suggested by Frederic, or would you prefer me to send a new series
with the cleanups integrated?

Frederic, I will also add the cleanup you suggested for patch 4/5.

- -- 
All rights reversed
-BEGIN PGP SIGNATURE-
Version: GnuPG v1

iQEcBAEBAgAGBQJU1Yw+AAoJEM553pKExN6DhycH/ifPeaRaFcj/BBKaDf7BmKAJ
cGMplf/vMtJA5DCFfZTmRp5Yb/9f3XBk8MU4Z+oWZFPB/msA8WkibhZtRGXpXXl9
7XgDXaXUuo++Axhb3SYHXEDhkPkhmfdjlctyr5ZUu3gHqkeWl6utv0t4anIBfo3Z
NdWG8yEhgKU6OyFppf3CH0Cm46xPN+pUyAFMgK9HbSfDkR3a9rMZ32aQq8fyV15e
LV4qE+/lPi7lyoLqbHmmU+pqp6iBqyQ9uFIDCRAoBBXF5jh0jxEynRubBn2D1HZJ
FBi+dBWGhAjRN05tuurvkwbJtcmTpnsHyNrmzNlAeop0Upc/5Vta43zN/nu1AFA=
=Z9mE
-END PGP SIGNATURE-
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 0/5] rcu,nohz,kvm: use RCU extended quiescent state when running KVM guest

2015-02-06 Thread Rik van Riel
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 02/06/2015 08:46 AM, Frederic Weisbecker wrote:
 On Thu, Feb 05, 2015 at 03:23:47PM -0500, r...@redhat.com wrote:
 When running a KVM guest on a system with NOHZ_FULL enabled
 
 I just need to clarify the motivation first, does the above
 situation really happen? Ok some distros enable NOHZ_FULL to let
 the user stop the tick in userspace. So most of the time,
 CONFIG_NOHZ_FULL=y but nohz full is runtime disabled (we need to
 pass a nohz_full= boot parameter to enable it). And when it is
 runtime disabled, there should be no rcu nocb CPU.
 
 (Although not setting CPUs in nocb mode when nohz full is runtime
 disabled is perhaps a recent change.)
 
 So for the problem to arise, one need to enable nohz_full and run
 KVM guest. And I never heard about such workloads. That said it's
 potentially interesting to turn off the tick on the host when the
 guest runs.
 
 , and the KVM guest running with idle=poll mode, we still get
 wakeups of the rcuos/N threads.
 
 So we need nohz_full on the host and idle=poll mode on the guest.
 Is it likely to happen? (sorry, again I'm just trying to make sure
 we agree on why we do this change).

We have users interested in doing just that, in order to run
KVM guests with the least amount of perturbation to the guest.


- -- 
All rights reversed
-BEGIN PGP SIGNATURE-
Version: GnuPG v1

iQEcBAEBAgAGBQJU1NYrAAoJEM553pKExN6DMgcIAIKoBBxsFEt7yOP+k32uqm+W
S/VbP5dIHE5OnYeBoitgNkia1U4rsAX6AVAuVFvKc7Y8aixENGzubPWHe0NuHida
VIaQmK92Jym4FH8Xsnj09MhgLV+ZEG/cCzdUFZfShJq3hHwzedZx+cC7uQMB6kd4
iuo7CtgTjzTgBce29Fc147azXlJbfFfwFt3a6YVxbv25IYpDL9ETulh34h6NrNLz
nB0snDjq8FHKcyjlD3XnJpT/tbaZcrZctExq4JrespcBMe6prMnoWvVoXWX/fVVG
TIR1hp2xfKWoS4gc56PnLazIIB9SRmlC/SzSMwAaSgf1dWa5BcwpuMbYVFIEeME=
=+BrK
-END PGP SIGNATURE-
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/5] rcu,nohz: add state parameter to context_tracking_user_enter/exit

2015-02-06 Thread Rik van Riel
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 02/06/2015 12:22 PM, Frederic Weisbecker wrote:
 On Thu, Feb 05, 2015 at 03:23:48PM -0500, r...@redhat.com wrote:
 From: Rik van Riel r...@redhat.com
 
 Add the expected ctx_state as a parameter to
 context_tracking_user_enter and context_tracking_user_exit,
 allowing the same functions to not just track kernel  user
 space switching, but also kernel  guest transitions.
 
 Signed-off-by: Rik van Riel r...@redhat.com
 
 You should consider using guest_enter() and guest_exit() instead.
 These are context tracking APIs too but specifically for guest.

What do you mean instead?  KVM already uses those.

I just wanted to avoid duplicating the code...

- -- 
All rights reversed
-BEGIN PGP SIGNATURE-
Version: GnuPG v1

iQEcBAEBAgAGBQJU1QXlAAoJEM553pKExN6D+l4H/1PPmFioxed9XyL+rJZf0XSt
mATl5JcWGlNybL5c4Tnld/3FX5/vYwBXmgw2Rh5a84F+TJi8B+Hu2Uwetl6C6vUF
EK2+ExJ1rla4lpiO3frxPDdfdOHJFw2bR0fhEb4GHqcN2ecfSdXtL4hKwFru5h5s
IJ8dzNIW52vzqzmulkcvI1y+VkgQBwUXYbkiGyy/MPf4F0WvGC9g44eXHZNPRXoT
V34/nMJCpFHlZ7FVuHqGGstmPjv19VUAYNhUkrlU8DOpZMKxT58Sb1CGLfwsGqvZ
y0+pRca8eT+gX0vqg9YUBfoEBNy4MnHdQEwQ0EPZwPJkcQ3Leco3/1JLHyDogCg=
=3AJV
-END PGP SIGNATURE-
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/4] kvm,rcu: use RCU extended quiescent state when running KVM guest

2015-02-05 Thread Rik van Riel
On 02/05/2015 01:56 PM, Paul E. McKenney wrote:

 The real danger is doing neither.
 
 On tick_nohz_full_cpu() CPUs, the exit-to-userspace code should invoke
 rcu_user_enter(), which sets some per-CPU state telling RCU to ignore
 that CPU, since it cannot possibly do host RCU read-side critical sections
 while running a guest.
 
 In contrast, a non-tick_nohz_full_cpu() CPU doesn't let RCU
 know that it is executing in a guest or in userspace.  So the
 rcu_virt_note_context_switch() does the notification in that case.

Looking at context_tracking.h, I see the
function context_tracking_cpu_is_enabled().

That looks like it should do the right thing
in this case.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC] kvm: x86: add halt_poll module parameter

2015-02-05 Thread Rik van Riel
On 02/05/2015 02:20 PM, Paolo Bonzini wrote:
 
 
 On 05/02/2015 19:55, Jan Kiszka wrote:
 This patch introduces a new module parameter for the KVM module; when it
 is present, KVM attempts a bit of polling on every HLT before scheduling
 itself out via kvm_vcpu_block.

 Wouldn't it be better to tune this on a per-VM basis? Think of mixed
 workloads with some latency-sensitive and some standard VMs.
 
 Yes, but:
 
 1) this turned out to be very cheap, so a per-host tunable is not too bad;
 
 2) it also affects only very few workloads (for example network
 workloads can already do polling in the guest) so it only affects few
 people;
 
 3) long term anyway we want it to auto tune, which is better than tuning
 it per-VM.

We may want to auto tune it per VM.

However, if we make auto tuning work well, I do not
think we want to expose a user visible tunable per
VM, and commit to keeping that kind of interface
around forever.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/4] kvm,rcu: use RCU extended quiescent state when running KVM guest

2015-02-05 Thread Rik van Riel
On 02/05/2015 11:44 AM, Christian Borntraeger wrote:
 Am 05.02.2015 um 17:35 schrieb r...@redhat.com:
 From: Rik van Riel r...@redhat.com

 The host kernel is not doing anything while the CPU is executing
 a KVM guest VCPU, so it can be marked as being in an extended
 quiescent state, identical to that used when running user space
 code.

 The only exception to that rule is when the host handles an
 interrupt, which is already handled by the irq code, which
 calls rcu_irq_enter and rcu_irq_exit.

 The guest_enter and guest_exit functions already switch vtime
 accounting independent of context tracking, so leave those calls
 where they are, instead of moving them into the context tracking
 code.

 Signed-off-by: Rik van Riel r...@redhat.com
 ---
  include/linux/context_tracking.h   | 8 +++-
  include/linux/context_tracking_state.h | 1 +
  2 files changed, 8 insertions(+), 1 deletion(-)

 diff --git a/include/linux/context_tracking.h 
 b/include/linux/context_tracking.h
 index bd9f000fc98d..a5d3bb44b897 100644
 --- a/include/linux/context_tracking.h
 +++ b/include/linux/context_tracking.h
 @@ -43,7 +43,7 @@ static inline enum ctx_state exception_enter(void)
  static inline void exception_exit(enum ctx_state prev_ctx)
  {
  if (context_tracking_is_enabled()) {
 -if (prev_ctx == IN_USER)
 +if (prev_ctx == IN_USER || prev_ctx == IN_GUEST)
  context_tracking_user_enter(prev_ctx);
  }
  }
 @@ -78,6 +78,9 @@ static inline void guest_enter(void)
  vtime_guest_enter(current);
  else
  current-flags |= PF_VCPU;
 +
 +if (context_tracking_is_enabled())
 +context_tracking_user_enter(IN_GUEST);
  }
 
 
 Couldnt we make
 rcu_virt_note_context_switch(smp_processor_id());
 conditional in include/linux/kvm_host.h (kvm_guest_enter)
 
 e.g. something like
 if (!context_tracking_is_enabled())
   rcu_virt_note_context_switch(smp_processor_id());

Possibly. I considered the same, but I do not know whether
or not just rcu_user_enter / rcu_user_exit is enough.

I could certainly try it out and see whether anything
explodes, but I am not convinced that is careful enough
when it comes to handling RCU code...

Paul? :)

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC] kvm: x86: add halt_poll module parameter

2015-02-05 Thread Rik van Riel
On 02/05/2015 11:05 AM, Paolo Bonzini wrote:
 This patch introduces a new module parameter for the KVM module; when it
 is present, KVM attempts a bit of polling on every HLT before scheduling
 itself out via kvm_vcpu_block.
 
 This parameter helps a lot for latency-bound workloads---in particular
 I tested it with O_DSYNC writes with a battery-backed disk in the host.
 In this case, writes are fast (because the data doesn't have to go all
 the way to the platters) but they cannot be merged by either the host or
 the guest.  KVM's performance here is usually around 30% of bare metal,
 or 50% if you use cache=directsync or cache=writethrough (these
 parameters avoid that the guest sends pointless flush requests, and
 at the same time they are not slow because of the battery-backed cache).
 The bad performance happens because on every halt the host CPU decides
 to halt itself too.  When the interrupt comes, the vCPU thread is then
 migrated to a new physical CPU, and in general the latency is horrible
 because the vCPU thread has to be scheduled back in.
 
 With this patch performance reaches 60-65% of bare metal and, more
 important, 99% of what you get if you use idle=poll in the guest.  This
 means that the tunable gets rid of this particular bottleneck, and more
 work can be done to improve performance in the kernel or QEMU.
 
 Of course there is some price to pay; every time an otherwise idle vCPUs
 is interrupted by an interrupt, it will poll unnecessarily and thus
 impose a little load on the host.  The above results were obtained with
 a mostly random value of the parameter (200), and the load was around
 1.5-2.5% CPU usage on one of the host's core for each idle guest vCPU.
 
 The patch also adds a new stat, /sys/kernel/debug/kvm/halt_successful_poll,
 that can be used to tune the parameter.  It counts how many HLT
 instructions received an interrupt during the polling period; each
 successful poll avoids that Linux schedules the VCPU thread out and back
 in, and may also avoid a likely trip to C1 and back for the physical CPU.

In the long run, this value should probably be auto-tuned.
However, it seems like a good idea to introduce this kind
of thing one step at a time.

 While the VM is idle, a Linux 4 VCPU VM halts around 10 times per second.
 Of these halts, almost all are failed polls.  During the benchmark,
 instead, basically all halts end within the polling period, except a more
 or less constant stream of 50 per second coming from vCPUs that are not
 running the benchmark.  The wasted time is thus very low.  Things may
 be slightly different for Windows VMs, which have a ~10 ms timer tick.
 
 The effect is also visible on Marcelo's recently-introduced latency
 test for the TSC deadline timer.  Though of course a non-RT kernel has
 awful latency bounds, the latency of the timer is around 8000-1 clock
 cycles compared to 2-12 without setting halt_poll.  For the TSC
 deadline timer, thus, the effect is both a smaller average latency and
 a smaller variance.
 
 Signed-off-by: Paolo Bonzini pbonz...@redhat.com

Acked-by: Rik van Riel r...@redhat.com

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/4] kvm,rcu: use RCU extended quiescent state when running KVM guest

2015-02-05 Thread Rik van Riel
On 02/05/2015 01:56 PM, Paul E. McKenney wrote:
 On Thu, Feb 05, 2015 at 01:09:19PM -0500, Rik van Riel wrote:
 On 02/05/2015 12:50 PM, Paul E. McKenney wrote:
 On Thu, Feb 05, 2015 at 11:52:37AM -0500, Rik van Riel wrote:
 On 02/05/2015 11:44 AM, Christian Borntraeger wrote:
 Am 05.02.2015 um 17:35 schrieb r...@redhat.com:
 From: Rik van Riel r...@redhat.com

 The host kernel is not doing anything while the CPU is executing
 a KVM guest VCPU, so it can be marked as being in an extended
 quiescent state, identical to that used when running user space
 code.

 The only exception to that rule is when the host handles an
 interrupt, which is already handled by the irq code, which
 calls rcu_irq_enter and rcu_irq_exit.

 The guest_enter and guest_exit functions already switch vtime
 accounting independent of context tracking, so leave those calls
 where they are, instead of moving them into the context tracking
 code.

 Signed-off-by: Rik van Riel r...@redhat.com
 ---
  include/linux/context_tracking.h   | 8 +++-
  include/linux/context_tracking_state.h | 1 +
  2 files changed, 8 insertions(+), 1 deletion(-)

 diff --git a/include/linux/context_tracking.h 
 b/include/linux/context_tracking.h
 index bd9f000fc98d..a5d3bb44b897 100644
 --- a/include/linux/context_tracking.h
 +++ b/include/linux/context_tracking.h
 @@ -43,7 +43,7 @@ static inline enum ctx_state exception_enter(void)
  static inline void exception_exit(enum ctx_state prev_ctx)
  {
  if (context_tracking_is_enabled()) {
 -if (prev_ctx == IN_USER)
 +if (prev_ctx == IN_USER || prev_ctx == IN_GUEST)
  context_tracking_user_enter(prev_ctx);
  }
  }
 @@ -78,6 +78,9 @@ static inline void guest_enter(void)
  vtime_guest_enter(current);
  else
  current-flags |= PF_VCPU;
 +
 +if (context_tracking_is_enabled())
 +context_tracking_user_enter(IN_GUEST);
  }


 Couldnt we make
 rcu_virt_note_context_switch(smp_processor_id());
 conditional in include/linux/kvm_host.h (kvm_guest_enter)

 e.g. something like
 if (!context_tracking_is_enabled())
   rcu_virt_note_context_switch(smp_processor_id());

 Possibly. I considered the same, but I do not know whether
 or not just rcu_user_enter / rcu_user_exit is enough.

 I could certainly try it out and see whether anything
 explodes, but I am not convinced that is careful enough
 when it comes to handling RCU code...

 Paul? :)

 That can fail for some odd combinations of Kconfig and boot parameters.
 As I understand it, you instead want the following:

 if (!tick_nohz_full_cpu(smp_processor_id()))
 rcu_virt_note_context_switch(smp_processor_id());

 Frederic might know of a better approach.

 Interesting, in what cases would that happen?
 
 My concern, perhaps misplaced, is that context_tracking_is_enabled(),
 but that the current CPU is not a nohz_full= CPU.  In that case, the
 context-tracking code would be within its rights to not tell RCU about
 the transition to the guest, and thus RCU would be unaware that the
 CPU was in a quiescent state.
 
 In most workloads, you would expect the CPU to get interrupted or
 preempted or something at some point, which would eventually inform
 RCU, but too much delay would result in the infamous RCU CPU stall
 warning message.  So let's code a bit more defensively.
 
 If context_tracking_is_enabled() we end up eventually
 calling into rcu_user_enter  rcu_user_exit.

 If it is not enabled, we call rcu_virt_note_context_switch.

 In what cases would we need to call both?

 I don't see exit-to-userspace do the things that
 rcu_virt_note_context_switch does, and do not understand
 why userspace is fine with that...
 
 The real danger is doing neither.
 
 On tick_nohz_full_cpu() CPUs, the exit-to-userspace code should invoke
 rcu_user_enter(), which sets some per-CPU state telling RCU to ignore
 that CPU, since it cannot possibly do host RCU read-side critical sections
 while running a guest.

Ahhh, I understand now.  Thank you for your
explanation, Paul.

I will make sure your suggestion is in version 2 of
this series.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/4] kvm,rcu: use RCU extended quiescent state when running KVM guest

2015-02-05 Thread Rik van Riel
On 02/05/2015 12:50 PM, Paul E. McKenney wrote:
 On Thu, Feb 05, 2015 at 11:52:37AM -0500, Rik van Riel wrote:
 On 02/05/2015 11:44 AM, Christian Borntraeger wrote:
 Am 05.02.2015 um 17:35 schrieb r...@redhat.com:
 From: Rik van Riel r...@redhat.com

 The host kernel is not doing anything while the CPU is executing
 a KVM guest VCPU, so it can be marked as being in an extended
 quiescent state, identical to that used when running user space
 code.

 The only exception to that rule is when the host handles an
 interrupt, which is already handled by the irq code, which
 calls rcu_irq_enter and rcu_irq_exit.

 The guest_enter and guest_exit functions already switch vtime
 accounting independent of context tracking, so leave those calls
 where they are, instead of moving them into the context tracking
 code.

 Signed-off-by: Rik van Riel r...@redhat.com
 ---
  include/linux/context_tracking.h   | 8 +++-
  include/linux/context_tracking_state.h | 1 +
  2 files changed, 8 insertions(+), 1 deletion(-)

 diff --git a/include/linux/context_tracking.h 
 b/include/linux/context_tracking.h
 index bd9f000fc98d..a5d3bb44b897 100644
 --- a/include/linux/context_tracking.h
 +++ b/include/linux/context_tracking.h
 @@ -43,7 +43,7 @@ static inline enum ctx_state exception_enter(void)
  static inline void exception_exit(enum ctx_state prev_ctx)
  {
if (context_tracking_is_enabled()) {
 -  if (prev_ctx == IN_USER)
 +  if (prev_ctx == IN_USER || prev_ctx == IN_GUEST)
context_tracking_user_enter(prev_ctx);
}
  }
 @@ -78,6 +78,9 @@ static inline void guest_enter(void)
vtime_guest_enter(current);
else
current-flags |= PF_VCPU;
 +
 +  if (context_tracking_is_enabled())
 +  context_tracking_user_enter(IN_GUEST);
  }


 Couldnt we make
 rcu_virt_note_context_switch(smp_processor_id());
 conditional in include/linux/kvm_host.h (kvm_guest_enter)

 e.g. something like
 if (!context_tracking_is_enabled())
 rcu_virt_note_context_switch(smp_processor_id());

 Possibly. I considered the same, but I do not know whether
 or not just rcu_user_enter / rcu_user_exit is enough.

 I could certainly try it out and see whether anything
 explodes, but I am not convinced that is careful enough
 when it comes to handling RCU code...

 Paul? :)
 
 That can fail for some odd combinations of Kconfig and boot parameters.
 As I understand it, you instead want the following:
 
   if (!tick_nohz_full_cpu(smp_processor_id()))
   rcu_virt_note_context_switch(smp_processor_id());
 
 Frederic might know of a better approach.

Interesting, in what cases would that happen?

If context_tracking_is_enabled() we end up eventually
calling into rcu_user_enter  rcu_user_exit.

If it is not enabled, we call rcu_virt_note_context_switch.

In what cases would we need to call both?

I don't see exit-to-userspace do the things that
rcu_virt_note_context_switch does, and do not understand
why userspace is fine with that...

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/4] kvm,rcu: use RCU extended quiescent state when running KVM guest

2015-02-05 Thread Rik van Riel
On 02/05/2015 02:27 PM, Paul E. McKenney wrote:
 On Thu, Feb 05, 2015 at 02:02:35PM -0500, Rik van Riel wrote:

 Looking at context_tracking.h, I see the
 function context_tracking_cpu_is_enabled().

 That looks like it should do the right thing
 in this case.
 
 Right you are -- that same check is used to guard the
 context_tracking_user_enter() function's call to rcu_user_enter().

My test suggests it is working.

v2 incoming :)

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch -rt 1/2] KVM: use simple waitqueue for vcpu-wq

2015-01-14 Thread Rik van Riel
On 01/14/2015 12:12 PM, Marcelo Tosatti wrote:
 The problem:
 
 On -RT, an emulated LAPIC timer instances has the following path:
 
 1) hard interrupt
 2) ksoftirqd is scheduled
 3) ksoftirqd wakes up vcpu thread
 4) vcpu thread is scheduled
 
 This extra context switch introduces unnecessary latency in the 
 LAPIC path for a KVM guest.

 This patch reduces the average latency in my tests from 14us to 11us.
 
 Signed-off-by: Marcelo Tosatti mtosa...@redhat.com

Acked-by: Rik van Riel r...@redhat.com

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch -rt 2/2] KVM: lapic: mark LAPIC timer handler as irqsafe

2015-01-14 Thread Rik van Riel
On 01/14/2015 12:12 PM, Marcelo Tosatti wrote:
 Since lapic timer handler only wakes up a simple waitqueue,
 it can be executed from hardirq context.
 
 Reduces average cyclictest latency by 3us.
 
 Signed-off-by: Marcelo Tosatti mtosa...@redhat.com

Acked-by: Rik van Riel r...@redhat.com

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 2/2] KVM: x86: add option to advance tscdeadline hrtimer expiration

2014-12-11 Thread Rik van Riel
On 12/11/2014 04:07 PM, Andy Lutomirski wrote:
 On Thu, Dec 11, 2014 at 12:58 PM, Marcelo Tosatti mtosa...@redhat.com wrote:
 On Thu, Dec 11, 2014 at 12:48:36PM -0800, Andy Lutomirski wrote:
 On 12/10/2014 07:07 PM, Marcelo Tosatti wrote:
 On Thu, Dec 11, 2014 at 12:37:57AM +0100, Paolo Bonzini wrote:


 On 10/12/2014 21:57, Marcelo Tosatti wrote:
 For the hrtimer which emulates the tscdeadline timer in the guest,
 add an option to advance expiration, and busy spin on VM-entry waiting
 for the actual expiration time to elapse.

 This allows achieving low latencies in cyclictest (or any scenario
 which requires strict timing regarding timer expiration).

 Reduces cyclictest avg latency by 50%.

 Note: this option requires tuning to find the appropriate value
 for a particular hardware/guest combination. One method is to measure the
 average delay between apic_timer_fn and VM-entry.
 Another method is to start with 1000ns, and increase the value
 in say 500ns increments until avg cyclictest numbers stop decreasing.

 What values are you using in practice for the parameter?

 7us.


 It takes 7us to get from TSC deadline expiration to the *start* of
 vmresume?  That seems rather extreme.

 Is it possible that almost all of that latency is from deadline
 expiration to C-state exit?  If so, can we teach the timer code to wake
 up early to account for that?  We're supposed to know our idle exit
 latency these days.

 7us includes:

 idle thread wakeup
 idle schedout
 ksoftirqd schedin
 ksoftirqd schedout
 qemu schedin
 vm-entry
 
 Is there some secret way to get perf to profile this?  Like some way
 to tell perf to only record samples after the IRQ fires and before
 vmresume?  Because 7 us seems way slower than it should be for
 this.
 
 Yes, Rik, I know that we're wasting all kinds of time doing dumb
 things with xstate, but that shouldn't be anywhere near 7us on modern
 hardware, unless we're actually taking a device-not-available
 exception in there. :)  There might be a whopping size xstate
 operations, but those should be 60ns each, perhaps, if the state is
 dirty?
 
 Maybe everything misses cache?

I don't expect the xstate stuff to be more than about half
a microsecond, with cache misses and failure to optimize
XSAVEOPT / XRSTOR across vmenter/vmexit.

We get another microsecond or so from not trapping from the
guest to the host every time the guest accesses the FPU for
the first time. Marcelo already has a hack for that in his
tree, and my series merely has something that achieves the
same in an automated (and hopefully upstreamable) way.


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 2/2] KVM: x86: add option to advance tscdeadline hrtimer expiration

2014-12-10 Thread Rik van Riel
On 12/10/2014 12:06 PM, Marcelo Tosatti wrote:
 For the hrtimer which emulates the tscdeadline timer in the guest,
 add an option to advance expiration, and busy spin on VM-entry waiting
 for the actual expiration time to elapse.
 
 This allows achieving low latencies in cyclictest (or any scenario 
 which requires strict timing regarding timer expiration).
 
 Reduces cyclictest avg latency by 50%.
 
 Note: this option requires tuning to find the appropriate value 
 for a particular hardware/guest combination. One method is to measure the 
 average delay between apic_timer_fn and VM-entry. 
 Another method is to start with 1000ns, and increase the value
 in say 500ns increments until avg cyclictest numbers stop decreasing.

It would be good to document how this is used, in the changelog.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 1/2] KVM: x86: add method to test PIR bitmap vector

2014-12-10 Thread Rik van Riel
On 12/10/2014 12:06 PM, Marcelo Tosatti wrote:
 kvm_x86_ops-test_posted_interrupt() returns true/false depending
 whether 'vector' is set.

Is that good? Bad? How does this patch address the issue?

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 1/2] KVM: x86: add method to test PIR bitmap vector

2014-12-10 Thread Rik van Riel
On 12/10/2014 12:27 PM, Marcelo Tosatti wrote:
 On Wed, Dec 10, 2014 at 12:10:04PM -0500, Rik van Riel wrote:
 On 12/10/2014 12:06 PM, Marcelo Tosatti wrote:
 kvm_x86_ops-test_posted_interrupt() returns true/false depending
 whether 'vector' is set.

 Is that good? Bad? How does this patch address the issue?
 
 What issue?

Why is this change being made?
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: kvm ivy bridge support?

2014-12-05 Thread Rik van Riel
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 12/05/2014 03:17 AM, Thomas Lau wrote:
 virsh cpu-models x86_64 | sort

...
 Penryn SandyBridge Westmere

...

 interesting that it doesn't have Ivy Bridge, any reason?

The reason would be that you are running an older version.

- -- 
All rights reversed
-BEGIN PGP SIGNATURE-
Version: GnuPG v1

iQEcBAEBAgAGBQJUgc4KAAoJEM553pKExN6DHGoH/iRWWBCGO0oIgRuA2iEGVYhh
M5iUVJfxzdK6S4sNRPRnUHW8YqQVkKQSULBna4zYPI6q/9d5llDZog8Ejk0XES23
8yembpEM5SsKW2bUzaxpd1t8g096+tX7tmX/i06VOyzSnRDMpyqBAR0cX2c/Yr44
2lFu7IoGI8E+ZM9CeOnCjtKBIsr5ZMV2S9++WKo9EpZAvhMdsxFQh5a47PSShJes
0Dib/KEmg3S45+eWRvDoqI6KwAgTXJ0MvvMkXvWT2uQ25BHu7AzekKl2jsSnKJeq
4PWgD23dDn1quUfeP7GJq5/zN+Rv2Apw4cQZYtVAmm7pfmMOaM8xV5A2R0ygyTw=
=o0NK
-END PGP SIGNATURE-
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch -rt 1/2] KVM: use simple waitqueue for vcpu-wq

2014-11-25 Thread Rik van Riel
On 11/25/2014 12:21 PM, Marcelo Tosatti wrote:
 The problem:
 
 On -RT, an emulated LAPIC timer instances has the following path:
 
 1) hard interrupt
 2) ksoftirqd is scheduled
 3) ksoftirqd wakes up vcpu thread
 4) vcpu thread is scheduled
 
 This extra context switch introduces unnecessary latency in the 
 LAPIC path for a KVM guest.
 
 The solution:
 
 Allow waking up vcpu thread from hardirq context,
 thus avoiding the need for ksoftirqd to be scheduled.
 
 Normal waitqueues make use of spinlocks, which on -RT 
 are sleepable locks. Therefore, waking up a waitqueue 
 waiter involves locking a sleeping lock, which 
 is not allowed from hard interrupt context.
 

What are the consequences for the realtime kernel of
making waitqueue traversal non-preemptible?

Is waitqueue traversal something that ever takes so
long we need to care about it being non-preemptible?


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch -rt 1/2] KVM: use simple waitqueue for vcpu-wq

2014-11-25 Thread Rik van Riel
On 11/25/2014 01:57 PM, Rik van Riel wrote:
 On 11/25/2014 12:21 PM, Marcelo Tosatti wrote:
 The problem:

 On -RT, an emulated LAPIC timer instances has the following path:

 1) hard interrupt
 2) ksoftirqd is scheduled
 3) ksoftirqd wakes up vcpu thread
 4) vcpu thread is scheduled

 This extra context switch introduces unnecessary latency in the 
 LAPIC path for a KVM guest.

 The solution:

 Allow waking up vcpu thread from hardirq context,
 thus avoiding the need for ksoftirqd to be scheduled.

 Normal waitqueues make use of spinlocks, which on -RT 
 are sleepable locks. Therefore, waking up a waitqueue 
 waiter involves locking a sleeping lock, which 
 is not allowed from hard interrupt context.

 
 What are the consequences for the realtime kernel of
 making waitqueue traversal non-preemptible?
 
 Is waitqueue traversal something that ever takes so
 long we need to care about it being non-preemptible?

I answered my own question.

This patch only changes the kvm vcpu waitqueue,
which should only ever have the vcpu thread itself
waiting on it. In other words, it is a wait queue
of just one entry long, and the latency of traversing
it will be absolutely minimal.

Unless someone can think of a better way to implement
this patch, it gets my:

Acked-by: Rik van Riel r...@redhat.com

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch -rt 2/2] KVM: lapic: mark LAPIC timer handler as irqsafe

2014-11-25 Thread Rik van Riel
On 11/25/2014 12:21 PM, Marcelo Tosatti wrote:
 Since lapic timer handler only wakes up a simple waitqueue,
 it can be executed from hardirq context.
 
 Reduces average cyclictest latency by 3us.

Can this patch be merged in the KVM tree, and go
upstream via Paolo?

 Signed-off-by: Marcelo Tosatti mtosa...@redhat.com

Acked-by: Rik van Riel r...@redhat.com

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] kvm: Fix page ageing bugs

2014-09-22 Thread Rik van Riel
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 09/22/2014 03:57 PM, Andres Lagar-Cavilla wrote:
 1. We were calling clear_flush_young_notify in unmap_one, but we
 are within an mmu notifier invalidate range scope. The spte exists
 no more (due to range_start) and the accessed bit info has already
 been propagated (due to kvm_pfn_set_accessed). Simply call 
 clear_flush_young.
 
 2. We clear_flush_young on a primary MMU PMD, but this may be
 mapped as a collection of PTEs by the secondary MMU (e.g. during
 log-dirty). This required expanding the interface of the
 clear_flush_young mmu notifier, so a lot of code has been trivially
 touched.
 
 3. In the absence of shadow_accessed_mask (e.g. EPT A bit), we
 emulate the access bit by blowing the spte. This requires proper
 synchronizing with MMU notifier consumers, like every other removal
 of spte's does.

Acked-by: Rik van Riel r...@redhat.com


- -- 
All rights reversed
-BEGIN PGP SIGNATURE-
Version: GnuPG v1

iQEcBAEBAgAGBQJUIIQiAAoJEM553pKExN6DWeoH/RpkYF1QCxnbxgZhnioaWjyu
Rp/kN6Rck6Eu3k/yRI6k+8IhgUJWkVhSXybTIDl1X6aVGgYwhaeOv2zPPfshfM6h
ABE3pLFjs2qtdotZXFSvZ4mNwbQE73YHphAbmFUBSdm2Oz1bj6Qfq+KYFdM+aQd7
UYIFgtdGg/tyLMqE25J7pAnSDRR5GKmAKLvkFjN3Q8O4ynD3rExH1yTMLtQbyKvb
oadSzaQLBOkLDAj3rpiOTl52B2tlQS+cxWqEfzA/AXOK8TkllDKIQT5BeRXV5O1c
/WsZmusiA6KYgjLxnL0K9XJpgpOQ5unYAFyIGgYmKiaN6PQsd+pGM5GDnOWGorE=
=dftO
-END PGP SIGNATURE-
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: status of reducing context switching overhead in KVM

2014-06-21 Thread Rik van Riel
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 06/21/2014 04:55 AM, Xuekun Hu wrote:
 Hi, Rik
 
 I saw your presentation at last year 2013 kvm submit. 
 (http://www.linux-kvm.org/wiki/images/2/27/Kvm-forum-2013-idle-latency.pdf).

 
You said there will have some patches later, but I didn't find them.
 Are you continuing to do the optimization work? Any updates? :-)

I have gotten side-tracked by the automatic NUMA balancing work,
but Michael Tsirkin has been doing some work on reducing idle
latencies.

- -- 
All rights reversed
-BEGIN PGP SIGNATURE-
Version: GnuPG v1
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iQEcBAEBAgAGBQJTpYFnAAoJEM553pKExN6DcDcH/1UlBofa3KmGaercqfBk19Sj
78T8JgiAog4WBanFn3znVqbvdu1icMztVdUw/CoqVwgyeRmgyfXt5uf1lHtCe7zx
HS/BDIAYJSXt0uXPd7bBUThFdTLusL/VSf7WBV6ex3Qcw8qywBooS2VdA4wYP8t4
oMclOSR6imjhwSPYLP+723duzYowu35wXC1Fojhrm1AidAbLSIiClVGrDtkFjp5p
iuM+J6hHDPH3E/hwSWYE8EBeOC5CrvhmMdguv+7R4Z5qI9Qz+K6Ol8U1P33BcDfL
IXMoW/76UUEwmK3BOjTvYFUYJTUb94znY/e3q3zK1UZCoIpMqUyUjLUNiEZKjXc=
=rmno
-END PGP SIGNATURE-
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] Implement Batched (group) ticket lock

2014-05-28 Thread Rik van Riel
On 05/28/2014 08:16 AM, Raghavendra K T wrote:

This patch looks very promising.

 TODO:
 - we need an intelligent way to nullify the effect of batching for baremetal
  (because extra cmpxchg is not required).

On (larger?) NUMA systems, the unfairness may be a nice performance
benefit, reducing cache line bouncing through the system, and it
could well outweigh the extra cmpxchg at times.

 - we may have to make batch size as kernel arg to solve above problem
  (to run same kernel for host/guest). Increasing batch size also seem to help
  virtualized guest more, so we will have flexibility of tuning depending on 
 vm size.
 
 - My kernbench/ebizzy test on baremetal (32 cpu +ht sandybridge) did not seem 
 to
   show the impact of extra cmpxchg. but there should be effect of extra 
 cmpxchg.

Canceled out by better NUMA locality?

Or maybe cmpxchg is cheap once you already own the cache line
exclusively?

 - virtualized guest had slight impact on 1x cases of some benchmarks but we 
 have got
  impressive performance for 1x cases. So overall, patch needs exhaustive 
 tesing.
 
 - we can further add dynamically changing batch_size implementation 
 (inspiration and
   hint by Paul McKenney) as necessary.

I could see a larger batch size being beneficial.

Currently the maximum wait time for a spinlock on a system
with N CPUs is N times the length of the largest critical
section.

Having the batch size set equal to the number of CPUs would only
double that, and better locality (CPUs local to the current
lock holder winning the spinlock operation) might speed things
up enough to cancel that part of that out again...

  I have found that increasing  batch size gives excellent improvements for 
  overcommitted guests. I understand that we need more exhaustive testing.
 
  Please provide your suggestion and comments.

I have only nitpicks so far...

 diff --git a/arch/x86/include/asm/spinlock_types.h 
 b/arch/x86/include/asm/spinlock_types.h
 index 4f1bea1..b04c03d 100644
 --- a/arch/x86/include/asm/spinlock_types.h
 +++ b/arch/x86/include/asm/spinlock_types.h
 @@ -3,15 +3,16 @@
  
  #include linux/types.h
  
 +#define TICKET_LOCK_INC_SHIFT 1
 +#define __TICKET_LOCK_TAIL_INC (1TICKET_LOCK_INC_SHIFT)
 +
  #ifdef CONFIG_PARAVIRT_SPINLOCKS
 -#define __TICKET_LOCK_INC2
  #define TICKET_SLOWPATH_FLAG ((__ticket_t)1)
  #else
 -#define __TICKET_LOCK_INC1
  #define TICKET_SLOWPATH_FLAG ((__ticket_t)0)
  #endif

For the !CONFIG_PARAVIRT case, TICKET_LOCK_INC_SHIFT used to be 0,
now you are making it one. Probably not an issue, since even people
who compile with 128  CONFIG_NR_CPUS = 256 will likely have their
spinlocks padded out to 32 or 64 bits anyway in most data structures.

 -#if (CONFIG_NR_CPUS  (256 / __TICKET_LOCK_INC))
 +#if (CONFIG_NR_CPUS  (256 / __TICKET_LOCK_TAIL_INC))
  typedef u8  __ticket_t;
  typedef u16 __ticketpair_t;
  #else
 @@ -19,7 +20,12 @@ typedef u16 __ticket_t;
  typedef u32 __ticketpair_t;
  #endif
  
 -#define TICKET_LOCK_INC  ((__ticket_t)__TICKET_LOCK_INC)
 +#define TICKET_LOCK_TAIL_INC ((__ticket_t)__TICKET_LOCK_TAIL_INC)
 +
 +#define TICKET_LOCK_HEAD_INC ((__ticket_t)1)
 +#define TICKET_BATCH0x4 /* 4 waiters can contend simultaneously */
 +#define TICKET_LOCK_BATCH_MASK (~(TICKET_BATCHTICKET_LOCK_INC_SHIFT) + \
 +   TICKET_LOCK_TAIL_INC - 1)

I do not see the value in having TICKET_BATCH declared with a
hexadecimal number, and it may be worth making sure the code
does not compile if someone tried a TICKET_BATCH value that
is not a power of 2.

-- 
All rights reversed
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] Implement Batched (group) ticket lock

2014-05-28 Thread Rik van Riel
On 05/28/2014 06:19 PM, Linus Torvalds wrote:

 If somebody has a P4 still, that's likely the worst case by far.

I'm sure cmpxchg isn't the only thing making P4 the worst case :)

-- 
All rights reversed
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: commit 0bf1457f0cfca7b mm: vmscan: do not swap anon pages just because free+file is low causes heavy performance regression on paging

2014-04-22 Thread Rik van Riel
On 04/22/2014 07:57 AM, Christian Borntraeger wrote:
 On 22/04/14 12:55, Christian Borntraeger wrote:
 While preparing/testing some KVM on s390 patches for the next merge window 
 (target is kvm/next which is based on 3.15-rc1) I faced a very severe 
 performance hickup on guest paging (all anonymous memory).

 All memory bound guests are in D state now and the system is barely 
 unusable.

 Reverting commit 0bf1457f0cfca7bc026a82323ad34bcf58ad035d
 mm: vmscan: do not swap anon pages just because free+file is low makes the 
 problem go away.

 According to /proc/vmstat the system is now in direct reclaim almost all the 
 time for every page fault (more than 10x more direct reclaims than kswap 
 reclaims)
 With the patch being reverted everything is fine again.

 Any ideas?
 
 Here is an idea to tackle my problem and the original problem:
 
 reverting  0bf1457f0cfca7bc026a82323ad34bcf58ad035d + checking against low, 
 also seems to make my system usable.
 
 --- a/mm/vmscan.c
 +++ b/mm/vmscan.c
 @@ -1923,7 +1923,7 @@ static void get_scan_count(struct lruvec *lruvec, 
 struct scan_control *sc,
  */
 if (global_reclaim(sc)) {
 free = zone_page_state(zone, NR_FREE_PAGES);
 -   if (unlikely(file + free = high_wmark_pages(zone))) {
 +   if (unlikely(file + free = low_wmark_pages(zone))) {
 scan_balance = SCAN_ANON;
 goto out;
 }
 

Looks reasonable to me.  Johannes?

-- 
All rights reversed
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH -mm 3/3] move mmu notifier call from change_protection to change_pmd_range

2014-02-18 Thread Rik van Riel
On 02/18/2014 09:24 PM, David Rientjes wrote:
 On Tue, 18 Feb 2014, r...@redhat.com wrote:
 
 From: Rik van Riel r...@redhat.com

 The NUMA scanning code can end up iterating over many gigabytes
 of unpopulated memory, especially in the case of a freshly started
 KVM guest with lots of memory.

 This results in the mmu notifier code being called even when
 there are no mapped pages in a virtual address range. The amount
 of time wasted can be enough to trigger soft lockup warnings
 with very large KVM guests.

 This patch moves the mmu notifier call to the pmd level, which
 represents 1GB areas of memory on x86-64. Furthermore, the mmu
 notifier code is only called from the address in the PMD where
 present mappings are first encountered.

 The hugetlbfs code is left alone for now; hugetlb mappings are
 not relocatable, and as such are left alone by the NUMA code,
 and should never trigger this problem to begin with.

 Signed-off-by: Rik van Riel r...@redhat.com
 Cc: Peter Zijlstra pet...@infradead.org
 Cc: Andrea Arcangeli aarca...@redhat.com
 Reported-by: Xing Gang gang.x...@hp.com
 Tested-by: Chegu Vinod chegu_vi...@hp.com
 
 Acked-by: David Rientjes rient...@google.com
 
 Might have been cleaner to move the 
 mmu_notifier_invalidate_range_{start,end}() to hugetlb_change_protection() 
 as well, though.

I can certainly do that if you want.  Just let me know
and I'll send a v2 of patch 3 :)

-- 
All rights reversed
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH -mm v2 3/3] move mmu notifier call from change_protection to change_pmd_range

2014-02-18 Thread Rik van Riel
On Tue, 18 Feb 2014 18:24:36 -0800 (PST)
David Rientjes rient...@google.com wrote:

 Acked-by: David Rientjes rient...@google.com
 
 Might have been cleaner to move the 
 mmu_notifier_invalidate_range_{start,end}() to hugetlb_change_protection() 
 as well, though.

Way cleaner!  Second version attached :)

Thanks, David.

---8---

Subject: move mmu notifier call from change_protection to change_pmd_range

The NUMA scanning code can end up iterating over many gigabytes
of unpopulated memory, especially in the case of a freshly started
KVM guest with lots of memory.

This results in the mmu notifier code being called even when
there are no mapped pages in a virtual address range. The amount
of time wasted can be enough to trigger soft lockup warnings
with very large KVM guests.

This patch moves the mmu notifier call to the pmd level, which
represents 1GB areas of memory on x86-64. Furthermore, the mmu
notifier code is only called from the address in the PMD where
present mappings are first encountered.

The hugetlbfs code is left alone for now; hugetlb mappings are
not relocatable, and as such are left alone by the NUMA code,
and should never trigger this problem to begin with.

Signed-off-by: Rik van Riel r...@redhat.com
Acked-by: David Rientjes rient...@google.com
---
 mm/hugetlb.c  |  2 ++
 mm/mprotect.c | 15 ---
 2 files changed, 14 insertions(+), 3 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index c0d930f..f0c5dfb 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -3065,6 +3065,7 @@ unsigned long hugetlb_change_protection(struct 
vm_area_struct *vma,
BUG_ON(address = end);
flush_cache_range(vma, address, end);
 
+   mmu_notifier_invalidate_range_start(mm, start, end);
mutex_lock(vma-vm_file-f_mapping-i_mmap_mutex);
for (; address  end; address += huge_page_size(h)) {
spinlock_t *ptl;
@@ -3094,6 +3095,7 @@ unsigned long hugetlb_change_protection(struct 
vm_area_struct *vma,
 */
flush_tlb_range(vma, start, end);
mutex_unlock(vma-vm_file-f_mapping-i_mmap_mutex);
+   mmu_notifier_invalidate_range_end(mm, start, end);
 
return pages  h-order;
 }
diff --git a/mm/mprotect.c b/mm/mprotect.c
index d790166..76146fa 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -115,9 +115,11 @@ static inline unsigned long change_pmd_range(struct 
vm_area_struct *vma,
pgprot_t newprot, int dirty_accountable, int prot_numa)
 {
pmd_t *pmd;
+   struct mm_struct *mm = vma-vm_mm;
unsigned long next;
unsigned long pages = 0;
unsigned long nr_huge_updates = 0;
+   unsigned long mni_start = 0;
 
pmd = pmd_offset(pud, addr);
do {
@@ -126,6 +128,13 @@ static inline unsigned long change_pmd_range(struct 
vm_area_struct *vma,
next = pmd_addr_end(addr, end);
if (!pmd_trans_huge(*pmd)  pmd_none_or_clear_bad(pmd))
continue;
+
+   /* invoke the mmu notifier if the pmd is populated */
+   if (!mni_start) {
+   mni_start = addr;
+   mmu_notifier_invalidate_range_start(mm, mni_start, end);
+   }
+
if (pmd_trans_huge(*pmd)) {
if (next - addr != HPAGE_PMD_SIZE)
split_huge_page_pmd(vma, addr, pmd);
@@ -149,6 +158,9 @@ static inline unsigned long change_pmd_range(struct 
vm_area_struct *vma,
pages += this_pages;
} while (pmd++, addr = next, addr != end);
 
+   if (mni_start)
+   mmu_notifier_invalidate_range_end(mm, mni_start, end);
+
if (nr_huge_updates)
count_vm_numa_events(NUMA_HUGE_PTE_UPDATES, nr_huge_updates);
return pages;
@@ -208,15 +220,12 @@ unsigned long change_protection(struct vm_area_struct 
*vma, unsigned long start,
   unsigned long end, pgprot_t newprot,
   int dirty_accountable, int prot_numa)
 {
-   struct mm_struct *mm = vma-vm_mm;
unsigned long pages;
 
-   mmu_notifier_invalidate_range_start(mm, start, end);
if (is_vm_hugetlb_page(vma))
pages = hugetlb_change_protection(vma, start, end, newprot);
else
pages = change_protection_range(vma, start, end, newprot, 
dirty_accountable, prot_numa);
-   mmu_notifier_invalidate_range_end(mm, start, end);
 
return pages;
 }
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] mm: hugetlb: initialize PG_reserved for tail pages of gigantig compound pages

2013-10-10 Thread Rik van Riel

On 10/10/2013 12:12 PM, Andrea Arcangeli wrote:

11feeb498086a3a5907b8148bdf1786a9b18fc55 introduced a memory leak when
KVM is run on gigantic compound pages.

11feeb498086a3a5907b8148bdf1786a9b18fc55 depends on the assumption
that PG_reserved is identical for all head and tail pages of a
compound page. So that if get_user_pages returns a tail page, we don't
need to check the head page in order to know if we deal with a
reserved page that requires different refcounting.

The assumption that PG_reserved is the same for head and tail pages is
certainly correct for THP and regular hugepages, but gigantic
hugepages allocated through bootmem don't clear the PG_reserved on the
tail pages (the clearing of PG_reserved is done later only if the
gigantic hugepage is freed).

This patch corrects the gigantic compound page initialization so that
we can retain the optimization in
11feeb498086a3a5907b8148bdf1786a9b18fc55. The cacheline was already
modified in order to set PG_tail so this won't affect the boot time of
large memory systems.

Reported-by: andy123 ajs124.ajs...@gmail.com
Signed-off-by: Andrea Arcangeli aarca...@redhat.com


Acked-by: Rik van Riel r...@redhat.com

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC 2/2] virtio_balloon: auto-ballooning support

2013-05-13 Thread Rik van Riel

On 05/13/2013 11:16 AM, Michael S. Tsirkin wrote:


However, there's a big question mark: host specifies
inflate, guest says deflate, who wins?


If we're dealing with a NUMA guest, they could both win :)

The host could see reduced memory use of the guest in one
place, while the guest could see increased memory availability
in another place...

I also suspect that having some churn could help sort out
exactly what the working set is.


At some point Google sent patches that gave guest
complete control over the balloon.
This has the advantage that management isn't involved.


I believe the Google patches still included some way for the
host to initiate balloon inflation on the guest side, because
the guest internal state alone is not enough to see when the
host is under memory pressure.

I discussed the project with the Google developers in question
a little over a year ago, but I do not remember whether their
pressure notification went through qemu, or directly from the
host kernel to the guest kernel...


And at some level it seems to make sense: why set
an upper limit on size of the balloon?
The bigger it is, the better.


Response time.

If too much of a guest's memory has been removed, it can take
too long for the guest to react to user requests, be it over
the web or ssh or something else...

--
All rights reversed
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC 2/2] virtio_balloon: auto-ballooning support

2013-05-13 Thread Rik van Riel

On 05/13/2013 11:35 AM, Michael S. Tsirkin wrote:

On Mon, May 13, 2013 at 11:22:58AM -0400, Rik van Riel wrote:



I believe the Google patches still included some way for the
host to initiate balloon inflation on the guest side, because
the guest internal state alone is not enough to see when the
host is under memory pressure.

I discussed the project with the Google developers in question
a little over a year ago, but I do not remember whether their
pressure notification went through qemu, or directly from the
host kernel to the guest kernel...


So increasing the max number of pages in balloon
indicates host memory pressure to the guest?
Fair enough but I wonder whether there's a way to
make it more explicit in the interface, somehow.


There may be a better way to do this, but I am really not sure
what that would be. What properties would you like to see in
the interface? What kind of behaviour are you looking for?

--
All rights reversed
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC 2/2] virtio_balloon: auto-ballooning support

2013-05-12 Thread Rik van Riel

On 05/12/2013 10:30 AM, Michael S. Tsirkin wrote:

On Thu, May 09, 2013 at 10:53:49AM -0400, Luiz Capitulino wrote:

Automatic ballooning consists of dynamically adjusting the guest's
balloon according to memory pressure in the host and in the guest.

This commit implements the guest side of automatic balloning, which
basically consists of registering a shrinker callback with the kernel,
which will try to deflate the guest's balloon by the amount of pages
being requested. The shrinker callback is only registered if the host
supports the VIRTIO_BALLOON_F_AUTO_BALLOON feature bit.


OK so we have a new feature bit, such that:
- if AUTO_BALLOON is set in host, guest can leak a
   page from a balloon at any time

questions left unanswered
- what meaning does num_pages have now?


as large as we could go


- when will the balloon be re-inflated?


I believe the qemu changes Luiz wrote address that side,
with qemu-kvm getting notifications from the host kernel
when there is memory pressure, and shrinking the guest
in reaction to that notification.


I'd like to see a spec patch addressing these questions.

Would we ever want to mix the two types of ballooning?
If yes possibly when we put a page in the balloon we
might want to give host a hint this page might be
leaked again soon.


It might not be the same page, and the host really does
not care which page it is.

The automatic inflation happens when the host needs to
free up memory.


Automatic inflate is performed by the host.

Here are some numbers. The test-case is to run 35 VMs (1G of RAM each)
in parallel doing a kernel build. Host has 32GB of RAM and 16GB of swap.
SWAP IN and SWAP OUT correspond to the number of pages swapped in and
swapped out, respectively.

Auto-ballooning disabled:

RUN  TIME(s)  SWAP IN  SWAP OUT

1634  930980   1588522
2610  627422   1362174
3649  1079847  1616367
4543  953289   1635379
5642  913237   1514000

Auto-ballooning enabled:

RUN  TIME(s)  SWAP IN  SWAP OUT

1629  901  12537
2624  981  18506
3626  573  9085
4631  2250 42534
5627  1610 20808


So what exactly happens here?
Much less swap in/out activity, but no gain
in total runtime. Doesn't this show there's
a bottleneck somewhere? Could be a problem in
the implementation?


It could also be an issue with the benchmark chosen,
which may not have swap as its bottleneck at any point.

However, the reduced swapping is still very promising!


Also, what happened with the balloon?
Did we end up with balloon completely inflated? deflated?

One question to consider: possibly if we are
going to reuse the page in the balloon soon,
we might want to bypass notify before use for it?
Maybe that will help speed things up.



Signed-off-by: Luiz Capitulino lcapitul...@redhat.com
---
  drivers/virtio/virtio_balloon.c | 55 +
  include/uapi/linux/virtio_balloon.h |  1 +
  2 files changed, 56 insertions(+)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index 9d5fe2b..f9dcae8 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -71,6 +71,9 @@ struct virtio_balloon
/* Memory statistics */
int need_stats_update;
struct virtio_balloon_stat stats[VIRTIO_BALLOON_S_NR];
+
+   /* Memory shrinker */
+   struct shrinker shrinker;
  };

  static struct virtio_device_id id_table[] = {
@@ -126,6 +129,7 @@ static void set_page_pfns(u32 pfns[], struct page *page)
pfns[i] = page_to_balloon_pfn(page) + i;
  }

+/* This function should be called with vb-balloon_mutex held */
  static void fill_balloon(struct virtio_balloon *vb, size_t num)
  {
struct balloon_dev_info *vb_dev_info = vb-vb_dev_info;
@@ -166,6 +170,7 @@ static void release_pages_by_pfn(const u32 pfns[], unsigned 
int num)
}
  }

+/* This function should be called with vb-balloon_mutex held */
  static void leak_balloon(struct virtio_balloon *vb, size_t num)
  {
struct page *page;
@@ -285,6 +290,45 @@ static void update_balloon_size(struct virtio_balloon *vb)
  actual, sizeof(actual));
  }

+static unsigned long balloon_get_nr_pages(const struct virtio_balloon *vb)
+{
+   return vb-num_pages / VIRTIO_BALLOON_PAGES_PER_PAGE;
+}
+
+static int balloon_shrinker(struct shrinker *shrinker,struct shrink_control 
*sc)
+{
+   unsigned int nr_pages, new_target;
+   struct virtio_balloon *vb;
+
+   vb = container_of(shrinker, struct virtio_balloon, shrinker);
+   if (!mutex_trylock(vb-balloon_lock)) {
+   return -1;
+   }
+
+   nr_pages = balloon_get_nr_pages(vb);
+   if (!sc-nr_to_scan || !nr_pages) {
+   goto out;
+   }
+
+   /*
+* If the current balloon size is greater than the number of
+* pages being reclaimed by the kernel, deflate only the needed
+* amount. 

Re: Preemptable Ticket Spinlock

2013-04-22 Thread Rik van Riel

On 04/22/2013 07:51 AM, Peter Zijlstra wrote:

On Sun, 2013-04-21 at 17:12 -0400, Rik van Riel wrote:


If we always incremented the ticket number by 2 (instead of 1), then
we could use the lower bit of the ticket number as the spinlock.


ISTR that paravirt ticket locks already do that and use the lsb to
indicate the unlock needs to perform wakeups.

Also, since all of this is virt nonsense, shouldn't it live in the
paravirt ticket lock code and leave the native code as is?


Sure, but that is still no reason not to have the virt
implementation be as fast as possible, and share the same
data type as the non-virt implementation.

Also, is it guaranteed that the native spin_lock code has
not been called yet before we switch over to the paravirt
functions?

If the native spin_lock code has been called already at
that time, the native code would still need to be modified
to increment the ticket number by 2, so we end up with a
compatible value in each spin lock's .tickets field, and
prevent a deadlock after we switch over to the paravirt
variant.

--
All rights reversed
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Preemptable Ticket Spinlock

2013-04-22 Thread Rik van Riel

On 04/22/2013 03:49 PM, Peter Zijlstra wrote:

On Mon, 2013-04-22 at 08:52 -0400, Rik van Riel wrote:



If the native spin_lock code has been called already at
that time, the native code would still need to be modified
to increment the ticket number by 2, so we end up with a
compatible value in each spin lock's .tickets field, and
prevent a deadlock after we switch over to the paravirt
variant.


I thought the stuff already made it upstream, but apparently not; the
lastest posting I'm aware of is here:

   https://lkml.org/lkml/2012/5/2/105

That stuff changes the normal ticket increment as well..


Jiannan,

It looks like the patch above could make a good patch
1 (or 2) in your patch series :)

--
All rights reversed
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Preemptable Ticket Spinlock

2013-04-22 Thread Rik van Riel

On 04/22/2013 04:08 PM, Peter Zijlstra wrote:

On Mon, 2013-04-22 at 15:56 -0400, Rik van Riel wrote:

On 04/22/2013 03:49 PM, Peter Zijlstra wrote:

On Mon, 2013-04-22 at 08:52 -0400, Rik van Riel wrote:



If the native spin_lock code has been called already at
that time, the native code would still need to be modified
to increment the ticket number by 2, so we end up with a
compatible value in each spin lock's .tickets field, and
prevent a deadlock after we switch over to the paravirt
variant.


I thought the stuff already made it upstream, but apparently not; the
lastest posting I'm aware of is here:

https://lkml.org/lkml/2012/5/2/105

That stuff changes the normal ticket increment as well..


Jiannan,

It looks like the patch above could make a good patch
1 (or 2) in your patch series :)


I much prefer the entire series from Jeremy since it maintains the
ticket semantics and doesn't degrade the lock to unfair under
contention.

Now I suppose there's a reason its not been merged yet and I suspect
its !paravirt hotpath impact which wasn't rightly justified or somesuch
so maybe someone can work on that or so.. dunno.


IIRC one of the reasons was that the performance improvement wasn't
as obvious.  Rescheduling VCPUs takes a fair amount of time, quite
probably more than the typical hold time of a spinlock.

--
All rights reversed
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Preemptable Ticket Spinlock

2013-04-22 Thread Rik van Riel

On 04/22/2013 04:46 PM, Jiannan Ouyang wrote:


It would still be very interesting to conduct more experiments to
compare these two, to see if the fairness enforced by pv_lock is
mandatory, and if preemptable-lock outperforms pv_lock in most cases,
and how do they work with PLE.


Given the fairly high cost of rescheduling a VCPU (which is likely
to include an IPI), versus the short hold time of most spinlocks,
I have the strong suspicion that your approach would win.

The fairness is only compromised in a limited way and in certain
circumstances, so I am not too worried about that.

--
All rights reversed
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Preemptable Ticket Spinlock

2013-04-22 Thread Rik van Riel

On 04/22/2013 04:48 PM, Peter Zijlstra wrote:


Hmm.. it looked like under light overcommit the paravirt ticket lock
still had some gain (~10%) and of course it brings the fairness thing
which is always good.

I can only imagine the mess unfair + vcpu preemption can bring to guest
tasks.


If you think unfairness + vcpu preemption is bad, you haven't
tried full fairness + vcpu preemption :)

--
All rights reversed
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Preemptable Ticket Spinlock

2013-04-22 Thread Rik van Riel

On 04/22/2013 04:55 PM, Peter Zijlstra wrote:

On Mon, 2013-04-22 at 16:46 -0400, Jiannan Ouyang wrote:



- pv-preemptable-lock has much less performance variance compare to
pv_lock, because it adapts to preemption within  VM,
   other than using rescheduling that increase VM interference


I would say it has a _much_ worse worst case (and thus worse variance)
than the paravirt ticket implementation from Jeremy. While full
paravirt ticket lock results in vcpu scheduling it does maintain
fairness.

If you drop strict fairness you can end up in unbounded starvation
cases and those are very ugly indeed.


If needed, Jiannan's scheme could easily be bounded to prevent
infinite starvation. For example, we could allow only the first
8 CPUs in line to jump the queue.

However, given the way that virtual CPUs get scheduled in and
out all the time, I suspect starvation is not a worry, and we
will not need the additional complexity to deal with it.

You may want to play around with virtualization a bit, to get
a feel for how things work in virt land.

--
All rights reversed
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Preemptable Ticket Spinlock

2013-04-22 Thread Rik van Riel

On 04/22/2013 05:56 PM, Andi Kleen wrote:

Rik van Riel r...@redhat.com writes:


If we always incremented the ticket number by 2 (instead of 1), then
we could use the lower bit of the ticket number as the spinlock.


Spinning on a single bit is very inefficient, as you need to do
try lock in a loop which is very unfriendly to the MESI state protocol.
It's much better to have at least three states and allow
spinning-while-reading-only.

This is typically very visible on systems with 2S.


Absolutely, the spinning should be read-only, until the CPU
sees that the desired bit is clear.  MESI-friendly spinning
is essential.

--
All rights reversed
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Preemptable Ticket Spinlock

2013-04-21 Thread Rik van Riel

On 04/20/2013 06:12 PM, Jiannan Ouyang wrote:

Hello Everyone,

I recently came up with a spinlock algorithm that can adapt to
preemption, which you may be interested in. The intuition is to
downgrade a fair lock to an unfair lock automatically upon preemption,
and preserve the fairness otherwise. It is a guest side optimization,
and can be used as a complementary technique to host side optimizations
like co-scheduling and Pause-Loop Exiting.

In my experiments, it improves VM performance by 5:32X on average, when
running on a non paravirtual VMM, and by 7:91X when running on a VMM
that supports a paravirtual locking interface (using a pv preemptable
ticket spinlock), when executing a set of microbenchmarks as well as a
realistic e-commerce benchmark.

A detailed algorithm description can be found in my VEE 2013 paper,
Preemptable Ticket Spinlocks: Improving Consolidated Performance in the
Cloud
Jiannan Ouyang, John R. Lange
ouyang,jackla...@cs.pitt.edu mailto:jackla...@cs.pitt.edu
University of Pittsburgh
http://people.cs.pitt.edu/~ouyang/files/publication/preemptable_lock-ouyang-vee13.pdf


Your algorithm is very clever, and very promising.

However, it does increase the size of the struct spinlock, and adds
an additional atomic operation to spin_unlock, neither of which I
suspect are necessary.

If we always incremented the ticket number by 2 (instead of 1), then
we could use the lower bit of the ticket number as the spinlock.

If we do NOT run virtualized, we simply increment the ticket by 2
in spin_unlock, and the code can remain otherwise the same.

If we do run virtualized, we take that spinlock after acquiring
the ticket (or timing out), just like in your code. In the
virtualized spin_unlock, we can then release the spinlock and
increment the ticket in one operation: by simply increasing the
ticket by 1.

In other words, we should be able to keep the overhead of this
to an absolute minimum, and keep spin_unlock to be always the
same cost it is today.

--
All rights reversed
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/4] Expand the steal time msr to also contain the consigned time.

2013-02-06 Thread Rik van Riel

On 02/05/2013 04:49 PM, Michael Wolf wrote:

Expand the steal time msr to also contain the consigned time.

Signed-off-by: Michael Wolf m...@linux.vnet.ibm.com
---
  arch/x86/include/asm/paravirt.h   |4 ++--
  arch/x86/include/asm/paravirt_types.h |2 +-
  arch/x86/kernel/kvm.c |7 ++-
  kernel/sched/core.c   |   10 +-
  kernel/sched/cputime.c|2 +-
  5 files changed, 15 insertions(+), 10 deletions(-)

diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
index 5edd174..9b753ea 100644
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -196,9 +196,9 @@ struct static_key;
  extern struct static_key paravirt_steal_enabled;
  extern struct static_key paravirt_steal_rq_enabled;

-static inline u64 paravirt_steal_clock(int cpu)
+static inline void paravirt_steal_clock(int cpu, u64 *steal)
  {
-   return PVOP_CALL1(u64, pv_time_ops.steal_clock, cpu);
+   PVOP_VCALL2(pv_time_ops.steal_clock, cpu, steal);
  }


This may be a stupid question, but what happens if a KVM
guest with this change, runs on a kernel that still has
the old steal time interface?

What happens if the host has the new steal time interface,
but the guest uses the old interface?

Will both cases continue to work as expected with your
patch series?

If so, could you document (in the source code) why things
continue to work?

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/4] Add the code to send the consigned time from the host to the guest

2013-02-06 Thread Rik van Riel

On 02/05/2013 04:49 PM, Michael Wolf wrote:

Change the paravirt calls that retrieve the steal-time information
from the host.  Add to it getting the consigned value as well as
the steal time.

Signed-off-by: Michael Wolf m...@linux.vnet.ibm.com



diff --git a/arch/x86/include/uapi/asm/kvm_para.h 
b/arch/x86/include/uapi/asm/kvm_para.h
index 06fdbd9..55d617f 100644
--- a/arch/x86/include/uapi/asm/kvm_para.h
+++ b/arch/x86/include/uapi/asm/kvm_para.h
@@ -42,9 +42,10 @@

  struct kvm_steal_time {
__u64 steal;
+   __u64 consigned;
__u32 version;
__u32 flags;
-   __u32 pad[12];
+   __u32 pad[10];
  };


The function kvm_register_steal_time passes the address of such
a structure to the host kernel, which then does something with
it.

Could running a guest with the above patch, on top of a host
with the old code, result in the values for version and
flags being written into consigned?

Could that result in confusing the guest kernel to no end,
and generally breaking things?
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: qemu-kvm hangs at start up under 3.8.0-rc3-00074-gb719f43 (works with CONFIG_LOCKDEP)

2013-01-15 Thread Rik van Riel

On 01/14/2013 01:24 PM, Andrew Clayton wrote:

On Mon, 14 Jan 2013 15:27:36 +0200, Gleb Natapov wrote:


On Sun, Jan 13, 2013 at 10:29:58PM +, Andrew Clayton wrote:

When running qemu-kvm under 64but Fedora 16 under current 3.8, it
just hangs at start up. Dong a ps -ef hangs the process at the
point where it would display the qemu process (trying to list the
qemu-kvm /proc pid directory contents just hangs ls).

I also noticed some other weirdness at this point like Firefox
hanging for many seconds at a time and increasing load average.

The qemu command I was trying to run was

$ qemu-kvm -m 512 -smp 2 -vga vmware -k en-gb -drive
file=/home/andrew/machines/qemu/f16-i386.img,if=virtio

Here's the last few lines of a strace on it at start up.

open(/home/andrew/machines/qemu/f16-i386.img,
O_RDWR|O_DSYNC|O_CLOEXEC) = 8 lseek(8, 0,
SEEK_END)   = 9100722176 pread(8,
QFI\373\0\0\0\2\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\20\0\0\0\2\200\0\0\0...,
512, 0) = 512 pread(8,
\200\0\0\0\0\4\0\0\200\0\0\0\0\10\0\0\200\0\0\0\2\210\0\0\200\0\0\0\2\233\0\0...,
512, 65536) = 512 brk(0)  =
0x7faf12db brk(0x7faf12ddd000

It's hanging in that brk syscall. The load average also then starts
to increase.


However. I can make it run fine, if I enable CONFIG_LOCKDEP. But
the only thing in dmesg I get is the usual.

kvm: SMP vm created on host with unstable TSC; guest TSC will not
be reliable

I've attached both working and non-working .configs. The only
difference being the lock checking enabled in config.good.

The most recent kernel I had it working in was 3.7.0

System is a Quad Core Intel running 64bit Fedora 16.


Can you run echo t  /proc/sysrq-trigger and see where it hangs?


Here you go, here's the bash process, qemu and a kvm bit. (From the
above command)

bashS 88013b2b0d00 0  3203   3133 0x
  880114dabe58 0082 800113558065 880114dabfd8
  880114dabfd8 4000 88013b0c5b00 88013b2b0d00
  880114dabd88 8109067d ea0004536670 ea0004536640
Call Trace:
  [8109067d] ? default_wake_function+0xd/0x10
  [8108a315] ? atomic_notifier_call_chain+0x15/0x20
  [8133d84f] ? tty_get_pgrp+0x3f/0x50
  [810819ac] ? pid_vnr+0x2c/0x30
  [8133fe54] ? tty_ioctl+0x7b4/0xbd0
  [8106bf62] ? wait_consider_task+0x102/0xaf0
  [815c00e4] schedule+0x24/0x70
  [8106cb24] do_wait+0x1d4/0x200
  [8106d9cb] sys_wait4+0x9b/0xf0
  [8106b9f0] ? task_stopped_code+0x50/0x50
  [815c1ad2] system_call_fastpath+0x16/0x1b

qemu-kvmD 88011ab8c8b8 0  3345   3203 0x
  880112129cd8 0082 880112129c50 880112129fd8
  880112129fd8 4000 88013b04ce00 880139da1a00
   000280da 880112129d38 810d3300
Call Trace:
  [810d3300] ? __alloc_pages_nodemask+0xf0/0x7c0
  [811273c6] ? touch_atime+0x66/0x170
  [810cdabf] ? generic_file_aio_read+0x5bf/0x730
  [815c00e4] schedule+0x24/0x70
  [815c0cdd] rwsem_down_failed_common+0xbd/0x150
  [815c0da3] rwsem_down_write_failed+0x13/0x15
  [812d1be3] call_rwsem_down_write_failed+0x13/0x20
  [815bf4dd] ? down_write+0x2d/0x34
  [810f0724] vma_adjust+0xe4/0x610
  [810f0fa4] vma_merge+0x1b4/0x270
  [810f1fa6] do_brk+0x196/0x330
  [810f2217] sys_brk+0xd7/0x130
  [815c1ad2] system_call_fastpath+0x16/0x1b


This looks like qemu-kvm getting stuck trying to get the anon_vma lock.

That leads to the obvious question: what is holding the lock, and/or
failed to release it?

Do you have any other (qemu-kvm?) processes on your system that have
any code in the VM (or strace/ptrace/...) in the backtrace, that might
be holding this lock?

Do you have anything in your dmesg showing threads that had a BUG_ON
(and exited) while holding the lock?

--
All rights reversed
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC V2 0/5] Separate consigned (expected steal) from steal time.

2012-10-22 Thread Rik van Riel

On 10/16/2012 10:23 PM, Michael Wolf wrote:

In the case of where you have a system that is running in a
capped or overcommitted environment the user may see steal time
being reported in accounting tools such as top or vmstat.  This can
cause confusion for the end user.


How do s390 and Power systems deal with reporting that kind
of information?

IMHO it would be good to see what those do, so we do not end
up re-inventing the wheel, and confusing admins with yet another
way of reporting the information...

--
All rights reversed
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] mm: compaction: Correct the nr_strict_isolated check for CMA

2012-10-16 Thread Rik van Riel

On 10/16/2012 04:39 AM, Mel Gorman wrote:

Thierry reported that the iron out patch for isolate_freepages_block()
had problems due to the strict check being too strict with mm: compaction:
Iron out isolate_freepages_block() and isolate_freepages_range() -fix1.
It's possible that more pages than necessary are isolated but the check
still fails and I missed that this fix was not picked up before RC1. This
same problem has been identified in 3.7-RC1 by Tony Prisk and should be
addressed by the following patch.

Signed-off-by: Mel Gorman mgor...@suse.de
Tested-by: Tony Prisk li...@prisktech.co.nz


Acked-by: Rik van Riel r...@redhat.com

--
All rights reversed
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC 1/2] kvm: Handle undercommitted guest case in PLE handler

2012-09-21 Thread Rik van Riel

On 09/21/2012 08:00 AM, Raghavendra K T wrote:

From: Raghavendra K T raghavendra...@linux.vnet.ibm.com

When total number of VCPUs of system is less than or equal to physical CPUs,
PLE exits become costly since each VCPU can have dedicated PCPU, and
trying to find a target VCPU to yield_to just burns time in PLE handler.

This patch reduces overhead, by simply doing a return in such scenarios by
checking the length of current cpu runqueue.


I am not convinced this is the way to go.

The VCPU that is holding the lock, and is not releasing it,
probably got scheduled out. That implies that VCPU is on a
runqueue with at least one other task.


--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1629,6 +1629,9 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me)
int pass;
int i;

+   if (unlikely(rq_nr_running() == 1))
+   return;
+
kvm_vcpu_set_in_spin_loop(me, true);
/*
 * We boost the priority of a VCPU that is runnable but not




--
All rights reversed
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC 2/2] kvm: Be courteous to other VMs in overcommitted scenario in PLE handler

2012-09-21 Thread Rik van Riel

On 09/21/2012 08:00 AM, Raghavendra K T wrote:

From: Raghavendra K T raghavendra...@linux.vnet.ibm.com

When PLE handler fails to find a better candidate to yield_to, it
goes back and does spin again. This is acceptable when we do not
have overcommit.
But in overcommitted scenarios (especially when we have large
number of small guests), it is better to yield.

Reviewed-by: Srikar Dronamraju sri...@linux.vnet.ibm.com
Signed-off-by: Raghavendra K T raghavendra...@linux.vnet.ibm.com


Acked-by: Rik van Riel r...@redhat.com

--
All rights reversed
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/9] Reduce compaction scanning and lock contention

2012-09-21 Thread Rik van Riel

On 09/21/2012 06:46 AM, Mel Gorman wrote:

Hi Andrew,

Richard Davies and Shaohua Li have both reported lock contention
problems in compaction on the zone and LRU locks as well as
significant amounts of time being spent in compaction. This series
aims to reduce lock contention and scanning rates to reduce that CPU
usage. Richard reported at https://lkml.org/lkml/2012/9/21/91 that
this series made a big different to a problem he reported in August
(http://marc.info/?l=kvmm=134511507015614w=2).



One way or the other, this series has a large impact on the amount of
scanning compaction does when there is a storm of THP allocations.


Andrew,

Mel and I have discussed the stuff in this series quite a bit,
and I am convinced this is the way forward with compaction.

--
All rights reversed
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC 2/2] kvm: Be courteous to other VMs in overcommitted scenario in PLE handler

2012-09-21 Thread Rik van Riel

On 09/21/2012 09:46 AM, Takuya Yoshikawa wrote:

On Fri, 21 Sep 2012 17:30:20 +0530
Raghavendra K T raghavendra...@linux.vnet.ibm.com wrote:


From: Raghavendra K T raghavendra...@linux.vnet.ibm.com

When PLE handler fails to find a better candidate to yield_to, it
goes back and does spin again. This is acceptable when we do not
have overcommit.
But in overcommitted scenarios (especially when we have large
number of small guests), it is better to yield.

Reviewed-by: Srikar Dronamraju sri...@linux.vnet.ibm.com
Signed-off-by: Raghavendra K T raghavendra...@linux.vnet.ibm.com
---
  virt/kvm/kvm_main.c |4 
  1 files changed, 4 insertions(+), 0 deletions(-)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 8323685..713b677 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1660,6 +1660,10 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me)
}
}
}
+   /* In overcommitted cases, yield instead of spinning */
+   if (!yielded  rq_nr_running()  1)
+   schedule();


How about doing cond_resched() instead?


Actually, an actual call to yield() may be better.

That will set scheduler hints to make the scheduler pick
another task for one round, while preserving this task's
top position in the runqueue.

--
All rights reversed
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 5/6] mm: compaction: Cache if a pageblock was scanned and no pages were isolated

2012-09-20 Thread Rik van Riel

On 09/20/2012 10:04 AM, Mel Gorman wrote:

When compaction was implemented it was known that scanning could potentially
be excessive. The ideal was that a counter be maintained for each pageblock
but maintaining this information would incur a severe penalty due to a
shared writable cache line. It has reached the point where the scanning
costs are an serious problem, particularly on long-lived systems where a
large process starts and allocates a large number of THPs at the same time.

Instead of using a shared counter, this patch adds another bit to the
pageblock flags called PG_migrate_skip. If a pageblock is scanned by
either migrate or free scanner and 0 pages were isolated, the pageblock
is marked to be skipped in the future. When scanning, this bit is checked
before any scanning takes place and the block skipped if set.

The main difficulty with a patch like this is when to ignore the cached
information? If it's ignored too often, the scanning rates will still
be excessive. If the information is too stale then allocations will fail
that might have otherwise succeeded. In this patch


Big hammer, but I guess it is effective...

Acked-by: Rik van Riel r...@redhat.com

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/6] Revert mm: have order 0 compaction start off where it left

2012-09-20 Thread Rik van Riel

On 09/20/2012 10:04 AM, Mel Gorman wrote:

This reverts commit 7db8889a (mm: have order  0 compaction start off
where it left) and commit de74f1cc (mm: have order  0 compaction start
near a pageblock with free pages). These patches were a good idea and
tests confirmed that they massively reduced the amount of scanning but
the implementation is complex and tricky to understand. A later patch
will cache what pageblocks should be skipped and reimplements the
concept of compact_cached_free_pfn on top for both migration and
free scanners.

Signed-off-by: Mel Gorman mgor...@suse.de


Sure, it makes your next patches easier...

Acked-by: Rik van Riel r...@redhat.com

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 6/6] mm: compaction: Restart compaction from near where it left off

2012-09-20 Thread Rik van Riel

On 09/20/2012 10:04 AM, Mel Gorman wrote:

This is almost entirely based on Rik's previous patches and discussions
with him about how this might be implemented.

Order  0 compaction stops when enough free pages of the correct page
order have been coalesced.  When doing subsequent higher order allocations,
it is possible for compaction to be invoked many times.

However, the compaction code always starts out looking for things to compact
at the start of the zone, and for free pages to compact things to at the
end of the zone.

This can cause quadratic behaviour, with isolate_freepages starting at
the end of the zone each time, even though previous invocations of the
compaction code already filled up all free memory on that end of the zone.
This can cause isolate_freepages to take enormous amounts of CPU with
certain workloads on larger memory systems.

This patch caches where the migration and free scanner should start from on
subsequent compaction invocations using the pageblock-skip information. When
compaction starts it begins from the cached restart points and will
update the cached restart points until a page is isolated or a pageblock
is skipped that would have been scanned by synchronous compaction.

Signed-off-by: Mel Gorman mgor...@suse.de


Together with patch 5/6, this has the effect of
skipping compaction in a zone if the free and
isolate markers have met, and it has been less
than 5 seconds since the skip information was
reset.

Compaction on zones where we cycle through more
slowly can continue, even when this particular
zone is experiencing problems, so I guess this
is desired behaviour...

Acked-by: Rik van Riel r...@redhat.com

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/6] mm: compaction: Acquire the zone-lock as late as possible

2012-09-20 Thread Rik van Riel

On 09/20/2012 10:04 AM, Mel Gorman wrote:

Compactions free scanner acquires the zone-lock when checking for PageBuddy
pages and isolating them. It does this even if there are no PageBuddy pages
in the range.

This patch defers acquiring the zone lock for as long as possible. In the
event there are no free pages in the pageblock then the lock will not be
acquired at all which reduces contention on zone-lock.

Signed-off-by: Mel Gorman mgor...@suse.de


Acked-by: Rik van Riel r...@redhat.com

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/6] mm: compaction: Acquire the zone-lru_lock as late as possible

2012-09-20 Thread Rik van Riel

On 09/20/2012 10:04 AM, Mel Gorman wrote:

Compactions migrate scanner acquires the zone-lru_lock when scanning a range
of pages looking for LRU pages to acquire. It does this even if there are
no LRU pages in the range. If multiple processes are compacting then this
can cause severe locking contention. To make matters worse commit b2eef8c0
(mm: compaction: minimise the time IRQs are disabled while isolating pages
for migration) releases the lru_lock every SWAP_CLUSTER_MAX pages that are
scanned.

This patch makes two changes to how the migrate scanner acquires the LRU
lock. First, it only releases the LRU lock every SWAP_CLUSTER_MAX pages if
the lock is contended. This reduces the number of times it unnecessarily
disables and re-enables IRQs. The second is that it defers acquiring the
LRU lock for as long as possible. If there are no LRU pages or the only
LRU pages are transhuge then the LRU lock will not be acquired at all
which reduces contention on zone-lru_lock.

Signed-off-by: Mel Gorman mgor...@suse.de


Acked-by: Rik van Riel r...@redhat.com

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/6] mm: compaction: Abort compaction loop if lock is contended or run too long

2012-09-20 Thread Rik van Riel

On 09/20/2012 10:04 AM, Mel Gorman wrote:

From: Shaohua Li s...@fusionio.com

Changelog since V2
o Fix BUG_ON triggered due to pages left on cc.migratepages
o Make compact_zone_order() require non-NULL arg `contended'

Changelog since V1
o only abort the compaction if lock is contended or run too long
o Rearranged the code by Andrea Arcangeli.

isolate_migratepages_range() might isolate no pages if for example when
zone-lru_lock is contended and running asynchronous compaction. In this
case, we should abort compaction, otherwise, compact_zone will run a
useless loop and make zone-lru_lock is even contended.

[minc...@kernel.org: Putback pages isolated for migration if aborting]
[a...@linux-foundation.org: compact_zone_order requires non-NULL arg contended]
Signed-off-by: Andrea Arcangeli aarca...@redhat.com
Signed-off-by: Shaohua Li s...@fusionio.com
Signed-off-by: Mel Gorman mgor...@suse.de


Acked-by: Rik van Riel r...@redhat.com

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH -v2 2/2] make the compaction skip ahead logic robust

2012-09-17 Thread Rik van Riel

On 09/15/2012 11:55 AM, Richard Davies wrote:

Hi Rik, Mel and Shaohua,

Thank you for your latest patches. I attach my latest perf report for a slow
boot with all of these applied.

Mel asked for timings of the slow boots. It's very hard to give anything
useful here! A normal boot would be a minute or so, and many are like that,
but the slowest that I have seen (on 3.5.x) was several hours. Basically, I
just test many times until I get one which is noticeably slow than normal
and then run perf record on that one.

The latest perf report for a slow boot is below. For the fast boots, most of
the time is in clean_page_c in do_huge_pmd_anonymous_page, but for this slow
one there is a lot of lock contention above that.


How often do you run into slow boots, vs. fast ones?


# Overhead  Command Shared Object   
   Symbol
#   ...    
..
#
 58.49% qemu-kvm  [kernel.kallsyms] [k] _raw_spin_lock_irqsave
|
--- _raw_spin_lock_irqsave
   |
   |--95.07%-- compact_checklock_irqsave
   |  |
   |  |--70.03%-- isolate_migratepages_range
   |  |  compact_zone
   |  |  compact_zone_order
   |  |  try_to_compact_pages
   |  |  __alloc_pages_direct_compact
   |  |  __alloc_pages_nodemask


Looks like it moved from isolate_freepages_block in your last
trace, to isolate_migratepages_range?

Mel, I wonder if we have any quadratic complexity problems
in this part of the code, too?

The isolate_freepages_block CPU use can be fixed by simply
restarting where the last invocation left off, instead of
always starting at the end of the zone.  Could we need
something similar for isolate_migratepages_range?

After all, Richard has a 128GB system, and runs 108GB worth
of KVM guests on it...

--
All rights reversed
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/2] Revert mm: have order 0 compaction start near a pageblock with free pages

2012-09-13 Thread Rik van Riel
On Wed, 12 Sep 2012 17:46:15 +0100
Richard Davies rich...@arachsys.com wrote:
 Mel Gorman wrote:
  I see that this is an old-ish bug but I did not read the full history.
  Is it now booting faster than 3.5.0 was? I'm asking because I'm
  interested to see if commit c67fe375 helped your particular case.
 
 Yes, I think 3.6.0-rc5 is already better than 3.5.x but can still be
 improved, as discussed.

Re-reading Mel's commit de74f1cc3b1e9730d9b58580cd11361d30cd182d,
I believe it re-introduces the quadratic behaviour that the code
was suffering from before, by not moving zone-compact_cached_free_pfn
down when no more free pfns are found in a page block.

This mail reverts that changeset, the next introduces what I hope to
be the proper fix.  Richard, would you be willing to give these patches
a try, since your system seems to reproduce this bug easily?

---8---

Revert mm: have order  0 compaction start near a pageblock with free pages

This reverts commit de74f1cc3b1e9730d9b58580cd11361d30cd182d.

Mel found a real issue with my skip ahead logic in the
compaction code, but unfortunately his approach appears to
have re-introduced quadratic behaviour in that the value
of zone-compact_cached_free_pfn is never advanced until
the compaction run wraps around the start of the zone.

This merely moved the starting point for the quadratic behaviour
further into the zone, but the behaviour has still been observed.

It looks like another fix is required.

Signed-off-by: Rik van Riel r...@redhat.com
Reported-by: Richard Davies rich...@daviesmail.org

diff --git a/mm/compaction.c b/mm/compaction.c
index 7fcd3a5..771775d 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -431,20 +431,6 @@ static bool suitable_migration_target(struct page *page)
 }
 
 /*
- * Returns the start pfn of the last page block in a zone.  This is the 
starting
- * point for full compaction of a zone.  Compaction searches for free pages 
from
- * the end of each zone, while isolate_freepages_block scans forward inside 
each
- * page block.
- */
-static unsigned long start_free_pfn(struct zone *zone)
-{
-   unsigned long free_pfn;
-   free_pfn = zone-zone_start_pfn + zone-spanned_pages;
-   free_pfn = ~(pageblock_nr_pages-1);
-   return free_pfn;
-}
-
-/*
  * Based on information in the current compact_control, find blocks
  * suitable for isolating free pages from and then isolate them.
  */
@@ -483,6 +469,17 @@ static void isolate_freepages(struct zone *zone,
pfn -= pageblock_nr_pages) {
unsigned long isolated;
 
+   /*
+* Skip ahead if another thread is compacting in the area
+* simultaneously. If we wrapped around, we can only skip
+* ahead if zone-compact_cached_free_pfn also wrapped to
+* above our starting point.
+*/
+   if (cc-order  0  (!cc-wrapped ||
+ zone-compact_cached_free_pfn 
+ cc-start_free_pfn))
+   pfn = min(pfn, zone-compact_cached_free_pfn);
+
if (!pfn_valid(pfn))
continue;
 
@@ -533,15 +530,7 @@ static void isolate_freepages(struct zone *zone,
 */
if (isolated) {
high_pfn = max(high_pfn, pfn);
-
-   /*
-* If the free scanner has wrapped, update
-* compact_cached_free_pfn to point to the highest
-* pageblock with free pages. This reduces excessive
-* scanning of full pageblocks near the end of the
-* zone
-*/
-   if (cc-order  0  cc-wrapped)
+   if (cc-order  0)
zone-compact_cached_free_pfn = high_pfn;
}
}
@@ -551,11 +540,6 @@ static void isolate_freepages(struct zone *zone,
 
cc-free_pfn = high_pfn;
cc-nr_freepages = nr_freepages;
-
-   /* If compact_cached_free_pfn is reset then set it now */
-   if (cc-order  0  !cc-wrapped 
-   zone-compact_cached_free_pfn == start_free_pfn(zone))
-   zone-compact_cached_free_pfn = high_pfn;
 }
 
 /*
@@ -642,6 +626,20 @@ static isolate_migrate_t isolate_migratepages(struct zone 
*zone,
return ISOLATE_SUCCESS;
 }
 
+/*
+ * Returns the start pfn of the last page block in a zone.  This is the 
starting
+ * point for full compaction of a zone.  Compaction searches for free pages 
from
+ * the end of each zone, while isolate_freepages_block scans forward inside 
each
+ * page block.
+ */
+static unsigned long start_free_pfn(struct zone *zone)
+{
+   unsigned long free_pfn;
+   free_pfn = zone-zone_start_pfn + zone-spanned_pages;
+   free_pfn = ~(pageblock_nr_pages-1);
+   return free_pfn

[PATCH 2/2] make the compaction skip ahead logic robust

2012-09-13 Thread Rik van Riel
Make the skip ahead logic in compaction resistant to compaction
wrapping around to the end of the zone.  This can lead to less
efficient compaction when one thread has wrapped around to the
end of the zone, and another simultaneous compactor has not done
so yet. However, it should ensure that we do not suffer quadratic
behaviour any more.

Signed-off-by: Rik van Riel r...@redhat.com
Reported-by: Richard Davies rich...@daviesmail.org

diff --git a/mm/compaction.c b/mm/compaction.c
index 771775d..0656759 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -431,6 +431,24 @@ static bool suitable_migration_target(struct page *page)
 }
 
 /*
+ * We scan the zone in a circular fashion, starting at
+ * zone-compact_cached_free_pfn. Be careful not to skip if
+ * one compacting thread has just wrapped back to the end of the
+ * zone, but another thread has not.
+ */
+static bool compaction_may_skip(struct zone *zone,
+   struct compact_control *cc)
+{
+   if (!cc-wrapped  zone-compact_free_pfn  cc-start_pfn)
+   return true;
+
+   if (cc-wrapped  zone_compact_free_pfn  cc-start_pfn)
+   return true;
+
+   return false;
+}
+
+/*
  * Based on information in the current compact_control, find blocks
  * suitable for isolating free pages from and then isolate them.
  */
@@ -471,13 +489,9 @@ static void isolate_freepages(struct zone *zone,
 
/*
 * Skip ahead if another thread is compacting in the area
-* simultaneously. If we wrapped around, we can only skip
-* ahead if zone-compact_cached_free_pfn also wrapped to
-* above our starting point.
+* simultaneously, and has finished with this page block.
 */
-   if (cc-order  0  (!cc-wrapped ||
- zone-compact_cached_free_pfn 
- cc-start_free_pfn))
+   if (cc-order  0  compaction_may_skip(zone, cc))
pfn = min(pfn, zone-compact_cached_free_pfn);
 
if (!pfn_valid(pfn))

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH -v2 2/2] make the compaction skip ahead logic robust

2012-09-13 Thread Rik van Riel
Argh. And of course I send out the version from _before_ the compile test,
instead of the one after! I am not used to caffeine any more and have had
way too much tea...

---8---

Make the skip ahead logic in compaction resistant to compaction
wrapping around to the end of the zone.  This can lead to less
efficient compaction when one thread has wrapped around to the
end of the zone, and another simultaneous compactor has not done
so yet. However, it should ensure that we do not suffer quadratic
behaviour any more.

Signed-off-by: Rik van Riel r...@redhat.com
Reported-by: Richard Davies rich...@daviesmail.org

diff --git a/mm/compaction.c b/mm/compaction.c
index 771775d..0656759 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -431,6 +431,24 @@ static bool suitable_migration_target(struct page *page)
 }
 
 /*
+ * We scan the zone in a circular fashion, starting at
+ * zone-compact_cached_free_pfn. Be careful not to skip if
+ * one compacting thread has just wrapped back to the end of the
+ * zone, but another thread has not.
+ */
+static bool compaction_may_skip(struct zone *zone,
+   struct compact_control *cc)
+{
+   if (!cc-wrapped  zone-compact_cached_free_pfn  cc-start_free_pfn)
+   return true;
+
+   if (cc-wrapped  zone-compact_cached_free_pfn  cc-start_free_pfn)
+   return true;
+
+   return false;
+}
+
+/*
  * Based on information in the current compact_control, find blocks
  * suitable for isolating free pages from and then isolate them.
  */
@@ -471,13 +489,9 @@ static void isolate_freepages(struct zone *zone,
 
/*
 * Skip ahead if another thread is compacting in the area
-* simultaneously. If we wrapped around, we can only skip
-* ahead if zone-compact_cached_free_pfn also wrapped to
-* above our starting point.
+* simultaneously, and has finished with this page block.
 */
-   if (cc-order  0  (!cc-wrapped ||
- zone-compact_cached_free_pfn 
- cc-start_free_pfn))
+   if (cc-order  0  compaction_may_skip(zone, cc))
pfn = min(pfn, zone-compact_cached_free_pfn);
 
if (!pfn_valid(pfn))

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Add a page cache-backed balloon device driver.

2012-09-10 Thread Rik van Riel

On 09/10/2012 01:37 PM, Mike Waychison wrote:

On Mon, Sep 10, 2012 at 5:05 AM, Michael S. Tsirkin m...@redhat.com wrote:



Also can you pls answer Avi's question?
How is overcommit managed?


Overcommit in our deployments is managed using memory cgroups on the
host.  This allows us to have very directed policies as to how
competing VMs on a host may overcommit.


How do your memory cgroups lead to guests inflating their balloons?

--
All rights reversed
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][PATCH] Improving directed yield scalability for PLE handler

2012-09-10 Thread Rik van Riel

On 09/10/2012 04:19 PM, Peter Zijlstra wrote:

On Mon, 2012-09-10 at 15:12 -0500, Andrew Theurer wrote:

+   /*
+* if the target task is not running, then only yield if the
+* current task is in guest mode
+*/
+   if (!(p_rq-curr-flags  PF_VCPU))
+   goto out_irq;


This would make yield_to() only ever work on KVM, not that I mind this
too much, its a horrid thing and making it less useful for (ab)use is a
good thing, still this probably wants mention somewhere :-)


Also, it would not preempt a non-kvm task, even if we need
to do that to boost a VCPU. I think the lines above should
be dropped.

--
All rights reversed
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC 1/1] kvm: Use vcpu_id as pivot instead of last boosted vcpu in PLE handler

2012-09-02 Thread Rik van Riel

On 09/02/2012 06:12 AM, Gleb Natapov wrote:

On Thu, Aug 30, 2012 at 12:51:01AM +0530, Raghavendra K T wrote:

  The idea of starting from next vcpu (source of yield_to + 1) seem to work
  well for overcomitted guest rather than using last boosted vcpu. We can also
  remove per VM variable with this approach.

  Iteration for eligible candidate after this patch starts from vcpu source+1
  and ends at source-1 (after wrapping)

  Thanks Nikunj for his quick verification of the patch.

  Please let me know if this patch is interesting and makes sense.


This last_boosted_vcpu thing caused us trouble during attempt to
implement vcpu destruction. It is good to see it removed from this POV.


I like this implementation.  It should achieve pretty much
the same as my old code, but without the downsides and without
having to keep the same amount of global state.

--
All rights reversed
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Windows slow boot: contractor wanted

2012-08-25 Thread Rik van Riel

On 08/25/2012 01:45 PM, Richard Davies wrote:


Are you talking about these patches?

http://git.kernel.org/?p=linux/kernel/git/torvalds/linux.git;a=commit;h=c67fe3752abe6ab47639e2f9b836900c3dc3da84
http://marc.info/?l=linux-mmm=134521289221259

If so, I believe those are in 3.6.0-rc3, so I tested with that.

Unfortunately, I can still get the slow boots and perf top showing
_raw_spin_lock_irqsave.


Here are two perf top traces on 3.6.0-rc3. They do look a bit different from
3.5.2, but _raw_spin_lock_irqsave is still at the top:

PerfTop:   35272 irqs/sec  kernel:98.1%  exact:  0.0% [4000Hz cycles],  
(all, 16 CPUs)
--

 61.85%  [kernel]  [k] _raw_spin_lock_irqsave
  7.18%  [kernel]  [k] sub_preempt_count
  5.03%  [kernel]  [k] isolate_freepages_block
  2.49%  [kernel]  [k] yield_to
  2.05%  [kernel]  [k] memcmp
  2.01%  [kernel]  [k] compact_zone
  1.76%  [kernel]  [k] add_preempt_count
  1.52%  [kernel]  [k] _raw_spin_lock
  1.31%  [kernel]  [k] kvm_vcpu_on_spin
  0.92%  [kernel]  [k] svm_vcpu_run


However, the compaction code is not as prominent as before.

Can you get a backtrace to that _raw_spin_lock_irqsave, to see
from where it is running into lock contention?

It would be good to know whether it is isolate_freepages_block,
yield_to, kvm_vcpu_on_spin or something else...

--
All rights reversed
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Windows slow boot: contractor wanted

2012-08-22 Thread Rik van Riel

On 08/22/2012 10:41 AM, Richard Davies wrote:

Avi Kivity wrote:

Richard Davies wrote:

I can trigger the slow boots without KSM and they have the same profile,
with _raw_spin_lock_irqsave and isolate_freepages_block at the top.

I reduced to 3x 20GB 8-core VMs on a 128GB host (rather than 3x 40GB 8-core
VMs), and haven't managed to get a really slow boot yet (5 minutes). I'll
post agan when I get one.


I think you can go higher than that.  But 120GB on a 128GB host is
pushing it.


I've now triggered a very slow boot at 3x 36GB 8-core VMs on a 128GB host
(i.e. 108GB on a 128GB host).

It has the same profile with _raw_spin_lock_irqsave and
isolate_freepages_block at the top.


That's the page compaction code.

Mel Gorman and I have been working to fix that,
the latest fixes and improvements are in the -mm
kernel already.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC V5 0/3] kvm: Improving directed yield in PLE handler

2012-07-22 Thread Rik van Riel

On 07/22/2012 08:34 AM, Raghavendra K T wrote:

On 07/20/2012 11:06 PM, Marcelo Tosatti wrote:

On Wed, Jul 18, 2012 at 07:07:17PM +0530, Raghavendra K T wrote:


Currently Pause Loop Exit (PLE) handler is doing directed yield to a
random vcpu on pl-exit. We already have filtering while choosing
the candidate to yield_to. This change adds more checks while choosing
a candidate to yield_to.

On a large vcpu guests, there is a high probability of
yielding to the same vcpu who had recently done a pause-loop exit.
Such a yield can lead to the vcpu spinning again.

The patchset keeps track of the pause loop exit and gives chance to a
vcpu which has:

(a) Not done pause loop exit at all (probably he is preempted
lock-holder)

(b) vcpu skipped in last iteration because it did pause loop exit, and
probably has become eligible now (next eligible lock holder)

This concept also helps in cpu relax interception cases which use
same handler.

Changes since V4:
- Naming Change (Avi):
struct ple == struct spin_loop
cpu_relax_intercepted == in_spin_loop
vcpu_check_and_update_eligible == vcpu_eligible_for_directed_yield
- mark vcpu in spinloop as not eligible to avoid influence of
previous exit

Changes since V3:
- arch specific fix/changes (Christian)

Changes since v2:
- Move ple structure to common code (Avi)
- rename pause_loop_exited to cpu_relax_intercepted (Avi)
- add config HAVE_KVM_CPU_RELAX_INTERCEPT (Avi)
- Drop superfluous curly braces (Ingo)

Changes since v1:
- Add more documentation for structure and algorithm and Rename
plo == ple (Rik).
- change dy_eligible initial value to false. (otherwise very first
directed
yield will not be skipped. (Nikunj)
- fixup signoff/from issue

Future enhancements:
(1) Currently we have a boolean to decide on eligibility of vcpu. It
would be nice if I get feedback on guest (32 vcpu) whether we can
improve better with integer counter. (with counter = say f(log n )).

(2) We have not considered system load during iteration of vcpu. With
that information we can limit the scan and also decide whether
schedule()
is better. [ I am able to use #kicked vcpus to decide on this But may
be there are better ideas like information from global loadavg.]

(3) We can exploit this further with PV patches since it also knows
about
next eligible lock-holder.

Summary: There is a very good improvement for kvm based guest on PLE
machine.
The V5 has huge improvement for kbench.

+---+---+---++---+
base_rik stdev patched stdev %improve
+---+---+---++---+
kernbench (time in sec lesser is better)
+---+---+---++---+
1x 49.2300 1.0171 22.6842 0.3073 117.0233 %
2x 91.9358 1.7768 53.9608 1.0154 70.37516 %
+---+---+---++---+

+---+---+---++---+
ebizzy (records/sec more is better)
+---+---+---++---+
1x 1129.2500 28.6793 2125.6250 32.8239 88.23334 %
2x 1892.3750 75.1112 2377.1250 181.6822 25.61596 %
+---+---+---++---+

Note: The patches are tested on x86.

Links
V4: https://lkml.org/lkml/2012/7/16/80
V3: https://lkml.org/lkml/2012/7/12/437
V2: https://lkml.org/lkml/2012/7/10/392
V1: https://lkml.org/lkml/2012/7/9/32

Raghavendra K T (3):
config: Add config to support ple or cpu relax optimzation
kvm : Note down when cpu relax intercepted or pause loop exited
kvm : Choose a better candidate for directed yield
---
arch/s390/kvm/Kconfig | 1 +
arch/x86/kvm/Kconfig | 1 +
include/linux/kvm_host.h | 39 +++
virt/kvm/Kconfig | 3 +++
virt/kvm/kvm_main.c | 41 +
5 files changed, 85 insertions(+), 0 deletions(-)


Reviewed-by: Marcelo Tosattimtosa...@redhat.com



Thanks Marcelo for the review. Avi, Rik, Christian, please let me know
if this series looks good now.


The series looks good to me.

Reviewed-by: Rik van Riel r...@redhat.com

--
All rights reversed
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC V4 3/3] kvm: Choose better candidate for directed yield

2012-07-16 Thread Rik van Riel

On 07/16/2012 06:07 AM, Avi Kivity wrote:


+{
+   bool eligible;
+
+   eligible = !vcpu-ple.cpu_relax_intercepted ||
+   (vcpu-ple.cpu_relax_intercepted
+vcpu-ple.dy_eligible);
+
+   if (vcpu-ple.cpu_relax_intercepted)
+   vcpu-ple.dy_eligible = !vcpu-ple.dy_eligible;


Probably should assign 'true', since the previous value is essentially
random.


I suspect the intended purpose of this conditional is to
flip the eligibility of a vcpu for being selected as a
direct yield target.

In other words, that bit of the code is correct.

--
All rights reversed
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC V2 1/2] kvm vcpu: Note down pause loop exit

2012-07-10 Thread Rik van Riel

On 07/10/2012 03:31 PM, Raghavendra K T wrote:

From: Raghavendra K Traghavendra...@linux.vnet.ibm.com

Noting pause loop exited vcpu helps in filtering right candidate to yield.
Yielding to same vcpu may result in more wastage of cpu.

Signed-off-by: Raghavendra K Traghavendra...@linux.vnet.ibm.com


Reviewed-by: Rik van Riel r...@redhat.com

--
All rights reversed
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC 0/2] kvm: Improving directed yield in PLE handler

2012-07-09 Thread Rik van Riel

On 07/09/2012 02:20 AM, Raghavendra K T wrote:

Currently Pause Looop Exit (PLE) handler is doing directed yield to a
random VCPU on PL exit. Though we already have filtering while choosing
the candidate to yield_to, we can do better.

Problem is, for large vcpu guests, we have more probability of yielding
to a bad vcpu. We are not able to prevent directed yield to same guy who
has done PL exit recently, who perhaps spins again and wastes CPU.

Fix that by keeping track of who has done PL exit. So The Algorithm in series
give chance to a VCPU which has:

  (a) Not done PLE exit at all (probably he is preempted lock-holder)

  (b) VCPU skipped in last iteration because it did PL exit, and probably
  has become eligible now (next eligible lock holder)

Future enhancemnets:


Your patch series looks good to me. Simple changes with a
significant result.

However, the simple heuristic could use some comments :)

--
All rights reversed
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC 2/2] kvm PLE handler: Choose better candidate for directed yield

2012-07-09 Thread Rik van Riel

On 07/09/2012 02:20 AM, Raghavendra K T wrote:


+bool kvm_arch_vcpu_check_and_update_eligible(struct kvm_vcpu *vcpu)
+{
+   bool eligible;
+
+   eligible = !vcpu-arch.plo.pause_loop_exited ||
+   (vcpu-arch.plo.pause_loop_exited
+vcpu-arch.plo.dy_eligible);
+
+   if (vcpu-arch.plo.pause_loop_exited)
+   vcpu-arch.plo.dy_eligible = !vcpu-arch.plo.dy_eligible;
+
+   return eligible;
+}


This is a nice simple mechanism to skip CPUs that were
eligible last time and had pause loop exits recently.

However, it could stand some documentation.  Please
add a good comment explaining how and why the algorithm
works, when arch.plo.pause_loop_exited is cleared, etc...

It would be good to make this heuristic understandable
to people who look at the code for the first time.

--
All rights reversed
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC 1/2] kvm vcpu: Note down pause loop exit

2012-07-09 Thread Rik van Riel

On 07/09/2012 02:20 AM, Raghavendra K T wrote:


@@ -484,6 +484,13 @@ struct kvm_vcpu_arch {
u64 length;
u64 status;
} osvw;
+
+   /* Pause loop exit optimization */
+   struct {
+   bool pause_loop_exited;
+   bool dy_eligible;
+   } plo;


I know kvm_vcpu_arch is traditionally not a well documented
structure, but it would be really nice if each variable inside
this sub-structure could get some documentation.

Also, do we really want to introduce another acronym here?

Or would we be better off simply calling this struct .ple,
since that is a name people are already familiar with.

--
All rights reversed
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] kvm: handle last_boosted_vcpu = 0 case

2012-07-02 Thread Rik van Riel

On 06/28/2012 06:55 PM, Vinod, Chegu wrote:

Hello,

I am just catching up on this email thread...

Perhaps one of you may be able to help answer this query.. preferably along 
with some data.  [BTW, I do understand the basic intent behind PLE in a typical 
[sweet spot] use case where there is over subscription etc. and the need to 
optimize the PLE handler in the host etc. ]

In a use case where the host has fewer but much larger guests (say 40VCPUs and 
higher) and there is no over subscription (i.e. # of vcpus across guests= 
physical cpus in the host  and perhaps each guest has their vcpu's pinned to 
specific physical cpus for other reasons), I would like to understand if/how  the 
PLE really helps ?  For these use cases would it be ok to turn PLE off (ple_gap=0) 
since is no real need to take an exit and find some other VCPU to yield to ?


Yes, that should be ok.

On a related note, I wonder if we should increase the ple_gap
significantly.

After all, 4096 cycles of spinning is not that much, when you
consider how much time is spent doing the subsequent vmexit,
scanning the other VCPU's status (200 cycles per cache miss),
deciding what to do, maybe poking another CPU, and eventually
a vmenter.

A factor 4 increase in ple_gap might be what it takes to
get the amount of time spent spinning equal to the amount of
time spent on the host side doing KVM stuff...

--
All rights reversed
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Add a page cache-backed balloon device driver.

2012-06-26 Thread Rik van Riel

On 06/26/2012 04:32 PM, Frank Swiderski wrote:

This implementation of a virtio balloon driver uses the page cache to
store pages that have been released to the host.  The communication
(outside of target counts) is one way--the guest notifies the host when
it adds a page to the page cache, allowing the host to madvise(2) with
MADV_DONTNEED.  Reclaim in the guest is therefore automatic and implicit
(via the regular page reclaim).  This means that inflating the balloon
is similar to the existing balloon mechanism, but the deflate is
different--it re-uses existing Linux kernel functionality to
automatically reclaim.

Signed-off-by: Frank Swiderskif...@google.com


It is a great idea, but how can this memory balancing
possibly work if someone uses memory cgroups inside a
guest?

Having said that, we currently do not have proper
memory reclaim balancing between cgroups at all, so
requiring that of this balloon driver would be
unreasonable.

The code looks good to me, my only worry is the
code duplication. We now have 5 balloon drivers,
for 4 hypervisors, all implementing everything
from scratch...
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Add a page cache-backed balloon device driver.

2012-06-26 Thread Rik van Riel

On 06/26/2012 05:31 PM, Frank Swiderski wrote:

On Tue, Jun 26, 2012 at 1:40 PM, Rik van Rielr...@redhat.com  wrote:



The code looks good to me, my only worry is the
code duplication. We now have 5 balloon drivers,
for 4 hypervisors, all implementing everything
from scratch...


Do you have any recommendations on this?  I could (I think reasonably
so) modify the existing virtio_balloon.c and have it change behavior
based on a feature bit or other configuration.  I'm not sure that
really addresses the root of what you're pointing out--it's still
adding a different implementation, but doing so as an extension of an
existing one.


Ideally, I believe we would have two balloon
top parts in a guest (one classical balloon,
one on the LRU), and four bottom parts (kvm,
xen, vmware  s390).

That way the virt specific bits of a balloon
driver would be essentially a -balloon_page
and -release_page callback for pages, as well
as methods to communicate with the host.

All the management of pages, including stuff
like putting them on the LRU, or isolating
them for migration, would be done with the
same common code, regardless of what virt
software we are running on.

Of course, that is a substantial amount of
work and I feel it would be unreasonable to
block anyone's code on that kind of thing
(especially considering that your code is good),
but I do believe the explosion of balloon
code is a little worrying.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] kvm: handle last_boosted_vcpu = 0 case

2012-06-20 Thread Rik van Riel

On 06/20/2012 04:12 PM, Raghavendra K T wrote:

On 06/20/2012 02:21 AM, Rik van Riel wrote:



Please let me know how it goes.


Yes, have got result today, too tired to summarize. got better
performance result too. will come back again tomorrow morning.
have to post, randomized start point patch also, which I discussed to
know the opinion.


The other person's problem has also gone away with this
patch.

Avi, could I convince you to apply this obvious bugfix
to kvm.git? :)


8

If last_boosted_vcpu == 0, then we fall through all test cases and
may end up with all VCPUs pouncing on vcpu 0. With a large enough
guest, this can result in enormous runqueue lock contention, which
can prevent vcpu0 from running, leading to a livelock.

Changing to= makes sure we properly handle that case.




Signed-off-by: Rik van Rielr...@redhat.com
---
virt/kvm/kvm_main.c | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 7e14068..1da542b 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1586,7 +1586,7 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me)
*/
for (pass = 0; pass 2 !yielded; pass++) {
kvm_for_each_vcpu(i, vcpu, kvm) {
- if (!pass i last_boosted_vcpu) {
+ if (!pass i= last_boosted_vcpu) {
i = last_boosted_vcpu;
continue;
} else if (pass i last_boosted_vcpu)







--
All rights reversed
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] kvm: handle last_boosted_vcpu = 0 case

2012-06-19 Thread Rik van Riel
On Wed, 20 Jun 2012 01:50:50 +0530
Raghavendra K T raghavendra...@linux.vnet.ibm.com wrote:

 
 In ple handler code, last_boosted_vcpu (lbv) variable is
 serving as reference point to start when we enter.

 Also statistical analysis (below) is showing lbv is not very well
 distributed with current approach.

You are the second person to spot this bug today (yes, today).

Due to time zones, the first person has not had a chance yet to
test the patch below, which might fix the issue...

Please let me know how it goes.

8

If last_boosted_vcpu == 0, then we fall through all test cases and
may end up with all VCPUs pouncing on vcpu 0.  With a large enough
guest, this can result in enormous runqueue lock contention, which
can prevent vcpu0 from running, leading to a livelock.

Changing  to = makes sure we properly handle that case.

Signed-off-by: Rik van Riel r...@redhat.com
---
 virt/kvm/kvm_main.c |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 7e14068..1da542b 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1586,7 +1586,7 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me)
 */
for (pass = 0; pass  2  !yielded; pass++) {
kvm_for_each_vcpu(i, vcpu, kvm) {
-   if (!pass  i  last_boosted_vcpu) {
+   if (!pass  i = last_boosted_vcpu) {
i = last_boosted_vcpu;
continue;
} else if (pass  i  last_boosted_vcpu)

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] kvm: don't call mmu_shrinker w/o used_mmu_pages

2012-04-23 Thread Rik van Riel

On 04/23/2012 12:40 PM, Ying Han wrote:


Avi, does this patch help the case as you mentioned above, where kvm
module is loaded but no virtual machines are present ? Why we have to
walk the empty while holding the spinlock?


It might help marginally, but rather than defending
the patch, it might be more useful to try finding a
solution to the problem Mike and Eric are triggering.

How can we prevent the code from taking the lock and
throwing out EPT/NPT pages when there is no real need
to, and they will be faulted back in soon anyway?

--
All rights reversed
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] kvm: don't call mmu_shrinker w/o used_mmu_pages

2012-04-20 Thread Rik van Riel

On 04/20/2012 06:11 PM, Andrew Morton wrote:

On Fri, 13 Apr 2012 15:38:41 -0700
Ying Hanying...@google.com  wrote:


The mmu_shrink() is heavy by itself by iterating all kvms and holding
the kvm_lock. spotted the code w/ Rik during LSF, and it turns out we
don't need to call the shrinker if nothing to shrink.



@@ -3900,6 +3905,9 @@ static int mmu_shrink(struct shrinker *shrink, struct 
shrink_control *sc)
if (nr_to_scan == 0)
goto out;

+   if (!get_kvm_total_used_mmu_pages())
+   return 0;
+



Do we actually know that this patch helps anything?  Any measurements? Is
kvm_total_used_mmu_pages==0 at all common?



On re-reading mmu.c, it looks like even with EPT or NPT,
we end up creating mmu pages for the nested page tables.

I have not had the time to look into it more, but it would
be nice to know if the patch has any effect at all.

--
All rights reversed
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Performance of 40-way guest running 2.6.32-220 (RHEL6.2) vs. 3.3.1 OS

2012-04-12 Thread Rik van Riel

On 04/11/2012 01:21 PM, Chegu Vinod wrote:


Hello,

While running an AIM7 (workfile.high_systime) in a single 40-way (or a single
60-way KVM guest) I noticed pretty bad performance when the guest was booted
with 3.3.1 kernel when compared to the same guest booted with 2.6.32-220
(RHEL6.2) kernel.



For the 40-way Guest-RunA (2.6.32-220 kernel) performed nearly 9x better than
the Guest-RunB (3.3.1 kernel). In the case of 60-way guest run the older guest
kernel was nearly 12x better !



Turned on function tracing and found that there appears to be more time being
spent around the lock code in the 3.3.1 guest when compared to the 2.6.32-220
guest.


Looks like you may be running into the ticket spinlock
code. During the early RHEL 6 days, Gleb came up with a
patch to automatically disable ticket spinlocks when
running inside a KVM guest.

IIRC that patch got rejected upstream at the time,
with upstream developers preferring to wait for a
better solution.

If such a better solution is not on its way upstream
now (two years later), maybe we should just merge
Gleb's patch upstream for the time being?
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v5 4/9] KVM-HV: KVM Steal time implementation

2011-07-06 Thread Rik van Riel

On 07/04/2011 11:32 AM, Glauber Costa wrote:

To implement steal time, we need the hypervisor to pass the guest
information about how much time was spent running other processes
outside the VM, while the vcpu had meaningful work to do - halt
time does not count.

This information is acquired through the run_delay field of
delayacct/schedstats infrastructure, that counts time spent in a
runqueue but not running.

Steal time is a per-cpu information, so the traditional MSR-based
infrastructure is used. A new msr, KVM_MSR_STEAL_TIME, holds the
memory area address containing information about steal time

This patch contains the hypervisor part of the steal time infrasructure,
and can be backported independently of the guest portion.

Signed-off-by: Glauber Costaglom...@redhat.com
CC: Rik van Rielr...@redhat.com
CC: Jeremy Fitzhardingejeremy.fitzhardi...@citrix.com
CC: Peter Zijlstrapet...@infradead.org
CC: Avi Kivitya...@redhat.com
CC: Anthony Liguorialigu...@us.ibm.com
CC: Eric B Munsonemun...@mgebm.net


Acked-by: Rik van Riel r...@redhat.com


--
All rights reversed
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


  1   2   3   >