Re: [PATCH 1/3] context_tracking: remove duplicate enabled check
On 10/27/2015 09:39 PM, Paolo Bonzini wrote: > All calls to context_tracking_enter and context_tracking_exit > are already checking context_tracking_is_enabled, except the > context_tracking_user_enter and context_tracking_user_exit > functions left in for the benefit of assembly calls. > > Pull the check up to those functions, by making them simple > wrappers around the user_enter and user_exit inline functions. > > Cc: Andy Lutomirski <l...@kernel.org> > Cc: Frederic Weisbecker <fweis...@gmail.com> > Cc: Rik van Riel <r...@redhat.com> > Cc: Paul McKenney <paul...@linux.vnet.ibm.com> > Signed-off-by: Paolo Bonzini <pbonz...@redhat.com> Reviewed-by: Rik van Riel <r...@redhat.com> Tested-by: Rik van Riel <r...@redhat.com> -- All rights reversed -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/3] context_tracking: avoid irq_save/irq_restore on guest entry and exit
On 10/27/2015 09:39 PM, Paolo Bonzini wrote: > guest_enter and guest_exit must be called with interrupts disabled, > since they take the vtime_seqlock with write_seq{lock,unlock}. > Therefore, it is not necessary to check for exceptions, nor to > save/restore the IRQ state, when context tracking functions are > called by guest_enter and guest_exit. > > Split the body of context_tracking_entry and context_tracking_exit > out to __-prefixed functions, and use them from KVM. > > Rik van Riel has measured this to speed up a tight vmentry/vmexit > loop by about 2%. > > Cc: Andy Lutomirski <l...@kernel.org> > Cc: Frederic Weisbecker <fweis...@gmail.com> > Cc: Rik van Riel <r...@redhat.com> > Cc: Paul McKenney <paul...@linux.vnet.ibm.com> > Signed-off-by: Paolo Bonzini <pbonz...@redhat.com> Reviewed-by: Rik van Riel <r...@redhat.com> Tested-by: Rik van Riel <r...@redhat.com> -- All rights reversed -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] context_tracking: avoid extra checks on guest_enter and guest_exit
On 04/29/2015 05:21 AM, Paolo Bonzini wrote: guest_enter and guest_exit must be called with interrupts disabled, since they take the vtime_seqlock with write_seq{lock,unlock}. Therefore, it is not necessary to check for exceptions, nor to save/restore the IRQ state, when context tracking functions are called by guest_enter and guest_exit. Split the body of context_tracking_entry and context_tracking_exit out to __-prefixed functions, and use them from KVM. Rik van Riel has measured this to speed up a tight vmentry/vmexit loop by about 2%. Cc: Frederic Weisbecker fweis...@gmail.com Cc: Rik van Riel r...@redhat.com Cc: Paul McKenney paul...@linux.vnet.ibm.com Signed-off-by: Paolo Bonzini pbonz...@redhat.com Reviewed-by: Rik van Riel r...@redhat.com Tested-by: Rik van Riel r...@redhat.com ... now time to figure out the same thing for user_enter and/or user_exit, lets see which of those we can call with irqs already disabled :) -- All rights reversed -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] context_tracking: remove duplicate enabled check
On 04/28/2015 07:36 AM, Paolo Bonzini wrote: All calls to context_tracking_enter and context_tracking_exit are already checking context_tracking_is_enabled, except the context_tracking_user_enter and context_tracking_user_exit functions left in for the benefit of assembly calls. Pull the check up to those functions, by making them simple wrappers around the user_enter and user_exit inline functions. With this and your other (not yet posted) patch, run time for the kvm-unit-tests vmexit.flat vmcall test with 10 million iterations drops from 17.8 to 17.5 seconds, which is about a 1.7% speedup. -- All rights reversed -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] context_tracking: remove duplicate enabled check
On 04/28/2015 07:36 AM, Paolo Bonzini wrote: All calls to context_tracking_enter and context_tracking_exit are already checking context_tracking_is_enabled, except the context_tracking_user_enter and context_tracking_user_exit functions left in for the benefit of assembly calls. Pull the check up to those functions, by making them simple wrappers around the user_enter and user_exit inline functions. Signed-off-by: Paolo Bonzini pbonz...@redhat.com Reviewed-by: Rik van Riel r...@redhat.com -- All rights reversed -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: x86: kvmclock: drop rdtsc_barrier()
On 04/24/2015 09:36 PM, Marcelo Tosatti wrote: Drop unnecessary rdtsc_barrier(), as has been determined empirically, see 057e6a8c660e95c3f4e7162e00e2fee1fc90c50d for details. Noticed by Andy Lutomirski. Improves clock_gettime() by approximately 15% on Intel i7-3520M @ 2.90GHz. Signed-off-by: Marcelo Tosatti mtosa...@redhat.com Acked-by: Rik van Riel r...@redhat.com -- All rights reversed -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] kvm,x86: load guest FPU context more eagerly
Currently KVM will clear the FPU bits in CR0.TS in the VMCS, and trap to re-load them every time the guest accesses the FPU after a switch back into the guest from the host. This patch copies the x86 task switch semantics for FPU loading, with the FPU loaded eagerly after first use if the system uses eager fpu mode, or if the guest uses the FPU frequently. In the latter case, after loading the FPU for 255 times, the fpu_counter will roll over, and we will revert to loading the FPU on demand, until it has been established that the guest is still actively using the FPU. This mirrors the x86 task switch policy, which seems to work. Signed-off-by: Rik van Riel r...@redhat.com --- I still hope to put the larger FPU changes in at some point, but with all the current changes to the FPU code I am somewhat uncomfortable causing even more churn. After 4.1 I may send in the changes to defer loading of user space FPU context to do_notify_resume() - unless people want them sooner. arch/x86/kvm/x86.c | 15 +-- include/linux/kvm_host.h | 1 + 2 files changed, 14 insertions(+), 2 deletions(-) diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index e1a81267f3f6..2cdb2472a633 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -7031,14 +7031,25 @@ void kvm_put_guest_fpu(struct kvm_vcpu *vcpu) { kvm_put_guest_xcr0(vcpu); - if (!vcpu-guest_fpu_loaded) + if (!vcpu-guest_fpu_loaded) { + vcpu-fpu_counter = 0; return; + } vcpu-guest_fpu_loaded = 0; fpu_save_init(vcpu-arch.guest_fpu); __kernel_fpu_end(); ++vcpu-stat.fpu_reload; - kvm_make_request(KVM_REQ_DEACTIVATE_FPU, vcpu); + /* +* If using eager FPU mode, or if the guest is a frequent user +* of the FPU, just leave the FPU active for next time. +* Every 255 times fpu_counter rolls over to 0; a guest that uses +* the FPU in bursts will revert to loading it on demand. +*/ + if (!use_eager_fpu()) { + if (++vcpu-fpu_counter 5) + kvm_make_request(KVM_REQ_DEACTIVATE_FPU, vcpu); + } trace_kvm_fpu(0); } diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index ad45054309a0..f197ad3f6316 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -230,6 +230,7 @@ struct kvm_vcpu { int fpu_active; int guest_fpu_loaded, guest_xcr0_loaded; + unsigned char fpu_counter; wait_queue_head_t wq; struct pid *pid; int sigset_active; -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [v6] kvm/fpu: Enable fully eager restore kvm FPU
On 04/23/2015 06:57 PM, Wanpeng Li wrote: Cc Rik, who is doing the similar work. :) Hi Liang, I posted this patch earlier, which should have the same effect as your patch on more modern systems, while not loading the FPU context for guests that barely use it on older systems: https://lkml.org/lkml/2015/4/23/349 I have to admit the diffstat on your patch looks very nice, but it might be good to know what impact it has on older systems... On Fri, Apr 24, 2015 at 05:13:03AM +0800, Liang Li wrote: Romove lazy FPU logic and use eager FPU entirely. Eager FPU does not have performance regression, and it can simplify the code. When compiling kernel on westmere, the performance of eager FPU is about 0.4% faster than lazy FPU. Signed-off-by: Liang Li liang.z...@intel.com Signed-off-by: Xudong Hao xudong@intel.com --- arch/x86/include/asm/kvm_host.h | 1 - arch/x86/kvm/svm.c | 22 ++-- arch/x86/kvm/vmx.c | 74 +++-- arch/x86/kvm/x86.c | 8 + include/linux/kvm_host.h| 2 -- 5 files changed, 9 insertions(+), 98 deletions(-) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index dea2e7e..5d84cc9 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -743,7 +743,6 @@ struct kvm_x86_ops { void (*cache_reg)(struct kvm_vcpu *vcpu, enum kvm_reg reg); unsigned long (*get_rflags)(struct kvm_vcpu *vcpu); void (*set_rflags)(struct kvm_vcpu *vcpu, unsigned long rflags); -void (*fpu_deactivate)(struct kvm_vcpu *vcpu); void (*tlb_flush)(struct kvm_vcpu *vcpu); diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c index ce741b8..1b3b29b 100644 --- a/arch/x86/kvm/svm.c +++ b/arch/x86/kvm/svm.c @@ -1087,7 +1087,6 @@ static void init_vmcb(struct vcpu_svm *svm) struct vmcb_control_area *control = svm-vmcb-control; struct vmcb_save_area *save = svm-vmcb-save; -svm-vcpu.fpu_active = 1; svm-vcpu.arch.hflags = 0; set_cr_intercept(svm, INTERCEPT_CR0_READ); @@ -1529,15 +1528,12 @@ static void update_cr0_intercept(struct vcpu_svm *svm) ulong gcr0 = svm-vcpu.arch.cr0; u64 *hcr0 = svm-vmcb-save.cr0; -if (!svm-vcpu.fpu_active) -*hcr0 |= SVM_CR0_SELECTIVE_MASK; -else -*hcr0 = (*hcr0 ~SVM_CR0_SELECTIVE_MASK) -| (gcr0 SVM_CR0_SELECTIVE_MASK); +*hcr0 = (*hcr0 ~SVM_CR0_SELECTIVE_MASK) +| (gcr0 SVM_CR0_SELECTIVE_MASK); mark_dirty(svm-vmcb, VMCB_CR); -if (gcr0 == *hcr0 svm-vcpu.fpu_active) { +if (gcr0 == *hcr0) { clr_cr_intercept(svm, INTERCEPT_CR0_READ); clr_cr_intercept(svm, INTERCEPT_CR0_WRITE); } else { @@ -1568,8 +1564,6 @@ static void svm_set_cr0(struct kvm_vcpu *vcpu, unsigned long cr0) if (!npt_enabled) cr0 |= X86_CR0_PG | X86_CR0_WP; -if (!vcpu-fpu_active) -cr0 |= X86_CR0_TS; /* * re-enable caching here because the QEMU bios * does not do it - this results in some delay at @@ -1795,7 +1789,6 @@ static void svm_fpu_activate(struct kvm_vcpu *vcpu) clr_exception_intercept(svm, NM_VECTOR); -svm-vcpu.fpu_active = 1; update_cr0_intercept(svm); } @@ -4139,14 +4132,6 @@ static bool svm_has_wbinvd_exit(void) return true; } -static void svm_fpu_deactivate(struct kvm_vcpu *vcpu) -{ -struct vcpu_svm *svm = to_svm(vcpu); - -set_exception_intercept(svm, NM_VECTOR); -update_cr0_intercept(svm); -} - #define PRE_EX(exit) { .exit_code = (exit), \ .stage = X86_ICPT_PRE_EXCEPT, } #define POST_EX(exit) { .exit_code = (exit), \ @@ -4381,7 +4366,6 @@ static struct kvm_x86_ops svm_x86_ops = { .cache_reg = svm_cache_reg, .get_rflags = svm_get_rflags, .set_rflags = svm_set_rflags, -.fpu_deactivate = svm_fpu_deactivate, .tlb_flush = svm_flush_tlb, diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index f5e8dce..811a666 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -1567,7 +1567,7 @@ static void update_exception_bitmap(struct kvm_vcpu *vcpu) u32 eb; eb = (1u PF_VECTOR) | (1u UD_VECTOR) | (1u MC_VECTOR) | - (1u NM_VECTOR) | (1u DB_VECTOR); + (1u DB_VECTOR); if ((vcpu-guest_debug (KVM_GUESTDBG_ENABLE | KVM_GUESTDBG_USE_SW_BP)) == (KVM_GUESTDBG_ENABLE | KVM_GUESTDBG_USE_SW_BP)) @@ -1576,8 +1576,6 @@ static void update_exception_bitmap(struct kvm_vcpu *vcpu) eb = ~0; if (enable_ept) eb = ~(1u PF_VECTOR); /* bypass_guest_pf = 0 */ -if (vcpu-fpu_active) -eb = ~(1u NM_VECTOR); /* When we are running a nested L2 guest and L1 specified for it a * certain exception bitmap, we must trap the same exceptions and pass @@ -1961,9 +1959,6 @@ static void
Re: [PATCH v15 16/16] unfair qspinlock: a queue based unfair lock
On 04/09/2015 10:13 AM, Peter Zijlstra wrote: On Thu, Apr 09, 2015 at 09:16:24AM -0400, Rik van Riel wrote: On 04/09/2015 03:01 AM, Peter Zijlstra wrote: On Wed, Apr 08, 2015 at 02:32:19PM -0400, Waiman Long wrote: For a virtual guest with the qspinlock patch, a simple unfair byte lock will be used if PV spinlock is not configured in or the hypervisor isn't either KVM or Xen. The byte lock works fine with small guest of just a few vCPUs. On a much larger guest, however, byte lock can have serious performance problem. Who cares? There are some people out there running guests with dozens of vCPUs. If the code exists to make those setups run better, is there a good reason not to use it? Well use paravirt, !paravirt stuff sucks performance wise anyhow. The question really is: is the added complexity worth the maintenance burden. And I'm just not convinced !paravirt virt is a performance critical target. Fair enough. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v15 16/16] unfair qspinlock: a queue based unfair lock
On 04/09/2015 03:01 AM, Peter Zijlstra wrote: On Wed, Apr 08, 2015 at 02:32:19PM -0400, Waiman Long wrote: For a virtual guest with the qspinlock patch, a simple unfair byte lock will be used if PV spinlock is not configured in or the hypervisor isn't either KVM or Xen. The byte lock works fine with small guest of just a few vCPUs. On a much larger guest, however, byte lock can have serious performance problem. Who cares? There are some people out there running guests with dozens of vCPUs. If the code exists to make those setups run better, is there a good reason not to use it? Having said that, only KVM and Xen seem to support very large guests, and PV spinlock is available there. I believe both VMware and Hyperv have a 32 VCPU limit, anyway. -- All rights reversed -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH -v5 6/5] context_tracking: fix exception_enter when already in IN_KERNEL
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 02/12/2015 12:00 PM, Frederic Weisbecker wrote: On Thu, Feb 12, 2015 at 10:47:10AM -0500, Rik van Riel wrote: -BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 02/12/2015 10:42 AM, Frederic Weisbecker wrote: On Wed, Feb 11, 2015 at 02:43:19PM -0500, Rik van Riel wrote: If exception_enter happens when already in IN_KERNEL state, the code still calls context_tracking_exit, which ends up in rcu_eqs_exit_common, which explodes with a WARN_ON when it is called in a situation where dynticks are not enabled. Fortunately context_tracking_exit() already has a current_state == IN_KERNEL check so this shouldn't be a problem. No, it had a hard-coded current_state == IN_USER check, which is very close, but ... ... I replaced that with a state argument, and forgot to ensure that it never gets called with state == IN_KERNEL. This patch fixes that. Ah that's right! Well I'm going to merge this patch to 1/5 then to avoid breaking bisection. Thank you, Frederic! - -- All rights reversed -BEGIN PGP SIGNATURE- Version: GnuPG v1 iQEcBAEBAgAGBQJU3OeDAAoJEM553pKExN6D7BsIAJ8CKC73jQ8T5Dqa/tlHV7Db QFSJdpxP+7jCZwssehgpjpxCwtJ0UvGgle5OwX/POUhmagHxHmxVydOBz+xfYdBr UuGkEl5TL+oyoMUr80Q4RTnJSZN08zi+THqiv33tyPUj6cNiycBZAuho3ELTRNOA bRcHrMW+xd95uqoung7dSKrgA2jcym3+umNGnQb0gniraqcNLAmWs+jfAO8yZLJg vk8bIKed6epQ3n6gcdYe0A28cLOuBvjEs5JNcEPxujY/349sjitKR2pLQ6HsfHLV frlKsh7qQIRtoUJLO9ZBBDtGrmThwBwH8rw+GcVR8zviPNvV4IRrx47VBcHDWjc= =mwFO -END PGP SIGNATURE- -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH -v5 6/5] context_tracking: fix exception_enter when already in IN_KERNEL
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 02/12/2015 10:42 AM, Frederic Weisbecker wrote: On Wed, Feb 11, 2015 at 02:43:19PM -0500, Rik van Riel wrote: If exception_enter happens when already in IN_KERNEL state, the code still calls context_tracking_exit, which ends up in rcu_eqs_exit_common, which explodes with a WARN_ON when it is called in a situation where dynticks are not enabled. Fortunately context_tracking_exit() already has a current_state == IN_KERNEL check so this shouldn't be a problem. No, it had a hard-coded current_state == IN_USER check, which is very close, but ... ... I replaced that with a state argument, and forgot to ensure that it never gets called with state == IN_KERNEL. This patch fixes that. Meanwhile I'll still take the patch, it's better to handle that from the caller. Thanks. - -- All rights reversed -BEGIN PGP SIGNATURE- Version: GnuPG v1 iQEcBAEBAgAGBQJU3Mr+AAoJEM553pKExN6DYNUH/2m9CtXhLdTHOEHRvxg41PCZ /xafetUOS9cka0CNuiYpUuvfMSucoePW7YqUXqjYSIP25DsAleOh0qdep1Ob5bH+ 2BqZNMwK3QDHf1+/V7nulnjVkeHtpXJm0HIZOjc06xeL+9T6ydB1vhQGIMLrGL9S LvOstI3fseeIgglwYc2Gx7H7e99oOkxysvwMMvcMrW0cPSRAOdYxINQnfYW8A5kq DTTXwWuJRZa4FLtP3wLpvocm5dMGDwTsDmuOk1PmXYlsTsO6H2BmCeio0euzStoJ l+jR4x7Aq2KXES7gnMgpPw1iON3xKJ/RbXF8IC/doII8FYEV8Raxnf7hl47etBw= =yIjW -END PGP SIGNATURE- -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH -v5 6/5] context_tracking: fix exception_enter when already in IN_KERNEL
If exception_enter happens when already in IN_KERNEL state, the code still calls context_tracking_exit, which ends up in rcu_eqs_exit_common, which explodes with a WARN_ON when it is called in a situation where dynticks are not enabled. This can be avoided by having exception_enter only switch to IN_KERNEL state if the current state is not already IN_KERNEL. Signed-off-by: Rik van Riel r...@redhat.com Reported-by: Luiz Capitulino lcapitul...@redhat.com --- Frederic, you will want this bonus patch, too :) Thanks to Luiz for finding this one. Whatever I was running did not trigger this issue... include/linux/context_tracking.h | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/include/linux/context_tracking.h b/include/linux/context_tracking.h index b65fd1420e53..9da230406e8c 100644 --- a/include/linux/context_tracking.h +++ b/include/linux/context_tracking.h @@ -37,7 +37,8 @@ static inline enum ctx_state exception_enter(void) return 0; prev_ctx = this_cpu_read(context_tracking.state); - context_tracking_exit(prev_ctx); + if (prev_ctx != IN_KERNEL) + context_tracking_exit(prev_ctx); return prev_ctx; } -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/6] rcu,nohz: add context_tracking_user_enter/exit wrapper functions
On 02/10/2015 10:28 AM, Frederic Weisbecker wrote: On Tue, Feb 10, 2015 at 09:41:45AM -0500, r...@redhat.com wrote: From: Rik van Riel r...@redhat.com These wrapper functions allow architecture code (eg. ARM) to keep calling context_tracking_user_enter context_tracking_user_exit the same way it always has, without error prone tricks like duplicate defines of argument values in assembly code. Signed-off-by: Rik van Riel r...@redhat.com This patch alone doesn't make much sense. Agreed, my goal was to keep things super easy to review, to reduce the chance of introducing bugs. The changelog says it's about keeping context_tracking_user_*() functions as wrappers but fails to explain to what they wrap, why and what are the new context_tracking_enter/exit functions for. Perhaps patches 1 and 2 should be merged together into something like: context_tracking: Generalize context tracking APIs to support user and guest Do that because we'll also track guestetc And keep the old user context tracking APIs for now to avoid painful enum parameter support in ARM assembly Can do... Paul, would you like me to resend the whole series, or just a merged patch that replaces patches 1 2? That is assuming Paul prefers having the patches merged into one :) -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 6/6] kvm,rcu,nohz: use RCU extended quiescent state when running KVM guest
On 02/10/2015 02:59 PM, Andy Lutomirski wrote: On 02/10/2015 06:41 AM, r...@redhat.com wrote: From: Rik van Riel r...@redhat.com The host kernel is not doing anything while the CPU is executing a KVM guest VCPU, so it can be marked as being in an extended quiescent state, identical to that used when running user space code. The only exception to that rule is when the host handles an interrupt, which is already handled by the irq code, which calls rcu_irq_enter and rcu_irq_exit. The guest_enter and guest_exit functions already switch vtime accounting independent of context tracking. Leave those calls where they are, instead of moving them into the context tracking code. Signed-off-by: Rik van Riel r...@redhat.com --- include/linux/context_tracking.h | 6 ++ include/linux/context_tracking_state.h | 1 + include/linux/kvm_host.h | 3 ++- 3 files changed, 9 insertions(+), 1 deletion(-) diff --git a/include/linux/context_tracking.h b/include/linux/context_tracking.h index 954253283709..b65fd1420e53 100644 --- a/include/linux/context_tracking.h +++ b/include/linux/context_tracking.h @@ -80,10 +80,16 @@ static inline void guest_enter(void) vtime_guest_enter(current); else current-flags |= PF_VCPU; + +if (context_tracking_is_enabled()) +context_tracking_enter(IN_GUEST); Why the if statement? Also, have you checked how much this hurts guest lightweight entry/exit latency? Context tracking is shockingly expensive for reasons I don't fully understand, but hopefully most of it is the vtime stuff. Guest_enter and guest_exit already do the vtime stuff today. This patch series adds the rcu stuff, and modifies context_tracking_enter context_tracking_exit to not do the vtime stuff twice. (Context tracking is *so* expensive that I almost think we should set the performance taint flag if we enable it, assuming that flag ended up getting merged. Also, we should make context tracking faster.) I am all for making it faster :) -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/5] rcu,nohz: add state parameter to context_tracking_user_enter/exit
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 02/07/2015 03:06 PM, Paul E. McKenney wrote: On Sat, Feb 07, 2015 at 09:30:41AM +0100, Frederic Weisbecker wrote: On Fri, Feb 06, 2015 at 11:14:53PM -0800, Paul E. McKenney wrote: On Fri, Feb 06, 2015 at 10:34:21PM -0800, Paul E. McKenney wrote: On Fri, Feb 06, 2015 at 10:53:34PM -0500, Rik van Riel wrote: -BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 02/06/2015 06:15 PM, Frederic Weisbecker wrote: Just a few things then: 1) In this case rename context_tracking_user_enter/exit() to context_tracking_enter() and context_tracking_exit(), since it's not anymore about user only but about any generic context. 2) We have the WARN_ON_ONCE(!current-mm); condition that is a debug check specific to userspace transitions because kernel threads aren't expected to resume to userspace. Can we also expect that we never switch to/from guest from a kernel thread? AFAICS this happens from an ioctl (thus user task) in x86 for kvm. But I only know this case. 3) You might want to update a few comments that assume we only deal with userspace transitions. 4) trace_user_enter/exit() should stay user-transitions specific. Paul, would you like me to send follow-up patches with the cleanups suggested by Frederic, or would you prefer me to send a new series with the cleanups integrated? I would prefer a new series, in order to prevent possible future confusion. Of course, if Frederic would rather push them himself, I am fine with that. And in that case, you should ask him for his preferences, which just might differ from mine. ;-) I prefer a new series too. Now whether you or me take the patches, I don't mind either way :-) Also I wonder how this feature is going to be enabled. Will it be enabled on full dynticks or should it be a seperate feature depending on full dynticks? Or even just CONFIG_RCU_USER_EQS? Because I'm still unclear about how and what this is used, if it involves full dynticks or only RCU extended quiescent states. Well, we certainly need it documented. And validation considerations would push for keeping the number of possible combinations low, while paranoia about added feature would push for having it be separately enabled. And if distros are going to enable this at build time, we either need -serious- validation or a way to disable at boot time. On the desired/required combinations of features, let's see... If I understand this completely, which I probably don't, we have the following considerations: o NO_HZ_FULL: Needed to get rid of the scheduling-clock interrupt during guest execution, though I am not sure whether we really have that completely wired up with this patch set. Regardless, Rik, for your use case, do you care about whether or not the guest gets interrupted by the host's scheduling-clock interrupts? (Based on discussion in this thread, my guess is yes.) o RCU_NOCB_CPUS: Implied by NO_HZ_FULL, but only on CPUs actually enabled for NO_HZ_FULL operation, either by NO_HZ_FULL_ALL at build time or by nohz_full= at boot time. Needed to avoid interrupting the guest with host RCU callback invocation. Rik, does your use case care about guests being interrupted by RCU callback invocation? (Based on discussion in this thread, my guess is yes.) o RCU_USER_EQS: Implied by NO_HZ_FULL, and I would have to go look to see what relation this has to nohz_full=. Needed for RCU to be able to recognize userspace-execution quiescent states on a given CPU without disturbing that CPU. Unless I am missing something subtle, you have to have this for this patch series to make sense. If my guesses are correct, the best approach would be to have this new mode of operation implied by NO_HZ_FULL. I agree. It makes sense to have all three, and all three are enabled in the configuration we use. I cannot think of a case where someone would significantly benefit from just one or two of the above, except maybe for debugging reasons. Having NO_HZ_FULL enable all the above, either through a boot time commandline option, or just by default, would make sense. - -- All rights reversed -BEGIN PGP SIGNATURE- Version: GnuPG v1 iQEcBAEBAgAGBQJU2NVpAAoJEM553pKExN6DxxUH/RwpZI6dRYvIQbtY2y93ax5/ Lba4QbmZ6n6AnGXrtlpwEQMSMvLawKqT9ZFSwzKeSarX6Uu4aRCdi8td34ruu9rg hfhv8hD1z15deYc0UPKUCbZrYrIi9uaG/FpioafDmPH+P4T2bFdvn7d/bKIoiaBM T1QA+LNddRxOhtayrIEDH1BnPKgXw9V8f7/mGQPmRf+oRz+Hgn6DPpEm0kTbqn+L RkhHNPemJ8bMaIwntAwzEklgnhkON9zOBe/XFof0lC+SdhtlAVkXPvX+cXiZMQZt 1rEqxK1+S9beeKVX65mLtxZg2omz46qz7SQRUGf3If2wHZXQtIRnvtlyCsDu/AI= =gj2E -END PGP SIGNATURE- -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH -v3 0/6] rcu,nohz,kvm: use RCU extended quiescent state when running KVM guest
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 02/09/2015 08:15 PM, Will Deacon wrote: Hi Rik, On Mon, Feb 09, 2015 at 04:04:38PM +, r...@redhat.com wrote: Apologies to Catalin and Will for not fixing up ARM. I am not familiar with ARM assembly, and not sure how to pass a constant argument to a function from assembly code on ARM :) It's a bit of a faff getting enum values into asm -- we actually have to duplicate the definitions using #defines to get at the constants. Perhaps it would be cleaner to leave context_tracking_user_{enter,exit} intact as C wrappers around context_tracking_{enter,exit} passing the appropriate constant? That way we don't actually need to change the arch code at all. If Paul and Frederic have no objections, I would be happy to do that. Paul, Frederic? - -- All rights reversed -BEGIN PGP SIGNATURE- Version: GnuPG v1 iQEcBAEBAgAGBQJU2V1zAAoJEM553pKExN6D6g0IALbbBouimRFvTEvET5mspC7i DDn9AZJ8lMfbtgeMQsNFsJqfXu0lM8lwczPMKoAz/op31p0pdPlykYIm9fmCdqik rNFTIRRd8fPD3JJVjNu8lpmVsae9VQuaDX7/36OK5d36bIpPfZ1yj6dxOdL4p5oI xYcnw1Re1rUmtykQo3i6V6379EPz8azSqxFRcP+Kf9kq3FLzHwzs+TwkOprGSWE4 B4wNVQMn+bLir/vQlS0+0NYcYXDyMllGQWm0Z8mBhUhjvHhP0rQ0UplHQk+zDOaN slEOKAIEtfikAm9ugv25cylCJ6ko+DOrP1aZ994BQ7E9AxmBnzz0EN6i5NAIJ1s= =msX3 -END PGP SIGNATURE- -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH -v3 0/6] rcu,nohz,kvm: use RCU extended quiescent state when running KVM guest
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 02/09/2015 10:01 PM, Paul E. McKenney wrote: On Tue, Feb 10, 2015 at 02:44:17AM +0100, Frederic Weisbecker wrote: On Mon, Feb 09, 2015 at 08:22:59PM -0500, Rik van Riel wrote: -BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 02/09/2015 08:15 PM, Will Deacon wrote: Hi Rik, On Mon, Feb 09, 2015 at 04:04:38PM +, r...@redhat.com wrote: Apologies to Catalin and Will for not fixing up ARM. I am not familiar with ARM assembly, and not sure how to pass a constant argument to a function from assembly code on ARM :) It's a bit of a faff getting enum values into asm -- we actually have to duplicate the definitions using #defines to get at the constants. Perhaps it would be cleaner to leave context_tracking_user_{enter,exit} intact as C wrappers around context_tracking_{enter,exit} passing the appropriate constant? That way we don't actually need to change the arch code at all. If Paul and Frederic have no objections, I would be happy to do that. Paul, Frederic? Sure, that's fine by me. And if it is fine by Frederic, it is fine by me! I'll send a new series tomorrow that addresses Will's concern, as well as Paulo's latest suggestions. - -- All rights reversed -BEGIN PGP SIGNATURE- Version: GnuPG v1 iQEcBAEBAgAGBQJU2XTsAAoJEM553pKExN6D2ZcH/R7KIkJLpaWwxM2Iy0iEy+9r CA2Tem1wXiDRMLzvDEKQHGlX8ea2o/VWNwjSMeRzDJbd3D9wsfPBMROXOXuji3Cc FI2XBrC+fax43vWaIcKDb/GrQanffed2VZ1klL2twTShUOaSBPXFofZ1DISvrySP //QOe7DbgYpqYJZ2KC+KohclvIkepAbo3AuiZKj97staw4IzP4kVJi1zAk0qbDDQ 6m7X4hvHnkfiSx0YLNRJ54j3i9/py5/rbUlnNP9LAxdQzOQQCFUSUKnn9zIMBZ03 e1W77VAQLI9+URXgeuSn7928PE+eqQQJwwdLzt0ckVojJL3i8J5SfoxTMr1sZh0= =ft08 -END PGP SIGNATURE- -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/6] nohz,kvm: export context_tracking_user_enter/exit
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 02/09/2015 12:05 PM, Paolo Bonzini wrote: On 09/02/2015 17:04, r...@redhat.com wrote: From: Rik van Riel r...@redhat.com Export context_tracking_user_enter/exit so it can be used by KVM. Wrong function name in the commit message... Oops, indeed. Paul, could I convince you to fix up the changelog and reorder patches 5 6, if you choose to apply these patches? :) - -- All rights reversed -BEGIN PGP SIGNATURE- Version: GnuPG v1 iQEcBAEBAgAGBQJU2RJAAAoJEM553pKExN6DvmAIAJPx/Y66UTOn1009Ware7w7g SyqsJvZBFGnJAYvbioQVDjJYA8hjnghffsj2AGiSc/WMkZTMFbhVoJuZKX9mOuE4 kJKkWZP74JGjYY5+uZqk1q+y83bqvvlUAdCGdWictWZJNTErFoXmBGaw4Qd4tiTe 8e4ltRHjGUpjwiBBO1H2vRjqkW4di35JXslUGEI5Os2cvGC+eYorz/7hEA3IX5hO Oe07iGfgOa5SuLy3pIg8I5Y8S5Cy6ixGK5h5+qmvetTHz3KdUfx30Beyrz9Zbv2l EF55sKIosMbGtlUBu2F7nhQL2aTiVf/pPuNokJH9tuv0eUbay/Tyosa9vTK7PiQ= =rMEH -END PGP SIGNATURE- -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/2] KVM: x86: extract guest running logic from __vcpu_run
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 02/06/2015 07:16 AM, Paolo Bonzini wrote: Rename the old __vcpu_run to vcpu_run, and extract its main logic to a new function. The next patch will add a new early-exit condition to __vcpu_run, avoid extra indentation. Signed-off-by: Paolo Bonzini pbonz...@redhat.com Reviewed-by: Rik van Riel r...@redhat.com - -- All rights reversed -BEGIN PGP SIGNATURE- Version: GnuPG v1 iQEcBAEBAgAGBQJU2UOmAAoJEM553pKExN6DLKoIAJhuuGhD46k4Xyai8JdtTGPr osSYKVjIn9PYCYDjR9NQuUfji1i5JXluFMDHLREnKnTQlzvC9+pKIfyxEObgI3ni 8uYs5mealtFAv3uWL2JLdvfi4e58rUEmvI2c+CZBFBQJ6NkGh2QRF0VVa0Vowppv /kvq/L3QKna4mfSngAA60p5g6ctKwNyXTXt1Pb+KbsYseJlnWLPSXwocDEBBEc35 vvHWdOwVtJQRbVwt/ZXNVPaePgCVB/eNLnxvCd1OujnIycoJhKNQKreEW2ik2UXY CsgXrBiK3wlgHbf65Ogs9ivrn9wXQXxGbWf2WlnlQeTAbkSR9TqixkeqCYSmCI0= =gy1C -END PGP SIGNATURE- -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/5] rcu,nohz: add state parameter to context_tracking_user_enter/exit
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 02/07/2015 03:30 AM, Frederic Weisbecker wrote: I prefer a new series too. Now whether you or me take the patches, I don't mind either way :-) I'll make it, no problem. Also I wonder how this feature is going to be enabled. Will it be enabled on full dynticks or should it be a seperate feature depending on full dynticks? Or even just CONFIG_RCU_USER_EQS? Because I'm still unclear about how and what this is used, if it involves full dynticks or only RCU extended quiescent states. It involves full dynticks and CONFIG_RCU_USER_EQS. - -- All rights reversed -BEGIN PGP SIGNATURE- Version: GnuPG v1 iQEcBAEBAgAGBQJU1fcsAAoJEM553pKExN6D7NEH/RZM/gqZ7CCACr4T3/Esd8GL IeYmZui+GDyrzj63/xX7ZgU+aqPbkfbEJ3ueQkabjtzIHhkurBM19XZ8CwWb42S9 5kAi51MjLrNnLPdvYCcu2q15TKSygU+V5wvxVohxHC9fi+tE/1+FOrVATky68uO4 6izXTm8EXbDLRg0tB5Mq/sRqBXGHfDw19vVQqMkQ47vzIw4oNHpLBSTv7GXHhN7u GH0QMzcDUUZ8IcyOSxLhRPOUX3XrV7C4U8ilP0ZJQ287sqtsMpQWtNZK6jmJN1tv niCrHQAOH++MuuF3x2fulpO3fSTbgwW3bGeMKh2ITHk0ODG6iIh1htmg4EFA2Bg= =AyIN -END PGP SIGNATURE- -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/5] rcu,nohz: add state parameter to context_tracking_user_enter/exit
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 02/06/2015 01:23 PM, Frederic Weisbecker wrote: On Fri, Feb 06, 2015 at 01:20:21PM -0500, Rik van Riel wrote: On 02/06/2015 12:22 PM, Frederic Weisbecker wrote: On Thu, Feb 05, 2015 at 03:23:48PM -0500, r...@redhat.com wrote: From: Rik van Riel r...@redhat.com Add the expected ctx_state as a parameter to context_tracking_user_enter and context_tracking_user_exit, allowing the same functions to not just track kernel user space switching, but also kernel guest transitions. Signed-off-by: Rik van Riel r...@redhat.com You should consider using guest_enter() and guest_exit() instead. These are context tracking APIs too but specifically for guest. What do you mean instead? KVM already uses those. I just wanted to avoid duplicating the code... I mean you can call rcu_user APIs directly from guest_enter/exit. You don't really need to call the context_tracking_user functions since guest_enter/guest_exit already handle the vtime accounting. I would still have to modify exception_enter and exception_exit, and with them context_tracking_user_enter and context_tracking_user_exit. We have to re-enable RCU when an exception happens. I suspect exceptions in a guest just trigger VMEXIT, and we figure later why the exception happened. However, if we were to get an exception during the code where we transition into or out of guest mode, we would still need exception_enter and exception_exit... - -- All rights reversed -BEGIN PGP SIGNATURE- Version: GnuPG v1 iQEcBAEBAgAGBQJU1Q1MAAoJEM553pKExN6DB90H/iVCnfrooAB15E5Qioa3Ty+X hNMaIMX6zjYg++IFR5BhYLp9hp36o/98sv8RLTjZQix2q1ljivobmbABvx2MBNhx NiPfU9DyBkhz3gwI4oTkggb383Wrcyt+RgvclI/96AbwkhrdzxmT1nnUc0kA98xC 6NTW2+imkYX31sY/2SFmYWnJMVZOjOIep3LCVh/hrWnQARd6mdyzzFr+v6Z/vyFe 8P2rbqlkN0nf1pGYz3VF6zqF8wVmOi1mx4mo0qy80Sax7jsZv9+gGfbF1HkHJnjg FLsj/q/mcrH1GBK54a3s3P6ghpcFXfwibjhkGmrmA07XNHqLiNgKgmgPtArhU+s= =9Ln1 -END PGP SIGNATURE- -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/5] rcu,nohz: add state parameter to context_tracking_user_enter/exit
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 02/06/2015 06:15 PM, Frederic Weisbecker wrote: Just a few things then: 1) In this case rename context_tracking_user_enter/exit() to context_tracking_enter() and context_tracking_exit(), since it's not anymore about user only but about any generic context. 2) We have the WARN_ON_ONCE(!current-mm); condition that is a debug check specific to userspace transitions because kernel threads aren't expected to resume to userspace. Can we also expect that we never switch to/from guest from a kernel thread? AFAICS this happens from an ioctl (thus user task) in x86 for kvm. But I only know this case. 3) You might want to update a few comments that assume we only deal with userspace transitions. 4) trace_user_enter/exit() should stay user-transitions specific. Paul, would you like me to send follow-up patches with the cleanups suggested by Frederic, or would you prefer me to send a new series with the cleanups integrated? Frederic, I will also add the cleanup you suggested for patch 4/5. - -- All rights reversed -BEGIN PGP SIGNATURE- Version: GnuPG v1 iQEcBAEBAgAGBQJU1Yw+AAoJEM553pKExN6DhycH/ifPeaRaFcj/BBKaDf7BmKAJ cGMplf/vMtJA5DCFfZTmRp5Yb/9f3XBk8MU4Z+oWZFPB/msA8WkibhZtRGXpXXl9 7XgDXaXUuo++Axhb3SYHXEDhkPkhmfdjlctyr5ZUu3gHqkeWl6utv0t4anIBfo3Z NdWG8yEhgKU6OyFppf3CH0Cm46xPN+pUyAFMgK9HbSfDkR3a9rMZ32aQq8fyV15e LV4qE+/lPi7lyoLqbHmmU+pqp6iBqyQ9uFIDCRAoBBXF5jh0jxEynRubBn2D1HZJ FBi+dBWGhAjRN05tuurvkwbJtcmTpnsHyNrmzNlAeop0Upc/5Vta43zN/nu1AFA= =Z9mE -END PGP SIGNATURE- -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 0/5] rcu,nohz,kvm: use RCU extended quiescent state when running KVM guest
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 02/06/2015 08:46 AM, Frederic Weisbecker wrote: On Thu, Feb 05, 2015 at 03:23:47PM -0500, r...@redhat.com wrote: When running a KVM guest on a system with NOHZ_FULL enabled I just need to clarify the motivation first, does the above situation really happen? Ok some distros enable NOHZ_FULL to let the user stop the tick in userspace. So most of the time, CONFIG_NOHZ_FULL=y but nohz full is runtime disabled (we need to pass a nohz_full= boot parameter to enable it). And when it is runtime disabled, there should be no rcu nocb CPU. (Although not setting CPUs in nocb mode when nohz full is runtime disabled is perhaps a recent change.) So for the problem to arise, one need to enable nohz_full and run KVM guest. And I never heard about such workloads. That said it's potentially interesting to turn off the tick on the host when the guest runs. , and the KVM guest running with idle=poll mode, we still get wakeups of the rcuos/N threads. So we need nohz_full on the host and idle=poll mode on the guest. Is it likely to happen? (sorry, again I'm just trying to make sure we agree on why we do this change). We have users interested in doing just that, in order to run KVM guests with the least amount of perturbation to the guest. - -- All rights reversed -BEGIN PGP SIGNATURE- Version: GnuPG v1 iQEcBAEBAgAGBQJU1NYrAAoJEM553pKExN6DMgcIAIKoBBxsFEt7yOP+k32uqm+W S/VbP5dIHE5OnYeBoitgNkia1U4rsAX6AVAuVFvKc7Y8aixENGzubPWHe0NuHida VIaQmK92Jym4FH8Xsnj09MhgLV+ZEG/cCzdUFZfShJq3hHwzedZx+cC7uQMB6kd4 iuo7CtgTjzTgBce29Fc147azXlJbfFfwFt3a6YVxbv25IYpDL9ETulh34h6NrNLz nB0snDjq8FHKcyjlD3XnJpT/tbaZcrZctExq4JrespcBMe6prMnoWvVoXWX/fVVG TIR1hp2xfKWoS4gc56PnLazIIB9SRmlC/SzSMwAaSgf1dWa5BcwpuMbYVFIEeME= =+BrK -END PGP SIGNATURE- -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/5] rcu,nohz: add state parameter to context_tracking_user_enter/exit
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 02/06/2015 12:22 PM, Frederic Weisbecker wrote: On Thu, Feb 05, 2015 at 03:23:48PM -0500, r...@redhat.com wrote: From: Rik van Riel r...@redhat.com Add the expected ctx_state as a parameter to context_tracking_user_enter and context_tracking_user_exit, allowing the same functions to not just track kernel user space switching, but also kernel guest transitions. Signed-off-by: Rik van Riel r...@redhat.com You should consider using guest_enter() and guest_exit() instead. These are context tracking APIs too but specifically for guest. What do you mean instead? KVM already uses those. I just wanted to avoid duplicating the code... - -- All rights reversed -BEGIN PGP SIGNATURE- Version: GnuPG v1 iQEcBAEBAgAGBQJU1QXlAAoJEM553pKExN6D+l4H/1PPmFioxed9XyL+rJZf0XSt mATl5JcWGlNybL5c4Tnld/3FX5/vYwBXmgw2Rh5a84F+TJi8B+Hu2Uwetl6C6vUF EK2+ExJ1rla4lpiO3frxPDdfdOHJFw2bR0fhEb4GHqcN2ecfSdXtL4hKwFru5h5s IJ8dzNIW52vzqzmulkcvI1y+VkgQBwUXYbkiGyy/MPf4F0WvGC9g44eXHZNPRXoT V34/nMJCpFHlZ7FVuHqGGstmPjv19VUAYNhUkrlU8DOpZMKxT58Sb1CGLfwsGqvZ y0+pRca8eT+gX0vqg9YUBfoEBNy4MnHdQEwQ0EPZwPJkcQ3Leco3/1JLHyDogCg= =3AJV -END PGP SIGNATURE- -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/4] kvm,rcu: use RCU extended quiescent state when running KVM guest
On 02/05/2015 01:56 PM, Paul E. McKenney wrote: The real danger is doing neither. On tick_nohz_full_cpu() CPUs, the exit-to-userspace code should invoke rcu_user_enter(), which sets some per-CPU state telling RCU to ignore that CPU, since it cannot possibly do host RCU read-side critical sections while running a guest. In contrast, a non-tick_nohz_full_cpu() CPU doesn't let RCU know that it is executing in a guest or in userspace. So the rcu_virt_note_context_switch() does the notification in that case. Looking at context_tracking.h, I see the function context_tracking_cpu_is_enabled(). That looks like it should do the right thing in this case. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC] kvm: x86: add halt_poll module parameter
On 02/05/2015 02:20 PM, Paolo Bonzini wrote: On 05/02/2015 19:55, Jan Kiszka wrote: This patch introduces a new module parameter for the KVM module; when it is present, KVM attempts a bit of polling on every HLT before scheduling itself out via kvm_vcpu_block. Wouldn't it be better to tune this on a per-VM basis? Think of mixed workloads with some latency-sensitive and some standard VMs. Yes, but: 1) this turned out to be very cheap, so a per-host tunable is not too bad; 2) it also affects only very few workloads (for example network workloads can already do polling in the guest) so it only affects few people; 3) long term anyway we want it to auto tune, which is better than tuning it per-VM. We may want to auto tune it per VM. However, if we make auto tuning work well, I do not think we want to expose a user visible tunable per VM, and commit to keeping that kind of interface around forever. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/4] kvm,rcu: use RCU extended quiescent state when running KVM guest
On 02/05/2015 11:44 AM, Christian Borntraeger wrote: Am 05.02.2015 um 17:35 schrieb r...@redhat.com: From: Rik van Riel r...@redhat.com The host kernel is not doing anything while the CPU is executing a KVM guest VCPU, so it can be marked as being in an extended quiescent state, identical to that used when running user space code. The only exception to that rule is when the host handles an interrupt, which is already handled by the irq code, which calls rcu_irq_enter and rcu_irq_exit. The guest_enter and guest_exit functions already switch vtime accounting independent of context tracking, so leave those calls where they are, instead of moving them into the context tracking code. Signed-off-by: Rik van Riel r...@redhat.com --- include/linux/context_tracking.h | 8 +++- include/linux/context_tracking_state.h | 1 + 2 files changed, 8 insertions(+), 1 deletion(-) diff --git a/include/linux/context_tracking.h b/include/linux/context_tracking.h index bd9f000fc98d..a5d3bb44b897 100644 --- a/include/linux/context_tracking.h +++ b/include/linux/context_tracking.h @@ -43,7 +43,7 @@ static inline enum ctx_state exception_enter(void) static inline void exception_exit(enum ctx_state prev_ctx) { if (context_tracking_is_enabled()) { -if (prev_ctx == IN_USER) +if (prev_ctx == IN_USER || prev_ctx == IN_GUEST) context_tracking_user_enter(prev_ctx); } } @@ -78,6 +78,9 @@ static inline void guest_enter(void) vtime_guest_enter(current); else current-flags |= PF_VCPU; + +if (context_tracking_is_enabled()) +context_tracking_user_enter(IN_GUEST); } Couldnt we make rcu_virt_note_context_switch(smp_processor_id()); conditional in include/linux/kvm_host.h (kvm_guest_enter) e.g. something like if (!context_tracking_is_enabled()) rcu_virt_note_context_switch(smp_processor_id()); Possibly. I considered the same, but I do not know whether or not just rcu_user_enter / rcu_user_exit is enough. I could certainly try it out and see whether anything explodes, but I am not convinced that is careful enough when it comes to handling RCU code... Paul? :) -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC] kvm: x86: add halt_poll module parameter
On 02/05/2015 11:05 AM, Paolo Bonzini wrote: This patch introduces a new module parameter for the KVM module; when it is present, KVM attempts a bit of polling on every HLT before scheduling itself out via kvm_vcpu_block. This parameter helps a lot for latency-bound workloads---in particular I tested it with O_DSYNC writes with a battery-backed disk in the host. In this case, writes are fast (because the data doesn't have to go all the way to the platters) but they cannot be merged by either the host or the guest. KVM's performance here is usually around 30% of bare metal, or 50% if you use cache=directsync or cache=writethrough (these parameters avoid that the guest sends pointless flush requests, and at the same time they are not slow because of the battery-backed cache). The bad performance happens because on every halt the host CPU decides to halt itself too. When the interrupt comes, the vCPU thread is then migrated to a new physical CPU, and in general the latency is horrible because the vCPU thread has to be scheduled back in. With this patch performance reaches 60-65% of bare metal and, more important, 99% of what you get if you use idle=poll in the guest. This means that the tunable gets rid of this particular bottleneck, and more work can be done to improve performance in the kernel or QEMU. Of course there is some price to pay; every time an otherwise idle vCPUs is interrupted by an interrupt, it will poll unnecessarily and thus impose a little load on the host. The above results were obtained with a mostly random value of the parameter (200), and the load was around 1.5-2.5% CPU usage on one of the host's core for each idle guest vCPU. The patch also adds a new stat, /sys/kernel/debug/kvm/halt_successful_poll, that can be used to tune the parameter. It counts how many HLT instructions received an interrupt during the polling period; each successful poll avoids that Linux schedules the VCPU thread out and back in, and may also avoid a likely trip to C1 and back for the physical CPU. In the long run, this value should probably be auto-tuned. However, it seems like a good idea to introduce this kind of thing one step at a time. While the VM is idle, a Linux 4 VCPU VM halts around 10 times per second. Of these halts, almost all are failed polls. During the benchmark, instead, basically all halts end within the polling period, except a more or less constant stream of 50 per second coming from vCPUs that are not running the benchmark. The wasted time is thus very low. Things may be slightly different for Windows VMs, which have a ~10 ms timer tick. The effect is also visible on Marcelo's recently-introduced latency test for the TSC deadline timer. Though of course a non-RT kernel has awful latency bounds, the latency of the timer is around 8000-1 clock cycles compared to 2-12 without setting halt_poll. For the TSC deadline timer, thus, the effect is both a smaller average latency and a smaller variance. Signed-off-by: Paolo Bonzini pbonz...@redhat.com Acked-by: Rik van Riel r...@redhat.com -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/4] kvm,rcu: use RCU extended quiescent state when running KVM guest
On 02/05/2015 01:56 PM, Paul E. McKenney wrote: On Thu, Feb 05, 2015 at 01:09:19PM -0500, Rik van Riel wrote: On 02/05/2015 12:50 PM, Paul E. McKenney wrote: On Thu, Feb 05, 2015 at 11:52:37AM -0500, Rik van Riel wrote: On 02/05/2015 11:44 AM, Christian Borntraeger wrote: Am 05.02.2015 um 17:35 schrieb r...@redhat.com: From: Rik van Riel r...@redhat.com The host kernel is not doing anything while the CPU is executing a KVM guest VCPU, so it can be marked as being in an extended quiescent state, identical to that used when running user space code. The only exception to that rule is when the host handles an interrupt, which is already handled by the irq code, which calls rcu_irq_enter and rcu_irq_exit. The guest_enter and guest_exit functions already switch vtime accounting independent of context tracking, so leave those calls where they are, instead of moving them into the context tracking code. Signed-off-by: Rik van Riel r...@redhat.com --- include/linux/context_tracking.h | 8 +++- include/linux/context_tracking_state.h | 1 + 2 files changed, 8 insertions(+), 1 deletion(-) diff --git a/include/linux/context_tracking.h b/include/linux/context_tracking.h index bd9f000fc98d..a5d3bb44b897 100644 --- a/include/linux/context_tracking.h +++ b/include/linux/context_tracking.h @@ -43,7 +43,7 @@ static inline enum ctx_state exception_enter(void) static inline void exception_exit(enum ctx_state prev_ctx) { if (context_tracking_is_enabled()) { -if (prev_ctx == IN_USER) +if (prev_ctx == IN_USER || prev_ctx == IN_GUEST) context_tracking_user_enter(prev_ctx); } } @@ -78,6 +78,9 @@ static inline void guest_enter(void) vtime_guest_enter(current); else current-flags |= PF_VCPU; + +if (context_tracking_is_enabled()) +context_tracking_user_enter(IN_GUEST); } Couldnt we make rcu_virt_note_context_switch(smp_processor_id()); conditional in include/linux/kvm_host.h (kvm_guest_enter) e.g. something like if (!context_tracking_is_enabled()) rcu_virt_note_context_switch(smp_processor_id()); Possibly. I considered the same, but I do not know whether or not just rcu_user_enter / rcu_user_exit is enough. I could certainly try it out and see whether anything explodes, but I am not convinced that is careful enough when it comes to handling RCU code... Paul? :) That can fail for some odd combinations of Kconfig and boot parameters. As I understand it, you instead want the following: if (!tick_nohz_full_cpu(smp_processor_id())) rcu_virt_note_context_switch(smp_processor_id()); Frederic might know of a better approach. Interesting, in what cases would that happen? My concern, perhaps misplaced, is that context_tracking_is_enabled(), but that the current CPU is not a nohz_full= CPU. In that case, the context-tracking code would be within its rights to not tell RCU about the transition to the guest, and thus RCU would be unaware that the CPU was in a quiescent state. In most workloads, you would expect the CPU to get interrupted or preempted or something at some point, which would eventually inform RCU, but too much delay would result in the infamous RCU CPU stall warning message. So let's code a bit more defensively. If context_tracking_is_enabled() we end up eventually calling into rcu_user_enter rcu_user_exit. If it is not enabled, we call rcu_virt_note_context_switch. In what cases would we need to call both? I don't see exit-to-userspace do the things that rcu_virt_note_context_switch does, and do not understand why userspace is fine with that... The real danger is doing neither. On tick_nohz_full_cpu() CPUs, the exit-to-userspace code should invoke rcu_user_enter(), which sets some per-CPU state telling RCU to ignore that CPU, since it cannot possibly do host RCU read-side critical sections while running a guest. Ahhh, I understand now. Thank you for your explanation, Paul. I will make sure your suggestion is in version 2 of this series. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/4] kvm,rcu: use RCU extended quiescent state when running KVM guest
On 02/05/2015 12:50 PM, Paul E. McKenney wrote: On Thu, Feb 05, 2015 at 11:52:37AM -0500, Rik van Riel wrote: On 02/05/2015 11:44 AM, Christian Borntraeger wrote: Am 05.02.2015 um 17:35 schrieb r...@redhat.com: From: Rik van Riel r...@redhat.com The host kernel is not doing anything while the CPU is executing a KVM guest VCPU, so it can be marked as being in an extended quiescent state, identical to that used when running user space code. The only exception to that rule is when the host handles an interrupt, which is already handled by the irq code, which calls rcu_irq_enter and rcu_irq_exit. The guest_enter and guest_exit functions already switch vtime accounting independent of context tracking, so leave those calls where they are, instead of moving them into the context tracking code. Signed-off-by: Rik van Riel r...@redhat.com --- include/linux/context_tracking.h | 8 +++- include/linux/context_tracking_state.h | 1 + 2 files changed, 8 insertions(+), 1 deletion(-) diff --git a/include/linux/context_tracking.h b/include/linux/context_tracking.h index bd9f000fc98d..a5d3bb44b897 100644 --- a/include/linux/context_tracking.h +++ b/include/linux/context_tracking.h @@ -43,7 +43,7 @@ static inline enum ctx_state exception_enter(void) static inline void exception_exit(enum ctx_state prev_ctx) { if (context_tracking_is_enabled()) { - if (prev_ctx == IN_USER) + if (prev_ctx == IN_USER || prev_ctx == IN_GUEST) context_tracking_user_enter(prev_ctx); } } @@ -78,6 +78,9 @@ static inline void guest_enter(void) vtime_guest_enter(current); else current-flags |= PF_VCPU; + + if (context_tracking_is_enabled()) + context_tracking_user_enter(IN_GUEST); } Couldnt we make rcu_virt_note_context_switch(smp_processor_id()); conditional in include/linux/kvm_host.h (kvm_guest_enter) e.g. something like if (!context_tracking_is_enabled()) rcu_virt_note_context_switch(smp_processor_id()); Possibly. I considered the same, but I do not know whether or not just rcu_user_enter / rcu_user_exit is enough. I could certainly try it out and see whether anything explodes, but I am not convinced that is careful enough when it comes to handling RCU code... Paul? :) That can fail for some odd combinations of Kconfig and boot parameters. As I understand it, you instead want the following: if (!tick_nohz_full_cpu(smp_processor_id())) rcu_virt_note_context_switch(smp_processor_id()); Frederic might know of a better approach. Interesting, in what cases would that happen? If context_tracking_is_enabled() we end up eventually calling into rcu_user_enter rcu_user_exit. If it is not enabled, we call rcu_virt_note_context_switch. In what cases would we need to call both? I don't see exit-to-userspace do the things that rcu_virt_note_context_switch does, and do not understand why userspace is fine with that... -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/4] kvm,rcu: use RCU extended quiescent state when running KVM guest
On 02/05/2015 02:27 PM, Paul E. McKenney wrote: On Thu, Feb 05, 2015 at 02:02:35PM -0500, Rik van Riel wrote: Looking at context_tracking.h, I see the function context_tracking_cpu_is_enabled(). That looks like it should do the right thing in this case. Right you are -- that same check is used to guard the context_tracking_user_enter() function's call to rcu_user_enter(). My test suggests it is working. v2 incoming :) -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch -rt 1/2] KVM: use simple waitqueue for vcpu-wq
On 01/14/2015 12:12 PM, Marcelo Tosatti wrote: The problem: On -RT, an emulated LAPIC timer instances has the following path: 1) hard interrupt 2) ksoftirqd is scheduled 3) ksoftirqd wakes up vcpu thread 4) vcpu thread is scheduled This extra context switch introduces unnecessary latency in the LAPIC path for a KVM guest. This patch reduces the average latency in my tests from 14us to 11us. Signed-off-by: Marcelo Tosatti mtosa...@redhat.com Acked-by: Rik van Riel r...@redhat.com -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch -rt 2/2] KVM: lapic: mark LAPIC timer handler as irqsafe
On 01/14/2015 12:12 PM, Marcelo Tosatti wrote: Since lapic timer handler only wakes up a simple waitqueue, it can be executed from hardirq context. Reduces average cyclictest latency by 3us. Signed-off-by: Marcelo Tosatti mtosa...@redhat.com Acked-by: Rik van Riel r...@redhat.com -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 2/2] KVM: x86: add option to advance tscdeadline hrtimer expiration
On 12/11/2014 04:07 PM, Andy Lutomirski wrote: On Thu, Dec 11, 2014 at 12:58 PM, Marcelo Tosatti mtosa...@redhat.com wrote: On Thu, Dec 11, 2014 at 12:48:36PM -0800, Andy Lutomirski wrote: On 12/10/2014 07:07 PM, Marcelo Tosatti wrote: On Thu, Dec 11, 2014 at 12:37:57AM +0100, Paolo Bonzini wrote: On 10/12/2014 21:57, Marcelo Tosatti wrote: For the hrtimer which emulates the tscdeadline timer in the guest, add an option to advance expiration, and busy spin on VM-entry waiting for the actual expiration time to elapse. This allows achieving low latencies in cyclictest (or any scenario which requires strict timing regarding timer expiration). Reduces cyclictest avg latency by 50%. Note: this option requires tuning to find the appropriate value for a particular hardware/guest combination. One method is to measure the average delay between apic_timer_fn and VM-entry. Another method is to start with 1000ns, and increase the value in say 500ns increments until avg cyclictest numbers stop decreasing. What values are you using in practice for the parameter? 7us. It takes 7us to get from TSC deadline expiration to the *start* of vmresume? That seems rather extreme. Is it possible that almost all of that latency is from deadline expiration to C-state exit? If so, can we teach the timer code to wake up early to account for that? We're supposed to know our idle exit latency these days. 7us includes: idle thread wakeup idle schedout ksoftirqd schedin ksoftirqd schedout qemu schedin vm-entry Is there some secret way to get perf to profile this? Like some way to tell perf to only record samples after the IRQ fires and before vmresume? Because 7 us seems way slower than it should be for this. Yes, Rik, I know that we're wasting all kinds of time doing dumb things with xstate, but that shouldn't be anywhere near 7us on modern hardware, unless we're actually taking a device-not-available exception in there. :) There might be a whopping size xstate operations, but those should be 60ns each, perhaps, if the state is dirty? Maybe everything misses cache? I don't expect the xstate stuff to be more than about half a microsecond, with cache misses and failure to optimize XSAVEOPT / XRSTOR across vmenter/vmexit. We get another microsecond or so from not trapping from the guest to the host every time the guest accesses the FPU for the first time. Marcelo already has a hack for that in his tree, and my series merely has something that achieves the same in an automated (and hopefully upstreamable) way. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 2/2] KVM: x86: add option to advance tscdeadline hrtimer expiration
On 12/10/2014 12:06 PM, Marcelo Tosatti wrote: For the hrtimer which emulates the tscdeadline timer in the guest, add an option to advance expiration, and busy spin on VM-entry waiting for the actual expiration time to elapse. This allows achieving low latencies in cyclictest (or any scenario which requires strict timing regarding timer expiration). Reduces cyclictest avg latency by 50%. Note: this option requires tuning to find the appropriate value for a particular hardware/guest combination. One method is to measure the average delay between apic_timer_fn and VM-entry. Another method is to start with 1000ns, and increase the value in say 500ns increments until avg cyclictest numbers stop decreasing. It would be good to document how this is used, in the changelog. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 1/2] KVM: x86: add method to test PIR bitmap vector
On 12/10/2014 12:06 PM, Marcelo Tosatti wrote: kvm_x86_ops-test_posted_interrupt() returns true/false depending whether 'vector' is set. Is that good? Bad? How does this patch address the issue? -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 1/2] KVM: x86: add method to test PIR bitmap vector
On 12/10/2014 12:27 PM, Marcelo Tosatti wrote: On Wed, Dec 10, 2014 at 12:10:04PM -0500, Rik van Riel wrote: On 12/10/2014 12:06 PM, Marcelo Tosatti wrote: kvm_x86_ops-test_posted_interrupt() returns true/false depending whether 'vector' is set. Is that good? Bad? How does this patch address the issue? What issue? Why is this change being made? -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: kvm ivy bridge support?
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 12/05/2014 03:17 AM, Thomas Lau wrote: virsh cpu-models x86_64 | sort ... Penryn SandyBridge Westmere ... interesting that it doesn't have Ivy Bridge, any reason? The reason would be that you are running an older version. - -- All rights reversed -BEGIN PGP SIGNATURE- Version: GnuPG v1 iQEcBAEBAgAGBQJUgc4KAAoJEM553pKExN6DHGoH/iRWWBCGO0oIgRuA2iEGVYhh M5iUVJfxzdK6S4sNRPRnUHW8YqQVkKQSULBna4zYPI6q/9d5llDZog8Ejk0XES23 8yembpEM5SsKW2bUzaxpd1t8g096+tX7tmX/i06VOyzSnRDMpyqBAR0cX2c/Yr44 2lFu7IoGI8E+ZM9CeOnCjtKBIsr5ZMV2S9++WKo9EpZAvhMdsxFQh5a47PSShJes 0Dib/KEmg3S45+eWRvDoqI6KwAgTXJ0MvvMkXvWT2uQ25BHu7AzekKl2jsSnKJeq 4PWgD23dDn1quUfeP7GJq5/zN+Rv2Apw4cQZYtVAmm7pfmMOaM8xV5A2R0ygyTw= =o0NK -END PGP SIGNATURE- -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch -rt 1/2] KVM: use simple waitqueue for vcpu-wq
On 11/25/2014 12:21 PM, Marcelo Tosatti wrote: The problem: On -RT, an emulated LAPIC timer instances has the following path: 1) hard interrupt 2) ksoftirqd is scheduled 3) ksoftirqd wakes up vcpu thread 4) vcpu thread is scheduled This extra context switch introduces unnecessary latency in the LAPIC path for a KVM guest. The solution: Allow waking up vcpu thread from hardirq context, thus avoiding the need for ksoftirqd to be scheduled. Normal waitqueues make use of spinlocks, which on -RT are sleepable locks. Therefore, waking up a waitqueue waiter involves locking a sleeping lock, which is not allowed from hard interrupt context. What are the consequences for the realtime kernel of making waitqueue traversal non-preemptible? Is waitqueue traversal something that ever takes so long we need to care about it being non-preemptible? -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch -rt 1/2] KVM: use simple waitqueue for vcpu-wq
On 11/25/2014 01:57 PM, Rik van Riel wrote: On 11/25/2014 12:21 PM, Marcelo Tosatti wrote: The problem: On -RT, an emulated LAPIC timer instances has the following path: 1) hard interrupt 2) ksoftirqd is scheduled 3) ksoftirqd wakes up vcpu thread 4) vcpu thread is scheduled This extra context switch introduces unnecessary latency in the LAPIC path for a KVM guest. The solution: Allow waking up vcpu thread from hardirq context, thus avoiding the need for ksoftirqd to be scheduled. Normal waitqueues make use of spinlocks, which on -RT are sleepable locks. Therefore, waking up a waitqueue waiter involves locking a sleeping lock, which is not allowed from hard interrupt context. What are the consequences for the realtime kernel of making waitqueue traversal non-preemptible? Is waitqueue traversal something that ever takes so long we need to care about it being non-preemptible? I answered my own question. This patch only changes the kvm vcpu waitqueue, which should only ever have the vcpu thread itself waiting on it. In other words, it is a wait queue of just one entry long, and the latency of traversing it will be absolutely minimal. Unless someone can think of a better way to implement this patch, it gets my: Acked-by: Rik van Riel r...@redhat.com -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch -rt 2/2] KVM: lapic: mark LAPIC timer handler as irqsafe
On 11/25/2014 12:21 PM, Marcelo Tosatti wrote: Since lapic timer handler only wakes up a simple waitqueue, it can be executed from hardirq context. Reduces average cyclictest latency by 3us. Can this patch be merged in the KVM tree, and go upstream via Paolo? Signed-off-by: Marcelo Tosatti mtosa...@redhat.com Acked-by: Rik van Riel r...@redhat.com -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] kvm: Fix page ageing bugs
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 09/22/2014 03:57 PM, Andres Lagar-Cavilla wrote: 1. We were calling clear_flush_young_notify in unmap_one, but we are within an mmu notifier invalidate range scope. The spte exists no more (due to range_start) and the accessed bit info has already been propagated (due to kvm_pfn_set_accessed). Simply call clear_flush_young. 2. We clear_flush_young on a primary MMU PMD, but this may be mapped as a collection of PTEs by the secondary MMU (e.g. during log-dirty). This required expanding the interface of the clear_flush_young mmu notifier, so a lot of code has been trivially touched. 3. In the absence of shadow_accessed_mask (e.g. EPT A bit), we emulate the access bit by blowing the spte. This requires proper synchronizing with MMU notifier consumers, like every other removal of spte's does. Acked-by: Rik van Riel r...@redhat.com - -- All rights reversed -BEGIN PGP SIGNATURE- Version: GnuPG v1 iQEcBAEBAgAGBQJUIIQiAAoJEM553pKExN6DWeoH/RpkYF1QCxnbxgZhnioaWjyu Rp/kN6Rck6Eu3k/yRI6k+8IhgUJWkVhSXybTIDl1X6aVGgYwhaeOv2zPPfshfM6h ABE3pLFjs2qtdotZXFSvZ4mNwbQE73YHphAbmFUBSdm2Oz1bj6Qfq+KYFdM+aQd7 UYIFgtdGg/tyLMqE25J7pAnSDRR5GKmAKLvkFjN3Q8O4ynD3rExH1yTMLtQbyKvb oadSzaQLBOkLDAj3rpiOTl52B2tlQS+cxWqEfzA/AXOK8TkllDKIQT5BeRXV5O1c /WsZmusiA6KYgjLxnL0K9XJpgpOQ5unYAFyIGgYmKiaN6PQsd+pGM5GDnOWGorE= =dftO -END PGP SIGNATURE- -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: status of reducing context switching overhead in KVM
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 06/21/2014 04:55 AM, Xuekun Hu wrote: Hi, Rik I saw your presentation at last year 2013 kvm submit. (http://www.linux-kvm.org/wiki/images/2/27/Kvm-forum-2013-idle-latency.pdf). You said there will have some patches later, but I didn't find them. Are you continuing to do the optimization work? Any updates? :-) I have gotten side-tracked by the automatic NUMA balancing work, but Michael Tsirkin has been doing some work on reducing idle latencies. - -- All rights reversed -BEGIN PGP SIGNATURE- Version: GnuPG v1 Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ iQEcBAEBAgAGBQJTpYFnAAoJEM553pKExN6DcDcH/1UlBofa3KmGaercqfBk19Sj 78T8JgiAog4WBanFn3znVqbvdu1icMztVdUw/CoqVwgyeRmgyfXt5uf1lHtCe7zx HS/BDIAYJSXt0uXPd7bBUThFdTLusL/VSf7WBV6ex3Qcw8qywBooS2VdA4wYP8t4 oMclOSR6imjhwSPYLP+723duzYowu35wXC1Fojhrm1AidAbLSIiClVGrDtkFjp5p iuM+J6hHDPH3E/hwSWYE8EBeOC5CrvhmMdguv+7R4Z5qI9Qz+K6Ol8U1P33BcDfL IXMoW/76UUEwmK3BOjTvYFUYJTUb94znY/e3q3zK1UZCoIpMqUyUjLUNiEZKjXc= =rmno -END PGP SIGNATURE- -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] Implement Batched (group) ticket lock
On 05/28/2014 08:16 AM, Raghavendra K T wrote: This patch looks very promising. TODO: - we need an intelligent way to nullify the effect of batching for baremetal (because extra cmpxchg is not required). On (larger?) NUMA systems, the unfairness may be a nice performance benefit, reducing cache line bouncing through the system, and it could well outweigh the extra cmpxchg at times. - we may have to make batch size as kernel arg to solve above problem (to run same kernel for host/guest). Increasing batch size also seem to help virtualized guest more, so we will have flexibility of tuning depending on vm size. - My kernbench/ebizzy test on baremetal (32 cpu +ht sandybridge) did not seem to show the impact of extra cmpxchg. but there should be effect of extra cmpxchg. Canceled out by better NUMA locality? Or maybe cmpxchg is cheap once you already own the cache line exclusively? - virtualized guest had slight impact on 1x cases of some benchmarks but we have got impressive performance for 1x cases. So overall, patch needs exhaustive tesing. - we can further add dynamically changing batch_size implementation (inspiration and hint by Paul McKenney) as necessary. I could see a larger batch size being beneficial. Currently the maximum wait time for a spinlock on a system with N CPUs is N times the length of the largest critical section. Having the batch size set equal to the number of CPUs would only double that, and better locality (CPUs local to the current lock holder winning the spinlock operation) might speed things up enough to cancel that part of that out again... I have found that increasing batch size gives excellent improvements for overcommitted guests. I understand that we need more exhaustive testing. Please provide your suggestion and comments. I have only nitpicks so far... diff --git a/arch/x86/include/asm/spinlock_types.h b/arch/x86/include/asm/spinlock_types.h index 4f1bea1..b04c03d 100644 --- a/arch/x86/include/asm/spinlock_types.h +++ b/arch/x86/include/asm/spinlock_types.h @@ -3,15 +3,16 @@ #include linux/types.h +#define TICKET_LOCK_INC_SHIFT 1 +#define __TICKET_LOCK_TAIL_INC (1TICKET_LOCK_INC_SHIFT) + #ifdef CONFIG_PARAVIRT_SPINLOCKS -#define __TICKET_LOCK_INC2 #define TICKET_SLOWPATH_FLAG ((__ticket_t)1) #else -#define __TICKET_LOCK_INC1 #define TICKET_SLOWPATH_FLAG ((__ticket_t)0) #endif For the !CONFIG_PARAVIRT case, TICKET_LOCK_INC_SHIFT used to be 0, now you are making it one. Probably not an issue, since even people who compile with 128 CONFIG_NR_CPUS = 256 will likely have their spinlocks padded out to 32 or 64 bits anyway in most data structures. -#if (CONFIG_NR_CPUS (256 / __TICKET_LOCK_INC)) +#if (CONFIG_NR_CPUS (256 / __TICKET_LOCK_TAIL_INC)) typedef u8 __ticket_t; typedef u16 __ticketpair_t; #else @@ -19,7 +20,12 @@ typedef u16 __ticket_t; typedef u32 __ticketpair_t; #endif -#define TICKET_LOCK_INC ((__ticket_t)__TICKET_LOCK_INC) +#define TICKET_LOCK_TAIL_INC ((__ticket_t)__TICKET_LOCK_TAIL_INC) + +#define TICKET_LOCK_HEAD_INC ((__ticket_t)1) +#define TICKET_BATCH0x4 /* 4 waiters can contend simultaneously */ +#define TICKET_LOCK_BATCH_MASK (~(TICKET_BATCHTICKET_LOCK_INC_SHIFT) + \ + TICKET_LOCK_TAIL_INC - 1) I do not see the value in having TICKET_BATCH declared with a hexadecimal number, and it may be worth making sure the code does not compile if someone tried a TICKET_BATCH value that is not a power of 2. -- All rights reversed -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] Implement Batched (group) ticket lock
On 05/28/2014 06:19 PM, Linus Torvalds wrote: If somebody has a P4 still, that's likely the worst case by far. I'm sure cmpxchg isn't the only thing making P4 the worst case :) -- All rights reversed -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: commit 0bf1457f0cfca7b mm: vmscan: do not swap anon pages just because free+file is low causes heavy performance regression on paging
On 04/22/2014 07:57 AM, Christian Borntraeger wrote: On 22/04/14 12:55, Christian Borntraeger wrote: While preparing/testing some KVM on s390 patches for the next merge window (target is kvm/next which is based on 3.15-rc1) I faced a very severe performance hickup on guest paging (all anonymous memory). All memory bound guests are in D state now and the system is barely unusable. Reverting commit 0bf1457f0cfca7bc026a82323ad34bcf58ad035d mm: vmscan: do not swap anon pages just because free+file is low makes the problem go away. According to /proc/vmstat the system is now in direct reclaim almost all the time for every page fault (more than 10x more direct reclaims than kswap reclaims) With the patch being reverted everything is fine again. Any ideas? Here is an idea to tackle my problem and the original problem: reverting 0bf1457f0cfca7bc026a82323ad34bcf58ad035d + checking against low, also seems to make my system usable. --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -1923,7 +1923,7 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc, */ if (global_reclaim(sc)) { free = zone_page_state(zone, NR_FREE_PAGES); - if (unlikely(file + free = high_wmark_pages(zone))) { + if (unlikely(file + free = low_wmark_pages(zone))) { scan_balance = SCAN_ANON; goto out; } Looks reasonable to me. Johannes? -- All rights reversed -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH -mm 3/3] move mmu notifier call from change_protection to change_pmd_range
On 02/18/2014 09:24 PM, David Rientjes wrote: On Tue, 18 Feb 2014, r...@redhat.com wrote: From: Rik van Riel r...@redhat.com The NUMA scanning code can end up iterating over many gigabytes of unpopulated memory, especially in the case of a freshly started KVM guest with lots of memory. This results in the mmu notifier code being called even when there are no mapped pages in a virtual address range. The amount of time wasted can be enough to trigger soft lockup warnings with very large KVM guests. This patch moves the mmu notifier call to the pmd level, which represents 1GB areas of memory on x86-64. Furthermore, the mmu notifier code is only called from the address in the PMD where present mappings are first encountered. The hugetlbfs code is left alone for now; hugetlb mappings are not relocatable, and as such are left alone by the NUMA code, and should never trigger this problem to begin with. Signed-off-by: Rik van Riel r...@redhat.com Cc: Peter Zijlstra pet...@infradead.org Cc: Andrea Arcangeli aarca...@redhat.com Reported-by: Xing Gang gang.x...@hp.com Tested-by: Chegu Vinod chegu_vi...@hp.com Acked-by: David Rientjes rient...@google.com Might have been cleaner to move the mmu_notifier_invalidate_range_{start,end}() to hugetlb_change_protection() as well, though. I can certainly do that if you want. Just let me know and I'll send a v2 of patch 3 :) -- All rights reversed -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH -mm v2 3/3] move mmu notifier call from change_protection to change_pmd_range
On Tue, 18 Feb 2014 18:24:36 -0800 (PST) David Rientjes rient...@google.com wrote: Acked-by: David Rientjes rient...@google.com Might have been cleaner to move the mmu_notifier_invalidate_range_{start,end}() to hugetlb_change_protection() as well, though. Way cleaner! Second version attached :) Thanks, David. ---8--- Subject: move mmu notifier call from change_protection to change_pmd_range The NUMA scanning code can end up iterating over many gigabytes of unpopulated memory, especially in the case of a freshly started KVM guest with lots of memory. This results in the mmu notifier code being called even when there are no mapped pages in a virtual address range. The amount of time wasted can be enough to trigger soft lockup warnings with very large KVM guests. This patch moves the mmu notifier call to the pmd level, which represents 1GB areas of memory on x86-64. Furthermore, the mmu notifier code is only called from the address in the PMD where present mappings are first encountered. The hugetlbfs code is left alone for now; hugetlb mappings are not relocatable, and as such are left alone by the NUMA code, and should never trigger this problem to begin with. Signed-off-by: Rik van Riel r...@redhat.com Acked-by: David Rientjes rient...@google.com --- mm/hugetlb.c | 2 ++ mm/mprotect.c | 15 --- 2 files changed, 14 insertions(+), 3 deletions(-) diff --git a/mm/hugetlb.c b/mm/hugetlb.c index c0d930f..f0c5dfb 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -3065,6 +3065,7 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma, BUG_ON(address = end); flush_cache_range(vma, address, end); + mmu_notifier_invalidate_range_start(mm, start, end); mutex_lock(vma-vm_file-f_mapping-i_mmap_mutex); for (; address end; address += huge_page_size(h)) { spinlock_t *ptl; @@ -3094,6 +3095,7 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma, */ flush_tlb_range(vma, start, end); mutex_unlock(vma-vm_file-f_mapping-i_mmap_mutex); + mmu_notifier_invalidate_range_end(mm, start, end); return pages h-order; } diff --git a/mm/mprotect.c b/mm/mprotect.c index d790166..76146fa 100644 --- a/mm/mprotect.c +++ b/mm/mprotect.c @@ -115,9 +115,11 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma, pgprot_t newprot, int dirty_accountable, int prot_numa) { pmd_t *pmd; + struct mm_struct *mm = vma-vm_mm; unsigned long next; unsigned long pages = 0; unsigned long nr_huge_updates = 0; + unsigned long mni_start = 0; pmd = pmd_offset(pud, addr); do { @@ -126,6 +128,13 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma, next = pmd_addr_end(addr, end); if (!pmd_trans_huge(*pmd) pmd_none_or_clear_bad(pmd)) continue; + + /* invoke the mmu notifier if the pmd is populated */ + if (!mni_start) { + mni_start = addr; + mmu_notifier_invalidate_range_start(mm, mni_start, end); + } + if (pmd_trans_huge(*pmd)) { if (next - addr != HPAGE_PMD_SIZE) split_huge_page_pmd(vma, addr, pmd); @@ -149,6 +158,9 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma, pages += this_pages; } while (pmd++, addr = next, addr != end); + if (mni_start) + mmu_notifier_invalidate_range_end(mm, mni_start, end); + if (nr_huge_updates) count_vm_numa_events(NUMA_HUGE_PTE_UPDATES, nr_huge_updates); return pages; @@ -208,15 +220,12 @@ unsigned long change_protection(struct vm_area_struct *vma, unsigned long start, unsigned long end, pgprot_t newprot, int dirty_accountable, int prot_numa) { - struct mm_struct *mm = vma-vm_mm; unsigned long pages; - mmu_notifier_invalidate_range_start(mm, start, end); if (is_vm_hugetlb_page(vma)) pages = hugetlb_change_protection(vma, start, end, newprot); else pages = change_protection_range(vma, start, end, newprot, dirty_accountable, prot_numa); - mmu_notifier_invalidate_range_end(mm, start, end); return pages; } -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] mm: hugetlb: initialize PG_reserved for tail pages of gigantig compound pages
On 10/10/2013 12:12 PM, Andrea Arcangeli wrote: 11feeb498086a3a5907b8148bdf1786a9b18fc55 introduced a memory leak when KVM is run on gigantic compound pages. 11feeb498086a3a5907b8148bdf1786a9b18fc55 depends on the assumption that PG_reserved is identical for all head and tail pages of a compound page. So that if get_user_pages returns a tail page, we don't need to check the head page in order to know if we deal with a reserved page that requires different refcounting. The assumption that PG_reserved is the same for head and tail pages is certainly correct for THP and regular hugepages, but gigantic hugepages allocated through bootmem don't clear the PG_reserved on the tail pages (the clearing of PG_reserved is done later only if the gigantic hugepage is freed). This patch corrects the gigantic compound page initialization so that we can retain the optimization in 11feeb498086a3a5907b8148bdf1786a9b18fc55. The cacheline was already modified in order to set PG_tail so this won't affect the boot time of large memory systems. Reported-by: andy123 ajs124.ajs...@gmail.com Signed-off-by: Andrea Arcangeli aarca...@redhat.com Acked-by: Rik van Riel r...@redhat.com -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC 2/2] virtio_balloon: auto-ballooning support
On 05/13/2013 11:16 AM, Michael S. Tsirkin wrote: However, there's a big question mark: host specifies inflate, guest says deflate, who wins? If we're dealing with a NUMA guest, they could both win :) The host could see reduced memory use of the guest in one place, while the guest could see increased memory availability in another place... I also suspect that having some churn could help sort out exactly what the working set is. At some point Google sent patches that gave guest complete control over the balloon. This has the advantage that management isn't involved. I believe the Google patches still included some way for the host to initiate balloon inflation on the guest side, because the guest internal state alone is not enough to see when the host is under memory pressure. I discussed the project with the Google developers in question a little over a year ago, but I do not remember whether their pressure notification went through qemu, or directly from the host kernel to the guest kernel... And at some level it seems to make sense: why set an upper limit on size of the balloon? The bigger it is, the better. Response time. If too much of a guest's memory has been removed, it can take too long for the guest to react to user requests, be it over the web or ssh or something else... -- All rights reversed -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC 2/2] virtio_balloon: auto-ballooning support
On 05/13/2013 11:35 AM, Michael S. Tsirkin wrote: On Mon, May 13, 2013 at 11:22:58AM -0400, Rik van Riel wrote: I believe the Google patches still included some way for the host to initiate balloon inflation on the guest side, because the guest internal state alone is not enough to see when the host is under memory pressure. I discussed the project with the Google developers in question a little over a year ago, but I do not remember whether their pressure notification went through qemu, or directly from the host kernel to the guest kernel... So increasing the max number of pages in balloon indicates host memory pressure to the guest? Fair enough but I wonder whether there's a way to make it more explicit in the interface, somehow. There may be a better way to do this, but I am really not sure what that would be. What properties would you like to see in the interface? What kind of behaviour are you looking for? -- All rights reversed -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC 2/2] virtio_balloon: auto-ballooning support
On 05/12/2013 10:30 AM, Michael S. Tsirkin wrote: On Thu, May 09, 2013 at 10:53:49AM -0400, Luiz Capitulino wrote: Automatic ballooning consists of dynamically adjusting the guest's balloon according to memory pressure in the host and in the guest. This commit implements the guest side of automatic balloning, which basically consists of registering a shrinker callback with the kernel, which will try to deflate the guest's balloon by the amount of pages being requested. The shrinker callback is only registered if the host supports the VIRTIO_BALLOON_F_AUTO_BALLOON feature bit. OK so we have a new feature bit, such that: - if AUTO_BALLOON is set in host, guest can leak a page from a balloon at any time questions left unanswered - what meaning does num_pages have now? as large as we could go - when will the balloon be re-inflated? I believe the qemu changes Luiz wrote address that side, with qemu-kvm getting notifications from the host kernel when there is memory pressure, and shrinking the guest in reaction to that notification. I'd like to see a spec patch addressing these questions. Would we ever want to mix the two types of ballooning? If yes possibly when we put a page in the balloon we might want to give host a hint this page might be leaked again soon. It might not be the same page, and the host really does not care which page it is. The automatic inflation happens when the host needs to free up memory. Automatic inflate is performed by the host. Here are some numbers. The test-case is to run 35 VMs (1G of RAM each) in parallel doing a kernel build. Host has 32GB of RAM and 16GB of swap. SWAP IN and SWAP OUT correspond to the number of pages swapped in and swapped out, respectively. Auto-ballooning disabled: RUN TIME(s) SWAP IN SWAP OUT 1634 930980 1588522 2610 627422 1362174 3649 1079847 1616367 4543 953289 1635379 5642 913237 1514000 Auto-ballooning enabled: RUN TIME(s) SWAP IN SWAP OUT 1629 901 12537 2624 981 18506 3626 573 9085 4631 2250 42534 5627 1610 20808 So what exactly happens here? Much less swap in/out activity, but no gain in total runtime. Doesn't this show there's a bottleneck somewhere? Could be a problem in the implementation? It could also be an issue with the benchmark chosen, which may not have swap as its bottleneck at any point. However, the reduced swapping is still very promising! Also, what happened with the balloon? Did we end up with balloon completely inflated? deflated? One question to consider: possibly if we are going to reuse the page in the balloon soon, we might want to bypass notify before use for it? Maybe that will help speed things up. Signed-off-by: Luiz Capitulino lcapitul...@redhat.com --- drivers/virtio/virtio_balloon.c | 55 + include/uapi/linux/virtio_balloon.h | 1 + 2 files changed, 56 insertions(+) diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c index 9d5fe2b..f9dcae8 100644 --- a/drivers/virtio/virtio_balloon.c +++ b/drivers/virtio/virtio_balloon.c @@ -71,6 +71,9 @@ struct virtio_balloon /* Memory statistics */ int need_stats_update; struct virtio_balloon_stat stats[VIRTIO_BALLOON_S_NR]; + + /* Memory shrinker */ + struct shrinker shrinker; }; static struct virtio_device_id id_table[] = { @@ -126,6 +129,7 @@ static void set_page_pfns(u32 pfns[], struct page *page) pfns[i] = page_to_balloon_pfn(page) + i; } +/* This function should be called with vb-balloon_mutex held */ static void fill_balloon(struct virtio_balloon *vb, size_t num) { struct balloon_dev_info *vb_dev_info = vb-vb_dev_info; @@ -166,6 +170,7 @@ static void release_pages_by_pfn(const u32 pfns[], unsigned int num) } } +/* This function should be called with vb-balloon_mutex held */ static void leak_balloon(struct virtio_balloon *vb, size_t num) { struct page *page; @@ -285,6 +290,45 @@ static void update_balloon_size(struct virtio_balloon *vb) actual, sizeof(actual)); } +static unsigned long balloon_get_nr_pages(const struct virtio_balloon *vb) +{ + return vb-num_pages / VIRTIO_BALLOON_PAGES_PER_PAGE; +} + +static int balloon_shrinker(struct shrinker *shrinker,struct shrink_control *sc) +{ + unsigned int nr_pages, new_target; + struct virtio_balloon *vb; + + vb = container_of(shrinker, struct virtio_balloon, shrinker); + if (!mutex_trylock(vb-balloon_lock)) { + return -1; + } + + nr_pages = balloon_get_nr_pages(vb); + if (!sc-nr_to_scan || !nr_pages) { + goto out; + } + + /* +* If the current balloon size is greater than the number of +* pages being reclaimed by the kernel, deflate only the needed +* amount.
Re: Preemptable Ticket Spinlock
On 04/22/2013 07:51 AM, Peter Zijlstra wrote: On Sun, 2013-04-21 at 17:12 -0400, Rik van Riel wrote: If we always incremented the ticket number by 2 (instead of 1), then we could use the lower bit of the ticket number as the spinlock. ISTR that paravirt ticket locks already do that and use the lsb to indicate the unlock needs to perform wakeups. Also, since all of this is virt nonsense, shouldn't it live in the paravirt ticket lock code and leave the native code as is? Sure, but that is still no reason not to have the virt implementation be as fast as possible, and share the same data type as the non-virt implementation. Also, is it guaranteed that the native spin_lock code has not been called yet before we switch over to the paravirt functions? If the native spin_lock code has been called already at that time, the native code would still need to be modified to increment the ticket number by 2, so we end up with a compatible value in each spin lock's .tickets field, and prevent a deadlock after we switch over to the paravirt variant. -- All rights reversed -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Preemptable Ticket Spinlock
On 04/22/2013 03:49 PM, Peter Zijlstra wrote: On Mon, 2013-04-22 at 08:52 -0400, Rik van Riel wrote: If the native spin_lock code has been called already at that time, the native code would still need to be modified to increment the ticket number by 2, so we end up with a compatible value in each spin lock's .tickets field, and prevent a deadlock after we switch over to the paravirt variant. I thought the stuff already made it upstream, but apparently not; the lastest posting I'm aware of is here: https://lkml.org/lkml/2012/5/2/105 That stuff changes the normal ticket increment as well.. Jiannan, It looks like the patch above could make a good patch 1 (or 2) in your patch series :) -- All rights reversed -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Preemptable Ticket Spinlock
On 04/22/2013 04:08 PM, Peter Zijlstra wrote: On Mon, 2013-04-22 at 15:56 -0400, Rik van Riel wrote: On 04/22/2013 03:49 PM, Peter Zijlstra wrote: On Mon, 2013-04-22 at 08:52 -0400, Rik van Riel wrote: If the native spin_lock code has been called already at that time, the native code would still need to be modified to increment the ticket number by 2, so we end up with a compatible value in each spin lock's .tickets field, and prevent a deadlock after we switch over to the paravirt variant. I thought the stuff already made it upstream, but apparently not; the lastest posting I'm aware of is here: https://lkml.org/lkml/2012/5/2/105 That stuff changes the normal ticket increment as well.. Jiannan, It looks like the patch above could make a good patch 1 (or 2) in your patch series :) I much prefer the entire series from Jeremy since it maintains the ticket semantics and doesn't degrade the lock to unfair under contention. Now I suppose there's a reason its not been merged yet and I suspect its !paravirt hotpath impact which wasn't rightly justified or somesuch so maybe someone can work on that or so.. dunno. IIRC one of the reasons was that the performance improvement wasn't as obvious. Rescheduling VCPUs takes a fair amount of time, quite probably more than the typical hold time of a spinlock. -- All rights reversed -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Preemptable Ticket Spinlock
On 04/22/2013 04:46 PM, Jiannan Ouyang wrote: It would still be very interesting to conduct more experiments to compare these two, to see if the fairness enforced by pv_lock is mandatory, and if preemptable-lock outperforms pv_lock in most cases, and how do they work with PLE. Given the fairly high cost of rescheduling a VCPU (which is likely to include an IPI), versus the short hold time of most spinlocks, I have the strong suspicion that your approach would win. The fairness is only compromised in a limited way and in certain circumstances, so I am not too worried about that. -- All rights reversed -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Preemptable Ticket Spinlock
On 04/22/2013 04:48 PM, Peter Zijlstra wrote: Hmm.. it looked like under light overcommit the paravirt ticket lock still had some gain (~10%) and of course it brings the fairness thing which is always good. I can only imagine the mess unfair + vcpu preemption can bring to guest tasks. If you think unfairness + vcpu preemption is bad, you haven't tried full fairness + vcpu preemption :) -- All rights reversed -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Preemptable Ticket Spinlock
On 04/22/2013 04:55 PM, Peter Zijlstra wrote: On Mon, 2013-04-22 at 16:46 -0400, Jiannan Ouyang wrote: - pv-preemptable-lock has much less performance variance compare to pv_lock, because it adapts to preemption within VM, other than using rescheduling that increase VM interference I would say it has a _much_ worse worst case (and thus worse variance) than the paravirt ticket implementation from Jeremy. While full paravirt ticket lock results in vcpu scheduling it does maintain fairness. If you drop strict fairness you can end up in unbounded starvation cases and those are very ugly indeed. If needed, Jiannan's scheme could easily be bounded to prevent infinite starvation. For example, we could allow only the first 8 CPUs in line to jump the queue. However, given the way that virtual CPUs get scheduled in and out all the time, I suspect starvation is not a worry, and we will not need the additional complexity to deal with it. You may want to play around with virtualization a bit, to get a feel for how things work in virt land. -- All rights reversed -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Preemptable Ticket Spinlock
On 04/22/2013 05:56 PM, Andi Kleen wrote: Rik van Riel r...@redhat.com writes: If we always incremented the ticket number by 2 (instead of 1), then we could use the lower bit of the ticket number as the spinlock. Spinning on a single bit is very inefficient, as you need to do try lock in a loop which is very unfriendly to the MESI state protocol. It's much better to have at least three states and allow spinning-while-reading-only. This is typically very visible on systems with 2S. Absolutely, the spinning should be read-only, until the CPU sees that the desired bit is clear. MESI-friendly spinning is essential. -- All rights reversed -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Preemptable Ticket Spinlock
On 04/20/2013 06:12 PM, Jiannan Ouyang wrote: Hello Everyone, I recently came up with a spinlock algorithm that can adapt to preemption, which you may be interested in. The intuition is to downgrade a fair lock to an unfair lock automatically upon preemption, and preserve the fairness otherwise. It is a guest side optimization, and can be used as a complementary technique to host side optimizations like co-scheduling and Pause-Loop Exiting. In my experiments, it improves VM performance by 5:32X on average, when running on a non paravirtual VMM, and by 7:91X when running on a VMM that supports a paravirtual locking interface (using a pv preemptable ticket spinlock), when executing a set of microbenchmarks as well as a realistic e-commerce benchmark. A detailed algorithm description can be found in my VEE 2013 paper, Preemptable Ticket Spinlocks: Improving Consolidated Performance in the Cloud Jiannan Ouyang, John R. Lange ouyang,jackla...@cs.pitt.edu mailto:jackla...@cs.pitt.edu University of Pittsburgh http://people.cs.pitt.edu/~ouyang/files/publication/preemptable_lock-ouyang-vee13.pdf Your algorithm is very clever, and very promising. However, it does increase the size of the struct spinlock, and adds an additional atomic operation to spin_unlock, neither of which I suspect are necessary. If we always incremented the ticket number by 2 (instead of 1), then we could use the lower bit of the ticket number as the spinlock. If we do NOT run virtualized, we simply increment the ticket by 2 in spin_unlock, and the code can remain otherwise the same. If we do run virtualized, we take that spinlock after acquiring the ticket (or timing out), just like in your code. In the virtualized spin_unlock, we can then release the spinlock and increment the ticket in one operation: by simply increasing the ticket by 1. In other words, we should be able to keep the overhead of this to an absolute minimum, and keep spin_unlock to be always the same cost it is today. -- All rights reversed -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/4] Expand the steal time msr to also contain the consigned time.
On 02/05/2013 04:49 PM, Michael Wolf wrote: Expand the steal time msr to also contain the consigned time. Signed-off-by: Michael Wolf m...@linux.vnet.ibm.com --- arch/x86/include/asm/paravirt.h |4 ++-- arch/x86/include/asm/paravirt_types.h |2 +- arch/x86/kernel/kvm.c |7 ++- kernel/sched/core.c | 10 +- kernel/sched/cputime.c|2 +- 5 files changed, 15 insertions(+), 10 deletions(-) diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h index 5edd174..9b753ea 100644 --- a/arch/x86/include/asm/paravirt.h +++ b/arch/x86/include/asm/paravirt.h @@ -196,9 +196,9 @@ struct static_key; extern struct static_key paravirt_steal_enabled; extern struct static_key paravirt_steal_rq_enabled; -static inline u64 paravirt_steal_clock(int cpu) +static inline void paravirt_steal_clock(int cpu, u64 *steal) { - return PVOP_CALL1(u64, pv_time_ops.steal_clock, cpu); + PVOP_VCALL2(pv_time_ops.steal_clock, cpu, steal); } This may be a stupid question, but what happens if a KVM guest with this change, runs on a kernel that still has the old steal time interface? What happens if the host has the new steal time interface, but the guest uses the old interface? Will both cases continue to work as expected with your patch series? If so, could you document (in the source code) why things continue to work? -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/4] Add the code to send the consigned time from the host to the guest
On 02/05/2013 04:49 PM, Michael Wolf wrote: Change the paravirt calls that retrieve the steal-time information from the host. Add to it getting the consigned value as well as the steal time. Signed-off-by: Michael Wolf m...@linux.vnet.ibm.com diff --git a/arch/x86/include/uapi/asm/kvm_para.h b/arch/x86/include/uapi/asm/kvm_para.h index 06fdbd9..55d617f 100644 --- a/arch/x86/include/uapi/asm/kvm_para.h +++ b/arch/x86/include/uapi/asm/kvm_para.h @@ -42,9 +42,10 @@ struct kvm_steal_time { __u64 steal; + __u64 consigned; __u32 version; __u32 flags; - __u32 pad[12]; + __u32 pad[10]; }; The function kvm_register_steal_time passes the address of such a structure to the host kernel, which then does something with it. Could running a guest with the above patch, on top of a host with the old code, result in the values for version and flags being written into consigned? Could that result in confusing the guest kernel to no end, and generally breaking things? -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: qemu-kvm hangs at start up under 3.8.0-rc3-00074-gb719f43 (works with CONFIG_LOCKDEP)
On 01/14/2013 01:24 PM, Andrew Clayton wrote: On Mon, 14 Jan 2013 15:27:36 +0200, Gleb Natapov wrote: On Sun, Jan 13, 2013 at 10:29:58PM +, Andrew Clayton wrote: When running qemu-kvm under 64but Fedora 16 under current 3.8, it just hangs at start up. Dong a ps -ef hangs the process at the point where it would display the qemu process (trying to list the qemu-kvm /proc pid directory contents just hangs ls). I also noticed some other weirdness at this point like Firefox hanging for many seconds at a time and increasing load average. The qemu command I was trying to run was $ qemu-kvm -m 512 -smp 2 -vga vmware -k en-gb -drive file=/home/andrew/machines/qemu/f16-i386.img,if=virtio Here's the last few lines of a strace on it at start up. open(/home/andrew/machines/qemu/f16-i386.img, O_RDWR|O_DSYNC|O_CLOEXEC) = 8 lseek(8, 0, SEEK_END) = 9100722176 pread(8, QFI\373\0\0\0\2\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\20\0\0\0\2\200\0\0\0..., 512, 0) = 512 pread(8, \200\0\0\0\0\4\0\0\200\0\0\0\0\10\0\0\200\0\0\0\2\210\0\0\200\0\0\0\2\233\0\0..., 512, 65536) = 512 brk(0) = 0x7faf12db brk(0x7faf12ddd000 It's hanging in that brk syscall. The load average also then starts to increase. However. I can make it run fine, if I enable CONFIG_LOCKDEP. But the only thing in dmesg I get is the usual. kvm: SMP vm created on host with unstable TSC; guest TSC will not be reliable I've attached both working and non-working .configs. The only difference being the lock checking enabled in config.good. The most recent kernel I had it working in was 3.7.0 System is a Quad Core Intel running 64bit Fedora 16. Can you run echo t /proc/sysrq-trigger and see where it hangs? Here you go, here's the bash process, qemu and a kvm bit. (From the above command) bashS 88013b2b0d00 0 3203 3133 0x 880114dabe58 0082 800113558065 880114dabfd8 880114dabfd8 4000 88013b0c5b00 88013b2b0d00 880114dabd88 8109067d ea0004536670 ea0004536640 Call Trace: [8109067d] ? default_wake_function+0xd/0x10 [8108a315] ? atomic_notifier_call_chain+0x15/0x20 [8133d84f] ? tty_get_pgrp+0x3f/0x50 [810819ac] ? pid_vnr+0x2c/0x30 [8133fe54] ? tty_ioctl+0x7b4/0xbd0 [8106bf62] ? wait_consider_task+0x102/0xaf0 [815c00e4] schedule+0x24/0x70 [8106cb24] do_wait+0x1d4/0x200 [8106d9cb] sys_wait4+0x9b/0xf0 [8106b9f0] ? task_stopped_code+0x50/0x50 [815c1ad2] system_call_fastpath+0x16/0x1b qemu-kvmD 88011ab8c8b8 0 3345 3203 0x 880112129cd8 0082 880112129c50 880112129fd8 880112129fd8 4000 88013b04ce00 880139da1a00 000280da 880112129d38 810d3300 Call Trace: [810d3300] ? __alloc_pages_nodemask+0xf0/0x7c0 [811273c6] ? touch_atime+0x66/0x170 [810cdabf] ? generic_file_aio_read+0x5bf/0x730 [815c00e4] schedule+0x24/0x70 [815c0cdd] rwsem_down_failed_common+0xbd/0x150 [815c0da3] rwsem_down_write_failed+0x13/0x15 [812d1be3] call_rwsem_down_write_failed+0x13/0x20 [815bf4dd] ? down_write+0x2d/0x34 [810f0724] vma_adjust+0xe4/0x610 [810f0fa4] vma_merge+0x1b4/0x270 [810f1fa6] do_brk+0x196/0x330 [810f2217] sys_brk+0xd7/0x130 [815c1ad2] system_call_fastpath+0x16/0x1b This looks like qemu-kvm getting stuck trying to get the anon_vma lock. That leads to the obvious question: what is holding the lock, and/or failed to release it? Do you have any other (qemu-kvm?) processes on your system that have any code in the VM (or strace/ptrace/...) in the backtrace, that might be holding this lock? Do you have anything in your dmesg showing threads that had a BUG_ON (and exited) while holding the lock? -- All rights reversed -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC V2 0/5] Separate consigned (expected steal) from steal time.
On 10/16/2012 10:23 PM, Michael Wolf wrote: In the case of where you have a system that is running in a capped or overcommitted environment the user may see steal time being reported in accounting tools such as top or vmstat. This can cause confusion for the end user. How do s390 and Power systems deal with reporting that kind of information? IMHO it would be good to see what those do, so we do not end up re-inventing the wheel, and confusing admins with yet another way of reporting the information... -- All rights reversed -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] mm: compaction: Correct the nr_strict_isolated check for CMA
On 10/16/2012 04:39 AM, Mel Gorman wrote: Thierry reported that the iron out patch for isolate_freepages_block() had problems due to the strict check being too strict with mm: compaction: Iron out isolate_freepages_block() and isolate_freepages_range() -fix1. It's possible that more pages than necessary are isolated but the check still fails and I missed that this fix was not picked up before RC1. This same problem has been identified in 3.7-RC1 by Tony Prisk and should be addressed by the following patch. Signed-off-by: Mel Gorman mgor...@suse.de Tested-by: Tony Prisk li...@prisktech.co.nz Acked-by: Rik van Riel r...@redhat.com -- All rights reversed -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC 1/2] kvm: Handle undercommitted guest case in PLE handler
On 09/21/2012 08:00 AM, Raghavendra K T wrote: From: Raghavendra K T raghavendra...@linux.vnet.ibm.com When total number of VCPUs of system is less than or equal to physical CPUs, PLE exits become costly since each VCPU can have dedicated PCPU, and trying to find a target VCPU to yield_to just burns time in PLE handler. This patch reduces overhead, by simply doing a return in such scenarios by checking the length of current cpu runqueue. I am not convinced this is the way to go. The VCPU that is holding the lock, and is not releasing it, probably got scheduled out. That implies that VCPU is on a runqueue with at least one other task. --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -1629,6 +1629,9 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me) int pass; int i; + if (unlikely(rq_nr_running() == 1)) + return; + kvm_vcpu_set_in_spin_loop(me, true); /* * We boost the priority of a VCPU that is runnable but not -- All rights reversed -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC 2/2] kvm: Be courteous to other VMs in overcommitted scenario in PLE handler
On 09/21/2012 08:00 AM, Raghavendra K T wrote: From: Raghavendra K T raghavendra...@linux.vnet.ibm.com When PLE handler fails to find a better candidate to yield_to, it goes back and does spin again. This is acceptable when we do not have overcommit. But in overcommitted scenarios (especially when we have large number of small guests), it is better to yield. Reviewed-by: Srikar Dronamraju sri...@linux.vnet.ibm.com Signed-off-by: Raghavendra K T raghavendra...@linux.vnet.ibm.com Acked-by: Rik van Riel r...@redhat.com -- All rights reversed -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/9] Reduce compaction scanning and lock contention
On 09/21/2012 06:46 AM, Mel Gorman wrote: Hi Andrew, Richard Davies and Shaohua Li have both reported lock contention problems in compaction on the zone and LRU locks as well as significant amounts of time being spent in compaction. This series aims to reduce lock contention and scanning rates to reduce that CPU usage. Richard reported at https://lkml.org/lkml/2012/9/21/91 that this series made a big different to a problem he reported in August (http://marc.info/?l=kvmm=134511507015614w=2). One way or the other, this series has a large impact on the amount of scanning compaction does when there is a storm of THP allocations. Andrew, Mel and I have discussed the stuff in this series quite a bit, and I am convinced this is the way forward with compaction. -- All rights reversed -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC 2/2] kvm: Be courteous to other VMs in overcommitted scenario in PLE handler
On 09/21/2012 09:46 AM, Takuya Yoshikawa wrote: On Fri, 21 Sep 2012 17:30:20 +0530 Raghavendra K T raghavendra...@linux.vnet.ibm.com wrote: From: Raghavendra K T raghavendra...@linux.vnet.ibm.com When PLE handler fails to find a better candidate to yield_to, it goes back and does spin again. This is acceptable when we do not have overcommit. But in overcommitted scenarios (especially when we have large number of small guests), it is better to yield. Reviewed-by: Srikar Dronamraju sri...@linux.vnet.ibm.com Signed-off-by: Raghavendra K T raghavendra...@linux.vnet.ibm.com --- virt/kvm/kvm_main.c |4 1 files changed, 4 insertions(+), 0 deletions(-) diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index 8323685..713b677 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -1660,6 +1660,10 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me) } } } + /* In overcommitted cases, yield instead of spinning */ + if (!yielded rq_nr_running() 1) + schedule(); How about doing cond_resched() instead? Actually, an actual call to yield() may be better. That will set scheduler hints to make the scheduler pick another task for one round, while preserving this task's top position in the runqueue. -- All rights reversed -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 5/6] mm: compaction: Cache if a pageblock was scanned and no pages were isolated
On 09/20/2012 10:04 AM, Mel Gorman wrote: When compaction was implemented it was known that scanning could potentially be excessive. The ideal was that a counter be maintained for each pageblock but maintaining this information would incur a severe penalty due to a shared writable cache line. It has reached the point where the scanning costs are an serious problem, particularly on long-lived systems where a large process starts and allocates a large number of THPs at the same time. Instead of using a shared counter, this patch adds another bit to the pageblock flags called PG_migrate_skip. If a pageblock is scanned by either migrate or free scanner and 0 pages were isolated, the pageblock is marked to be skipped in the future. When scanning, this bit is checked before any scanning takes place and the block skipped if set. The main difficulty with a patch like this is when to ignore the cached information? If it's ignored too often, the scanning rates will still be excessive. If the information is too stale then allocations will fail that might have otherwise succeeded. In this patch Big hammer, but I guess it is effective... Acked-by: Rik van Riel r...@redhat.com -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/6] Revert mm: have order 0 compaction start off where it left
On 09/20/2012 10:04 AM, Mel Gorman wrote: This reverts commit 7db8889a (mm: have order 0 compaction start off where it left) and commit de74f1cc (mm: have order 0 compaction start near a pageblock with free pages). These patches were a good idea and tests confirmed that they massively reduced the amount of scanning but the implementation is complex and tricky to understand. A later patch will cache what pageblocks should be skipped and reimplements the concept of compact_cached_free_pfn on top for both migration and free scanners. Signed-off-by: Mel Gorman mgor...@suse.de Sure, it makes your next patches easier... Acked-by: Rik van Riel r...@redhat.com -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 6/6] mm: compaction: Restart compaction from near where it left off
On 09/20/2012 10:04 AM, Mel Gorman wrote: This is almost entirely based on Rik's previous patches and discussions with him about how this might be implemented. Order 0 compaction stops when enough free pages of the correct page order have been coalesced. When doing subsequent higher order allocations, it is possible for compaction to be invoked many times. However, the compaction code always starts out looking for things to compact at the start of the zone, and for free pages to compact things to at the end of the zone. This can cause quadratic behaviour, with isolate_freepages starting at the end of the zone each time, even though previous invocations of the compaction code already filled up all free memory on that end of the zone. This can cause isolate_freepages to take enormous amounts of CPU with certain workloads on larger memory systems. This patch caches where the migration and free scanner should start from on subsequent compaction invocations using the pageblock-skip information. When compaction starts it begins from the cached restart points and will update the cached restart points until a page is isolated or a pageblock is skipped that would have been scanned by synchronous compaction. Signed-off-by: Mel Gorman mgor...@suse.de Together with patch 5/6, this has the effect of skipping compaction in a zone if the free and isolate markers have met, and it has been less than 5 seconds since the skip information was reset. Compaction on zones where we cycle through more slowly can continue, even when this particular zone is experiencing problems, so I guess this is desired behaviour... Acked-by: Rik van Riel r...@redhat.com -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/6] mm: compaction: Acquire the zone-lock as late as possible
On 09/20/2012 10:04 AM, Mel Gorman wrote: Compactions free scanner acquires the zone-lock when checking for PageBuddy pages and isolating them. It does this even if there are no PageBuddy pages in the range. This patch defers acquiring the zone lock for as long as possible. In the event there are no free pages in the pageblock then the lock will not be acquired at all which reduces contention on zone-lock. Signed-off-by: Mel Gorman mgor...@suse.de Acked-by: Rik van Riel r...@redhat.com -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/6] mm: compaction: Acquire the zone-lru_lock as late as possible
On 09/20/2012 10:04 AM, Mel Gorman wrote: Compactions migrate scanner acquires the zone-lru_lock when scanning a range of pages looking for LRU pages to acquire. It does this even if there are no LRU pages in the range. If multiple processes are compacting then this can cause severe locking contention. To make matters worse commit b2eef8c0 (mm: compaction: minimise the time IRQs are disabled while isolating pages for migration) releases the lru_lock every SWAP_CLUSTER_MAX pages that are scanned. This patch makes two changes to how the migrate scanner acquires the LRU lock. First, it only releases the LRU lock every SWAP_CLUSTER_MAX pages if the lock is contended. This reduces the number of times it unnecessarily disables and re-enables IRQs. The second is that it defers acquiring the LRU lock for as long as possible. If there are no LRU pages or the only LRU pages are transhuge then the LRU lock will not be acquired at all which reduces contention on zone-lru_lock. Signed-off-by: Mel Gorman mgor...@suse.de Acked-by: Rik van Riel r...@redhat.com -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/6] mm: compaction: Abort compaction loop if lock is contended or run too long
On 09/20/2012 10:04 AM, Mel Gorman wrote: From: Shaohua Li s...@fusionio.com Changelog since V2 o Fix BUG_ON triggered due to pages left on cc.migratepages o Make compact_zone_order() require non-NULL arg `contended' Changelog since V1 o only abort the compaction if lock is contended or run too long o Rearranged the code by Andrea Arcangeli. isolate_migratepages_range() might isolate no pages if for example when zone-lru_lock is contended and running asynchronous compaction. In this case, we should abort compaction, otherwise, compact_zone will run a useless loop and make zone-lru_lock is even contended. [minc...@kernel.org: Putback pages isolated for migration if aborting] [a...@linux-foundation.org: compact_zone_order requires non-NULL arg contended] Signed-off-by: Andrea Arcangeli aarca...@redhat.com Signed-off-by: Shaohua Li s...@fusionio.com Signed-off-by: Mel Gorman mgor...@suse.de Acked-by: Rik van Riel r...@redhat.com -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH -v2 2/2] make the compaction skip ahead logic robust
On 09/15/2012 11:55 AM, Richard Davies wrote: Hi Rik, Mel and Shaohua, Thank you for your latest patches. I attach my latest perf report for a slow boot with all of these applied. Mel asked for timings of the slow boots. It's very hard to give anything useful here! A normal boot would be a minute or so, and many are like that, but the slowest that I have seen (on 3.5.x) was several hours. Basically, I just test many times until I get one which is noticeably slow than normal and then run perf record on that one. The latest perf report for a slow boot is below. For the fast boots, most of the time is in clean_page_c in do_huge_pmd_anonymous_page, but for this slow one there is a lot of lock contention above that. How often do you run into slow boots, vs. fast ones? # Overhead Command Shared Object Symbol # ... .. # 58.49% qemu-kvm [kernel.kallsyms] [k] _raw_spin_lock_irqsave | --- _raw_spin_lock_irqsave | |--95.07%-- compact_checklock_irqsave | | | |--70.03%-- isolate_migratepages_range | | compact_zone | | compact_zone_order | | try_to_compact_pages | | __alloc_pages_direct_compact | | __alloc_pages_nodemask Looks like it moved from isolate_freepages_block in your last trace, to isolate_migratepages_range? Mel, I wonder if we have any quadratic complexity problems in this part of the code, too? The isolate_freepages_block CPU use can be fixed by simply restarting where the last invocation left off, instead of always starting at the end of the zone. Could we need something similar for isolate_migratepages_range? After all, Richard has a 128GB system, and runs 108GB worth of KVM guests on it... -- All rights reversed -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/2] Revert mm: have order 0 compaction start near a pageblock with free pages
On Wed, 12 Sep 2012 17:46:15 +0100 Richard Davies rich...@arachsys.com wrote: Mel Gorman wrote: I see that this is an old-ish bug but I did not read the full history. Is it now booting faster than 3.5.0 was? I'm asking because I'm interested to see if commit c67fe375 helped your particular case. Yes, I think 3.6.0-rc5 is already better than 3.5.x but can still be improved, as discussed. Re-reading Mel's commit de74f1cc3b1e9730d9b58580cd11361d30cd182d, I believe it re-introduces the quadratic behaviour that the code was suffering from before, by not moving zone-compact_cached_free_pfn down when no more free pfns are found in a page block. This mail reverts that changeset, the next introduces what I hope to be the proper fix. Richard, would you be willing to give these patches a try, since your system seems to reproduce this bug easily? ---8--- Revert mm: have order 0 compaction start near a pageblock with free pages This reverts commit de74f1cc3b1e9730d9b58580cd11361d30cd182d. Mel found a real issue with my skip ahead logic in the compaction code, but unfortunately his approach appears to have re-introduced quadratic behaviour in that the value of zone-compact_cached_free_pfn is never advanced until the compaction run wraps around the start of the zone. This merely moved the starting point for the quadratic behaviour further into the zone, but the behaviour has still been observed. It looks like another fix is required. Signed-off-by: Rik van Riel r...@redhat.com Reported-by: Richard Davies rich...@daviesmail.org diff --git a/mm/compaction.c b/mm/compaction.c index 7fcd3a5..771775d 100644 --- a/mm/compaction.c +++ b/mm/compaction.c @@ -431,20 +431,6 @@ static bool suitable_migration_target(struct page *page) } /* - * Returns the start pfn of the last page block in a zone. This is the starting - * point for full compaction of a zone. Compaction searches for free pages from - * the end of each zone, while isolate_freepages_block scans forward inside each - * page block. - */ -static unsigned long start_free_pfn(struct zone *zone) -{ - unsigned long free_pfn; - free_pfn = zone-zone_start_pfn + zone-spanned_pages; - free_pfn = ~(pageblock_nr_pages-1); - return free_pfn; -} - -/* * Based on information in the current compact_control, find blocks * suitable for isolating free pages from and then isolate them. */ @@ -483,6 +469,17 @@ static void isolate_freepages(struct zone *zone, pfn -= pageblock_nr_pages) { unsigned long isolated; + /* +* Skip ahead if another thread is compacting in the area +* simultaneously. If we wrapped around, we can only skip +* ahead if zone-compact_cached_free_pfn also wrapped to +* above our starting point. +*/ + if (cc-order 0 (!cc-wrapped || + zone-compact_cached_free_pfn + cc-start_free_pfn)) + pfn = min(pfn, zone-compact_cached_free_pfn); + if (!pfn_valid(pfn)) continue; @@ -533,15 +530,7 @@ static void isolate_freepages(struct zone *zone, */ if (isolated) { high_pfn = max(high_pfn, pfn); - - /* -* If the free scanner has wrapped, update -* compact_cached_free_pfn to point to the highest -* pageblock with free pages. This reduces excessive -* scanning of full pageblocks near the end of the -* zone -*/ - if (cc-order 0 cc-wrapped) + if (cc-order 0) zone-compact_cached_free_pfn = high_pfn; } } @@ -551,11 +540,6 @@ static void isolate_freepages(struct zone *zone, cc-free_pfn = high_pfn; cc-nr_freepages = nr_freepages; - - /* If compact_cached_free_pfn is reset then set it now */ - if (cc-order 0 !cc-wrapped - zone-compact_cached_free_pfn == start_free_pfn(zone)) - zone-compact_cached_free_pfn = high_pfn; } /* @@ -642,6 +626,20 @@ static isolate_migrate_t isolate_migratepages(struct zone *zone, return ISOLATE_SUCCESS; } +/* + * Returns the start pfn of the last page block in a zone. This is the starting + * point for full compaction of a zone. Compaction searches for free pages from + * the end of each zone, while isolate_freepages_block scans forward inside each + * page block. + */ +static unsigned long start_free_pfn(struct zone *zone) +{ + unsigned long free_pfn; + free_pfn = zone-zone_start_pfn + zone-spanned_pages; + free_pfn = ~(pageblock_nr_pages-1); + return free_pfn
[PATCH 2/2] make the compaction skip ahead logic robust
Make the skip ahead logic in compaction resistant to compaction wrapping around to the end of the zone. This can lead to less efficient compaction when one thread has wrapped around to the end of the zone, and another simultaneous compactor has not done so yet. However, it should ensure that we do not suffer quadratic behaviour any more. Signed-off-by: Rik van Riel r...@redhat.com Reported-by: Richard Davies rich...@daviesmail.org diff --git a/mm/compaction.c b/mm/compaction.c index 771775d..0656759 100644 --- a/mm/compaction.c +++ b/mm/compaction.c @@ -431,6 +431,24 @@ static bool suitable_migration_target(struct page *page) } /* + * We scan the zone in a circular fashion, starting at + * zone-compact_cached_free_pfn. Be careful not to skip if + * one compacting thread has just wrapped back to the end of the + * zone, but another thread has not. + */ +static bool compaction_may_skip(struct zone *zone, + struct compact_control *cc) +{ + if (!cc-wrapped zone-compact_free_pfn cc-start_pfn) + return true; + + if (cc-wrapped zone_compact_free_pfn cc-start_pfn) + return true; + + return false; +} + +/* * Based on information in the current compact_control, find blocks * suitable for isolating free pages from and then isolate them. */ @@ -471,13 +489,9 @@ static void isolate_freepages(struct zone *zone, /* * Skip ahead if another thread is compacting in the area -* simultaneously. If we wrapped around, we can only skip -* ahead if zone-compact_cached_free_pfn also wrapped to -* above our starting point. +* simultaneously, and has finished with this page block. */ - if (cc-order 0 (!cc-wrapped || - zone-compact_cached_free_pfn - cc-start_free_pfn)) + if (cc-order 0 compaction_may_skip(zone, cc)) pfn = min(pfn, zone-compact_cached_free_pfn); if (!pfn_valid(pfn)) -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH -v2 2/2] make the compaction skip ahead logic robust
Argh. And of course I send out the version from _before_ the compile test, instead of the one after! I am not used to caffeine any more and have had way too much tea... ---8--- Make the skip ahead logic in compaction resistant to compaction wrapping around to the end of the zone. This can lead to less efficient compaction when one thread has wrapped around to the end of the zone, and another simultaneous compactor has not done so yet. However, it should ensure that we do not suffer quadratic behaviour any more. Signed-off-by: Rik van Riel r...@redhat.com Reported-by: Richard Davies rich...@daviesmail.org diff --git a/mm/compaction.c b/mm/compaction.c index 771775d..0656759 100644 --- a/mm/compaction.c +++ b/mm/compaction.c @@ -431,6 +431,24 @@ static bool suitable_migration_target(struct page *page) } /* + * We scan the zone in a circular fashion, starting at + * zone-compact_cached_free_pfn. Be careful not to skip if + * one compacting thread has just wrapped back to the end of the + * zone, but another thread has not. + */ +static bool compaction_may_skip(struct zone *zone, + struct compact_control *cc) +{ + if (!cc-wrapped zone-compact_cached_free_pfn cc-start_free_pfn) + return true; + + if (cc-wrapped zone-compact_cached_free_pfn cc-start_free_pfn) + return true; + + return false; +} + +/* * Based on information in the current compact_control, find blocks * suitable for isolating free pages from and then isolate them. */ @@ -471,13 +489,9 @@ static void isolate_freepages(struct zone *zone, /* * Skip ahead if another thread is compacting in the area -* simultaneously. If we wrapped around, we can only skip -* ahead if zone-compact_cached_free_pfn also wrapped to -* above our starting point. +* simultaneously, and has finished with this page block. */ - if (cc-order 0 (!cc-wrapped || - zone-compact_cached_free_pfn - cc-start_free_pfn)) + if (cc-order 0 compaction_may_skip(zone, cc)) pfn = min(pfn, zone-compact_cached_free_pfn); if (!pfn_valid(pfn)) -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Add a page cache-backed balloon device driver.
On 09/10/2012 01:37 PM, Mike Waychison wrote: On Mon, Sep 10, 2012 at 5:05 AM, Michael S. Tsirkin m...@redhat.com wrote: Also can you pls answer Avi's question? How is overcommit managed? Overcommit in our deployments is managed using memory cgroups on the host. This allows us to have very directed policies as to how competing VMs on a host may overcommit. How do your memory cgroups lead to guests inflating their balloons? -- All rights reversed -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH] Improving directed yield scalability for PLE handler
On 09/10/2012 04:19 PM, Peter Zijlstra wrote: On Mon, 2012-09-10 at 15:12 -0500, Andrew Theurer wrote: + /* +* if the target task is not running, then only yield if the +* current task is in guest mode +*/ + if (!(p_rq-curr-flags PF_VCPU)) + goto out_irq; This would make yield_to() only ever work on KVM, not that I mind this too much, its a horrid thing and making it less useful for (ab)use is a good thing, still this probably wants mention somewhere :-) Also, it would not preempt a non-kvm task, even if we need to do that to boost a VCPU. I think the lines above should be dropped. -- All rights reversed -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC 1/1] kvm: Use vcpu_id as pivot instead of last boosted vcpu in PLE handler
On 09/02/2012 06:12 AM, Gleb Natapov wrote: On Thu, Aug 30, 2012 at 12:51:01AM +0530, Raghavendra K T wrote: The idea of starting from next vcpu (source of yield_to + 1) seem to work well for overcomitted guest rather than using last boosted vcpu. We can also remove per VM variable with this approach. Iteration for eligible candidate after this patch starts from vcpu source+1 and ends at source-1 (after wrapping) Thanks Nikunj for his quick verification of the patch. Please let me know if this patch is interesting and makes sense. This last_boosted_vcpu thing caused us trouble during attempt to implement vcpu destruction. It is good to see it removed from this POV. I like this implementation. It should achieve pretty much the same as my old code, but without the downsides and without having to keep the same amount of global state. -- All rights reversed -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Windows slow boot: contractor wanted
On 08/25/2012 01:45 PM, Richard Davies wrote: Are you talking about these patches? http://git.kernel.org/?p=linux/kernel/git/torvalds/linux.git;a=commit;h=c67fe3752abe6ab47639e2f9b836900c3dc3da84 http://marc.info/?l=linux-mmm=134521289221259 If so, I believe those are in 3.6.0-rc3, so I tested with that. Unfortunately, I can still get the slow boots and perf top showing _raw_spin_lock_irqsave. Here are two perf top traces on 3.6.0-rc3. They do look a bit different from 3.5.2, but _raw_spin_lock_irqsave is still at the top: PerfTop: 35272 irqs/sec kernel:98.1% exact: 0.0% [4000Hz cycles], (all, 16 CPUs) -- 61.85% [kernel] [k] _raw_spin_lock_irqsave 7.18% [kernel] [k] sub_preempt_count 5.03% [kernel] [k] isolate_freepages_block 2.49% [kernel] [k] yield_to 2.05% [kernel] [k] memcmp 2.01% [kernel] [k] compact_zone 1.76% [kernel] [k] add_preempt_count 1.52% [kernel] [k] _raw_spin_lock 1.31% [kernel] [k] kvm_vcpu_on_spin 0.92% [kernel] [k] svm_vcpu_run However, the compaction code is not as prominent as before. Can you get a backtrace to that _raw_spin_lock_irqsave, to see from where it is running into lock contention? It would be good to know whether it is isolate_freepages_block, yield_to, kvm_vcpu_on_spin or something else... -- All rights reversed -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Windows slow boot: contractor wanted
On 08/22/2012 10:41 AM, Richard Davies wrote: Avi Kivity wrote: Richard Davies wrote: I can trigger the slow boots without KSM and they have the same profile, with _raw_spin_lock_irqsave and isolate_freepages_block at the top. I reduced to 3x 20GB 8-core VMs on a 128GB host (rather than 3x 40GB 8-core VMs), and haven't managed to get a really slow boot yet (5 minutes). I'll post agan when I get one. I think you can go higher than that. But 120GB on a 128GB host is pushing it. I've now triggered a very slow boot at 3x 36GB 8-core VMs on a 128GB host (i.e. 108GB on a 128GB host). It has the same profile with _raw_spin_lock_irqsave and isolate_freepages_block at the top. That's the page compaction code. Mel Gorman and I have been working to fix that, the latest fixes and improvements are in the -mm kernel already. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC V5 0/3] kvm: Improving directed yield in PLE handler
On 07/22/2012 08:34 AM, Raghavendra K T wrote: On 07/20/2012 11:06 PM, Marcelo Tosatti wrote: On Wed, Jul 18, 2012 at 07:07:17PM +0530, Raghavendra K T wrote: Currently Pause Loop Exit (PLE) handler is doing directed yield to a random vcpu on pl-exit. We already have filtering while choosing the candidate to yield_to. This change adds more checks while choosing a candidate to yield_to. On a large vcpu guests, there is a high probability of yielding to the same vcpu who had recently done a pause-loop exit. Such a yield can lead to the vcpu spinning again. The patchset keeps track of the pause loop exit and gives chance to a vcpu which has: (a) Not done pause loop exit at all (probably he is preempted lock-holder) (b) vcpu skipped in last iteration because it did pause loop exit, and probably has become eligible now (next eligible lock holder) This concept also helps in cpu relax interception cases which use same handler. Changes since V4: - Naming Change (Avi): struct ple == struct spin_loop cpu_relax_intercepted == in_spin_loop vcpu_check_and_update_eligible == vcpu_eligible_for_directed_yield - mark vcpu in spinloop as not eligible to avoid influence of previous exit Changes since V3: - arch specific fix/changes (Christian) Changes since v2: - Move ple structure to common code (Avi) - rename pause_loop_exited to cpu_relax_intercepted (Avi) - add config HAVE_KVM_CPU_RELAX_INTERCEPT (Avi) - Drop superfluous curly braces (Ingo) Changes since v1: - Add more documentation for structure and algorithm and Rename plo == ple (Rik). - change dy_eligible initial value to false. (otherwise very first directed yield will not be skipped. (Nikunj) - fixup signoff/from issue Future enhancements: (1) Currently we have a boolean to decide on eligibility of vcpu. It would be nice if I get feedback on guest (32 vcpu) whether we can improve better with integer counter. (with counter = say f(log n )). (2) We have not considered system load during iteration of vcpu. With that information we can limit the scan and also decide whether schedule() is better. [ I am able to use #kicked vcpus to decide on this But may be there are better ideas like information from global loadavg.] (3) We can exploit this further with PV patches since it also knows about next eligible lock-holder. Summary: There is a very good improvement for kvm based guest on PLE machine. The V5 has huge improvement for kbench. +---+---+---++---+ base_rik stdev patched stdev %improve +---+---+---++---+ kernbench (time in sec lesser is better) +---+---+---++---+ 1x 49.2300 1.0171 22.6842 0.3073 117.0233 % 2x 91.9358 1.7768 53.9608 1.0154 70.37516 % +---+---+---++---+ +---+---+---++---+ ebizzy (records/sec more is better) +---+---+---++---+ 1x 1129.2500 28.6793 2125.6250 32.8239 88.23334 % 2x 1892.3750 75.1112 2377.1250 181.6822 25.61596 % +---+---+---++---+ Note: The patches are tested on x86. Links V4: https://lkml.org/lkml/2012/7/16/80 V3: https://lkml.org/lkml/2012/7/12/437 V2: https://lkml.org/lkml/2012/7/10/392 V1: https://lkml.org/lkml/2012/7/9/32 Raghavendra K T (3): config: Add config to support ple or cpu relax optimzation kvm : Note down when cpu relax intercepted or pause loop exited kvm : Choose a better candidate for directed yield --- arch/s390/kvm/Kconfig | 1 + arch/x86/kvm/Kconfig | 1 + include/linux/kvm_host.h | 39 +++ virt/kvm/Kconfig | 3 +++ virt/kvm/kvm_main.c | 41 + 5 files changed, 85 insertions(+), 0 deletions(-) Reviewed-by: Marcelo Tosattimtosa...@redhat.com Thanks Marcelo for the review. Avi, Rik, Christian, please let me know if this series looks good now. The series looks good to me. Reviewed-by: Rik van Riel r...@redhat.com -- All rights reversed -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC V4 3/3] kvm: Choose better candidate for directed yield
On 07/16/2012 06:07 AM, Avi Kivity wrote: +{ + bool eligible; + + eligible = !vcpu-ple.cpu_relax_intercepted || + (vcpu-ple.cpu_relax_intercepted +vcpu-ple.dy_eligible); + + if (vcpu-ple.cpu_relax_intercepted) + vcpu-ple.dy_eligible = !vcpu-ple.dy_eligible; Probably should assign 'true', since the previous value is essentially random. I suspect the intended purpose of this conditional is to flip the eligibility of a vcpu for being selected as a direct yield target. In other words, that bit of the code is correct. -- All rights reversed -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC V2 1/2] kvm vcpu: Note down pause loop exit
On 07/10/2012 03:31 PM, Raghavendra K T wrote: From: Raghavendra K Traghavendra...@linux.vnet.ibm.com Noting pause loop exited vcpu helps in filtering right candidate to yield. Yielding to same vcpu may result in more wastage of cpu. Signed-off-by: Raghavendra K Traghavendra...@linux.vnet.ibm.com Reviewed-by: Rik van Riel r...@redhat.com -- All rights reversed -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC 0/2] kvm: Improving directed yield in PLE handler
On 07/09/2012 02:20 AM, Raghavendra K T wrote: Currently Pause Looop Exit (PLE) handler is doing directed yield to a random VCPU on PL exit. Though we already have filtering while choosing the candidate to yield_to, we can do better. Problem is, for large vcpu guests, we have more probability of yielding to a bad vcpu. We are not able to prevent directed yield to same guy who has done PL exit recently, who perhaps spins again and wastes CPU. Fix that by keeping track of who has done PL exit. So The Algorithm in series give chance to a VCPU which has: (a) Not done PLE exit at all (probably he is preempted lock-holder) (b) VCPU skipped in last iteration because it did PL exit, and probably has become eligible now (next eligible lock holder) Future enhancemnets: Your patch series looks good to me. Simple changes with a significant result. However, the simple heuristic could use some comments :) -- All rights reversed -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC 2/2] kvm PLE handler: Choose better candidate for directed yield
On 07/09/2012 02:20 AM, Raghavendra K T wrote: +bool kvm_arch_vcpu_check_and_update_eligible(struct kvm_vcpu *vcpu) +{ + bool eligible; + + eligible = !vcpu-arch.plo.pause_loop_exited || + (vcpu-arch.plo.pause_loop_exited +vcpu-arch.plo.dy_eligible); + + if (vcpu-arch.plo.pause_loop_exited) + vcpu-arch.plo.dy_eligible = !vcpu-arch.plo.dy_eligible; + + return eligible; +} This is a nice simple mechanism to skip CPUs that were eligible last time and had pause loop exits recently. However, it could stand some documentation. Please add a good comment explaining how and why the algorithm works, when arch.plo.pause_loop_exited is cleared, etc... It would be good to make this heuristic understandable to people who look at the code for the first time. -- All rights reversed -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC 1/2] kvm vcpu: Note down pause loop exit
On 07/09/2012 02:20 AM, Raghavendra K T wrote: @@ -484,6 +484,13 @@ struct kvm_vcpu_arch { u64 length; u64 status; } osvw; + + /* Pause loop exit optimization */ + struct { + bool pause_loop_exited; + bool dy_eligible; + } plo; I know kvm_vcpu_arch is traditionally not a well documented structure, but it would be really nice if each variable inside this sub-structure could get some documentation. Also, do we really want to introduce another acronym here? Or would we be better off simply calling this struct .ple, since that is a name people are already familiar with. -- All rights reversed -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] kvm: handle last_boosted_vcpu = 0 case
On 06/28/2012 06:55 PM, Vinod, Chegu wrote: Hello, I am just catching up on this email thread... Perhaps one of you may be able to help answer this query.. preferably along with some data. [BTW, I do understand the basic intent behind PLE in a typical [sweet spot] use case where there is over subscription etc. and the need to optimize the PLE handler in the host etc. ] In a use case where the host has fewer but much larger guests (say 40VCPUs and higher) and there is no over subscription (i.e. # of vcpus across guests= physical cpus in the host and perhaps each guest has their vcpu's pinned to specific physical cpus for other reasons), I would like to understand if/how the PLE really helps ? For these use cases would it be ok to turn PLE off (ple_gap=0) since is no real need to take an exit and find some other VCPU to yield to ? Yes, that should be ok. On a related note, I wonder if we should increase the ple_gap significantly. After all, 4096 cycles of spinning is not that much, when you consider how much time is spent doing the subsequent vmexit, scanning the other VCPU's status (200 cycles per cache miss), deciding what to do, maybe poking another CPU, and eventually a vmenter. A factor 4 increase in ple_gap might be what it takes to get the amount of time spent spinning equal to the amount of time spent on the host side doing KVM stuff... -- All rights reversed -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Add a page cache-backed balloon device driver.
On 06/26/2012 04:32 PM, Frank Swiderski wrote: This implementation of a virtio balloon driver uses the page cache to store pages that have been released to the host. The communication (outside of target counts) is one way--the guest notifies the host when it adds a page to the page cache, allowing the host to madvise(2) with MADV_DONTNEED. Reclaim in the guest is therefore automatic and implicit (via the regular page reclaim). This means that inflating the balloon is similar to the existing balloon mechanism, but the deflate is different--it re-uses existing Linux kernel functionality to automatically reclaim. Signed-off-by: Frank Swiderskif...@google.com It is a great idea, but how can this memory balancing possibly work if someone uses memory cgroups inside a guest? Having said that, we currently do not have proper memory reclaim balancing between cgroups at all, so requiring that of this balloon driver would be unreasonable. The code looks good to me, my only worry is the code duplication. We now have 5 balloon drivers, for 4 hypervisors, all implementing everything from scratch... -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Add a page cache-backed balloon device driver.
On 06/26/2012 05:31 PM, Frank Swiderski wrote: On Tue, Jun 26, 2012 at 1:40 PM, Rik van Rielr...@redhat.com wrote: The code looks good to me, my only worry is the code duplication. We now have 5 balloon drivers, for 4 hypervisors, all implementing everything from scratch... Do you have any recommendations on this? I could (I think reasonably so) modify the existing virtio_balloon.c and have it change behavior based on a feature bit or other configuration. I'm not sure that really addresses the root of what you're pointing out--it's still adding a different implementation, but doing so as an extension of an existing one. Ideally, I believe we would have two balloon top parts in a guest (one classical balloon, one on the LRU), and four bottom parts (kvm, xen, vmware s390). That way the virt specific bits of a balloon driver would be essentially a -balloon_page and -release_page callback for pages, as well as methods to communicate with the host. All the management of pages, including stuff like putting them on the LRU, or isolating them for migration, would be done with the same common code, regardless of what virt software we are running on. Of course, that is a substantial amount of work and I feel it would be unreasonable to block anyone's code on that kind of thing (especially considering that your code is good), but I do believe the explosion of balloon code is a little worrying. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] kvm: handle last_boosted_vcpu = 0 case
On 06/20/2012 04:12 PM, Raghavendra K T wrote: On 06/20/2012 02:21 AM, Rik van Riel wrote: Please let me know how it goes. Yes, have got result today, too tired to summarize. got better performance result too. will come back again tomorrow morning. have to post, randomized start point patch also, which I discussed to know the opinion. The other person's problem has also gone away with this patch. Avi, could I convince you to apply this obvious bugfix to kvm.git? :) 8 If last_boosted_vcpu == 0, then we fall through all test cases and may end up with all VCPUs pouncing on vcpu 0. With a large enough guest, this can result in enormous runqueue lock contention, which can prevent vcpu0 from running, leading to a livelock. Changing to= makes sure we properly handle that case. Signed-off-by: Rik van Rielr...@redhat.com --- virt/kvm/kvm_main.c | 2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index 7e14068..1da542b 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -1586,7 +1586,7 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me) */ for (pass = 0; pass 2 !yielded; pass++) { kvm_for_each_vcpu(i, vcpu, kvm) { - if (!pass i last_boosted_vcpu) { + if (!pass i= last_boosted_vcpu) { i = last_boosted_vcpu; continue; } else if (pass i last_boosted_vcpu) -- All rights reversed -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] kvm: handle last_boosted_vcpu = 0 case
On Wed, 20 Jun 2012 01:50:50 +0530 Raghavendra K T raghavendra...@linux.vnet.ibm.com wrote: In ple handler code, last_boosted_vcpu (lbv) variable is serving as reference point to start when we enter. Also statistical analysis (below) is showing lbv is not very well distributed with current approach. You are the second person to spot this bug today (yes, today). Due to time zones, the first person has not had a chance yet to test the patch below, which might fix the issue... Please let me know how it goes. 8 If last_boosted_vcpu == 0, then we fall through all test cases and may end up with all VCPUs pouncing on vcpu 0. With a large enough guest, this can result in enormous runqueue lock contention, which can prevent vcpu0 from running, leading to a livelock. Changing to = makes sure we properly handle that case. Signed-off-by: Rik van Riel r...@redhat.com --- virt/kvm/kvm_main.c |2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index 7e14068..1da542b 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -1586,7 +1586,7 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me) */ for (pass = 0; pass 2 !yielded; pass++) { kvm_for_each_vcpu(i, vcpu, kvm) { - if (!pass i last_boosted_vcpu) { + if (!pass i = last_boosted_vcpu) { i = last_boosted_vcpu; continue; } else if (pass i last_boosted_vcpu) -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] kvm: don't call mmu_shrinker w/o used_mmu_pages
On 04/23/2012 12:40 PM, Ying Han wrote: Avi, does this patch help the case as you mentioned above, where kvm module is loaded but no virtual machines are present ? Why we have to walk the empty while holding the spinlock? It might help marginally, but rather than defending the patch, it might be more useful to try finding a solution to the problem Mike and Eric are triggering. How can we prevent the code from taking the lock and throwing out EPT/NPT pages when there is no real need to, and they will be faulted back in soon anyway? -- All rights reversed -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] kvm: don't call mmu_shrinker w/o used_mmu_pages
On 04/20/2012 06:11 PM, Andrew Morton wrote: On Fri, 13 Apr 2012 15:38:41 -0700 Ying Hanying...@google.com wrote: The mmu_shrink() is heavy by itself by iterating all kvms and holding the kvm_lock. spotted the code w/ Rik during LSF, and it turns out we don't need to call the shrinker if nothing to shrink. @@ -3900,6 +3905,9 @@ static int mmu_shrink(struct shrinker *shrink, struct shrink_control *sc) if (nr_to_scan == 0) goto out; + if (!get_kvm_total_used_mmu_pages()) + return 0; + Do we actually know that this patch helps anything? Any measurements? Is kvm_total_used_mmu_pages==0 at all common? On re-reading mmu.c, it looks like even with EPT or NPT, we end up creating mmu pages for the nested page tables. I have not had the time to look into it more, but it would be nice to know if the patch has any effect at all. -- All rights reversed -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Performance of 40-way guest running 2.6.32-220 (RHEL6.2) vs. 3.3.1 OS
On 04/11/2012 01:21 PM, Chegu Vinod wrote: Hello, While running an AIM7 (workfile.high_systime) in a single 40-way (or a single 60-way KVM guest) I noticed pretty bad performance when the guest was booted with 3.3.1 kernel when compared to the same guest booted with 2.6.32-220 (RHEL6.2) kernel. For the 40-way Guest-RunA (2.6.32-220 kernel) performed nearly 9x better than the Guest-RunB (3.3.1 kernel). In the case of 60-way guest run the older guest kernel was nearly 12x better ! Turned on function tracing and found that there appears to be more time being spent around the lock code in the 3.3.1 guest when compared to the 2.6.32-220 guest. Looks like you may be running into the ticket spinlock code. During the early RHEL 6 days, Gleb came up with a patch to automatically disable ticket spinlocks when running inside a KVM guest. IIRC that patch got rejected upstream at the time, with upstream developers preferring to wait for a better solution. If such a better solution is not on its way upstream now (two years later), maybe we should just merge Gleb's patch upstream for the time being? -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v5 4/9] KVM-HV: KVM Steal time implementation
On 07/04/2011 11:32 AM, Glauber Costa wrote: To implement steal time, we need the hypervisor to pass the guest information about how much time was spent running other processes outside the VM, while the vcpu had meaningful work to do - halt time does not count. This information is acquired through the run_delay field of delayacct/schedstats infrastructure, that counts time spent in a runqueue but not running. Steal time is a per-cpu information, so the traditional MSR-based infrastructure is used. A new msr, KVM_MSR_STEAL_TIME, holds the memory area address containing information about steal time This patch contains the hypervisor part of the steal time infrasructure, and can be backported independently of the guest portion. Signed-off-by: Glauber Costaglom...@redhat.com CC: Rik van Rielr...@redhat.com CC: Jeremy Fitzhardingejeremy.fitzhardi...@citrix.com CC: Peter Zijlstrapet...@infradead.org CC: Avi Kivitya...@redhat.com CC: Anthony Liguorialigu...@us.ibm.com CC: Eric B Munsonemun...@mgebm.net Acked-by: Rik van Riel r...@redhat.com -- All rights reversed -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html