回复: [PATCH] gpu: drm: use struct_size() in kmalloc()
Daniel, what you are talking about is totally wrong. 1) AFAIK, only one zero-size array can be in the end of a struct. 2) two struct_size will add up struct itself twice. the sum is wrong then. No offense. I can't help feeling lucky that you are in intel. 发件人: Daniel Vetter 代表 Daniel Vetter 发送时间: 2019年5月21日 0:28 收件人: Pan, Xinhui 抄送: Deucher, Alexander; Koenig, Christian; Zhou, David(ChunMing); airl...@linux.ie; dan...@ffwll.ch; Quan, Evan; xiaolinkui; amd-...@lists.freedesktop.org; dri-de...@lists.freedesktop.org; linux-kernel@vger.kernel.org 主题: Re: [PATCH] gpu: drm: use struct_size() in kmalloc() [CAUTION: External Email] On Fri, May 17, 2019 at 04:44:30PM +, Pan, Xinhui wrote: > I am going to put more members which are also array after this struct, > not only obj[]. Looks like this struct_size did not help on multiple > array case. Thanks anyway. You can then add them up, e.g. kmalloc(struct_size()+struct_size(), GFP_KERNEL), so this patch here still looks like a good idea. Reviewed-by: Daniel Vetter Cheers, Daniel > From: xiaolinkui > Sent: Friday, May 17, 2019 4:46:00 PM > To: Deucher, Alexander; Koenig, Christian; Zhou, David(ChunMing); > airl...@linux.ie; dan...@ffwll.ch; Pan, Xinhui; Quan, Evan > Cc: amd-...@lists.freedesktop.org; dri-de...@lists.freedesktop.org; > linux-kernel@vger.kernel.org; xiaolin...@kylinos.cn > Subject: [PATCH] gpu: drm: use struct_size() in kmalloc() > > [CAUTION: External Email] > > Use struct_size() helper to keep code simple. > > Signed-off-by: xiaolinkui > --- > drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 3 +-- > 1 file changed, 1 insertion(+), 2 deletions(-) > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c > b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c > index 22bd21e..4717a64 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c > @@ -1375,8 +1375,7 @@ int amdgpu_ras_init(struct amdgpu_device *adev) > if (con) > return 0; > > - con = kmalloc(sizeof(struct amdgpu_ras) + > - sizeof(struct ras_manager) * AMDGPU_RAS_BLOCK_COUNT, > + con = kmalloc(struct_size(con, objs, AMDGPU_RAS_BLOCK_COUNT), > GFP_KERNEL|__GFP_ZERO); > if (!con) > return -ENOMEM; > -- > 2.7.4 > > > -- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch
Re: [PATCH v5] locking/pvqspinlock: Relax cmpxchg's to improve performance on some archs
在 2017/2/23 22:13, Waiman Long 写道: All the locking related cmpxchg's in the following functions are replaced with the _acquire variants: - pv_queued_spin_steal_lock() - trylock_clear_pending() This change should help performance on architectures that use LL/SC. On a 2-core 16-thread Power8 system with pvqspinlock explicitly enabled, the performance of a locking microbenchmark with and without this patch on a 4.10-rc8 kernel with Xinhui's PPC qspinlock patch were as follows: # of thread w/o patchwith patch % Change --- --- 4 4053.3 Mop/s 4223.7 Mop/s +4.2% 8 3310.4 Mop/s 3406.0 Mop/s +2.9% 12 2576.4 Mop/s 2674.6 Mop/s +3.8% Signed-off-by: Waiman Long --- Works on my side :) Reviewed-by: Pan Xinhui v4->v5: - Correct some grammatical issues in comment. v3->v4: - Update the comment in pv_kick_node() to mention that the code may not work in some archs. v2->v3: - Reduce scope by relaxing cmpxchg's in fast path only. v1->v2: - Add comments in changelog and code for the rationale of the change. kernel/locking/qspinlock_paravirt.h | 19 +-- 1 file changed, 13 insertions(+), 6 deletions(-) diff --git a/kernel/locking/qspinlock_paravirt.h b/kernel/locking/qspinlock_paravirt.h index e6b2f7a..4614e39 100644 --- a/kernel/locking/qspinlock_paravirt.h +++ b/kernel/locking/qspinlock_paravirt.h @@ -72,7 +72,7 @@ static inline bool pv_queued_spin_steal_lock(struct qspinlock *lock) struct __qspinlock *l = (void *)lock; if (!(atomic_read(&lock->val) & _Q_LOCKED_PENDING_MASK) && - (cmpxchg(&l->locked, 0, _Q_LOCKED_VAL) == 0)) { + (cmpxchg_acquire(&l->locked, 0, _Q_LOCKED_VAL) == 0)) { qstat_inc(qstat_pv_lock_stealing, true); return true; } @@ -101,16 +101,16 @@ static __always_inline void clear_pending(struct qspinlock *lock) /* * The pending bit check in pv_queued_spin_steal_lock() isn't a memory - * barrier. Therefore, an atomic cmpxchg() is used to acquire the lock - * just to be sure that it will get it. + * barrier. Therefore, an atomic cmpxchg_acquire() is used to acquire the + * lock just to be sure that it will get it. */ static __always_inline int trylock_clear_pending(struct qspinlock *lock) { struct __qspinlock *l = (void *)lock; return !READ_ONCE(l->locked) && - (cmpxchg(&l->locked_pending, _Q_PENDING_VAL, _Q_LOCKED_VAL) - == _Q_PENDING_VAL); + (cmpxchg_acquire(&l->locked_pending, _Q_PENDING_VAL, + _Q_LOCKED_VAL) == _Q_PENDING_VAL); } #else /* _Q_PENDING_BITS == 8 */ static __always_inline void set_pending(struct qspinlock *lock) @@ -138,7 +138,7 @@ static __always_inline int trylock_clear_pending(struct qspinlock *lock) */ old = val; new = (val & ~_Q_PENDING_MASK) | _Q_LOCKED_VAL; - val = atomic_cmpxchg(&lock->val, old, new); + val = atomic_cmpxchg_acquire(&lock->val, old, new); if (val == old) return 1; @@ -361,6 +361,13 @@ static void pv_kick_node(struct qspinlock *lock, struct mcs_spinlock *node) * observe its next->locked value and advance itself. * * Matches with smp_store_mb() and cmpxchg() in pv_wait_node() +* +* The write to next->locked in arch_mcs_spin_unlock_contended() +* must be ordered before the read of pn->state in the cmpxchg() +* below for the code to work correctly. However, this is not +* guaranteed on all architectures when the cmpxchg() call fails. +* Both x86 and PPC can provide that guarantee, but other +* architectures not necessarily. */ if (cmpxchg(&pn->state, vcpu_halted, vcpu_hashed) != vcpu_halted) return;
Re: [PATCH] powerpc/xmon: Fix an unexpected xmon onoff state change
在 2017/2/17 14:05, Michael Ellerman 写道: Pan Xinhui writes: diff --git a/arch/powerpc/xmon/xmon.c b/arch/powerpc/xmon/xmon.c index 9c0e17c..f6e5c3d 100644 --- a/arch/powerpc/xmon/xmon.c +++ b/arch/powerpc/xmon/xmon.c @@ -76,6 +76,7 @@ static int xmon_gate; #endif /* CONFIG_SMP */ static unsigned long in_xmon __read_mostly = 0; +static int xmon_off = !IS_ENABLED(CONFIG_XMON_DEFAULT); I think the logic would probably clearer if we invert this to become xmon_on. yep, make sense. @@ -3266,16 +3269,16 @@ static int __init setup_xmon_sysrq(void) __initcall(setup_xmon_sysrq); #endif /* CONFIG_MAGIC_SYSRQ */ -static int __initdata xmon_early, xmon_off; +static int __initdata xmon_early; static int __init early_parse_xmon(char *p) { if (!p || strncmp(p, "early", 5) == 0) { /* just "xmon" is equivalent to "xmon=early" */ - xmon_init(1); xmon_early = 1; + xmon_off = 0; } else if (strncmp(p, "on", 2) == 0) - xmon_init(1); + xmon_off = 0; You've just changed the timing of when xmon gets enabled for the above two cases, from here which is called very early, to xmon_setup() which is called much later in boot. That effectively disables xmon for most of the boot, which we do not want to do. Although it is not often that kernel got stucked during boot. Yes, the behavior changed anyway. Will fix that in v3. cheers
Re: [PATCH] powerpc/xmon: Fix an unexpected xmon onoff state change
在 2017/2/16 18:57, Guilherme G. Piccoli 写道: On 16/02/2017 03:09, Michael Ellerman wrote: Pan Xinhui writes: Once xmon is triggered by sysrq-x, it is enabled always afterwards even if it is disabled during boot. This will cause a system reset interrut fail to dump. So keep xmon in its original state after exit. Signed-off-by: Pan Xinhui --- arch/powerpc/xmon/xmon.c | 5 - 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/arch/powerpc/xmon/xmon.c b/arch/powerpc/xmon/xmon.c index 9c0e17c..721212f 100644 --- a/arch/powerpc/xmon/xmon.c +++ b/arch/powerpc/xmon/xmon.c @@ -76,6 +76,7 @@ static int xmon_gate; #endif /* CONFIG_SMP */ static unsigned long in_xmon __read_mostly = 0; +static int xmon_off = 0; static unsigned long adrs; static int size = 1; @@ -3250,6 +3251,8 @@ static void sysrq_handle_xmon(int key) /* ensure xmon is enabled */ xmon_init(1); debugger(get_irq_regs()); + if (xmon_off) + xmon_init(0); } I don't think this is right. xmon_off is only true if you boot with xmon=off on the command line. So if you boot with CONFIG_XMON_DEFAULT=n, and nothing on the command line, then enter xmon via sysrq, then exit, xmon will still be enabled. Agreed, noticed it after some work in V2 of my patch. I'm addressing it there, so maybe no harm in keeping this way here.. Hi, mpe I cooked a new patch. And as Paul mentioned in slack, we need keep xmon on too if xmon=early is set in cmdline. hi, Guilherme feel free to include it in your new patchset. :) thanks xinhui patch- powerpc/xmon: Fix an unexpected xmon onoff state change Once xmon is triggered by sysrq-x, it is enabled always afterwards even if it is disabled during boot. This will cause a system reset interrupt fail to dump. So keep xmon in its original state after exit. We have several ways to set xmon on or off. 1) by a build config CONFIG_XMON_DEFAULT. 2) by a boot cmdline with xmon or xmon=early or xmon=on to enable xmon and xmon=off to disable xmon. This value will override that in step 1. 3) by a debugfs interface. We need someone implement it in the future. And this value can override those in step 1 and 2. Signed-off-by: Pan Xinhui --- arch/powerpc/xmon/xmon.c | 16 +--- 1 file changed, 9 insertions(+), 7 deletions(-) diff --git a/arch/powerpc/xmon/xmon.c b/arch/powerpc/xmon/xmon.c index 9c0e17c..f6e5c3d 100644 --- a/arch/powerpc/xmon/xmon.c +++ b/arch/powerpc/xmon/xmon.c @@ -76,6 +76,7 @@ static int xmon_gate; #endif /* CONFIG_SMP */ static unsigned long in_xmon __read_mostly = 0; +static int xmon_off = !IS_ENABLED(CONFIG_XMON_DEFAULT); static unsigned long adrs; static int size = 1; @@ -3250,6 +3251,8 @@ static void sysrq_handle_xmon(int key) /* ensure xmon is enabled */ xmon_init(1); debugger(get_irq_regs()); + if (xmon_off) + xmon_init(0); } static struct sysrq_key_op sysrq_xmon_op = { @@ -3266,16 +3269,16 @@ static int __init setup_xmon_sysrq(void) __initcall(setup_xmon_sysrq); #endif /* CONFIG_MAGIC_SYSRQ */ -static int __initdata xmon_early, xmon_off; +static int __initdata xmon_early; static int __init early_parse_xmon(char *p) { if (!p || strncmp(p, "early", 5) == 0) { /* just "xmon" is equivalent to "xmon=early" */ - xmon_init(1); xmon_early = 1; + xmon_off = 0; } else if (strncmp(p, "on", 2) == 0) - xmon_init(1); + xmon_off = 0; else if (strncmp(p, "off", 3) == 0) xmon_off = 1; else if (strncmp(p, "nobt", 4) == 0) @@ -3289,10 +3292,9 @@ early_param("xmon", early_parse_xmon); void __init xmon_setup(void) { -#ifdef CONFIG_XMON_DEFAULT - if (!xmon_off) - xmon_init(1); -#endif + if (xmon_off) + return; + xmon_init(1); if (xmon_early) debugger(NULL); } -- 2.9.3 Thanks, Guilherme cheers
[PATCH] powerpc/xmon: Fix an unexpected xmon onoff state change
Once xmon is triggered by sysrq-x, it is enabled always afterwards even if it is disabled during boot. This will cause a system reset interrut fail to dump. So keep xmon in its original state after exit. Signed-off-by: Pan Xinhui --- arch/powerpc/xmon/xmon.c | 5 - 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/arch/powerpc/xmon/xmon.c b/arch/powerpc/xmon/xmon.c index 9c0e17c..721212f 100644 --- a/arch/powerpc/xmon/xmon.c +++ b/arch/powerpc/xmon/xmon.c @@ -76,6 +76,7 @@ static int xmon_gate; #endif /* CONFIG_SMP */ static unsigned long in_xmon __read_mostly = 0; +static int xmon_off = 0; static unsigned long adrs; static int size = 1; @@ -3250,6 +3251,8 @@ static void sysrq_handle_xmon(int key) /* ensure xmon is enabled */ xmon_init(1); debugger(get_irq_regs()); + if (xmon_off) + xmon_init(0); } static struct sysrq_key_op sysrq_xmon_op = { @@ -3266,7 +3269,7 @@ static int __init setup_xmon_sysrq(void) __initcall(setup_xmon_sysrq); #endif /* CONFIG_MAGIC_SYSRQ */ -static int __initdata xmon_early, xmon_off; +static int __initdata xmon_early; static int __init early_parse_xmon(char *p) { -- 2.4.11
[PATCH] powerpc/xmon: add turn off xmon option
Once xmon is triggered, there is no interface to turn it off again. However there exists disable/enable xmon code flows. And more important, System reset interrupt on powerVM will fire an oops to make a dump. At that time, xmon should not be triggered. So add 'z' option after current 'x|X' exit commands. Turn xmon off if 'z' is following. Signed-off-by: Pan Xinhui --- arch/powerpc/xmon/xmon.c | 12 +--- 1 file changed, 9 insertions(+), 3 deletions(-) diff --git a/arch/powerpc/xmon/xmon.c b/arch/powerpc/xmon/xmon.c index 9c0e17c..2f4e7b1 100644 --- a/arch/powerpc/xmon/xmon.c +++ b/arch/powerpc/xmon/xmon.c @@ -76,6 +76,7 @@ static int xmon_gate; #endif /* CONFIG_SMP */ static unsigned long in_xmon __read_mostly = 0; +static int xmon_off = 0; static unsigned long adrs; static int size = 1; @@ -255,8 +256,8 @@ Commands:\n\ Sr # read SPR #\n\ Sw #v write v to SPR #\n\ tprint backtrace\n\ - xexit monitor and recover\n\ - Xexit monitor and don't recover\n" + x[z] exit monitor and recover, turn off xmon with 'z'\n\ + X[z] exit monitor and don't recover, turn off xmon with 'z'\n" #if defined(CONFIG_PPC64) && !defined(CONFIG_PPC_BOOK3E) " u dump segment table or SLB\n" #elif defined(CONFIG_PPC_STD_MMU_32) @@ -952,6 +953,8 @@ cmds(struct pt_regs *excp) break; case 'x': case 'X': + if (inchar() == 'z') + xmon_off = 1; return cmd; case EOF: printf(" \n"); @@ -3248,8 +3251,11 @@ static void xmon_init(int enable) static void sysrq_handle_xmon(int key) { /* ensure xmon is enabled */ + xmon_off = 0; xmon_init(1); debugger(get_irq_regs()); + if (xmon_off) + xmon_init(0); } static struct sysrq_key_op sysrq_xmon_op = { @@ -3266,7 +3272,7 @@ static int __init setup_xmon_sysrq(void) __initcall(setup_xmon_sysrq); #endif /* CONFIG_MAGIC_SYSRQ */ -static int __initdata xmon_early, xmon_off; +static int __initdata xmon_early; static int __init early_parse_xmon(char *p) { -- 2.4.11
Re: [PATCH v2] locking/pvqspinlock: Relax cmpxchg's to improve performance on some archs
在 2017/2/8 14:09, Boqun Feng 写道: On Wed, Feb 08, 2017 at 12:05:40PM +0800, Boqun Feng wrote: On Wed, Feb 08, 2017 at 11:39:10AM +0800, Xinhui Pan wrote: 2016-12-26 4:26 GMT+08:00 Waiman Long : A number of cmpxchg calls in qspinlock_paravirt.h were replaced by more relaxed versions to improve performance on architectures that use LL/SC. All the locking related cmpxchg's are replaced with the _acquire variants: - pv_queued_spin_steal_lock() - trylock_clear_pending() The cmpxchg's related to hashing are replaced by either by the _release or the _relaxed variants. See the inline comment for details. Signed-off-by: Waiman Long v1->v2: - Add comments in changelog and code for the rationale of the change. --- kernel/locking/qspinlock_paravirt.h | 50 -- --- 1 file changed, 33 insertions(+), 17 deletions(-) @@ -323,8 +329,14 @@ static void pv_wait_node(struct mcs_spinlock *node, struct mcs_spinlock *prev) * If pv_kick_node() changed us to vcpu_hashed, retain that * value so that pv_wait_head_or_lock() knows to not also try * to hash this lock. +* +* The smp_store_mb() and control dependency above will ensure +* that state change won't happen before that. Synchronizing +* with pv_kick_node() wrt hashing by this waiter or by the +* lock holder is done solely by the state variable. There is +* no other ordering requirement. */ - cmpxchg(&pn->state, vcpu_halted, vcpu_running); + cmpxchg_relaxed(&pn->state, vcpu_halted, vcpu_running); /* * If the locked flag is still not set after wakeup, it is a @@ -360,9 +372,12 @@ static void pv_kick_node(struct qspinlock *lock, struct mcs_spinlock *node) * pv_wait_node(). If OTOH this fails, the vCPU was running and will * observe its next->locked value and advance itself. * -* Matches with smp_store_mb() and cmpxchg() in pv_wait_node() +* Matches with smp_store_mb() and cmpxchg_relaxed() in pv_wait_node(). +* A release barrier is used here to ensure that node->locked is +* always set before changing the state. See comment in pv_wait_node(). */ - if (cmpxchg(&pn->state, vcpu_halted, vcpu_hashed) != vcpu_halted) + if (cmpxchg_release(&pn->state, vcpu_halted, vcpu_hashed) + != vcpu_halted) return; hi, Waiman We can't use _release here, a full barrier is needed. There is pv_kick_node vs pv_wait_head_or_lock [w] l->locked = _Q_SLOW_VAL //reordered here if (READ_ONCE(pn->state) == vcpu_hashed) //False. lp = (struct qspinlock **)1; [STORE] pn->state = vcpu_hashedlp = pv_hash(lock, pn); pv_hash()if (xchg(&l->locked, _Q_SLOW_VAL) == 0) // fasle, not unhashed. This analysis is correct, but.. Hmm.. look at this again, I don't think this analysis is meaningful, let's say the reordering didn't happen, we still got(similar to your case): but there is cmpxchg_relaxed(&pn->state, vcpu_halted, vcpu_running); if (READ_ONCE(pn->state) == vcpu_hashed) // false. lp = (struct qspinlock **)1; cmpxchg(pn->state, vcpu_halted, vcpu_hashed); this cmpxchg will observe the cmpxchg_relaxed above, so this cmpxchg will fail as pn->state is vcpu_running. No bug here.. if(!lp) { lp = pv_hash(lock, pn); WRITE_ONCE(l->locked, _Q_SLOW_VAL); pv_hash(); if (xchg(&l->locked, _Q_SLOW_VAL) == 0) // fasle, not unhashed. , right? Actually, I think this or your case could not happen because we have cmpxchg(pn->state, vcpu_halted, vcpu_running); in pv_wait_node(), which makes us either observe vcpu_hashed or set pn->state to vcpu_running before pv_kick_node() trying to do the hash. I may miss something subtle, but does switching back to cmpxchg() could fix the RCU stall you observed? Regards, Boqun Then the same lock has hashed twice but only unhashed once. So at last as the hash table grows big, we hit RCU stall. I hit RCU stall when I run netperf benchmark how will a big hash table hit RCU stall? Do you have the call trace for your RCU stall? Regards, Boqun thanks xinhui -- 1.8.3.1
Re: [PATCH v2] locking/pvqspinlock: Relax cmpxchg's to improve performance on some archs
在 2017/2/8 14:09, Boqun Feng 写道: On Wed, Feb 08, 2017 at 12:05:40PM +0800, Boqun Feng wrote: On Wed, Feb 08, 2017 at 11:39:10AM +0800, Xinhui Pan wrote: 2016-12-26 4:26 GMT+08:00 Waiman Long : A number of cmpxchg calls in qspinlock_paravirt.h were replaced by more relaxed versions to improve performance on architectures that use LL/SC. All the locking related cmpxchg's are replaced with the _acquire variants: - pv_queued_spin_steal_lock() - trylock_clear_pending() The cmpxchg's related to hashing are replaced by either by the _release or the _relaxed variants. See the inline comment for details. Signed-off-by: Waiman Long v1->v2: - Add comments in changelog and code for the rationale of the change. --- kernel/locking/qspinlock_paravirt.h | 50 -- --- 1 file changed, 33 insertions(+), 17 deletions(-) @@ -323,8 +329,14 @@ static void pv_wait_node(struct mcs_spinlock *node, struct mcs_spinlock *prev) * If pv_kick_node() changed us to vcpu_hashed, retain that * value so that pv_wait_head_or_lock() knows to not also try * to hash this lock. +* +* The smp_store_mb() and control dependency above will ensure +* that state change won't happen before that. Synchronizing +* with pv_kick_node() wrt hashing by this waiter or by the +* lock holder is done solely by the state variable. There is +* no other ordering requirement. */ - cmpxchg(&pn->state, vcpu_halted, vcpu_running); + cmpxchg_relaxed(&pn->state, vcpu_halted, vcpu_running); /* * If the locked flag is still not set after wakeup, it is a @@ -360,9 +372,12 @@ static void pv_kick_node(struct qspinlock *lock, struct mcs_spinlock *node) * pv_wait_node(). If OTOH this fails, the vCPU was running and will * observe its next->locked value and advance itself. * -* Matches with smp_store_mb() and cmpxchg() in pv_wait_node() +* Matches with smp_store_mb() and cmpxchg_relaxed() in pv_wait_node(). +* A release barrier is used here to ensure that node->locked is +* always set before changing the state. See comment in pv_wait_node(). */ - if (cmpxchg(&pn->state, vcpu_halted, vcpu_hashed) != vcpu_halted) + if (cmpxchg_release(&pn->state, vcpu_halted, vcpu_hashed) + != vcpu_halted) return; hi, Waiman We can't use _release here, a full barrier is needed. There is pv_kick_node vs pv_wait_head_or_lock [w] l->locked = _Q_SLOW_VAL //reordered here if (READ_ONCE(pn->state) == vcpu_hashed) //False. lp = (struct qspinlock **)1; [STORE] pn->state = vcpu_hashedlp = pv_hash(lock, pn); pv_hash()if (xchg(&l->locked, _Q_SLOW_VAL) == 0) // fasle, not unhashed. This analysis is correct, but.. Hmm.. look at this again, I don't think this analysis is meaningful, let's say the reordering didn't happen, we still got(similar to your case): if (READ_ONCE(pn->state) == vcpu_hashed) // false. lp = (struct qspinlock **)1; cmpxchg(pn->state, vcpu_halted, vcpu_hashed); if(!lp) { lp = pv_hash(lock, pn); WRITE_ONCE(l->locked, _Q_SLOW_VAL); pv_hash(); if (xchg(&l->locked, _Q_SLOW_VAL) == 0) // fasle, not unhashed. , right? Actually, I think this or your case could not happen because we have cmpxchg(pn->state, vcpu_halted, vcpu_running); in pv_wait_node(), which makes us either observe vcpu_hashed or set pn->state to vcpu_running before pv_kick_node() trying to do the hash. yep, there is still a race. We have to fix it. so I think we must check old = xchg(&l->locked, _Q_SLOW_VAL) if (old == 0) do something else if (old == _Q_SLOW_VAL) do something else I may miss something subtle, but does switching back to cmpxchg() could fix the RCU stall you observed? yes, just fix this cmpxchg and then no RCU stall. Regards, Boqun Then the same lock has hashed twice but only unhashed once. So at last as the hash table grows big, we hit RCU stall. I hit RCU stall when I run netperf benchmark how will a big hash table hit RCU stall? Do you have the call trace for your RCU stall? maybe too many time on hashing? I am not sure. Regards, Boqun thanks xinhui -- 1.8.3.1
Re: [PATCH v2] locking/pvqspinlock: Relax cmpxchg's to improve performance on some archs
在 2017/2/8 14:09, Boqun Feng 写道: On Wed, Feb 08, 2017 at 12:05:40PM +0800, Boqun Feng wrote: On Wed, Feb 08, 2017 at 11:39:10AM +0800, Xinhui Pan wrote: 2016-12-26 4:26 GMT+08:00 Waiman Long : A number of cmpxchg calls in qspinlock_paravirt.h were replaced by more relaxed versions to improve performance on architectures that use LL/SC. All the locking related cmpxchg's are replaced with the _acquire variants: - pv_queued_spin_steal_lock() - trylock_clear_pending() The cmpxchg's related to hashing are replaced by either by the _release or the _relaxed variants. See the inline comment for details. Signed-off-by: Waiman Long v1->v2: - Add comments in changelog and code for the rationale of the change. --- kernel/locking/qspinlock_paravirt.h | 50 -- --- 1 file changed, 33 insertions(+), 17 deletions(-) @@ -323,8 +329,14 @@ static void pv_wait_node(struct mcs_spinlock *node, struct mcs_spinlock *prev) * If pv_kick_node() changed us to vcpu_hashed, retain that * value so that pv_wait_head_or_lock() knows to not also try * to hash this lock. +* +* The smp_store_mb() and control dependency above will ensure +* that state change won't happen before that. Synchronizing +* with pv_kick_node() wrt hashing by this waiter or by the +* lock holder is done solely by the state variable. There is +* no other ordering requirement. */ - cmpxchg(&pn->state, vcpu_halted, vcpu_running); + cmpxchg_relaxed(&pn->state, vcpu_halted, vcpu_running); /* * If the locked flag is still not set after wakeup, it is a @@ -360,9 +372,12 @@ static void pv_kick_node(struct qspinlock *lock, struct mcs_spinlock *node) * pv_wait_node(). If OTOH this fails, the vCPU was running and will * observe its next->locked value and advance itself. * -* Matches with smp_store_mb() and cmpxchg() in pv_wait_node() +* Matches with smp_store_mb() and cmpxchg_relaxed() in pv_wait_node(). +* A release barrier is used here to ensure that node->locked is +* always set before changing the state. See comment in pv_wait_node(). */ - if (cmpxchg(&pn->state, vcpu_halted, vcpu_hashed) != vcpu_halted) + if (cmpxchg_release(&pn->state, vcpu_halted, vcpu_hashed) + != vcpu_halted) return; hi, Waiman We can't use _release here, a full barrier is needed. There is pv_kick_node vs pv_wait_head_or_lock [w] l->locked = _Q_SLOW_VAL //reordered here if (READ_ONCE(pn->state) == vcpu_hashed) //False. lp = (struct qspinlock **)1; [STORE] pn->state = vcpu_hashedlp = pv_hash(lock, pn); pv_hash()if (xchg(&l->locked, _Q_SLOW_VAL) == 0) // fasle, not unhashed. This analysis is correct, but.. Hmm.. look at this again, I don't think this analysis is meaningful, let's say the reordering didn't happen, we still got(similar to your case): if (READ_ONCE(pn->state) == vcpu_hashed) // false. lp = (struct qspinlock **)1; cmpxchg(pn->state, vcpu_halted, vcpu_hashed); if(!lp) { lp = pv_hash(lock, pn); WRITE_ONCE(l->locked, _Q_SLOW_VAL); pv_hash(); if (xchg(&l->locked, _Q_SLOW_VAL) == 0) // fasle, not unhashed. , right? Actually, I think this or your case could not happen because we have cmpxchg(pn->state, vcpu_halted, vcpu_running); in pv_wait_node(), which makes us either observe vcpu_hashed or set pn->state to vcpu_running before pv_kick_node() trying to do the hash. yep, there is still a race. We have to fix it. so I think we must check old = xchg(&l->locked, _Q_SLOW_VAL) if (old == 0) do something else if (old == _Q_SLOW_VAL) do something else I may miss something subtle, but does switching back to cmpxchg() could fix the RCU stall you observed? yes, just fix this cmpxchg and then no RCU stall. Regards, Boqun Then the same lock has hashed twice but only unhashed once. So at last as the hash table grows big, we hit RCU stall. I hit RCU stall when I run netperf benchmark how will a big hash table hit RCU stall? Do you have the call trace for your RCU stall? Regards, Boqun thanks xinhui -- 1.8.3.1
[tip:locking/core] locking/pvqspinlock: Don't wait if vCPU is preempted
Commit-ID: 75437bb304b20a2b350b9a8e9f9238d5e24e12ba Gitweb: http://git.kernel.org/tip/75437bb304b20a2b350b9a8e9f9238d5e24e12ba Author: Pan Xinhui AuthorDate: Tue, 10 Jan 2017 02:56:46 -0500 Committer: Ingo Molnar CommitDate: Thu, 12 Jan 2017 09:35:57 +0100 locking/pvqspinlock: Don't wait if vCPU is preempted If prev node is not in running state or its vCPU is preempted, we can give up our vCPU slices in pv_wait_node() ASAP. Signed-off-by: Pan Xinhui Cc: Linus Torvalds Cc: Peter Zijlstra Cc: Thomas Gleixner Cc: long...@redhat.com Link: http://lkml.kernel.org/r/1484035006-6787-1-git-send-email-xinhui@linux.vnet.ibm.com [ Fixed typos in the changelog, removed ugly linebreak from the code. ] Signed-off-by: Ingo Molnar --- kernel/locking/qspinlock_paravirt.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/locking/qspinlock_paravirt.h b/kernel/locking/qspinlock_paravirt.h index e3b5520..e6b2f7a 100644 --- a/kernel/locking/qspinlock_paravirt.h +++ b/kernel/locking/qspinlock_paravirt.h @@ -263,7 +263,7 @@ pv_wait_early(struct pv_node *prev, int loop) if ((loop & PV_PREV_CHECK_MASK) != 0) return false; - return READ_ONCE(prev->state) != vcpu_running; + return READ_ONCE(prev->state) != vcpu_running || vcpu_is_preempted(prev->cpu); } /*
[PATCH v2] locking/pvqspinlock: Wait early if vCPU is preempted
If prev node is not in runnig state or its vCPU is preempted, we can give up our vCPU slices ASAP in pv_wait_node. After commit d9345c65eb79 ("sched/core: Introduce the vcpu_is_preempted(cpu) interface") kernel has knowledge of one vCPU is running or not. Signed-off-by: Pan Xinhui --- v2: rewrite the commit message as Ingo pointed out the mistake. --- kernel/locking/qspinlock_paravirt.h | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/kernel/locking/qspinlock_paravirt.h b/kernel/locking/qspinlock_paravirt.h index e3b5520..48648dc 100644 --- a/kernel/locking/qspinlock_paravirt.h +++ b/kernel/locking/qspinlock_paravirt.h @@ -263,7 +263,8 @@ pv_wait_early(struct pv_node *prev, int loop) if ((loop & PV_PREV_CHECK_MASK) != 0) return false; - return READ_ONCE(prev->state) != vcpu_running; + return READ_ONCE(prev->state) != vcpu_running || + vcpu_is_preempted(prev->cpu); } /* -- 2.4.11
Re: [PATCH v2] locking/pvqspinlock: Relax cmpxchg's to improve performance on some archs
在 2017/1/4 17:41, Peter Zijlstra 写道: On Tue, Jan 03, 2017 at 05:07:54PM -0500, Waiman Long wrote: On 01/03/2017 11:18 AM, Peter Zijlstra wrote: On Sun, Dec 25, 2016 at 03:26:01PM -0500, Waiman Long wrote: A number of cmpxchg calls in qspinlock_paravirt.h were replaced by more relaxed versions to improve performance on architectures that use LL/SC. Claim without numbers ;-) Well it is hard to produce actual numbers here as I don't have the setup to gather data. Surely RHT has big PPC machines around? I know that getting to them is a wee bit of a bother, but they should be available somewhere. hi, I do some tests about cmpxchg and cmpxchg_acquire before on ppc. loops in 15s of each cmpxchg is below. cmpxchg_relaxed: 336663 cmpxchg_release: 369054 cmpxchg_acquire: 363364 cmpxchg: 179435 so cmpxchg is really expensive than others. but I also have doubt about the cmpxchg_relaxed, it should be the cheapest, but from the tests, release/acquire are faster than it. thanks xinhui
Re: [PATCH] locking/pvqspinlock: Wait early if vcpu is preempted
在 2017/1/5 16:23, Ingo Molnar 写道: * Pan Xinhui wrote: If prev node is not in runnig state or its cpu is preempted, we need wait early in pv_wait_node. After commit "sched/core: Introduce the vcpu_is_preempted(cpu) interface" kernel has knowledge of one vcpu is running or not. So lets use it. That's not how we reference upstream commits. I had to search the git log the hard way to figure out that it's commit d9345c65eb79 ... hi, Ingo Sorry. Such stupid mistake will not happen next time. Also, once you look at that commit you'll see how sloppy the capitalization in your changelog is. I fixed that up for all the original commits in the hope of you learning from those mistakes - but that was last year and the experiment failed. So sorry, I pay most attention on spelling of words. :( Maybe I rely on the tools of ./scripts/checkpatch.pl too much. I will rewrite the comments and resend this patch. Thanks, Ingo
Re: [PATCH v7 08/11] x86, kvm/x86.c: support vcpu preempted check
hi, Andrea thanks for your reply. :) 在 2016/12/19 19:42, Andrea Arcangeli 写道: Hello, On Wed, Nov 02, 2016 at 05:08:35AM -0400, Pan Xinhui wrote: Support the vcpu_is_preempted() functionality under KVM. This will enhance lock performance on overcommitted hosts (more runnable vcpus than physical cpus in the system) as doing busy waits for preempted vcpus will hurt system performance far worse than early yielding. Use one field of struct kvm_steal_time ::preempted to indicate that if one vcpu is running or not. Signed-off-by: Pan Xinhui Acked-by: Paolo Bonzini --- arch/x86/include/uapi/asm/kvm_para.h | 4 +++- arch/x86/kvm/x86.c | 16 2 files changed, 19 insertions(+), 1 deletion(-) [..] +static void kvm_steal_time_set_preempted(struct kvm_vcpu *vcpu) +{ + if (!(vcpu->arch.st.msr_val & KVM_MSR_ENABLED)) + return; + + vcpu->arch.st.steal.preempted = 1; + + kvm_write_guest_offset_cached(vcpu->kvm, &vcpu->arch.st.stime, + &vcpu->arch.st.steal.preempted, + offsetof(struct kvm_steal_time, preempted), + sizeof(vcpu->arch.st.steal.preempted)); +} + void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu) { + kvm_steal_time_set_preempted(vcpu); kvm_x86_ops->vcpu_put(vcpu); kvm_put_guest_fpu(vcpu); vcpu->arch.last_host_tsc = rdtsc(); You can't call kvm_steal_time_set_preempted in atomic context (neither in sched_out notifier nor in vcpu_put() after preempt_disable)). __copy_to_user in kvm_write_guest_offset_cached schedules and locks up the host. yes, you are right! :) we have known the problems. I am going to introduce something like kvm_write_guest_XXX_atomic and use them instead of kvm_write_guest_offset_cached. within pagefault_disable()/enable(), we can not call __copy_to_user I think. kvm->srcu (or kvm->slots_lock) is also not taken and kvm_write_guest_offset_cached needs to call kvm_memslots which requires it. let me check the details later. thanks for pointing it out. This I think is why postcopy live migration locks up with current upstream, and it doesn't seem related to userfaultfd at all (initially I suspected the vmf conversion but it wasn't that) and in theory it can happen with heavy swapping or page migration too. Just the page is written so frequently it's unlikely to be swapped out. The page being written so frequently also means it's very likely found as re-dirtied when postcopy starts and that pretty much guarantees an userfault will trigger a scheduling event in kvm_steal_time_set_preempted in destination. There are opposite probabilities of reproducing this with swapping vs postcopy live migration. Good analyze. :) For now I applied the below two patches, but this just will skip the write and only prevent the host instability as nobody checks the retval of __copy_to_user (what happens to guest after the write is skipped is not as clear and should be investigated, but at least the host will survive and not all guests will care about this flag being updated). For this to be fully safe the preempted information should be just an hint and not fundamental for correct functionality of the guest pv spinlock code. This bug was introduced in commit 0b9f6c4615c993d2b552e0d2bd1ade49b56e5beb in v4.9-rc7. From 458897fd44aa9b91459a006caa4051a7d1628a23 Mon Sep 17 00:00:00 2001 From: Andrea Arcangeli Date: Sat, 17 Dec 2016 18:43:52 +0100 Subject: [PATCH 1/2] kvm: fix schedule in atomic in kvm_steal_time_set_preempted() kvm_steal_time_set_preempted() isn't disabling the pagefaults before calling __copy_to_user and the kernel debug notices. Signed-off-by: Andrea Arcangeli --- arch/x86/kvm/x86.c | 10 ++ 1 file changed, 10 insertions(+) diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 1f0d238..2dabaeb 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -2844,7 +2844,17 @@ static void kvm_steal_time_set_preempted(struct kvm_vcpu *vcpu) void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu) { + /* +* Disable page faults because we're in atomic context here. +* kvm_write_guest_offset_cached() would call might_fault() +* that relies on pagefault_disable() to tell if there's a +* bug. NOTE: the write to guest memory may not go through if +* during postcopy live migration or if there's heavy guest +* paging. +*/ + pagefault_disable(); kvm_steal_time_set_preempted(vcpu); + pagefault_enable(); can we just add this? I think it is better to modify kvm_steal_time_set_preempted() and let it run correctly in atomic context. thanks xinhui kvm_x86_ops->vcpu_put(vcpu); kvm_put_guest_fpu(vcpu); vcpu->arch.last_host_tsc = rdtsc(); From 2845eba22ac74c5e313e3b590f9dac33e1b3cfef Mon Sep 17 00:00:00 2001 From: And
Re: [GIT PULL] KVM fixes for 4.10 merge window
在 2016/12/17 03:42, Linus Torvalds 写道: On Fri, Dec 16, 2016 at 8:57 AM, Paolo Bonzini wrote: git://git.kernel.org/pub/scm/virt/kvm/kvm.git tags/for-linus This piece-of-shit branch has obviously never been even compile-tested: arch/x86/kernel/kvm.c: In function ‘__kvm_vcpu_is_preempted’: arch/x86/kernel/kvm.c:596:14: error: ‘struct kvm_steal_time’ has no member named ‘preempted’ hi, Linus oh, my bad also. I introduce this struct member and use it in same patch. Better to separate tem into two patches. I make one fix patch below. sorry again. Hi, Paolo I have known where is th problem, I think if we can set this ->preempted later after preempted_enable() or just introduce something like write_guest_nosleep (per cpu memory section in guest, so there is no page_fault or any other cannot sleep problems)? thanks xinhui - From d4fa3ea0b8b6f3e5ff511604a4a6665d1cbb74c3 Mon Sep 17 00:00:00 2001 From: Pan Xinhui Date: Sat, 17 Dec 2016 02:56:33 -0500 Subject: [PATCH] kvm: fix compile issue we revert commit 0b9f6c4615c993d2b552e0d2bd1ade49b56e5beb which calls sleep function while preempt_disable on host part. But we remove struct kvm_steal_time::preempted too. This casues compile problem as both guest and host code use it. Fix it by adding struct kvm_steal_time::preempted back. Signed-off-by: Pan Xinhui --- arch/x86/include/uapi/asm/kvm_para.h | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/arch/x86/include/uapi/asm/kvm_para.h b/arch/x86/include/uapi/asm/kvm_para.h index 94dc8ca..1421a65 100644 --- a/arch/x86/include/uapi/asm/kvm_para.h +++ b/arch/x86/include/uapi/asm/kvm_para.h @@ -45,7 +45,9 @@ struct kvm_steal_time { __u64 steal; __u32 version; __u32 flags; - __u32 pad[12]; + __u8 preempted; + __u8 u8_pad[3]; + __u32 pad[11]; }; #define KVM_STEAL_ALIGNMENT_BITS 5 -- 2.4.11 where commit b94c3698b4b0 ("Revert "x86/kvm: Support the vCPU preemption check"") removed the "preempted" field from struct kvm_steal_time, but you left this in place: __visible bool __kvm_vcpu_is_preempted(int cpu) { struct kvm_steal_time *src = &per_cpu(steal_time, cpu); return !!src->preempted; } And no, that is not a merge artifact in my tree (although that function did come in from Ingo). That compile failure comes from your very own branch. Am I upset? You bet I am. Get your act together. You can't just randomly revert things without checking the end result. Linus
Re: [PATCH v5 1/2] sysctl: introduce new proc handler proc_dobool
在 2016/12/15 15:24, Jia He 写道: This is to let bool variable could be correctly displayed in big/little endian sysctl procfs. sizeof(bool) is arch dependent, proc_dobool should work in all arches. Suggested-by: Pan Xinhui Signed-off-by: Jia He --- include/linux/sysctl.h | 2 ++ kernel/sysctl.c| 41 + 2 files changed, 43 insertions(+) Reviewed-by: Pan Xinhui diff --git a/include/linux/sysctl.h b/include/linux/sysctl.h index adf4e51..255a9c7 100644 --- a/include/linux/sysctl.h +++ b/include/linux/sysctl.h @@ -41,6 +41,8 @@ typedef int proc_handler (struct ctl_table *ctl, int write, extern int proc_dostring(struct ctl_table *, int, void __user *, size_t *, loff_t *); +extern int proc_dobool(struct ctl_table *, int, + void __user *, size_t *, loff_t *); extern int proc_dointvec(struct ctl_table *, int, void __user *, size_t *, loff_t *); extern int proc_douintvec(struct ctl_table *, int, diff --git a/kernel/sysctl.c b/kernel/sysctl.c index 706309f..c4bec65 100644 --- a/kernel/sysctl.c +++ b/kernel/sysctl.c @@ -2112,6 +2112,20 @@ static int proc_put_char(void __user **buf, size_t *size, char c) return 0; } +static int do_proc_dobool_conv(bool *negp, unsigned long *lvalp, + int *valp, + int write, void *data) +{ + if (write) + *(bool *)valp = *lvalp; + else { + int val = *(bool *)valp; + + *lvalp = (unsigned long)val; + } + return 0; +} + static int do_proc_dointvec_conv(bool *negp, unsigned long *lvalp, int *valp, int write, void *data) @@ -2258,6 +2272,26 @@ static int do_proc_dointvec(struct ctl_table *table, int write, } /** + * proc_dobool - read/write a bool + * @table: the sysctl table + * @write: %TRUE if this is a write to the sysctl file + * @buffer: the user buffer + * @lenp: the size of the user buffer + * @ppos: file position + * + * Reads/writes up to table->maxlen/sizeof(unsigned int) integer + * values from/to the user buffer, treated as an ASCII string. + * + * Returns 0 on success. + */ +int proc_dobool(struct ctl_table *table, int write, + void __user *buffer, size_t *lenp, loff_t *ppos) +{ + return do_proc_dointvec(table, write, buffer, lenp, ppos, + do_proc_dobool_conv, NULL); +} + +/** * proc_dointvec - read a vector of integers * @table: the sysctl table * @write: %TRUE if this is a write to the sysctl file @@ -2885,6 +2919,12 @@ int proc_dostring(struct ctl_table *table, int write, return -ENOSYS; } +int proc_dobool(struct ctl_table *table, int write, + void __user *buffer, size_t *lenp, loff_t *ppos) +{ +return -ENOSYS; +} + int proc_dointvec(struct ctl_table *table, int write, void __user *buffer, size_t *lenp, loff_t *ppos) { @@ -2941,6 +2981,7 @@ int proc_doulongvec_ms_jiffies_minmax(struct ctl_table *table, int write, * No sense putting this after each symbol definition, twice, * exception granted :-) */ +EXPORT_SYMBOL(proc_dobool); EXPORT_SYMBOL(proc_dointvec); EXPORT_SYMBOL(proc_douintvec); EXPORT_SYMBOL(proc_dointvec_jiffies);
Re: [PATCH v5 2/2] lockd: change the proc_handler for nsm_use_hostnames
在 2016/12/15 15:24, Jia He 写道: nsm_use_hostnames is a module parameter and it will be exported to sysctl procfs. This is to let user sometimes change it from userspace. But the minimal unit for sysctl procfs read/write it sizeof(int). ^^^is^^^ In big endian system, the converting from/to bool to/from int will cause error for proc items. This patch use a new proc_handler proc_dobool to fixe it. ^^^fix^^^ Signed-off-by: Jia He --- other than that is okay for me. Reviewed-by: Pan Xinhui fs/lockd/svc.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fs/lockd/svc.c b/fs/lockd/svc.c index fc4084e..bd6fcf9 100644 --- a/fs/lockd/svc.c +++ b/fs/lockd/svc.c @@ -561,7 +561,7 @@ static struct ctl_table nlm_sysctls[] = { .data = &nsm_use_hostnames, .maxlen = sizeof(int), .mode = 0644, - .proc_handler = proc_dointvec, + .proc_handler = proc_dobool, }, { .procname = "nsm_local_state",
Re: [PATCH v2 1/1] lockd: Change nsm_use_hostnames from bool to u32
在 2016/12/12 01:43, Pan Xinhui 写道: hi, jia nice catch! However I think we should fix it totally. This is because do_proc_dointvec_conv() try to get a int value from a bool *. something like below might help. pls. ignore the code style and this is tested :) _untested_.
Re: [PATCH v2 0/1] lockd: Change nsm_use_hostnames from bool to u32
在 2016/12/11 23:36, Jia He 写道: nsm_use_hostnames is a module parameter and it will be exported to sysctl procfs. This is to let user sometimes change it from userspace. But the minimal unit for sysctl procfs read/write it sizeof(int). In big endian system, the converting from/to bool to/from int will cause error for proc items. hi, Jia not only in BE system. :) Current code is just touching a wrong pointer. some tests based on yours u8 __read_mostly nsm_use_hostnames[4]={1,2,3,4}; // an arrary of u8, and [0] passed to ctl_table as data static struct ctl_table my_sysctl[] = { { .procname = "nsm_use_hostnames", .data = &nsm_use_hostnames[0],//u8, .maxlen = sizeof(int), .mode = 0644, .proc_handler = &proc_dointvec, }, {} }; then run your tests and result will be root@ltcalpine2-lp13:~/linux/bench# cat /proc/sys/mysysctl/nsm_use_hostnames 67305985( This is 0x4030201, expected be 0x1) So your fix patch work around it. But I suggest we can support u8/u16, not only int/double int. thanks xinhui This patch changes the type definition of nsm_use_hostnames. The test case I used: /***/ #include #include #include bool __read_mostly nsm_use_hostnames; module_param(nsm_use_hostnames, bool, 0644); static struct ctl_table my_sysctl[] = { { .procname = "nsm_use_hostnames", .data = &nsm_use_hostnames, .maxlen = sizeof(int), .mode = 0644, .proc_handler = &proc_dointvec, }, {} }; static struct ctl_table my_root[] = { { .procname = "mysysctl", .mode = 0555, .child = my_sysctl, }, {} }; static struct ctl_table_header * my_ctl_header; static int __init sysctl_exam_init(void) { my_ctl_header = register_sysctl_table(&my_root); if (my_ctl_header == NULL) printk("error regiester sysctl"); return 0; } static void __exit sysctl_exam_exit(void) { unregister_sysctl_table(my_ctl_header); } module_init(sysctl_exam_init); module_exit(sysctl_exam_exit); MODULE_LICENSE("GPL"); // [root@bigendian my]# insmod -f /root/my/hello.ko nsm_use_hostnames=1 [root@bigendian my]# cat /proc/sys/mysysctl/nsm_use_hostnames 16777216 After I change the bool to int: [root@bigendian my]# insmod -f /root/my/hello.ko nsm_use_hostnames=1 [root@bigendian my]# cat /proc/sys/mysysctl/nsm_use_hostnames 1 In little endian system, there is no such issue. Jia He (1): lockd: Change nsm_use_hostnames from bool to u32 fs/lockd/mon.c | 2 +- fs/lockd/svc.c | 2 +- include/linux/lockd/lockd.h | 2 +- 3 files changed, 3 insertions(+), 3 deletions(-)
Re: [PATCH v2 1/1] lockd: Change nsm_use_hostnames from bool to u32
hi, jia nice catch! However I think we should fix it totally. This is because do_proc_dointvec_conv() try to get a int value from a bool *. something like below might help. pls. ignore the code style and this is tested :) diff --git a/fs/lockd/svc.c b/fs/lockd/svc.c index fc4084e..7eeaee4 100644 --- a/fs/lockd/svc.c +++ b/fs/lockd/svc.c @@ -519,6 +519,8 @@ EXPORT_SYMBOL_GPL(lockd_down); * Sysctl parameters (same as module parameters, different interface). */ +int proc_dou8vec(struct ctl_table *table, int write, +void __user *buffer, size_t *lenp, loff_t *ppos); static struct ctl_table nlm_sysctls[] = { { .procname = "nlm_grace_period", @@ -561,7 +563,7 @@ static struct ctl_table nlm_sysctls[] = { .data = &nsm_use_hostnames, .maxlen = sizeof(int), .mode = 0644, - .proc_handler = proc_dointvec, + .proc_handler = proc_dou8vec, }, { .procname = "nsm_local_state", diff --git a/kernel/sysctl.c b/kernel/sysctl.c index 706309f..6307737 100644 --- a/kernel/sysctl.c +++ b/kernel/sysctl.c @@ -2112,6 +2112,30 @@ static int proc_put_char(void __user **buf, size_t *size, char c) return 0; } + +static int do_proc_dou8vec_conv(bool *negp, unsigned long *lvalp, +u8 *valp, +int write, void *data) +{ + if (write) { + if (*negp) { + *valp = -*lvalp; + } else { + *valp = *lvalp; + } + } else { + int val = *valp; + if (val < 0) { + *negp = true; + *lvalp = -(unsigned long)val; + } else { + *negp = false; + *lvalp = (unsigned long)val; + } + } + return 0; +} + static int do_proc_dointvec_conv(bool *negp, unsigned long *lvalp, int *valp, int write, void *data) @@ -2296,6 +2320,14 @@ int proc_douintvec(struct ctl_table *table, int write, do_proc_douintvec_conv, NULL); } +int proc_dou8vec(struct ctl_table *table, int write, +void __user *buffer, size_t *lenp, loff_t *ppos) +{ + return do_proc_dointvec(table, write, buffer, lenp, ppos, + do_proc_dou8vec_conv, NULL); +} + + 在 2016/12/11 23:36, Jia He 写道: nsm_use_hostnames is a module paramter and it will be exported to sysctl procfs. This is to let user sometimes change it from userspace. But the minimal unit for sysctl procfs read/write it sizeof(int). In big endian system, the converting from/to bool to/from int will cause error for proc items. This patch changes the type definition of nsm_use_hostnames. V2: Changes extern type in lockd.h Signed-off-by: Jia He --- fs/lockd/mon.c | 2 +- fs/lockd/svc.c | 2 +- include/linux/lockd/lockd.h | 2 +- 3 files changed, 3 insertions(+), 3 deletions(-) diff --git a/fs/lockd/mon.c b/fs/lockd/mon.c index 19166d4..3e7ff4d 100644 --- a/fs/lockd/mon.c +++ b/fs/lockd/mon.c @@ -57,7 +57,7 @@ static DEFINE_SPINLOCK(nsm_lock); * Local NSM state */ u32__read_mostly nsm_local_state; -bool __read_mostly nsm_use_hostnames; +u32__read_mostly nsm_use_hostnames; static inline struct sockaddr *nsm_addr(const struct nsm_handle *nsm) { diff --git a/fs/lockd/svc.c b/fs/lockd/svc.c index fc4084e..308033d 100644 --- a/fs/lockd/svc.c +++ b/fs/lockd/svc.c @@ -658,7 +658,7 @@ module_param_call(nlm_udpport, param_set_port, param_get_int, &nlm_udpport, 0644); module_param_call(nlm_tcpport, param_set_port, param_get_int, &nlm_tcpport, 0644); -module_param(nsm_use_hostnames, bool, 0644); +module_param(nsm_use_hostnames, u32, 0644); module_param(nlm_max_connections, uint, 0644); static int lockd_init_net(struct net *net) diff --git a/include/linux/lockd/lockd.h b/include/linux/lockd/lockd.h index c153738..db52152 100644 --- a/include/linux/lockd/lockd.h +++ b/include/linux/lockd/lockd.h @@ -196,7 +196,7 @@ extern struct svc_procedure nlmsvc_procedures4[]; #endif extern int nlmsvc_grace_period; extern unsigned long nlmsvc_timeout; -extern boolnsm_use_hostnames; +extern u32 nsm_use_hostnames; extern u32 nsm_local_state; /*
Re: [PATCH 2/2] x86, paravirt: Fix bool return type for PVOP_CALL
hi, Peter I think I know the point. then could we just let __eax rettype(here is bool), not unsigned long? I does not do tests for my thoughts. @@ -461,7 +461,9 @@ int paravirt_disable_iospace(void); #define PVOP_VCALL_ARGS \ unsigned long __eax = __eax, __edx = __edx, __ecx = __ecx; \ register void *__sp asm("esp") -#define PVOP_CALL_ARGS PVOP_VCALL_ARGS +#define PVOP_CALL_ARGS \ + rettype __eax = __eax, __edx = __edx, __ecx = __ecx;\ + register void *__sp asm("esp")
Re: [PATCH-tip] locking/pvqspinlock: Relax cmpxchg's to improve performance on some archs
在 2016/12/7 03:14, Waiman Long 写道: A number of cmpxchg calls in qspinlock_paravirt.h were replaced by more relaxed versions to improve performance on architectures that use LL/SC. Signed-off-by: Waiman Long --- thanks! I apply it on my tree. and the tests is okay. ke rnel/locking/qspinlock_paravirt.h | 36 +++- 1 file changed, 19 insertions(+), 17 deletions(-) diff --git a/kernel/locking/qspinlock_paravirt.h b/kernel/locking/qspinlock_paravirt.h index e3b5520..9d2205f 100644 --- a/kernel/locking/qspinlock_paravirt.h +++ b/kernel/locking/qspinlock_paravirt.h @@ -72,7 +72,7 @@ static inline bool pv_queued_spin_steal_lock(struct qspinlock *lock) struct __qspinlock *l = (void *)lock; if (!(atomic_read(&lock->val) & _Q_LOCKED_PENDING_MASK) && - (cmpxchg(&l->locked, 0, _Q_LOCKED_VAL) == 0)) { + (cmpxchg_acquire(&l->locked, 0, _Q_LOCKED_VAL) == 0)) { qstat_inc(qstat_pv_lock_stealing, true); return true; } @@ -101,16 +101,16 @@ static __always_inline void clear_pending(struct qspinlock *lock) /* * The pending bit check in pv_queued_spin_steal_lock() isn't a memory - * barrier. Therefore, an atomic cmpxchg() is used to acquire the lock - * just to be sure that it will get it. + * barrier. Therefore, an atomic cmpxchg_acquire() is used to acquire the + * lock to provide the proper memory barrier. */ static __always_inline int trylock_clear_pending(struct qspinlock *lock) { struct __qspinlock *l = (void *)lock; return !READ_ONCE(l->locked) && - (cmpxchg(&l->locked_pending, _Q_PENDING_VAL, _Q_LOCKED_VAL) - == _Q_PENDING_VAL); + (cmpxchg_acquire(&l->locked_pending, _Q_PENDING_VAL, + _Q_LOCKED_VAL) == _Q_PENDING_VAL); } #else /* _Q_PENDING_BITS == 8 */ static __always_inline void set_pending(struct qspinlock *lock) @@ -138,7 +138,7 @@ static __always_inline int trylock_clear_pending(struct qspinlock *lock) */ old = val; new = (val & ~_Q_PENDING_MASK) | _Q_LOCKED_VAL; - val = atomic_cmpxchg(&lock->val, old, new); + val = atomic_cmpxchg_acquire(&lock->val, old, new); if (val == old) return 1; @@ -211,7 +211,7 @@ static struct qspinlock **pv_hash(struct qspinlock *lock, struct pv_node *node) for_each_hash_entry(he, offset, hash) { hopcnt++; - if (!cmpxchg(&he->lock, NULL, lock)) { + if (!cmpxchg_relaxed(&he->lock, NULL, lock)) { WRITE_ONCE(he->node, node); qstat_hop(hopcnt); return &he->lock; @@ -309,7 +309,7 @@ static void pv_wait_node(struct mcs_spinlock *node, struct mcs_spinlock *prev) * MB MB * [L] pn->locked[RmW] pn->state = vcpu_hashed * -* Matches the cmpxchg() from pv_kick_node(). +* Matches the cmpxchg_release() from pv_kick_node(). */ smp_store_mb(pn->state, vcpu_halted); @@ -324,7 +324,7 @@ static void pv_wait_node(struct mcs_spinlock *node, struct mcs_spinlock *prev) * value so that pv_wait_head_or_lock() knows to not also try * to hash this lock. */ - cmpxchg(&pn->state, vcpu_halted, vcpu_running); + cmpxchg_relaxed(&pn->state, vcpu_halted, vcpu_running); /* * If the locked flag is still not set after wakeup, it is a @@ -360,9 +360,10 @@ static void pv_kick_node(struct qspinlock *lock, struct mcs_spinlock *node) * pv_wait_node(). If OTOH this fails, the vCPU was running and will * observe its next->locked value and advance itself. * -* Matches with smp_store_mb() and cmpxchg() in pv_wait_node() +* Matches with smp_store_mb() and cmpxchg_relaxed() in pv_wait_node(). */ - if (cmpxchg(&pn->state, vcpu_halted, vcpu_hashed) != vcpu_halted) + if (cmpxchg_release(&pn->state, vcpu_halted, vcpu_hashed) + != vcpu_halted) return; /* @@ -461,8 +462,8 @@ static void pv_kick_node(struct qspinlock *lock, struct mcs_spinlock *node) } /* -* The cmpxchg() or xchg() call before coming here provides the -* acquire semantics for locking. The dummy ORing of _Q_LOCKED_VAL +* The cmpxchg_acquire() or xchg() call before coming here provides +* the acquire semantics for locking. The dummy ORing of _Q_LOCKED_VAL * here is to indicate to the compiler that the value will always * be nozero to enable better code optimization. */ @@ -488,11 +489,12 @@ static void pv_kick_node(struct qspinlock *lock, struct mcs_spinl
[PATCH] locking/pvqspinlock: Wait early if vcpu is preempted
If prev node is not in runnig state or its cpu is preempted, we need wait early in pv_wait_node. After commit "sched/core: Introduce the vcpu_is_preempted(cpu) interface" kernel has knowledge of one vcpu is running or not. So lets use it. Signed-off-by: Pan Xinhui --- kernel/locking/qspinlock_paravirt.h | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/kernel/locking/qspinlock_paravirt.h b/kernel/locking/qspinlock_paravirt.h index e3b5520..48648dc 100644 --- a/kernel/locking/qspinlock_paravirt.h +++ b/kernel/locking/qspinlock_paravirt.h @@ -263,7 +263,8 @@ pv_wait_early(struct pv_node *prev, int loop) if ((loop & PV_PREV_CHECK_MASK) != 0) return false; - return READ_ONCE(prev->state) != vcpu_running; + return READ_ONCE(prev->state) != vcpu_running || + vcpu_is_preempted(prev->cpu); } /* -- 2.4.11
[PATCH v9 4/6] powerpc/pv-qspinlock: powerpc support pv-qspinlock
The default pv-qspinlock uses qspinlock(native version of pv-qspinlock). pv_lock initialization should be done in bootstage with irq disabled. And if we run as a guest with powerKVM/pHyp shared_processor mode, restore pv_lock_ops callbacks to pv-qspinlock(pv version) which makes full use of virtualization. There is a hash table, we store cpu number into it and the key is lock. So everytime pv_wait can know who is the lock holder by searching the lock. Also store the lock in a per_cpu struct, and remove it when we own the lock. Then pv_wait can know which lock we are spinning on. But the cpu in the hash table might not be the correct lock holder, as for performace issue, we does not take care of hash conflict. Also introduce spin_lock_holder, which tells who owns the lock now. currently the only user is spin_unlock_wait. Signed-off-by: Pan Xinhui --- arch/powerpc/include/asm/qspinlock.h | 29 +++- arch/powerpc/include/asm/qspinlock_paravirt.h | 36 + .../powerpc/include/asm/qspinlock_paravirt_types.h | 13 ++ arch/powerpc/kernel/paravirt.c | 153 + arch/powerpc/lib/locks.c | 8 +- arch/powerpc/platforms/pseries/setup.c | 5 + 6 files changed, 241 insertions(+), 3 deletions(-) create mode 100644 arch/powerpc/include/asm/qspinlock_paravirt.h create mode 100644 arch/powerpc/include/asm/qspinlock_paravirt_types.h create mode 100644 arch/powerpc/kernel/paravirt.c diff --git a/arch/powerpc/include/asm/qspinlock.h b/arch/powerpc/include/asm/qspinlock.h index 4c89256..8fd6349 100644 --- a/arch/powerpc/include/asm/qspinlock.h +++ b/arch/powerpc/include/asm/qspinlock.h @@ -15,7 +15,7 @@ static inline u8 *__qspinlock_lock_byte(struct qspinlock *lock) return (u8 *)lock + 3 * IS_BUILTIN(CONFIG_CPU_BIG_ENDIAN); } -static inline void queued_spin_unlock(struct qspinlock *lock) +static inline void native_queued_spin_unlock(struct qspinlock *lock) { /* release semantics is required */ smp_store_release(__qspinlock_lock_byte(lock), 0); @@ -27,6 +27,33 @@ static inline int queued_spin_is_locked(struct qspinlock *lock) return atomic_read(&lock->val); } +#ifdef CONFIG_PARAVIRT_SPINLOCKS +#include +/* + * try to know who is the lock holder, however it is not always true + * Return: + * -1, we did not know the lock holder. + * other value, likely is the lock holder. + */ +extern int spin_lock_holder(void *lock); + +static inline void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val) +{ + pv_queued_spin_lock(lock, val); +} + +static inline void queued_spin_unlock(struct qspinlock *lock) +{ + pv_queued_spin_unlock(lock); +} +#else +#define spin_lock_holder(l) (-1) +static inline void queued_spin_unlock(struct qspinlock *lock) +{ + native_queued_spin_unlock(lock); +} +#endif + #include /* we need override it as ppc has io_sync stuff */ diff --git a/arch/powerpc/include/asm/qspinlock_paravirt.h b/arch/powerpc/include/asm/qspinlock_paravirt.h new file mode 100644 index 000..d87cda0 --- /dev/null +++ b/arch/powerpc/include/asm/qspinlock_paravirt.h @@ -0,0 +1,36 @@ +#ifndef CONFIG_PARAVIRT_SPINLOCKS +#error "do not include this file" +#endif + +#ifndef _ASM_QSPINLOCK_PARAVIRT_H +#define _ASM_QSPINLOCK_PARAVIRT_H + +#include + +extern void pv_lock_init(void); +extern void native_queued_spin_lock_slowpath(struct qspinlock *lock, u32 val); +extern void __pv_init_lock_hash(void); +extern void __pv_queued_spin_lock_slowpath(struct qspinlock *lock, u32 val); +extern void __pv_queued_spin_unlock(struct qspinlock *lock); + +static inline void pv_queued_spin_lock(struct qspinlock *lock, u32 val) +{ + pv_lock_op.lock(lock, val); +} + +static inline void pv_queued_spin_unlock(struct qspinlock *lock) +{ + pv_lock_op.unlock(lock); +} + +static inline void pv_wait(u8 *ptr, u8 val) +{ + pv_lock_op.wait(ptr, val); +} + +static inline void pv_kick(int cpu) +{ + pv_lock_op.kick(cpu); +} + +#endif diff --git a/arch/powerpc/include/asm/qspinlock_paravirt_types.h b/arch/powerpc/include/asm/qspinlock_paravirt_types.h new file mode 100644 index 000..83611ed --- /dev/null +++ b/arch/powerpc/include/asm/qspinlock_paravirt_types.h @@ -0,0 +1,13 @@ +#ifndef _ASM_QSPINLOCK_PARAVIRT_TYPES_H +#define _ASM_QSPINLOCK_PARAVIRT_TYPES_H + +struct pv_lock_ops { + void (*lock)(struct qspinlock *lock, u32 val); + void (*unlock)(struct qspinlock *lock); + void (*wait)(u8 *ptr, u8 val); + void (*kick)(int cpu); +}; + +extern struct pv_lock_ops pv_lock_op; + +#endif diff --git a/arch/powerpc/kernel/paravirt.c b/arch/powerpc/kernel/paravirt.c new file mode 100644 index 000..e697b17 --- /dev/null +++ b/arch/powerpc/kernel/paravirt.c @@ -0,0 +1,153 @@ +/* + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + *
[PATCH v9 0/6] Implement qspinlock/pv-qspinlock on ppc
bufsize 8000 maxblocks 2851.7 2838.12785.5 Pipe Throughput 1221.9 1265.31250.4 Pipe-based Context Switching 529.8578.1 564.2 Process Creation 408.4421.6 287.6 Shell Scripts (1 concurrent)1201.8 1215.31185.8 Shell Scripts (8 concurrent)3758.4 3799.33878.9 System Call Overhead1008.3 1122.61134.2 = System Benchmarks Index Score 1072.0 1108.91050.6 Pan Xinhui (6): powerpc/qspinlock: powerpc support qspinlock powerpc: platforms/Kconfig: Add qspinlock build config powerpc: lib/locks.c: Add cpu yield/wake helper function powerpc/pv-qspinlock: powerpc support pv-qspinlock powerpc: pSeries: Add pv-qspinlock build config/make powerpc/pv-qspinlock: Optimise native unlock path arch/powerpc/include/asm/qspinlock.h | 93 arch/powerpc/include/asm/qspinlock_paravirt.h | 52 +++ .../powerpc/include/asm/qspinlock_paravirt_types.h | 13 ++ arch/powerpc/include/asm/spinlock.h| 35 +++-- arch/powerpc/include/asm/spinlock_types.h | 4 + arch/powerpc/kernel/Makefile | 1 + arch/powerpc/kernel/paravirt.c | 157 + arch/powerpc/lib/locks.c | 123 arch/powerpc/platforms/Kconfig | 9 ++ arch/powerpc/platforms/pseries/Kconfig | 8 ++ arch/powerpc/platforms/pseries/setup.c | 5 + 11 files changed, 487 insertions(+), 13 deletions(-) create mode 100644 arch/powerpc/include/asm/qspinlock.h create mode 100644 arch/powerpc/include/asm/qspinlock_paravirt.h create mode 100644 arch/powerpc/include/asm/qspinlock_paravirt_types.h create mode 100644 arch/powerpc/kernel/paravirt.c -- 2.4.11
[PATCH v9 6/6] powerpc/pv-qspinlock: Optimise native unlock path
Avoid a function call under native version of qspinlock. On powerNV, bafore applying this patch, every unlock is expensive. This small optimizes enhance the performance. We use static_key with jump_lable which removes unnecessary loads of lppaca and its stuff. Signed-off-by: Pan Xinhui --- arch/powerpc/include/asm/qspinlock_paravirt.h | 18 +- arch/powerpc/kernel/paravirt.c| 4 2 files changed, 21 insertions(+), 1 deletion(-) diff --git a/arch/powerpc/include/asm/qspinlock_paravirt.h b/arch/powerpc/include/asm/qspinlock_paravirt.h index d87cda0..8d39446 100644 --- a/arch/powerpc/include/asm/qspinlock_paravirt.h +++ b/arch/powerpc/include/asm/qspinlock_paravirt.h @@ -6,12 +6,14 @@ #define _ASM_QSPINLOCK_PARAVIRT_H #include +#include extern void pv_lock_init(void); extern void native_queued_spin_lock_slowpath(struct qspinlock *lock, u32 val); extern void __pv_init_lock_hash(void); extern void __pv_queued_spin_lock_slowpath(struct qspinlock *lock, u32 val); extern void __pv_queued_spin_unlock(struct qspinlock *lock); +extern struct static_key_true sharedprocessor_key; static inline void pv_queued_spin_lock(struct qspinlock *lock, u32 val) { @@ -20,7 +22,21 @@ static inline void pv_queued_spin_lock(struct qspinlock *lock, u32 val) static inline void pv_queued_spin_unlock(struct qspinlock *lock) { - pv_lock_op.unlock(lock); + /* +* on powerNV and pSeries with jump_label, code will be +* PowerNV:pSeries: +* nop;b 2f; +* native unlock 2: +* pv unlock; +* In this way, we can do unlock quick in native case. +* +* IF jump_label is not enabled, we fall back into +* if condition, IOW, ld && cmp && bne. +*/ + if (static_branch_likely(&sharedprocessor_key)) + native_queued_spin_unlock(lock); + else + pv_lock_op.unlock(lock); } static inline void pv_wait(u8 *ptr, u8 val) diff --git a/arch/powerpc/kernel/paravirt.c b/arch/powerpc/kernel/paravirt.c index e697b17..a0a000e 100644 --- a/arch/powerpc/kernel/paravirt.c +++ b/arch/powerpc/kernel/paravirt.c @@ -140,6 +140,9 @@ struct pv_lock_ops pv_lock_op = { }; EXPORT_SYMBOL(pv_lock_op); +struct static_key_true sharedprocessor_key = STATIC_KEY_TRUE_INIT; +EXPORT_SYMBOL(sharedprocessor_key); + void __init pv_lock_init(void) { if (SHARED_PROCESSOR) { @@ -149,5 +152,6 @@ void __init pv_lock_init(void) pv_lock_op.unlock = __pv_queued_spin_unlock; pv_lock_op.wait = __pv_wait; pv_lock_op.kick = __pv_kick; + static_branch_disable(&sharedprocessor_key); } } -- 2.4.11
[PATCH v9 5/6] powerpc: pSeries: Add pv-qspinlock build config/make
pSeries run as a guest and might need pv-qspinlock. Signed-off-by: Pan Xinhui --- arch/powerpc/kernel/Makefile | 1 + arch/powerpc/platforms/pseries/Kconfig | 8 2 files changed, 9 insertions(+) diff --git a/arch/powerpc/kernel/Makefile b/arch/powerpc/kernel/Makefile index 1925341..4780415 100644 --- a/arch/powerpc/kernel/Makefile +++ b/arch/powerpc/kernel/Makefile @@ -53,6 +53,7 @@ obj-$(CONFIG_PPC_970_NAP) += idle_power4.o obj-$(CONFIG_PPC_P7_NAP) += idle_book3s.o procfs-y := proc_powerpc.o obj-$(CONFIG_PROC_FS) += $(procfs-y) +obj-$(CONFIG_PARAVIRT_SPINLOCKS) += paravirt.o rtaspci-$(CONFIG_PPC64)-$(CONFIG_PCI) := rtas_pci.o obj-$(CONFIG_PPC_RTAS) += rtas.o rtas-rtc.o $(rtaspci-y-y) obj-$(CONFIG_PPC_RTAS_DAEMON) += rtasd.o diff --git a/arch/powerpc/platforms/pseries/Kconfig b/arch/powerpc/platforms/pseries/Kconfig index bec90fb..c9cc064 100644 --- a/arch/powerpc/platforms/pseries/Kconfig +++ b/arch/powerpc/platforms/pseries/Kconfig @@ -33,6 +33,14 @@ config PPC_SPLPAR processors, that is, which share physical processors between two or more partitions. +config PARAVIRT_SPINLOCKS + bool "Paravirtialization support for qspinlock" + depends on PPC_SPLPAR && QUEUED_SPINLOCKS + default y + help + If kernel need run as a guest then enable this option. + Generally it can let kernel have a better performace. + config DTL bool "Dispatch Trace Log" depends on PPC_SPLPAR && DEBUG_FS -- 2.4.11
[PATCH v9 1/6] powerpc/qspinlock: powerpc support qspinlock
This patch add basic code to enable qspinlock on powerpc. qspinlock is one kind of fairlock implementation. And seen some performance improvement under some scenarios. queued_spin_unlock() release the lock by just one write of NULL to the ::locked field which sits at different places in the two endianness system. We override some arch_spin_XXX as powerpc has io_sync stuff which makes sure the io operations are protected by the lock correctly. There is another special case, see commit 2c610022711 ("locking/qspinlock: Fix spin_unlock_wait() some more") Signed-off-by: Pan Xinhui --- arch/powerpc/include/asm/qspinlock.h | 66 +++ arch/powerpc/include/asm/spinlock.h | 31 +-- arch/powerpc/include/asm/spinlock_types.h | 4 ++ arch/powerpc/lib/locks.c | 62 + 4 files changed, 150 insertions(+), 13 deletions(-) create mode 100644 arch/powerpc/include/asm/qspinlock.h diff --git a/arch/powerpc/include/asm/qspinlock.h b/arch/powerpc/include/asm/qspinlock.h new file mode 100644 index 000..4c89256 --- /dev/null +++ b/arch/powerpc/include/asm/qspinlock.h @@ -0,0 +1,66 @@ +#ifndef _ASM_POWERPC_QSPINLOCK_H +#define _ASM_POWERPC_QSPINLOCK_H + +#include + +#define SPIN_THRESHOLD (1 << 15) +#define queued_spin_unlock queued_spin_unlock +#define queued_spin_is_locked queued_spin_is_locked +#define queued_spin_unlock_wait queued_spin_unlock_wait + +extern void queued_spin_unlock_wait(struct qspinlock *lock); + +static inline u8 *__qspinlock_lock_byte(struct qspinlock *lock) +{ + return (u8 *)lock + 3 * IS_BUILTIN(CONFIG_CPU_BIG_ENDIAN); +} + +static inline void queued_spin_unlock(struct qspinlock *lock) +{ + /* release semantics is required */ + smp_store_release(__qspinlock_lock_byte(lock), 0); +} + +static inline int queued_spin_is_locked(struct qspinlock *lock) +{ + smp_mb(); + return atomic_read(&lock->val); +} + +#include + +/* we need override it as ppc has io_sync stuff */ +#undef arch_spin_trylock +#undef arch_spin_lock +#undef arch_spin_lock_flags +#undef arch_spin_unlock +#define arch_spin_trylock arch_spin_trylock +#define arch_spin_lock arch_spin_lock +#define arch_spin_lock_flags arch_spin_lock_flags +#define arch_spin_unlock arch_spin_unlock + +static inline int arch_spin_trylock(arch_spinlock_t *lock) +{ + CLEAR_IO_SYNC; + return queued_spin_trylock(lock); +} + +static inline void arch_spin_lock(arch_spinlock_t *lock) +{ + CLEAR_IO_SYNC; + queued_spin_lock(lock); +} + +static inline +void arch_spin_lock_flags(arch_spinlock_t *lock, unsigned long flags) +{ + CLEAR_IO_SYNC; + queued_spin_lock(lock); +} + +static inline void arch_spin_unlock(arch_spinlock_t *lock) +{ + SYNC_IO; + queued_spin_unlock(lock); +} +#endif /* _ASM_POWERPC_QSPINLOCK_H */ diff --git a/arch/powerpc/include/asm/spinlock.h b/arch/powerpc/include/asm/spinlock.h index 8c1b913..954099e 100644 --- a/arch/powerpc/include/asm/spinlock.h +++ b/arch/powerpc/include/asm/spinlock.h @@ -60,6 +60,23 @@ static inline bool vcpu_is_preempted(int cpu) } #endif +#if defined(CONFIG_PPC_SPLPAR) +/* We only yield to the hypervisor if we are in shared processor mode */ +#define SHARED_PROCESSOR (lppaca_shared_proc(local_paca->lppaca_ptr)) +extern void __spin_yield(arch_spinlock_t *lock); +extern void __rw_yield(arch_rwlock_t *lock); +#else /* SPLPAR */ +#define __spin_yield(x)barrier() +#define __rw_yield(x) barrier() +#define SHARED_PROCESSOR 0 +#endif + +#ifdef CONFIG_QUEUED_SPINLOCKS +#include +#else + +#define arch_spin_relax(lock) __spin_yield(lock) + static __always_inline int arch_spin_value_unlocked(arch_spinlock_t lock) { return lock.slock == 0; @@ -114,18 +131,6 @@ static inline int arch_spin_trylock(arch_spinlock_t *lock) * held. Conveniently, we have a word in the paca that holds this * value. */ - -#if defined(CONFIG_PPC_SPLPAR) -/* We only yield to the hypervisor if we are in shared processor mode */ -#define SHARED_PROCESSOR (lppaca_shared_proc(local_paca->lppaca_ptr)) -extern void __spin_yield(arch_spinlock_t *lock); -extern void __rw_yield(arch_rwlock_t *lock); -#else /* SPLPAR */ -#define __spin_yield(x)barrier() -#define __rw_yield(x) barrier() -#define SHARED_PROCESSOR 0 -#endif - static inline void arch_spin_lock(arch_spinlock_t *lock) { CLEAR_IO_SYNC; @@ -203,6 +208,7 @@ static inline void arch_spin_unlock_wait(arch_spinlock_t *lock) smp_mb(); } +#endif /* !CONFIG_QUEUED_SPINLOCKS */ /* * Read-write spinlocks, allowing multiple readers * but only one writer. @@ -338,7 +344,6 @@ static inline void arch_write_unlock(arch_rwlock_t *rw) #define arch_read_lock_flags(lock, flags) arch_read_lock(lock) #define arch_write_lock_flags(lock, flags) arch_write_lock(lock) -#define arch_spin_relax(lock) __spin_yield(lock) #define arch_read_relax(loc
[PATCH v9 2/6] powerpc: platforms/Kconfig: Add qspinlock build config
pSeries/powerNV will use qspinlock from now on. Signed-off-by: Pan Xinhui --- arch/powerpc/platforms/Kconfig | 9 + 1 file changed, 9 insertions(+) diff --git a/arch/powerpc/platforms/Kconfig b/arch/powerpc/platforms/Kconfig index fbdae83..3559bbf 100644 --- a/arch/powerpc/platforms/Kconfig +++ b/arch/powerpc/platforms/Kconfig @@ -20,6 +20,15 @@ source "arch/powerpc/platforms/44x/Kconfig" source "arch/powerpc/platforms/40x/Kconfig" source "arch/powerpc/platforms/amigaone/Kconfig" +config ARCH_USE_QUEUED_SPINLOCKS +depends on PPC_PSERIES || PPC_POWERNV +bool "Enable qspinlock" +default y +help + Enabling this option will let kernel use qspinlock which is a kind of + fairlock. It has shown a good performance improvement on x86 and also + ppc especially in high contention cases. + config KVM_GUEST bool "KVM Guest support" default n -- 2.4.11
[PATCH v9 3/6] powerpc: lib/locks.c: Add cpu yield/wake helper function
Add two corresponding helper functions to support pv-qspinlock. For normal use, __spin_yield_cpu will confer current vcpu slices to the target vcpu(say, a lock holder). If target vcpu is not specified or it is in running state, such conferging to lpar happens or not depends. Because hcall itself will introduce latency and a little overhead. And we do NOT want to suffer any latency on some cases, e.g. in interrupt handler. The second parameter *confer* can indicate such case. __spin_wake_cpu is simpiler, it will wake up one vcpu regardless of its current vcpu state. Signed-off-by: Pan Xinhui --- arch/powerpc/include/asm/spinlock.h | 4 +++ arch/powerpc/lib/locks.c| 57 + 2 files changed, 61 insertions(+) diff --git a/arch/powerpc/include/asm/spinlock.h b/arch/powerpc/include/asm/spinlock.h index 954099e..6426bd5 100644 --- a/arch/powerpc/include/asm/spinlock.h +++ b/arch/powerpc/include/asm/spinlock.h @@ -64,9 +64,13 @@ static inline bool vcpu_is_preempted(int cpu) /* We only yield to the hypervisor if we are in shared processor mode */ #define SHARED_PROCESSOR (lppaca_shared_proc(local_paca->lppaca_ptr)) extern void __spin_yield(arch_spinlock_t *lock); +extern void __spin_yield_cpu(int cpu, int confer); +extern void __spin_wake_cpu(int cpu); extern void __rw_yield(arch_rwlock_t *lock); #else /* SPLPAR */ #define __spin_yield(x)barrier() +#define __spin_yield_cpu(x, y) barrier() +#define __spin_wake_cpu(x) barrier() #define __rw_yield(x) barrier() #define SHARED_PROCESSOR 0 #endif diff --git a/arch/powerpc/lib/locks.c b/arch/powerpc/lib/locks.c index 8f6dbb0..dff0bfa 100644 --- a/arch/powerpc/lib/locks.c +++ b/arch/powerpc/lib/locks.c @@ -23,6 +23,63 @@ #include #include +/* + * confer our slices to a specified cpu and return. If it is in running state + * or cpu is -1, then we will check confer. If confer is NULL, we will return + * otherwise we confer our slices to lpar. + */ +void __spin_yield_cpu(int cpu, int confer) +{ + unsigned int yield_count; + + if (cpu == -1) + goto yield_to_lpar; + + BUG_ON(cpu >= nr_cpu_ids); + yield_count = be32_to_cpu(lppaca_of(cpu).yield_count); + + /* if cpu is running, confer slices to lpar conditionally*/ + if ((yield_count & 1) == 0) + goto yield_to_lpar; + + plpar_hcall_norets(H_CONFER, + get_hard_smp_processor_id(cpu), yield_count); + return; + +yield_to_lpar: + if (confer) + plpar_hcall_norets(H_CONFER, -1, 0); +} +EXPORT_SYMBOL_GPL(__spin_yield_cpu); + +void __spin_wake_cpu(int cpu) +{ + BUG_ON(cpu >= nr_cpu_ids); + /* +* NOTE: we should always do this hcall regardless of +* the yield_count of the holder_cpu. +* as thers might be a case like below; +* CPU 1 CPU 2 +* yielded = true +* if (yielded) +* __spin_wake_cpu() +* __spin_yield_cpu() +* +* So we might lose a wake if we check the yield_count and +* return directly if the holder_cpu is running. +* IOW. do NOT code like below. +* yield_count = be32_to_cpu(lppaca_of(cpu).yield_count); +* if ((yield_count & 1) == 0) +* return; +* +* a PROD hcall marks the target_cpu proded, which cause the next cede +* or confer called on the target_cpu invalid. +*/ + plpar_hcall_norets(H_PROD, + get_hard_smp_processor_id(cpu)); +} +EXPORT_SYMBOL_GPL(__spin_wake_cpu); + #ifndef CONFIG_QUEUED_SPINLOCKS void __spin_yield(arch_spinlock_t *lock) { -- 2.4.11
Re: [PATCH v8 2/6] powerpc: pSeries/Kconfig: Add qspinlock build config
在 2016/12/6 09:24, Pan Xinhui 写道: 在 2016/12/6 08:58, Boqun Feng 写道: On Mon, Dec 05, 2016 at 10:19:22AM -0500, Pan Xinhui wrote: pSeries/powerNV will use qspinlock from now on. Signed-off-by: Pan Xinhui --- arch/powerpc/platforms/pseries/Kconfig | 8 1 file changed, 8 insertions(+) diff --git a/arch/powerpc/platforms/pseries/Kconfig b/arch/powerpc/platforms/pseries/Kconfig index bec90fb..8a87d06 100644 --- a/arch/powerpc/platforms/pseries/Kconfig +++ b/arch/powerpc/platforms/pseries/Kconfig Why here? Not arch/powerpc/platforms/Kconfig? @@ -23,6 +23,14 @@ config PPC_PSERIES select PPC_DOORBELL default y +config ARCH_USE_QUEUED_SPINLOCKS +default y +bool "Enable qspinlock" I think you just enable qspinlock by default for all PPC platforms. I guess you need to put depends on PPC_PSERIES || PPC_POWERNV here to achieve what you mean in you commit message. oh, yes, need depends on PPC_PSERIES || PPC_POWERNV. yes, another good way. I prefer to put it in pseries/Kconfig as same as pv-qspinlocks config. when we build nv, it still include pSeries's config anyway. thanks xinhui Regards, Boqun +help + Enabling this option will let kernel use qspinlock which is a kind of + fairlock. It has shown a good performance improvement on x86 and also ppc + especially in high contention cases. + config PPC_SPLPAR depends on PPC_PSERIES bool "Support for shared-processor logical partitions" -- 2.4.11
Re: [PATCH v8 2/6] powerpc: pSeries/Kconfig: Add qspinlock build config
在 2016/12/6 08:58, Boqun Feng 写道: On Mon, Dec 05, 2016 at 10:19:22AM -0500, Pan Xinhui wrote: pSeries/powerNV will use qspinlock from now on. Signed-off-by: Pan Xinhui --- arch/powerpc/platforms/pseries/Kconfig | 8 1 file changed, 8 insertions(+) diff --git a/arch/powerpc/platforms/pseries/Kconfig b/arch/powerpc/platforms/pseries/Kconfig index bec90fb..8a87d06 100644 --- a/arch/powerpc/platforms/pseries/Kconfig +++ b/arch/powerpc/platforms/pseries/Kconfig Why here? Not arch/powerpc/platforms/Kconfig? @@ -23,6 +23,14 @@ config PPC_PSERIES select PPC_DOORBELL default y +config ARCH_USE_QUEUED_SPINLOCKS + default y + bool "Enable qspinlock" I think you just enable qspinlock by default for all PPC platforms. I guess you need to put depends on PPC_PSERIES || PPC_POWERNV here to achieve what you mean in you commit message. yes, another good way. I prefer to put it in pseries/Kconfig as same as pv-qspinlocks config. when we build nv, it still include pSeries's config anyway. thanks xinhui Regards, Boqun + help + Enabling this option will let kernel use qspinlock which is a kind of + fairlock. It has shown a good performance improvement on x86 and also ppc + especially in high contention cases. + config PPC_SPLPAR depends on PPC_PSERIES bool "Support for shared-processor logical partitions" -- 2.4.11
Re: [PATCH v8 1/6] powerpc/qspinlock: powerpc support qspinlock
correct waiman's address. 在 2016/12/6 08:47, Boqun Feng 写道: On Mon, Dec 05, 2016 at 10:19:21AM -0500, Pan Xinhui wrote: This patch add basic code to enable qspinlock on powerpc. qspinlock is one kind of fairlock implementation. And seen some performance improvement under some scenarios. queued_spin_unlock() release the lock by just one write of NULL to the ::locked field which sits at different places in the two endianness system. We override some arch_spin_XXX as powerpc has io_sync stuff which makes sure the io operations are protected by the lock correctly. There is another special case, see commit 2c610022711 ("locking/qspinlock: Fix spin_unlock_wait() some more") Signed-off-by: Pan Xinhui --- arch/powerpc/include/asm/qspinlock.h | 66 +++ arch/powerpc/include/asm/spinlock.h | 31 +-- arch/powerpc/include/asm/spinlock_types.h | 4 ++ arch/powerpc/lib/locks.c | 59 +++ 4 files changed, 147 insertions(+), 13 deletions(-) create mode 100644 arch/powerpc/include/asm/qspinlock.h diff --git a/arch/powerpc/include/asm/qspinlock.h b/arch/powerpc/include/asm/qspinlock.h new file mode 100644 index 000..4c89256 --- /dev/null +++ b/arch/powerpc/include/asm/qspinlock.h @@ -0,0 +1,66 @@ +#ifndef _ASM_POWERPC_QSPINLOCK_H +#define _ASM_POWERPC_QSPINLOCK_H + +#include + +#define SPIN_THRESHOLD (1 << 15) +#define queued_spin_unlock queued_spin_unlock +#define queued_spin_is_locked queued_spin_is_locked +#define queued_spin_unlock_wait queued_spin_unlock_wait + +extern void queued_spin_unlock_wait(struct qspinlock *lock); + +static inline u8 *__qspinlock_lock_byte(struct qspinlock *lock) +{ + return (u8 *)lock + 3 * IS_BUILTIN(CONFIG_CPU_BIG_ENDIAN); +} + +static inline void queued_spin_unlock(struct qspinlock *lock) +{ + /* release semantics is required */ + smp_store_release(__qspinlock_lock_byte(lock), 0); +} + +static inline int queued_spin_is_locked(struct qspinlock *lock) +{ + smp_mb(); + return atomic_read(&lock->val); +} + +#include + +/* we need override it as ppc has io_sync stuff */ +#undef arch_spin_trylock +#undef arch_spin_lock +#undef arch_spin_lock_flags +#undef arch_spin_unlock +#define arch_spin_trylock arch_spin_trylock +#define arch_spin_lock arch_spin_lock +#define arch_spin_lock_flags arch_spin_lock_flags +#define arch_spin_unlock arch_spin_unlock + +static inline int arch_spin_trylock(arch_spinlock_t *lock) +{ + CLEAR_IO_SYNC; + return queued_spin_trylock(lock); +} + +static inline void arch_spin_lock(arch_spinlock_t *lock) +{ + CLEAR_IO_SYNC; + queued_spin_lock(lock); +} + +static inline +void arch_spin_lock_flags(arch_spinlock_t *lock, unsigned long flags) +{ + CLEAR_IO_SYNC; + queued_spin_lock(lock); +} + +static inline void arch_spin_unlock(arch_spinlock_t *lock) +{ + SYNC_IO; + queued_spin_unlock(lock); +} +#endif /* _ASM_POWERPC_QSPINLOCK_H */ diff --git a/arch/powerpc/include/asm/spinlock.h b/arch/powerpc/include/asm/spinlock.h index 8c1b913..954099e 100644 --- a/arch/powerpc/include/asm/spinlock.h +++ b/arch/powerpc/include/asm/spinlock.h @@ -60,6 +60,23 @@ static inline bool vcpu_is_preempted(int cpu) } #endif +#if defined(CONFIG_PPC_SPLPAR) +/* We only yield to the hypervisor if we are in shared processor mode */ +#define SHARED_PROCESSOR (lppaca_shared_proc(local_paca->lppaca_ptr)) +extern void __spin_yield(arch_spinlock_t *lock); +extern void __rw_yield(arch_rwlock_t *lock); +#else /* SPLPAR */ +#define __spin_yield(x)barrier() +#define __rw_yield(x) barrier() +#define SHARED_PROCESSOR 0 +#endif + +#ifdef CONFIG_QUEUED_SPINLOCKS +#include +#else + +#define arch_spin_relax(lock) __spin_yield(lock) + static __always_inline int arch_spin_value_unlocked(arch_spinlock_t lock) { return lock.slock == 0; @@ -114,18 +131,6 @@ static inline int arch_spin_trylock(arch_spinlock_t *lock) * held. Conveniently, we have a word in the paca that holds this * value. */ - -#if defined(CONFIG_PPC_SPLPAR) -/* We only yield to the hypervisor if we are in shared processor mode */ -#define SHARED_PROCESSOR (lppaca_shared_proc(local_paca->lppaca_ptr)) -extern void __spin_yield(arch_spinlock_t *lock); -extern void __rw_yield(arch_rwlock_t *lock); -#else /* SPLPAR */ -#define __spin_yield(x)barrier() -#define __rw_yield(x) barrier() -#define SHARED_PROCESSOR 0 -#endif - static inline void arch_spin_lock(arch_spinlock_t *lock) { CLEAR_IO_SYNC; @@ -203,6 +208,7 @@ static inline void arch_spin_unlock_wait(arch_spinlock_t *lock) smp_mb(); } +#endif /* !CONFIG_QUEUED_SPINLOCKS */ /* * Read-write spinlocks, allowing multiple readers * but only one writer. @@ -338,7 +344,6 @@ static inline void arch_write_unlock(arch_rwlock_t *rw) #define arch_read_lock_flags(lock, flags) arch_read_lock(lock) #define arch_write_lock_
[PATCH v8 1/6] powerpc/qspinlock: powerpc support qspinlock
This patch add basic code to enable qspinlock on powerpc. qspinlock is one kind of fairlock implementation. And seen some performance improvement under some scenarios. queued_spin_unlock() release the lock by just one write of NULL to the ::locked field which sits at different places in the two endianness system. We override some arch_spin_XXX as powerpc has io_sync stuff which makes sure the io operations are protected by the lock correctly. There is another special case, see commit 2c610022711 ("locking/qspinlock: Fix spin_unlock_wait() some more") Signed-off-by: Pan Xinhui --- arch/powerpc/include/asm/qspinlock.h | 66 +++ arch/powerpc/include/asm/spinlock.h | 31 +-- arch/powerpc/include/asm/spinlock_types.h | 4 ++ arch/powerpc/lib/locks.c | 59 +++ 4 files changed, 147 insertions(+), 13 deletions(-) create mode 100644 arch/powerpc/include/asm/qspinlock.h diff --git a/arch/powerpc/include/asm/qspinlock.h b/arch/powerpc/include/asm/qspinlock.h new file mode 100644 index 000..4c89256 --- /dev/null +++ b/arch/powerpc/include/asm/qspinlock.h @@ -0,0 +1,66 @@ +#ifndef _ASM_POWERPC_QSPINLOCK_H +#define _ASM_POWERPC_QSPINLOCK_H + +#include + +#define SPIN_THRESHOLD (1 << 15) +#define queued_spin_unlock queued_spin_unlock +#define queued_spin_is_locked queued_spin_is_locked +#define queued_spin_unlock_wait queued_spin_unlock_wait + +extern void queued_spin_unlock_wait(struct qspinlock *lock); + +static inline u8 *__qspinlock_lock_byte(struct qspinlock *lock) +{ + return (u8 *)lock + 3 * IS_BUILTIN(CONFIG_CPU_BIG_ENDIAN); +} + +static inline void queued_spin_unlock(struct qspinlock *lock) +{ + /* release semantics is required */ + smp_store_release(__qspinlock_lock_byte(lock), 0); +} + +static inline int queued_spin_is_locked(struct qspinlock *lock) +{ + smp_mb(); + return atomic_read(&lock->val); +} + +#include + +/* we need override it as ppc has io_sync stuff */ +#undef arch_spin_trylock +#undef arch_spin_lock +#undef arch_spin_lock_flags +#undef arch_spin_unlock +#define arch_spin_trylock arch_spin_trylock +#define arch_spin_lock arch_spin_lock +#define arch_spin_lock_flags arch_spin_lock_flags +#define arch_spin_unlock arch_spin_unlock + +static inline int arch_spin_trylock(arch_spinlock_t *lock) +{ + CLEAR_IO_SYNC; + return queued_spin_trylock(lock); +} + +static inline void arch_spin_lock(arch_spinlock_t *lock) +{ + CLEAR_IO_SYNC; + queued_spin_lock(lock); +} + +static inline +void arch_spin_lock_flags(arch_spinlock_t *lock, unsigned long flags) +{ + CLEAR_IO_SYNC; + queued_spin_lock(lock); +} + +static inline void arch_spin_unlock(arch_spinlock_t *lock) +{ + SYNC_IO; + queued_spin_unlock(lock); +} +#endif /* _ASM_POWERPC_QSPINLOCK_H */ diff --git a/arch/powerpc/include/asm/spinlock.h b/arch/powerpc/include/asm/spinlock.h index 8c1b913..954099e 100644 --- a/arch/powerpc/include/asm/spinlock.h +++ b/arch/powerpc/include/asm/spinlock.h @@ -60,6 +60,23 @@ static inline bool vcpu_is_preempted(int cpu) } #endif +#if defined(CONFIG_PPC_SPLPAR) +/* We only yield to the hypervisor if we are in shared processor mode */ +#define SHARED_PROCESSOR (lppaca_shared_proc(local_paca->lppaca_ptr)) +extern void __spin_yield(arch_spinlock_t *lock); +extern void __rw_yield(arch_rwlock_t *lock); +#else /* SPLPAR */ +#define __spin_yield(x)barrier() +#define __rw_yield(x) barrier() +#define SHARED_PROCESSOR 0 +#endif + +#ifdef CONFIG_QUEUED_SPINLOCKS +#include +#else + +#define arch_spin_relax(lock) __spin_yield(lock) + static __always_inline int arch_spin_value_unlocked(arch_spinlock_t lock) { return lock.slock == 0; @@ -114,18 +131,6 @@ static inline int arch_spin_trylock(arch_spinlock_t *lock) * held. Conveniently, we have a word in the paca that holds this * value. */ - -#if defined(CONFIG_PPC_SPLPAR) -/* We only yield to the hypervisor if we are in shared processor mode */ -#define SHARED_PROCESSOR (lppaca_shared_proc(local_paca->lppaca_ptr)) -extern void __spin_yield(arch_spinlock_t *lock); -extern void __rw_yield(arch_rwlock_t *lock); -#else /* SPLPAR */ -#define __spin_yield(x)barrier() -#define __rw_yield(x) barrier() -#define SHARED_PROCESSOR 0 -#endif - static inline void arch_spin_lock(arch_spinlock_t *lock) { CLEAR_IO_SYNC; @@ -203,6 +208,7 @@ static inline void arch_spin_unlock_wait(arch_spinlock_t *lock) smp_mb(); } +#endif /* !CONFIG_QUEUED_SPINLOCKS */ /* * Read-write spinlocks, allowing multiple readers * but only one writer. @@ -338,7 +344,6 @@ static inline void arch_write_unlock(arch_rwlock_t *rw) #define arch_read_lock_flags(lock, flags) arch_read_lock(lock) #define arch_write_lock_flags(lock, flags) arch_write_lock(lock) -#define arch_spin_relax(lock) __spin_yield(lock) #define arch_read_relax(loc
[PATCH v8 2/6] powerpc: pSeries/Kconfig: Add qspinlock build config
pSeries/powerNV will use qspinlock from now on. Signed-off-by: Pan Xinhui --- arch/powerpc/platforms/pseries/Kconfig | 8 1 file changed, 8 insertions(+) diff --git a/arch/powerpc/platforms/pseries/Kconfig b/arch/powerpc/platforms/pseries/Kconfig index bec90fb..8a87d06 100644 --- a/arch/powerpc/platforms/pseries/Kconfig +++ b/arch/powerpc/platforms/pseries/Kconfig @@ -23,6 +23,14 @@ config PPC_PSERIES select PPC_DOORBELL default y +config ARCH_USE_QUEUED_SPINLOCKS + default y + bool "Enable qspinlock" + help + Enabling this option will let kernel use qspinlock which is a kind of + fairlock. It has shown a good performance improvement on x86 and also ppc + especially in high contention cases. + config PPC_SPLPAR depends on PPC_PSERIES bool "Support for shared-processor logical partitions" -- 2.4.11
[PATCH v8 4/6] powerpc/pv-qspinlock: powerpc support pv-qspinlock
The default pv-qspinlock uses qspinlock(native version of pv-qspinlock). pv_lock initialization should be done in bootstage with irq disabled. And if we run as a guest with powerKVM/pHyp shared_processor mode, restore pv_lock_ops callbacks to pv-qspinlock(pv version) which makes full use of virtualization. There is a hash table, we store cpu number into it and the key is lock. So everytime pv_wait can know who is the lock holder by searching the lock. Also store the lock in a per_cpu struct, and remove it when we own the lock. Then pv_wait can know which lock we are spinning on. But the cpu in the hash table might not be the correct lock holder, as for performace issue, we does not take care of hash conflict. Also introduce spin_lock_holder, which tells who owns the lock now. currently the only user is spin_unlock_wait. Signed-off-by: Pan Xinhui --- arch/powerpc/include/asm/qspinlock.h | 29 +++- arch/powerpc/include/asm/qspinlock_paravirt.h | 36 + .../powerpc/include/asm/qspinlock_paravirt_types.h | 13 ++ arch/powerpc/kernel/paravirt.c | 153 + arch/powerpc/lib/locks.c | 8 +- arch/powerpc/platforms/pseries/setup.c | 5 + 6 files changed, 241 insertions(+), 3 deletions(-) create mode 100644 arch/powerpc/include/asm/qspinlock_paravirt.h create mode 100644 arch/powerpc/include/asm/qspinlock_paravirt_types.h create mode 100644 arch/powerpc/kernel/paravirt.c diff --git a/arch/powerpc/include/asm/qspinlock.h b/arch/powerpc/include/asm/qspinlock.h index 4c89256..8fd6349 100644 --- a/arch/powerpc/include/asm/qspinlock.h +++ b/arch/powerpc/include/asm/qspinlock.h @@ -15,7 +15,7 @@ static inline u8 *__qspinlock_lock_byte(struct qspinlock *lock) return (u8 *)lock + 3 * IS_BUILTIN(CONFIG_CPU_BIG_ENDIAN); } -static inline void queued_spin_unlock(struct qspinlock *lock) +static inline void native_queued_spin_unlock(struct qspinlock *lock) { /* release semantics is required */ smp_store_release(__qspinlock_lock_byte(lock), 0); @@ -27,6 +27,33 @@ static inline int queued_spin_is_locked(struct qspinlock *lock) return atomic_read(&lock->val); } +#ifdef CONFIG_PARAVIRT_SPINLOCKS +#include +/* + * try to know who is the lock holder, however it is not always true + * Return: + * -1, we did not know the lock holder. + * other value, likely is the lock holder. + */ +extern int spin_lock_holder(void *lock); + +static inline void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val) +{ + pv_queued_spin_lock(lock, val); +} + +static inline void queued_spin_unlock(struct qspinlock *lock) +{ + pv_queued_spin_unlock(lock); +} +#else +#define spin_lock_holder(l) (-1) +static inline void queued_spin_unlock(struct qspinlock *lock) +{ + native_queued_spin_unlock(lock); +} +#endif + #include /* we need override it as ppc has io_sync stuff */ diff --git a/arch/powerpc/include/asm/qspinlock_paravirt.h b/arch/powerpc/include/asm/qspinlock_paravirt.h new file mode 100644 index 000..d87cda0 --- /dev/null +++ b/arch/powerpc/include/asm/qspinlock_paravirt.h @@ -0,0 +1,36 @@ +#ifndef CONFIG_PARAVIRT_SPINLOCKS +#error "do not include this file" +#endif + +#ifndef _ASM_QSPINLOCK_PARAVIRT_H +#define _ASM_QSPINLOCK_PARAVIRT_H + +#include + +extern void pv_lock_init(void); +extern void native_queued_spin_lock_slowpath(struct qspinlock *lock, u32 val); +extern void __pv_init_lock_hash(void); +extern void __pv_queued_spin_lock_slowpath(struct qspinlock *lock, u32 val); +extern void __pv_queued_spin_unlock(struct qspinlock *lock); + +static inline void pv_queued_spin_lock(struct qspinlock *lock, u32 val) +{ + pv_lock_op.lock(lock, val); +} + +static inline void pv_queued_spin_unlock(struct qspinlock *lock) +{ + pv_lock_op.unlock(lock); +} + +static inline void pv_wait(u8 *ptr, u8 val) +{ + pv_lock_op.wait(ptr, val); +} + +static inline void pv_kick(int cpu) +{ + pv_lock_op.kick(cpu); +} + +#endif diff --git a/arch/powerpc/include/asm/qspinlock_paravirt_types.h b/arch/powerpc/include/asm/qspinlock_paravirt_types.h new file mode 100644 index 000..83611ed --- /dev/null +++ b/arch/powerpc/include/asm/qspinlock_paravirt_types.h @@ -0,0 +1,13 @@ +#ifndef _ASM_QSPINLOCK_PARAVIRT_TYPES_H +#define _ASM_QSPINLOCK_PARAVIRT_TYPES_H + +struct pv_lock_ops { + void (*lock)(struct qspinlock *lock, u32 val); + void (*unlock)(struct qspinlock *lock); + void (*wait)(u8 *ptr, u8 val); + void (*kick)(int cpu); +}; + +extern struct pv_lock_ops pv_lock_op; + +#endif diff --git a/arch/powerpc/kernel/paravirt.c b/arch/powerpc/kernel/paravirt.c new file mode 100644 index 000..e697b17 --- /dev/null +++ b/arch/powerpc/kernel/paravirt.c @@ -0,0 +1,153 @@ +/* + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + *
[PATCH v8 6/6] powerpc/pv-qspinlock: Optimise native unlock path
Avoid a function call under native version of qspinlock. On powerNV, bafore applying this patch, every unlock is expensive. This small optimizes enhance the performance. We use static_key with jump_label which removes unnecessary loads of lppaca and its stuff. Signed-off-by: Pan Xinhui --- arch/powerpc/include/asm/qspinlock_paravirt.h | 18 +- arch/powerpc/kernel/paravirt.c| 4 2 files changed, 21 insertions(+), 1 deletion(-) diff --git a/arch/powerpc/include/asm/qspinlock_paravirt.h b/arch/powerpc/include/asm/qspinlock_paravirt.h index d87cda0..8d39446 100644 --- a/arch/powerpc/include/asm/qspinlock_paravirt.h +++ b/arch/powerpc/include/asm/qspinlock_paravirt.h @@ -6,12 +6,14 @@ #define _ASM_QSPINLOCK_PARAVIRT_H #include +#include extern void pv_lock_init(void); extern void native_queued_spin_lock_slowpath(struct qspinlock *lock, u32 val); extern void __pv_init_lock_hash(void); extern void __pv_queued_spin_lock_slowpath(struct qspinlock *lock, u32 val); extern void __pv_queued_spin_unlock(struct qspinlock *lock); +extern struct static_key_true sharedprocessor_key; static inline void pv_queued_spin_lock(struct qspinlock *lock, u32 val) { @@ -20,7 +22,21 @@ static inline void pv_queued_spin_lock(struct qspinlock *lock, u32 val) static inline void pv_queued_spin_unlock(struct qspinlock *lock) { - pv_lock_op.unlock(lock); + /* +* on powerNV and pSeries with jump_label, code will be +* PowerNV:pSeries: +* nop;b 2f; +* native unlock 2: +* pv unlock; +* In this way, we can do unlock quick in native case. +* +* IF jump_label is not enabled, we fall back into +* if condition, IOW, ld && cmp && bne. +*/ + if (static_branch_likely(&sharedprocessor_key)) + native_queued_spin_unlock(lock); + else + pv_lock_op.unlock(lock); } static inline void pv_wait(u8 *ptr, u8 val) diff --git a/arch/powerpc/kernel/paravirt.c b/arch/powerpc/kernel/paravirt.c index e697b17..a0a000e 100644 --- a/arch/powerpc/kernel/paravirt.c +++ b/arch/powerpc/kernel/paravirt.c @@ -140,6 +140,9 @@ struct pv_lock_ops pv_lock_op = { }; EXPORT_SYMBOL(pv_lock_op); +struct static_key_true sharedprocessor_key = STATIC_KEY_TRUE_INIT; +EXPORT_SYMBOL(sharedprocessor_key); + void __init pv_lock_init(void) { if (SHARED_PROCESSOR) { @@ -149,5 +152,6 @@ void __init pv_lock_init(void) pv_lock_op.unlock = __pv_queued_spin_unlock; pv_lock_op.wait = __pv_wait; pv_lock_op.kick = __pv_kick; + static_branch_disable(&sharedprocessor_key); } } -- 2.4.11
[PATCH v8 0/6] Implement qspinlock/pv-qspinlock on ppc
d Context Switching 529.8578.1 564.2 Process Creation 408.4421.6 287.6 Shell Scripts (1 concurrent)1201.8 1215.31185.8 Shell Scripts (8 concurrent)3758.4 3799.33878.9 System Call Overhead1008.3 1122.61134.2 = System Benchmarks Index Score 1072.0 1108.91050.6 ---- Pan Xinhui (6): powerpc/qspinlock: powerpc support qspinlock powerpc: pSeries/Kconfig: Add qspinlock build config powerpc: lib/locks.c: Add cpu yield/wake helper function powerpc/pv-qspinlock: powerpc support pv-qspinlock powerpc: pSeries: Add pv-qspinlock build config/make powerpc/pv-qspinlock: Optimise native unlock path arch/powerpc/include/asm/qspinlock.h | 93 arch/powerpc/include/asm/qspinlock_paravirt.h | 52 +++ .../powerpc/include/asm/qspinlock_paravirt_types.h | 13 ++ arch/powerpc/include/asm/spinlock.h| 35 +++-- arch/powerpc/include/asm/spinlock_types.h | 4 + arch/powerpc/kernel/Makefile | 1 + arch/powerpc/kernel/paravirt.c | 157 + arch/powerpc/lib/locks.c | 122 arch/powerpc/platforms/pseries/Kconfig | 16 +++ arch/powerpc/platforms/pseries/setup.c | 5 + 10 files changed, 485 insertions(+), 13 deletions(-) create mode 100644 arch/powerpc/include/asm/qspinlock.h create mode 100644 arch/powerpc/include/asm/qspinlock_paravirt.h create mode 100644 arch/powerpc/include/asm/qspinlock_paravirt_types.h create mode 100644 arch/powerpc/kernel/paravirt.c -- 2.4.11
[PATCH v8 5/6] powerpc: pSeries: Add pv-qspinlock build config/make
pSeries run as a guest and might need pv-qspinlock. Signed-off-by: Pan Xinhui --- arch/powerpc/kernel/Makefile | 1 + arch/powerpc/platforms/pseries/Kconfig | 8 2 files changed, 9 insertions(+) diff --git a/arch/powerpc/kernel/Makefile b/arch/powerpc/kernel/Makefile index 1925341..4780415 100644 --- a/arch/powerpc/kernel/Makefile +++ b/arch/powerpc/kernel/Makefile @@ -53,6 +53,7 @@ obj-$(CONFIG_PPC_970_NAP) += idle_power4.o obj-$(CONFIG_PPC_P7_NAP) += idle_book3s.o procfs-y := proc_powerpc.o obj-$(CONFIG_PROC_FS) += $(procfs-y) +obj-$(CONFIG_PARAVIRT_SPINLOCKS) += paravirt.o rtaspci-$(CONFIG_PPC64)-$(CONFIG_PCI) := rtas_pci.o obj-$(CONFIG_PPC_RTAS) += rtas.o rtas-rtc.o $(rtaspci-y-y) obj-$(CONFIG_PPC_RTAS_DAEMON) += rtasd.o diff --git a/arch/powerpc/platforms/pseries/Kconfig b/arch/powerpc/platforms/pseries/Kconfig index 8a87d06..0288c78 100644 --- a/arch/powerpc/platforms/pseries/Kconfig +++ b/arch/powerpc/platforms/pseries/Kconfig @@ -31,6 +31,14 @@ config ARCH_USE_QUEUED_SPINLOCKS fairlock. It has shown a good performance improvement on x86 and also ppc especially in high contention cases. +config PARAVIRT_SPINLOCKS + bool "Paravirtialization support for qspinlock" + depends on PPC_SPLPAR && QUEUED_SPINLOCKS + default y + help + If kernel need run as a guest then enable this option. + Generally it can let kernel have a better performace. + config PPC_SPLPAR depends on PPC_PSERIES bool "Support for shared-processor logical partitions" -- 2.4.11
[PATCH v8 3/6] powerpc: lib/locks.c: Add cpu yield/wake helper function
Add two corresponding helper functions to support pv-qspinlock. For normal use, __spin_yield_cpu will confer current vcpu slices to the target vcpu(say, a lock holder). If target vcpu is not specified or it is in running state, such conferging to lpar happens or not depends. Because hcall itself will introduce latency and a little overhead. And we do NOT want to suffer any latency on some cases, e.g. in interrupt handler. The second parameter *confer* can indicate such case. __spin_wake_cpu is simpiler, it will wake up one vcpu regardless of its current vcpu state. Signed-off-by: Pan Xinhui --- arch/powerpc/include/asm/spinlock.h | 4 +++ arch/powerpc/lib/locks.c| 59 + 2 files changed, 63 insertions(+) diff --git a/arch/powerpc/include/asm/spinlock.h b/arch/powerpc/include/asm/spinlock.h index 954099e..6426bd5 100644 --- a/arch/powerpc/include/asm/spinlock.h +++ b/arch/powerpc/include/asm/spinlock.h @@ -64,9 +64,13 @@ static inline bool vcpu_is_preempted(int cpu) /* We only yield to the hypervisor if we are in shared processor mode */ #define SHARED_PROCESSOR (lppaca_shared_proc(local_paca->lppaca_ptr)) extern void __spin_yield(arch_spinlock_t *lock); +extern void __spin_yield_cpu(int cpu, int confer); +extern void __spin_wake_cpu(int cpu); extern void __rw_yield(arch_rwlock_t *lock); #else /* SPLPAR */ #define __spin_yield(x)barrier() +#define __spin_yield_cpu(x, y) barrier() +#define __spin_wake_cpu(x) barrier() #define __rw_yield(x) barrier() #define SHARED_PROCESSOR 0 #endif diff --git a/arch/powerpc/lib/locks.c b/arch/powerpc/lib/locks.c index 6574626..bd872c9 100644 --- a/arch/powerpc/lib/locks.c +++ b/arch/powerpc/lib/locks.c @@ -23,6 +23,65 @@ #include #include +/* + * confer our slices to a specified cpu and return. If it is in running state + * or cpu is -1, then we will check confer. If confer is NULL, we will return + * otherwise we confer our slices to lpar. + */ +void __spin_yield_cpu(int cpu, int confer) +{ + unsigned int holder_cpu = cpu, yield_count; + + if (cpu == -1) + goto yield_to_lpar; + + BUG_ON(holder_cpu >= nr_cpu_ids); + yield_count = be32_to_cpu(lppaca_of(holder_cpu).yield_count); + + /* if cpu is running, confer slices to lpar conditionally*/ + if ((yield_count & 1) == 0) + goto yield_to_lpar; + + plpar_hcall_norets(H_CONFER, + get_hard_smp_processor_id(holder_cpu), yield_count); + return; + +yield_to_lpar: + if (confer) + plpar_hcall_norets(H_CONFER, -1, 0); +} +EXPORT_SYMBOL_GPL(__spin_yield_cpu); + +void __spin_wake_cpu(int cpu) +{ + unsigned int holder_cpu = cpu; + + BUG_ON(holder_cpu >= nr_cpu_ids); + /* +* NOTE: we should always do this hcall regardless of +* the yield_count of the holder_cpu. +* as thers might be a case like below; +* CPU 1 CPU 2 +* yielded = true +* if (yielded) +* __spin_wake_cpu() +* __spin_yield_cpu() +* +* So we might lose a wake if we check the yield_count and +* return directly if the holder_cpu is running. +* IOW. do NOT code like below. +* yield_count = be32_to_cpu(lppaca_of(holder_cpu).yield_count); +* if ((yield_count & 1) == 0) +* return; +* +* a PROD hcall marks the target_cpu proded, which cause the next cede +* or confer called on the target_cpu invalid. +*/ + plpar_hcall_norets(H_PROD, + get_hard_smp_processor_id(holder_cpu)); +} +EXPORT_SYMBOL_GPL(__spin_wake_cpu); + #ifndef CONFIG_QUEUED_SPINLOCKS void __spin_yield(arch_spinlock_t *lock) { -- 2.4.11
Re: [PATCH] powerpc: cputime: fix a compile warning
在 2016/12/2 12:35, yjin 写道: On 2016年12月02日 12:22, Balbir Singh wrote: On Fri, Dec 2, 2016 at 3:15 PM, Michael Ellerman wrote: yanjiang@windriver.com writes: diff --git a/arch/powerpc/include/asm/cputime.h b/arch/powerpc/include/asm/cputime.h index 4f60db0..4423e97 100644 --- a/arch/powerpc/include/asm/cputime.h +++ b/arch/powerpc/include/asm/cputime.h @@ -228,7 +228,8 @@ static inline cputime_t clock_t_to_cputime(const unsigned long clk) return (__force cputime_t) ct; } -#define cputime64_to_clock_t(ct) cputime_to_clock_t((cputime_t)(ct)) +#define cputime64_to_clock_t(ct) \ + (__force u64)(cputime_to_clock_t((cputime_t)(ct))) Given the name of the function is "cputime64 to clock_t", surely we should be returning a clock_t ? Please fix it in cpuacct.c Also check out git commit 527b0a76f41d062381adbb55c8eb61e32cb0bfc9 sched/cpuacct: Avoid %lld seq_printf warning Hi Balbir, Where can I find this commit? hello, it is in next tree. :) commit 527b0a76f41d062381adbb55c8eb61e32cb0bfc9 Author: Martin Schwidefsky Date: Fri Nov 11 15:27:49 2016 +0100 sched/cpuacct: Avoid %lld seq_printf warning For s390 kernel builds I keep getting this warning: kernel/sched/cpuacct.c: In function 'cpuacct_stats_show': kernel/sched/cpuacct.c:298:25: warning: format '%lld' expects argument of type 'long long int', but argument 4 has type 'clock_t {aka long int}' [-Wformat=] seq_printf(sf, "%s %lld\n", Silence the warning by adding an explicit cast. Signed-off-by: Martin Schwidefsky Signed-off-by: Peter Zijlstra (Intel) Cc: Linus Torvalds Cc: Peter Zijlstra Cc: Thomas Gleixner Link: http://lkml.kernel.org/r/2016142749.6545-1-schwidef...@de.ibm.com Signed-off-by: Ingo Molnar diff --git a/kernel/sched/cpuacct.c b/kernel/sched/cpuacct.c index bc0b309c..9add206 100644 --- a/kernel/sched/cpuacct.c +++ b/kernel/sched/cpuacct.c @@ -297,7 +297,7 @@ static int cpuacct_stats_show(struct seq_file *sf, void *v) for (stat = 0; stat < CPUACCT_STAT_NSTATS; stat++) { seq_printf(sf, "%s %lld\n", cpuacct_stat_desc[stat], - cputime64_to_clock_t(val[stat])); + (long long)cputime64_to_clock_t(val[stat])); } return 0; Thanks! Yanjiang Balbir
[tip:locking/core] x86/kvm: Support the vCPU preemption check
Commit-ID: 0b9f6c4615c993d2b552e0d2bd1ade49b56e5beb Gitweb: http://git.kernel.org/tip/0b9f6c4615c993d2b552e0d2bd1ade49b56e5beb Author: Pan Xinhui AuthorDate: Wed, 2 Nov 2016 05:08:35 -0400 Committer: Ingo Molnar CommitDate: Tue, 22 Nov 2016 12:48:08 +0100 x86/kvm: Support the vCPU preemption check Support the vcpu_is_preempted() functionality under KVM. This will enhance lock performance on overcommitted hosts (more runnable vCPUs than physical CPUs in the system) as doing busy waits for preempted vCPUs will hurt system performance far worse than early yielding. Use struct kvm_steal_time::preempted to indicate that if a vCPU is running or not. Signed-off-by: Pan Xinhui Signed-off-by: Peter Zijlstra (Intel) Acked-by: Paolo Bonzini Cc: david.lai...@aculab.com Cc: Linus Torvalds Cc: Peter Zijlstra Cc: Thomas Gleixner Cc: b...@kernel.crashing.org Cc: boqun.f...@gmail.com Cc: borntrae...@de.ibm.com Cc: bsinghar...@gmail.com Cc: d...@stgolabs.net Cc: jgr...@suse.com Cc: kernel...@gmail.com Cc: konrad.w...@oracle.com Cc: linuxppc-...@lists.ozlabs.org Cc: m...@ellerman.id.au Cc: paul...@linux.vnet.ibm.com Cc: pau...@samba.org Cc: rkrc...@redhat.com Cc: virtualizat...@lists.linux-foundation.org Cc: will.dea...@arm.com Cc: xen-devel-requ...@lists.xenproject.org Cc: xen-de...@lists.xenproject.org Link: http://lkml.kernel.org/r/1478077718-37424-9-git-send-email-xinhui@linux.vnet.ibm.com [ Typo fixes. ] Signed-off-by: Ingo Molnar --- arch/x86/include/uapi/asm/kvm_para.h | 4 +++- arch/x86/kvm/x86.c | 16 2 files changed, 19 insertions(+), 1 deletion(-) diff --git a/arch/x86/include/uapi/asm/kvm_para.h b/arch/x86/include/uapi/asm/kvm_para.h index 94dc8ca..1421a65 100644 --- a/arch/x86/include/uapi/asm/kvm_para.h +++ b/arch/x86/include/uapi/asm/kvm_para.h @@ -45,7 +45,9 @@ struct kvm_steal_time { __u64 steal; __u32 version; __u32 flags; - __u32 pad[12]; + __u8 preempted; + __u8 u8_pad[3]; + __u32 pad[11]; }; #define KVM_STEAL_ALIGNMENT_BITS 5 diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 04c5d96..59c2d6f 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -2071,6 +2071,8 @@ static void record_steal_time(struct kvm_vcpu *vcpu) &vcpu->arch.st.steal, sizeof(struct kvm_steal_time return; + vcpu->arch.st.steal.preempted = 0; + if (vcpu->arch.st.steal.version & 1) vcpu->arch.st.steal.version += 1; /* first time write, random junk */ @@ -2826,8 +2828,22 @@ void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu) kvm_make_request(KVM_REQ_STEAL_UPDATE, vcpu); } +static void kvm_steal_time_set_preempted(struct kvm_vcpu *vcpu) +{ + if (!(vcpu->arch.st.msr_val & KVM_MSR_ENABLED)) + return; + + vcpu->arch.st.steal.preempted = 1; + + kvm_write_guest_offset_cached(vcpu->kvm, &vcpu->arch.st.stime, + &vcpu->arch.st.steal.preempted, + offsetof(struct kvm_steal_time, preempted), + sizeof(vcpu->arch.st.steal.preempted)); +} + void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu) { + kvm_steal_time_set_preempted(vcpu); kvm_x86_ops->vcpu_put(vcpu); kvm_put_guest_fpu(vcpu); vcpu->arch.last_host_tsc = rdtsc();
[tip:locking/core] x86/kvm: Support the vCPU preemption check
Commit-ID: 1885aa7041c9e801e5d5b093b9dad38937ca37f6 Gitweb: http://git.kernel.org/tip/1885aa7041c9e801e5d5b093b9dad38937ca37f6 Author: Pan Xinhui AuthorDate: Wed, 2 Nov 2016 05:08:36 -0400 Committer: Ingo Molnar CommitDate: Tue, 22 Nov 2016 12:48:08 +0100 x86/kvm: Support the vCPU preemption check Support the vcpu_is_preempted() functionality under KVM. This will enhance lock performance on overcommitted hosts (more runnable vCPUs than physical CPUs in the system) as doing busy waits for preempted vCPUs will hurt system performance far worse than early yielding. struct kvm_steal_time::preempted indicates that if one vCPU is running or not after commit "x86, kvm/x86.c: support vCPU preempted check". unix benchmark result: host: kernel 4.8.1, i5-4570, 4 cpus guest: kernel 4.8.1, 8 vcpus test-case after-patch before-patch Execl Throughput |18307.9 lps |11701.6 lps File Copy 1024 bufsize 2000 maxblocks | 1352407.3 KBps | 790418.9 KBps File Copy 256 bufsize 500 maxblocks| 367555.6 KBps | 222867.7 KBps File Copy 4096 bufsize 8000 maxblocks | 3675649.7 KBps | 1780614.4 KBps Pipe Throughput| 11872208.7 lps | 11855628.9 lps Pipe-based Context Switching | 1495126.5 lps | 1490533.9 lps Process Creation |29881.2 lps |28572.8 lps Shell Scripts (1 concurrent) |23224.3 lpm |22607.4 lpm Shell Scripts (8 concurrent) | 3531.4 lpm | 3211.9 lpm System Call Overhead | 10385653.0 lps | 10419979.0 lps Signed-off-by: Pan Xinhui Signed-off-by: Peter Zijlstra (Intel) Acked-by: Paolo Bonzini Cc: david.lai...@aculab.com Cc: Linus Torvalds Cc: Peter Zijlstra Cc: Thomas Gleixner Cc: b...@kernel.crashing.org Cc: boqun.f...@gmail.com Cc: borntrae...@de.ibm.com Cc: bsinghar...@gmail.com Cc: d...@stgolabs.net Cc: jgr...@suse.com Cc: kernel...@gmail.com Cc: konrad.w...@oracle.com Cc: linuxppc-...@lists.ozlabs.org Cc: m...@ellerman.id.au Cc: paul...@linux.vnet.ibm.com Cc: pau...@samba.org Cc: rkrc...@redhat.com Cc: virtualizat...@lists.linux-foundation.org Cc: will.dea...@arm.com Cc: xen-devel-requ...@lists.xenproject.org Cc: xen-de...@lists.xenproject.org Link: http://lkml.kernel.org/r/1478077718-37424-10-git-send-email-xinhui@linux.vnet.ibm.com Signed-off-by: Ingo Molnar --- arch/x86/kernel/kvm.c | 12 1 file changed, 12 insertions(+) diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c index edbbfc8..0b48dd2 100644 --- a/arch/x86/kernel/kvm.c +++ b/arch/x86/kernel/kvm.c @@ -415,6 +415,15 @@ void kvm_disable_steal_time(void) wrmsr(MSR_KVM_STEAL_TIME, 0, 0); } +static bool kvm_vcpu_is_preempted(int cpu) +{ + struct kvm_steal_time *src; + + src = &per_cpu(steal_time, cpu); + + return !!src->preempted; +} + #ifdef CONFIG_SMP static void __init kvm_smp_prepare_boot_cpu(void) { @@ -471,6 +480,9 @@ void __init kvm_guest_init(void) if (kvm_para_has_feature(KVM_FEATURE_STEAL_TIME)) { has_steal_clock = 1; pv_time_ops.steal_clock = kvm_steal_clock; +#ifdef CONFIG_PARAVIRT_SPINLOCKS + pv_lock_ops.vcpu_is_preempted = kvm_vcpu_is_preempted; +#endif } if (kvm_para_has_feature(KVM_FEATURE_PV_EOI))
[tip:locking/core] locking/mutex: Break out of expensive busy-loop on {mutex,rwsem}_spin_on_owner() when owner vCPU is preempted
Commit-ID: 05ffc951392df57edecc2519327b169210c3df75 Gitweb: http://git.kernel.org/tip/05ffc951392df57edecc2519327b169210c3df75 Author: Pan Xinhui AuthorDate: Wed, 2 Nov 2016 05:08:30 -0400 Committer: Ingo Molnar CommitDate: Tue, 22 Nov 2016 12:48:10 +0100 locking/mutex: Break out of expensive busy-loop on {mutex,rwsem}_spin_on_owner() when owner vCPU is preempted An over-committed guest with more vCPUs than pCPUs has a heavy overload in the two spin_on_owner. This blames on the lock holder preemption issue. Break out of the loop if the vCPU is preempted: if vcpu_is_preempted(cpu) is true. test-case: perf record -a perf bench sched messaging -g 400 -p && perf report before patch: 20.68% sched-messaging [kernel.vmlinux] [k] mutex_spin_on_owner 8.45% sched-messaging [kernel.vmlinux] [k] mutex_unlock 4.12% sched-messaging [kernel.vmlinux] [k] system_call 3.01% sched-messaging [kernel.vmlinux] [k] system_call_common 2.83% sched-messaging [kernel.vmlinux] [k] copypage_power7 2.64% sched-messaging [kernel.vmlinux] [k] rwsem_spin_on_owner 2.00% sched-messaging [kernel.vmlinux] [k] osq_lock after patch: 9.99% sched-messaging [kernel.vmlinux] [k] mutex_unlock 5.28% sched-messaging [unknown] [H] 0xc00768e0 4.27% sched-messaging [kernel.vmlinux] [k] __copy_tofrom_user_power7 3.77% sched-messaging [kernel.vmlinux] [k] copypage_power7 3.24% sched-messaging [kernel.vmlinux] [k] _raw_write_lock_irq 3.02% sched-messaging [kernel.vmlinux] [k] system_call 2.69% sched-messaging [kernel.vmlinux] [k] wait_consider_task Tested-by: Juergen Gross Signed-off-by: Pan Xinhui Signed-off-by: Peter Zijlstra (Intel) Acked-by: Christian Borntraeger Acked-by: Paolo Bonzini Cc: david.lai...@aculab.com Cc: Linus Torvalds Cc: Peter Zijlstra Cc: Thomas Gleixner Cc: b...@kernel.crashing.org Cc: boqun.f...@gmail.com Cc: bsinghar...@gmail.com Cc: d...@stgolabs.net Cc: kernel...@gmail.com Cc: konrad.w...@oracle.com Cc: linuxppc-...@lists.ozlabs.org Cc: m...@ellerman.id.au Cc: paul...@linux.vnet.ibm.com Cc: pau...@samba.org Cc: rkrc...@redhat.com Cc: virtualizat...@lists.linux-foundation.org Cc: will.dea...@arm.com Cc: xen-devel-requ...@lists.xenproject.org Cc: xen-de...@lists.xenproject.org Link: http://lkml.kernel.org/r/1478077718-37424-4-git-send-email-xinhui@linux.vnet.ibm.com Signed-off-by: Ingo Molnar --- kernel/locking/mutex.c | 13 +++-- kernel/locking/rwsem-xadd.c | 14 +++--- 2 files changed, 22 insertions(+), 5 deletions(-) diff --git a/kernel/locking/mutex.c b/kernel/locking/mutex.c index c073168..9b34961 100644 --- a/kernel/locking/mutex.c +++ b/kernel/locking/mutex.c @@ -364,7 +364,11 @@ bool mutex_spin_on_owner(struct mutex *lock, struct task_struct *owner) */ barrier(); - if (!owner->on_cpu || need_resched()) { + /* +* Use vcpu_is_preempted to detect lock holder preemption issue. +*/ + if (!owner->on_cpu || need_resched() || + vcpu_is_preempted(task_cpu(owner))) { ret = false; break; } @@ -389,8 +393,13 @@ static inline int mutex_can_spin_on_owner(struct mutex *lock) rcu_read_lock(); owner = __mutex_owner(lock); + + /* +* As lock holder preemption issue, we both skip spinning if task is not +* on cpu or its cpu is preempted +*/ if (owner) - retval = owner->on_cpu; + retval = owner->on_cpu && !vcpu_is_preempted(task_cpu(owner)); rcu_read_unlock(); /* diff --git a/kernel/locking/rwsem-xadd.c b/kernel/locking/rwsem-xadd.c index 263e744..6315060 100644 --- a/kernel/locking/rwsem-xadd.c +++ b/kernel/locking/rwsem-xadd.c @@ -336,7 +336,11 @@ static inline bool rwsem_can_spin_on_owner(struct rw_semaphore *sem) goto done; } - ret = owner->on_cpu; + /* +* As lock holder preemption issue, we both skip spinning if task is not +* on cpu or its cpu is preempted +*/ + ret = owner->on_cpu && !vcpu_is_preempted(task_cpu(owner)); done: rcu_read_unlock(); return ret; @@ -362,8 +366,12 @@ static noinline bool rwsem_spin_on_owner(struct rw_semaphore *sem) */ barrier(); - /* abort spinning when need_resched or owner is not running */ - if (!owner->on_cpu || need_resched()) { + /* +* abort spinning when need_resched or owner is not running or +* owner's cpu is preempted. +*/ + if (!owner->on_cpu || need_resched() || + vcpu_is_preempted(task_cpu(owner))) { rcu_read_unlock(); return false; }
[tip:locking/core] Documentation/virtual/kvm: Support the vCPU preemption check
Commit-ID: 3dd3e0ce7989b645eee0174b17f5095e187c7f28 Gitweb: http://git.kernel.org/tip/3dd3e0ce7989b645eee0174b17f5095e187c7f28 Author: Pan Xinhui AuthorDate: Wed, 2 Nov 2016 05:08:38 -0400 Committer: Ingo Molnar CommitDate: Tue, 22 Nov 2016 12:48:09 +0100 Documentation/virtual/kvm: Support the vCPU preemption check Commit ("x86/kvm: support vCPU preemption check") added a new struct kvm_steal_time::preempted field. This field tells us if a vCPU is running or not. It is zero if some old KVM does not support this field or if the vCPU is not preempted. Other values means the vCPU has been preempted. Signed-off-by: Pan Xinhui Signed-off-by: Peter Zijlstra (Intel) Acked-by: Radim Krčmář Acked-by: Paolo Bonzini Cc: david.lai...@aculab.com Cc: Linus Torvalds Cc: Peter Zijlstra Cc: Thomas Gleixner Cc: b...@kernel.crashing.org Cc: boqun.f...@gmail.com Cc: borntrae...@de.ibm.com Cc: bsinghar...@gmail.com Cc: d...@stgolabs.net Cc: jgr...@suse.com Cc: kernel...@gmail.com Cc: konrad.w...@oracle.com Cc: linuxppc-...@lists.ozlabs.org Cc: m...@ellerman.id.au Cc: paul...@linux.vnet.ibm.com Cc: pau...@samba.org Cc: virtualizat...@lists.linux-foundation.org Cc: will.dea...@arm.com Cc: xen-devel-requ...@lists.xenproject.org Cc: xen-de...@lists.xenproject.org Link: http://lkml.kernel.org/r/1478077718-37424-12-git-send-email-xinhui@linux.vnet.ibm.com [ Various typo fixes. ] Signed-off-by: Ingo Molnar --- Documentation/virtual/kvm/msr.txt | 9 - 1 file changed, 8 insertions(+), 1 deletion(-) diff --git a/Documentation/virtual/kvm/msr.txt b/Documentation/virtual/kvm/msr.txt index 2a71c8f..0a9ea51 100644 --- a/Documentation/virtual/kvm/msr.txt +++ b/Documentation/virtual/kvm/msr.txt @@ -208,7 +208,9 @@ MSR_KVM_STEAL_TIME: 0x4b564d03 __u64 steal; __u32 version; __u32 flags; - __u32 pad[12]; + __u8 preempted; + __u8 u8_pad[3]; + __u32 pad[11]; } whose data will be filled in by the hypervisor periodically. Only one @@ -232,6 +234,11 @@ MSR_KVM_STEAL_TIME: 0x4b564d03 nanoseconds. Time during which the vcpu is idle, will not be reported as steal time. + preempted: indicate the vCPU who owns this struct is running or + not. Non-zero values mean the vCPU has been preempted. Zero + means the vCPU is not preempted. NOTE, it is always zero if the + the hypervisor doesn't support this field. + MSR_KVM_EOI_EN: 0x4b564d04 data: Bit 0 is 1 when PV end of interrupt is enabled on the vcpu; 0 when disabled. Bit 1 is reserved and must be zero. When PV end of
[tip:locking/core] locking/osq: Break out of spin-wait busy waiting loop for a preempted vCPU in osq_lock()
Commit-ID: 5aff60a191e579ae00ae5ca6ce16c13b687bc8a3 Gitweb: http://git.kernel.org/tip/5aff60a191e579ae00ae5ca6ce16c13b687bc8a3 Author: Pan Xinhui AuthorDate: Wed, 2 Nov 2016 05:08:29 -0400 Committer: Ingo Molnar CommitDate: Tue, 22 Nov 2016 12:48:10 +0100 locking/osq: Break out of spin-wait busy waiting loop for a preempted vCPU in osq_lock() An over-committed guest with more vCPUs than pCPUs has a heavy overload in osq_lock(). This is because if vCPU-A holds the osq lock and yields out, vCPU-B ends up waiting for per_cpu node->locked to be set. IOW, vCPU-B waits for vCPU-A to run and unlock the osq lock. Use the new vcpu_is_preempted(cpu) interface to detect if a vCPU is currently running or not, and break out of the spin-loop if so. test case: $ perf record -a perf bench sched messaging -g 400 -p && perf report before patch: 18.09% sched-messaging [kernel.vmlinux] [k] osq_lock 12.28% sched-messaging [kernel.vmlinux] [k] rwsem_spin_on_owner 5.27% sched-messaging [kernel.vmlinux] [k] mutex_unlock 3.89% sched-messaging [kernel.vmlinux] [k] wait_consider_task 3.64% sched-messaging [kernel.vmlinux] [k] _raw_write_lock_irq 3.41% sched-messaging [kernel.vmlinux] [k] mutex_spin_on_owner.is 2.49% sched-messaging [kernel.vmlinux] [k] system_call after patch: 20.68% sched-messaging [kernel.vmlinux] [k] mutex_spin_on_owner 8.45% sched-messaging [kernel.vmlinux] [k] mutex_unlock 4.12% sched-messaging [kernel.vmlinux] [k] system_call 3.01% sched-messaging [kernel.vmlinux] [k] system_call_common 2.83% sched-messaging [kernel.vmlinux] [k] copypage_power7 2.64% sched-messaging [kernel.vmlinux] [k] rwsem_spin_on_owner 2.00% sched-messaging [kernel.vmlinux] [k] osq_lock Suggested-by: Boqun Feng Tested-by: Juergen Gross Signed-off-by: Pan Xinhui Signed-off-by: Peter Zijlstra (Intel) Acked-by: Christian Borntraeger Acked-by: Paolo Bonzini Cc: david.lai...@aculab.com Cc: Linus Torvalds Cc: Peter Zijlstra Cc: Thomas Gleixner Cc: b...@kernel.crashing.org Cc: bsinghar...@gmail.com Cc: d...@stgolabs.net Cc: kernel...@gmail.com Cc: konrad.w...@oracle.com Cc: linuxppc-...@lists.ozlabs.org Cc: m...@ellerman.id.au Cc: paul...@linux.vnet.ibm.com Cc: pau...@samba.org Cc: rkrc...@redhat.com Cc: virtualizat...@lists.linux-foundation.org Cc: will.dea...@arm.com Cc: xen-devel-requ...@lists.xenproject.org Cc: xen-de...@lists.xenproject.org Link: http://lkml.kernel.org/r/1478077718-37424-3-git-send-email-xinhui@linux.vnet.ibm.com [ Translated to English. ] Signed-off-by: Ingo Molnar --- kernel/locking/osq_lock.c | 9 - 1 file changed, 8 insertions(+), 1 deletion(-) diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c index 4ea2710..a316794 100644 --- a/kernel/locking/osq_lock.c +++ b/kernel/locking/osq_lock.c @@ -21,6 +21,11 @@ static inline int encode_cpu(int cpu_nr) return cpu_nr + 1; } +static inline int node_cpu(struct optimistic_spin_node *node) +{ + return node->cpu - 1; +} + static inline struct optimistic_spin_node *decode_cpu(int encoded_cpu_val) { int cpu_nr = encoded_cpu_val - 1; @@ -118,8 +123,10 @@ bool osq_lock(struct optimistic_spin_queue *lock) while (!READ_ONCE(node->locked)) { /* * If we need to reschedule bail... so we can block. +* Use vcpu_is_preempted() to avoid waiting for a preempted +* lock holder: */ - if (need_resched()) + if (need_resched() || vcpu_is_preempted(node_cpu(node->prev))) goto unqueue; cpu_relax();
[tip:locking/core] kvm: Introduce kvm_write_guest_offset_cached()
Commit-ID: 4ec6e863625625a54f527464ab91ce1a1cb16c42 Gitweb: http://git.kernel.org/tip/4ec6e863625625a54f527464ab91ce1a1cb16c42 Author: Pan Xinhui AuthorDate: Wed, 2 Nov 2016 05:08:34 -0400 Committer: Ingo Molnar CommitDate: Tue, 22 Nov 2016 12:48:07 +0100 kvm: Introduce kvm_write_guest_offset_cached() It allows us to update some status or field of a structure partially. We can also save a kvm_read_guest_cached() call if we just update one fild of the struct regardless of its current value. Signed-off-by: Pan Xinhui Signed-off-by: Peter Zijlstra (Intel) Acked-by: Paolo Bonzini Cc: david.lai...@aculab.com Cc: Linus Torvalds Cc: Peter Zijlstra Cc: Thomas Gleixner Cc: b...@kernel.crashing.org Cc: boqun.f...@gmail.com Cc: borntrae...@de.ibm.com Cc: bsinghar...@gmail.com Cc: d...@stgolabs.net Cc: jgr...@suse.com Cc: kernel...@gmail.com Cc: konrad.w...@oracle.com Cc: linuxppc-...@lists.ozlabs.org Cc: m...@ellerman.id.au Cc: paul...@linux.vnet.ibm.com Cc: pau...@samba.org Cc: rkrc...@redhat.com Cc: virtualizat...@lists.linux-foundation.org Cc: will.dea...@arm.com Cc: xen-devel-requ...@lists.xenproject.org Cc: xen-de...@lists.xenproject.org Link: http://lkml.kernel.org/r/1478077718-37424-8-git-send-email-xinhui@linux.vnet.ibm.com [ Typo fixes. ] Signed-off-by: Ingo Molnar --- include/linux/kvm_host.h | 2 ++ virt/kvm/kvm_main.c | 20 ++-- 2 files changed, 16 insertions(+), 6 deletions(-) diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index 01c0b9c..6f00237 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -645,6 +645,8 @@ int kvm_write_guest(struct kvm *kvm, gpa_t gpa, const void *data, unsigned long len); int kvm_write_guest_cached(struct kvm *kvm, struct gfn_to_hva_cache *ghc, void *data, unsigned long len); +int kvm_write_guest_offset_cached(struct kvm *kvm, struct gfn_to_hva_cache *ghc, + void *data, int offset, unsigned long len); int kvm_gfn_to_hva_cache_init(struct kvm *kvm, struct gfn_to_hva_cache *ghc, gpa_t gpa, unsigned long len); int kvm_clear_guest_page(struct kvm *kvm, gfn_t gfn, int offset, int len); diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index 5c36034..2f38ce5 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -1972,30 +1972,38 @@ int kvm_gfn_to_hva_cache_init(struct kvm *kvm, struct gfn_to_hva_cache *ghc, } EXPORT_SYMBOL_GPL(kvm_gfn_to_hva_cache_init); -int kvm_write_guest_cached(struct kvm *kvm, struct gfn_to_hva_cache *ghc, - void *data, unsigned long len) +int kvm_write_guest_offset_cached(struct kvm *kvm, struct gfn_to_hva_cache *ghc, + void *data, int offset, unsigned long len) { struct kvm_memslots *slots = kvm_memslots(kvm); int r; + gpa_t gpa = ghc->gpa + offset; - BUG_ON(len > ghc->len); + BUG_ON(len + offset > ghc->len); if (slots->generation != ghc->generation) kvm_gfn_to_hva_cache_init(kvm, ghc, ghc->gpa, ghc->len); if (unlikely(!ghc->memslot)) - return kvm_write_guest(kvm, ghc->gpa, data, len); + return kvm_write_guest(kvm, gpa, data, len); if (kvm_is_error_hva(ghc->hva)) return -EFAULT; - r = __copy_to_user((void __user *)ghc->hva, data, len); + r = __copy_to_user((void __user *)ghc->hva + offset, data, len); if (r) return -EFAULT; - mark_page_dirty_in_slot(ghc->memslot, ghc->gpa >> PAGE_SHIFT); + mark_page_dirty_in_slot(ghc->memslot, gpa >> PAGE_SHIFT); return 0; } +EXPORT_SYMBOL_GPL(kvm_write_guest_offset_cached); + +int kvm_write_guest_cached(struct kvm *kvm, struct gfn_to_hva_cache *ghc, + void *data, unsigned long len) +{ + return kvm_write_guest_offset_cached(kvm, ghc, data, 0, len); +} EXPORT_SYMBOL_GPL(kvm_write_guest_cached); int kvm_read_guest_cached(struct kvm *kvm, struct gfn_to_hva_cache *ghc,
[tip:locking/core] locking/core, x86/paravirt: Implement vcpu_is_preempted(cpu) for KVM and Xen guests
Commit-ID: 446f3dc8cc0af59259c6c8b898726fae7ed2c055 Gitweb: http://git.kernel.org/tip/446f3dc8cc0af59259c6c8b898726fae7ed2c055 Author: Pan Xinhui AuthorDate: Wed, 2 Nov 2016 05:08:33 -0400 Committer: Ingo Molnar CommitDate: Tue, 22 Nov 2016 12:48:07 +0100 locking/core, x86/paravirt: Implement vcpu_is_preempted(cpu) for KVM and Xen guests Optimize spinlock and mutex busy-loops by providing a vcpu_is_preempted(cpu) function on KVM and Xen platforms. Extend the pv_lock_ops interface accordingly and implement the callbacks on KVM and Xen. Signed-off-by: Pan Xinhui Signed-off-by: Peter Zijlstra (Intel) [ Translated to English. ] Acked-by: Paolo Bonzini Cc: david.lai...@aculab.com Cc: Linus Torvalds Cc: Peter Zijlstra Cc: Thomas Gleixner Cc: b...@kernel.crashing.org Cc: boqun.f...@gmail.com Cc: borntrae...@de.ibm.com Cc: bsinghar...@gmail.com Cc: d...@stgolabs.net Cc: jgr...@suse.com Cc: kernel...@gmail.com Cc: konrad.w...@oracle.com Cc: linuxppc-...@lists.ozlabs.org Cc: m...@ellerman.id.au Cc: paul...@linux.vnet.ibm.com Cc: pau...@samba.org Cc: rkrc...@redhat.com Cc: virtualizat...@lists.linux-foundation.org Cc: will.dea...@arm.com Cc: xen-devel-requ...@lists.xenproject.org Cc: xen-de...@lists.xenproject.org Link: http://lkml.kernel.org/r/1478077718-37424-7-git-send-email-xinhui@linux.vnet.ibm.com Signed-off-by: Ingo Molnar --- arch/x86/include/asm/paravirt_types.h | 2 ++ arch/x86/include/asm/spinlock.h | 8 arch/x86/kernel/paravirt-spinlocks.c | 6 ++ 3 files changed, 16 insertions(+) diff --git a/arch/x86/include/asm/paravirt_types.h b/arch/x86/include/asm/paravirt_types.h index 0f400c0..38c3bb7 100644 --- a/arch/x86/include/asm/paravirt_types.h +++ b/arch/x86/include/asm/paravirt_types.h @@ -310,6 +310,8 @@ struct pv_lock_ops { void (*wait)(u8 *ptr, u8 val); void (*kick)(int cpu); + + bool (*vcpu_is_preempted)(int cpu); }; /* This contains all the paravirt structures: we get a convenient diff --git a/arch/x86/include/asm/spinlock.h b/arch/x86/include/asm/spinlock.h index 921bea7..0526f59 100644 --- a/arch/x86/include/asm/spinlock.h +++ b/arch/x86/include/asm/spinlock.h @@ -26,6 +26,14 @@ extern struct static_key paravirt_ticketlocks_enabled; static __always_inline bool static_key_false(struct static_key *key); +#ifdef CONFIG_PARAVIRT_SPINLOCKS +#define vcpu_is_preempted vcpu_is_preempted +static inline bool vcpu_is_preempted(int cpu) +{ + return pv_lock_ops.vcpu_is_preempted(cpu); +} +#endif + #include /* diff --git a/arch/x86/kernel/paravirt-spinlocks.c b/arch/x86/kernel/paravirt-spinlocks.c index 2c55a00..2f204dd 100644 --- a/arch/x86/kernel/paravirt-spinlocks.c +++ b/arch/x86/kernel/paravirt-spinlocks.c @@ -21,12 +21,18 @@ bool pv_is_native_spin_unlock(void) __raw_callee_save___native_queued_spin_unlock; } +static bool native_vcpu_is_preempted(int cpu) +{ + return 0; +} + struct pv_lock_ops pv_lock_ops = { #ifdef CONFIG_SMP .queued_spin_lock_slowpath = native_queued_spin_lock_slowpath, .queued_spin_unlock = PV_CALLEE_SAVE(__native_queued_spin_unlock), .wait = paravirt_nop, .kick = paravirt_nop, + .vcpu_is_preempted = native_vcpu_is_preempted, #endif /* SMP */ }; EXPORT_SYMBOL(pv_lock_ops);
[tip:locking/core] sched/core: Introduce the vcpu_is_preempted(cpu) interface
Commit-ID: d9345c65eb7930ac6755cf593ee7686f4029ccf4 Gitweb: http://git.kernel.org/tip/d9345c65eb7930ac6755cf593ee7686f4029ccf4 Author: Pan Xinhui AuthorDate: Wed, 2 Nov 2016 05:08:28 -0400 Committer: Ingo Molnar CommitDate: Tue, 22 Nov 2016 12:48:05 +0100 sched/core: Introduce the vcpu_is_preempted(cpu) interface This patch is the first step to add support to improve lock holder preemption beaviour. vcpu_is_preempted(cpu) does the obvious thing: it tells us whether a vCPU is preempted or not. Defaults to false on architectures that don't support it. Suggested-by: Peter Zijlstra (Intel) Tested-by: Juergen Gross Signed-off-by: Pan Xinhui Signed-off-by: Peter Zijlstra (Intel) [ Translated the changelog to English. ] Acked-by: Christian Borntraeger Acked-by: Paolo Bonzini Cc: david.lai...@aculab.com Cc: Linus Torvalds Cc: Peter Zijlstra Cc: Thomas Gleixner Cc: b...@kernel.crashing.org Cc: boqun.f...@gmail.com Cc: bsinghar...@gmail.com Cc: d...@stgolabs.net Cc: kernel...@gmail.com Cc: konrad.w...@oracle.com Cc: linuxppc-...@lists.ozlabs.org Cc: m...@ellerman.id.au Cc: paul...@linux.vnet.ibm.com Cc: pau...@samba.org Cc: rkrc...@redhat.com Cc: virtualizat...@lists.linux-foundation.org Cc: will.dea...@arm.com Cc: xen-devel-requ...@lists.xenproject.org Cc: xen-de...@lists.xenproject.org Link: http://lkml.kernel.org/r/1478077718-37424-2-git-send-email-xinhui@linux.vnet.ibm.com Signed-off-by: Ingo Molnar --- include/linux/sched.h | 12 1 file changed, 12 insertions(+) diff --git a/include/linux/sched.h b/include/linux/sched.h index dc37cbe..37261af 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -3510,6 +3510,18 @@ static inline void set_task_cpu(struct task_struct *p, unsigned int cpu) #endif /* CONFIG_SMP */ +/* + * In order to reduce various lock holder preemption latencies provide an + * interface to see if a vCPU is currently running or not. + * + * This allows us to terminate optimistic spin loops and block, analogous to + * the native optimistic spin heuristic of testing if the lock owner task is + * running or not. + */ +#ifndef vcpu_is_preempted +# define vcpu_is_preempted(cpu)false +#endif + extern long sched_setaffinity(pid_t pid, const struct cpumask *new_mask); extern long sched_getaffinity(pid_t pid, struct cpumask *mask);
[tip:locking/core] locking/core, powerpc: Implement vcpu_is_preempted(cpu)
Commit-ID: 41946c86876ea6a3e8857182356e6d76dbfe7fb6 Gitweb: http://git.kernel.org/tip/41946c86876ea6a3e8857182356e6d76dbfe7fb6 Author: Pan Xinhui AuthorDate: Wed, 2 Nov 2016 05:08:31 -0400 Committer: Ingo Molnar CommitDate: Tue, 22 Nov 2016 12:48:06 +0100 locking/core, powerpc: Implement vcpu_is_preempted(cpu) Optimize spinlock and mutex busy-loops by providing a vcpu_is_preempted(cpu) function on pSeries. We do not support PowerNV. All this can be achieved by using lppaca->yield_count, which is zero on PowerNV. Suggested-by: Boqun Feng Suggested-by: Peter Zijlstra (Intel) Signed-off-by: Pan Xinhui Signed-off-by: Peter Zijlstra (Intel) Cc: david.lai...@aculab.com Cc: Linus Torvalds Cc: Peter Zijlstra Cc: Thomas Gleixner Cc: b...@kernel.crashing.org Cc: borntrae...@de.ibm.com Cc: bsinghar...@gmail.com Cc: d...@stgolabs.net Cc: jgr...@suse.com Cc: kernel...@gmail.com Cc: konrad.w...@oracle.com Cc: linuxppc-...@lists.ozlabs.org Cc: m...@ellerman.id.au Cc: paul...@linux.vnet.ibm.com Cc: pau...@samba.org Cc: pbonz...@redhat.com Cc: rkrc...@redhat.com Cc: virtualizat...@lists.linux-foundation.org Cc: will.dea...@arm.com Cc: xen-devel-requ...@lists.xenproject.org Cc: xen-de...@lists.xenproject.org Link: http://lkml.kernel.org/r/1478077718-37424-5-git-send-email-xinhui@linux.vnet.ibm.com Signed-off-by: Ingo Molnar --- arch/powerpc/include/asm/spinlock.h | 8 1 file changed, 8 insertions(+) diff --git a/arch/powerpc/include/asm/spinlock.h b/arch/powerpc/include/asm/spinlock.h index fa37fe9..8c1b913 100644 --- a/arch/powerpc/include/asm/spinlock.h +++ b/arch/powerpc/include/asm/spinlock.h @@ -52,6 +52,14 @@ #define SYNC_IO #endif +#ifdef CONFIG_PPC_PSERIES +#define vcpu_is_preempted vcpu_is_preempted +static inline bool vcpu_is_preempted(int cpu) +{ + return !!(be32_to_cpu(lppaca_of(cpu).yield_count) & 1); +} +#endif + static __always_inline int arch_spin_value_unlocked(arch_spinlock_t lock) { return lock.slock == 0;
Re: [PATCH v7 06/11] x86, paravirt: Add interface to support kvm/xen vcpu preempted check
在 2016/11/16 18:23, Peter Zijlstra 写道: On Wed, Nov 16, 2016 at 12:19:09PM +0800, Pan Xinhui wrote: Hi, Peter. I think we can avoid a function call in a simpler way. How about below static inline bool vcpu_is_preempted(int cpu) { /* only set in pv case*/ if (pv_lock_ops.vcpu_is_preempted) return pv_lock_ops.vcpu_is_preempted(cpu); return false; } That is still more expensive. It needs to do an actual load and makes it hard to predict the branch, you'd have to actually wait for the load to complete etc. yes, one more load in native case. I think this is acceptable as vcpu_is_preempted is not a critical function. however if we use pv_callee_save_regs_thunk, more unnecessary registers might be save/resotred in pv case. that will introduce a little overhead. but I think I am okay with your idea. I can make another patch based on this patchset with your suggested-by. thanks xinhui Also, it generates more code. Paravirt muck should strive to be as cheap as possible when ran on native hardware.
Re: [PATCH v7 06/11] x86, paravirt: Add interface to support kvm/xen vcpu preempted check
在 2016/11/15 23:47, Peter Zijlstra 写道: On Wed, Nov 02, 2016 at 05:08:33AM -0400, Pan Xinhui wrote: diff --git a/arch/x86/include/asm/paravirt_types.h b/arch/x86/include/asm/paravirt_types.h index 0f400c0..38c3bb7 100644 --- a/arch/x86/include/asm/paravirt_types.h +++ b/arch/x86/include/asm/paravirt_types.h @@ -310,6 +310,8 @@ struct pv_lock_ops { void (*wait)(u8 *ptr, u8 val); void (*kick)(int cpu); + + bool (*vcpu_is_preempted)(int cpu); }; So that ends up with a full function call in the native case. I did something like the below on top, completely untested, not been near a compiler etc.. Hi, Peter. I think we can avoid a function call in a simpler way. How about below static inline bool vcpu_is_preempted(int cpu) { /* only set in pv case*/ if (pv_lock_ops.vcpu_is_preempted) return pv_lock_ops.vcpu_is_preempted(cpu); return false; } It doesn't get rid of the branch, but at least it avoids the function call, and hardware should have no trouble predicting a constant condition. Also, it looks like you end up not setting vcpu_is_preempted when KVM doesn't support steal clock, which would end up in an instant NULL deref. Fixed that too. maybe not true. There is .vcpu_is_preempted = native_vcpu_is_preempted when we define pv_lock_ops. your patch is a good example for any people who want to add any native/pv function. :) thanks xinhui --- --- a/arch/x86/include/asm/paravirt.h +++ b/arch/x86/include/asm/paravirt.h @@ -673,6 +673,11 @@ static __always_inline void pv_kick(int PVOP_VCALL1(pv_lock_ops.kick, cpu); } +static __always_inline void pv_vcpu_is_prempted(int cpu) +{ + PVOP_VCALLEE1(pv_lock_ops.vcpu_is_preempted, cpu); +} + #endif /* SMP && PARAVIRT_SPINLOCKS */ #ifdef CONFIG_X86_32 --- a/arch/x86/include/asm/paravirt_types.h +++ b/arch/x86/include/asm/paravirt_types.h @@ -309,7 +309,7 @@ struct pv_lock_ops { void (*wait)(u8 *ptr, u8 val); void (*kick)(int cpu); - bool (*vcpu_is_preempted)(int cpu); + struct paravirt_callee_save vcpu_is_preempted; }; /* This contains all the paravirt structures: we get a convenient --- a/arch/x86/include/asm/qspinlock.h +++ b/arch/x86/include/asm/qspinlock.h @@ -32,6 +32,12 @@ static inline void queued_spin_unlock(st { pv_queued_spin_unlock(lock); } + +#define vcpu_is_preempted vcpu_is_preempted +static inline bool vcpu_is_preempted(int cpu) +{ + return pv_vcpu_is_preempted(cpu); +} #else static inline void queued_spin_unlock(struct qspinlock *lock) { --- a/arch/x86/include/asm/spinlock.h +++ b/arch/x86/include/asm/spinlock.h @@ -26,14 +26,6 @@ extern struct static_key paravirt_ticketlocks_enabled; static __always_inline bool static_key_false(struct static_key *key); -#ifdef CONFIG_PARAVIRT_SPINLOCKS -#define vcpu_is_preempted vcpu_is_preempted -static inline bool vcpu_is_preempted(int cpu) -{ - return pv_lock_ops.vcpu_is_preempted(cpu); -} -#endif - #include /* --- a/arch/x86/kernel/kvm.c +++ b/arch/x86/kernel/kvm.c @@ -415,15 +415,6 @@ void kvm_disable_steal_time(void) wrmsr(MSR_KVM_STEAL_TIME, 0, 0); } -static bool kvm_vcpu_is_preempted(int cpu) -{ - struct kvm_steal_time *src; - - src = &per_cpu(steal_time, cpu); - - return !!src->preempted; -} - #ifdef CONFIG_SMP static void __init kvm_smp_prepare_boot_cpu(void) { @@ -480,9 +471,6 @@ void __init kvm_guest_init(void) if (kvm_para_has_feature(KVM_FEATURE_STEAL_TIME)) { has_steal_clock = 1; pv_time_ops.steal_clock = kvm_steal_clock; -#ifdef CONFIG_PARAVIRT_SPINLOCKS - pv_lock_ops.vcpu_is_preempted = kvm_vcpu_is_preempted; -#endif } if (kvm_para_has_feature(KVM_FEATURE_PV_EOI)) @@ -604,6 +592,14 @@ static void kvm_wait(u8 *ptr, u8 val) local_irq_restore(flags); } +static bool __kvm_vcpu_is_preempted(int cpu) +{ + struct kvm_steal_time *src = &per_cpu(steal_time, cpu); + + return !!src->preempted; +} +PV_CALLEE_SAVE_REGS_THUNK(__kvm_vcpu_is_preempted); + /* * Setup pv_lock_ops to exploit KVM_FEATURE_PV_UNHALT if present. */ @@ -620,6 +616,12 @@ void __init kvm_spinlock_init(void) pv_lock_ops.queued_spin_unlock = PV_CALLEE_SAVE(__pv_queued_spin_unlock); pv_lock_ops.wait = kvm_wait; pv_lock_ops.kick = kvm_kick_cpu; + pv_lock_ops.vcpu_is_preempted = PV_CALLEE_SAVE(__native_vcpu_is_preempted); + + if (kvm_para_has_feature(KVM_FEATURE_STEAL_TIME)) { + pv_lock_ops.vcpu_is_preempted = + PV_CALLEE_SAVE(__kvm_vcpu_is_preempted); + } } static __init int kvm_spinlock_init_jump(void) --- a/arch/x86/kernel/paravirt-spinlocks.c +++ b/arch/x86/kernel/paravirt-spinlocks.c @@ -12,7 +12,6 @@ __visible void __native_queued_spin_unlo { native_queued_spin_unlock(lock); } - PV_CALLEE_SAVE_REGS_THUNK(__na
[PATCH v7 10/11] x86, xen: support vcpu preempted check
From: Juergen Gross Support the vcpu_is_preempted() functionality under Xen. This will enhance lock performance on overcommitted hosts (more runnable vcpus than physical cpus in the system) as doing busy waits for preempted vcpus will hurt system performance far worse than early yielding. A quick test (4 vcpus on 1 physical cpu doing a parallel build job with "make -j 8") reduced system time by about 5% with this patch. Signed-off-by: Juergen Gross Signed-off-by: Pan Xinhui --- arch/x86/xen/spinlock.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/arch/x86/xen/spinlock.c b/arch/x86/xen/spinlock.c index 3d6e006..74756bb 100644 --- a/arch/x86/xen/spinlock.c +++ b/arch/x86/xen/spinlock.c @@ -114,7 +114,6 @@ void xen_uninit_lock_cpu(int cpu) per_cpu(irq_name, cpu) = NULL; } - /* * Our init of PV spinlocks is split in two init functions due to us * using paravirt patching and jump labels patching and having to do @@ -137,6 +136,8 @@ void __init xen_init_spinlocks(void) pv_lock_ops.queued_spin_unlock = PV_CALLEE_SAVE(__pv_queued_spin_unlock); pv_lock_ops.wait = xen_qlock_wait; pv_lock_ops.kick = xen_qlock_kick; + + pv_lock_ops.vcpu_is_preempted = xen_vcpu_stolen; } /* -- 2.4.11
[PATCH v7 11/11] Documentation: virtual: kvm: Support vcpu preempted check
Commit ("x86, kvm: support vcpu preempted check") add one field "__u8 preempted" into struct kvm_steal_time. This field tells if one vcpu is running or not. It is zero if 1) some old KVM deos not support this filed. 2) the vcpu is not preempted. Other values means the vcpu has been preempted. Signed-off-by: Pan Xinhui Acked-by: Radim Krčmář Acked-by: Paolo Bonzini --- Documentation/virtual/kvm/msr.txt | 9 - 1 file changed, 8 insertions(+), 1 deletion(-) diff --git a/Documentation/virtual/kvm/msr.txt b/Documentation/virtual/kvm/msr.txt index 2a71c8f..ab2ab76 100644 --- a/Documentation/virtual/kvm/msr.txt +++ b/Documentation/virtual/kvm/msr.txt @@ -208,7 +208,9 @@ MSR_KVM_STEAL_TIME: 0x4b564d03 __u64 steal; __u32 version; __u32 flags; - __u32 pad[12]; + __u8 preempted; + __u8 u8_pad[3]; + __u32 pad[11]; } whose data will be filled in by the hypervisor periodically. Only one @@ -232,6 +234,11 @@ MSR_KVM_STEAL_TIME: 0x4b564d03 nanoseconds. Time during which the vcpu is idle, will not be reported as steal time. + preempted: indicate the VCPU who owns this struct is running or + not. Non-zero values mean the VCPU has been preempted. Zero + means the VCPU is not preempted. NOTE, it is always zero if the + the hypervisor doesn't support this field. + MSR_KVM_EOI_EN: 0x4b564d04 data: Bit 0 is 1 when PV end of interrupt is enabled on the vcpu; 0 when disabled. Bit 1 is reserved and must be zero. When PV end of -- 2.4.11
[PATCH v7 08/11] x86, kvm/x86.c: support vcpu preempted check
Support the vcpu_is_preempted() functionality under KVM. This will enhance lock performance on overcommitted hosts (more runnable vcpus than physical cpus in the system) as doing busy waits for preempted vcpus will hurt system performance far worse than early yielding. Use one field of struct kvm_steal_time ::preempted to indicate that if one vcpu is running or not. Signed-off-by: Pan Xinhui Acked-by: Paolo Bonzini --- arch/x86/include/uapi/asm/kvm_para.h | 4 +++- arch/x86/kvm/x86.c | 16 2 files changed, 19 insertions(+), 1 deletion(-) diff --git a/arch/x86/include/uapi/asm/kvm_para.h b/arch/x86/include/uapi/asm/kvm_para.h index 94dc8ca..1421a65 100644 --- a/arch/x86/include/uapi/asm/kvm_para.h +++ b/arch/x86/include/uapi/asm/kvm_para.h @@ -45,7 +45,9 @@ struct kvm_steal_time { __u64 steal; __u32 version; __u32 flags; - __u32 pad[12]; + __u8 preempted; + __u8 u8_pad[3]; + __u32 pad[11]; }; #define KVM_STEAL_ALIGNMENT_BITS 5 diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index e375235..f06e115 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -2057,6 +2057,8 @@ static void record_steal_time(struct kvm_vcpu *vcpu) &vcpu->arch.st.steal, sizeof(struct kvm_steal_time return; + vcpu->arch.st.steal.preempted = 0; + if (vcpu->arch.st.steal.version & 1) vcpu->arch.st.steal.version += 1; /* first time write, random junk */ @@ -2810,8 +2812,22 @@ void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu) kvm_make_request(KVM_REQ_STEAL_UPDATE, vcpu); } +static void kvm_steal_time_set_preempted(struct kvm_vcpu *vcpu) +{ + if (!(vcpu->arch.st.msr_val & KVM_MSR_ENABLED)) + return; + + vcpu->arch.st.steal.preempted = 1; + + kvm_write_guest_offset_cached(vcpu->kvm, &vcpu->arch.st.stime, + &vcpu->arch.st.steal.preempted, + offsetof(struct kvm_steal_time, preempted), + sizeof(vcpu->arch.st.steal.preempted)); +} + void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu) { + kvm_steal_time_set_preempted(vcpu); kvm_x86_ops->vcpu_put(vcpu); kvm_put_guest_fpu(vcpu); vcpu->arch.last_host_tsc = rdtsc(); -- 2.4.11
[PATCH v7 09/11] x86, kernel/kvm.c: support vcpu preempted check
Support the vcpu_is_preempted() functionality under KVM. This will enhance lock performance on overcommitted hosts (more runnable vcpus than physical cpus in the system) as doing busy waits for preempted vcpus will hurt system performance far worse than early yielding. struct kvm_steal_time::preempted indicate that if one vcpu is running or not after commit("x86, kvm/x86.c: support vcpu preempted check"). unix benchmark result: host: kernel 4.8.1, i5-4570, 4 cpus guest: kernel 4.8.1, 8 vcpus test-case after-patch before-patch Execl Throughput |18307.9 lps |11701.6 lps File Copy 1024 bufsize 2000 maxblocks | 1352407.3 KBps | 790418.9 KBps File Copy 256 bufsize 500 maxblocks| 367555.6 KBps | 222867.7 KBps File Copy 4096 bufsize 8000 maxblocks | 3675649.7 KBps | 1780614.4 KBps Pipe Throughput| 11872208.7 lps | 11855628.9 lps Pipe-based Context Switching | 1495126.5 lps | 1490533.9 lps Process Creation |29881.2 lps |28572.8 lps Shell Scripts (1 concurrent) |23224.3 lpm |22607.4 lpm Shell Scripts (8 concurrent) | 3531.4 lpm | 3211.9 lpm System Call Overhead | 10385653.0 lps | 10419979.0 lps Signed-off-by: Pan Xinhui Acked-by: Paolo Bonzini --- arch/x86/kernel/kvm.c | 12 1 file changed, 12 insertions(+) diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c index edbbfc8..0b48dd2 100644 --- a/arch/x86/kernel/kvm.c +++ b/arch/x86/kernel/kvm.c @@ -415,6 +415,15 @@ void kvm_disable_steal_time(void) wrmsr(MSR_KVM_STEAL_TIME, 0, 0); } +static bool kvm_vcpu_is_preempted(int cpu) +{ + struct kvm_steal_time *src; + + src = &per_cpu(steal_time, cpu); + + return !!src->preempted; +} + #ifdef CONFIG_SMP static void __init kvm_smp_prepare_boot_cpu(void) { @@ -471,6 +480,9 @@ void __init kvm_guest_init(void) if (kvm_para_has_feature(KVM_FEATURE_STEAL_TIME)) { has_steal_clock = 1; pv_time_ops.steal_clock = kvm_steal_clock; +#ifdef CONFIG_PARAVIRT_SPINLOCKS + pv_lock_ops.vcpu_is_preempted = kvm_vcpu_is_preempted; +#endif } if (kvm_para_has_feature(KVM_FEATURE_PV_EOI)) -- 2.4.11
[PATCH v7 06/11] x86, paravirt: Add interface to support kvm/xen vcpu preempted check
This is to fix some lock holder preemption issues. Some other locks implementation do a spin loop before acquiring the lock itself. Currently kernel has an interface of bool vcpu_is_preempted(int cpu). It takes the cpu as parameter and return true if the cpu is preempted. Then kernel can break the spin loops upon the retval of vcpu_is_preempted. As kernel has used this interface, So lets support it. To deal with kernel and kvm/xen, add vcpu_is_preempted into struct pv_lock_ops. Then kvm or xen could provide their own implementation to support vcpu_is_preempted. Signed-off-by: Pan Xinhui Acked-by: Paolo Bonzini --- arch/x86/include/asm/paravirt_types.h | 2 ++ arch/x86/include/asm/spinlock.h | 8 arch/x86/kernel/paravirt-spinlocks.c | 6 ++ 3 files changed, 16 insertions(+) diff --git a/arch/x86/include/asm/paravirt_types.h b/arch/x86/include/asm/paravirt_types.h index 0f400c0..38c3bb7 100644 --- a/arch/x86/include/asm/paravirt_types.h +++ b/arch/x86/include/asm/paravirt_types.h @@ -310,6 +310,8 @@ struct pv_lock_ops { void (*wait)(u8 *ptr, u8 val); void (*kick)(int cpu); + + bool (*vcpu_is_preempted)(int cpu); }; /* This contains all the paravirt structures: we get a convenient diff --git a/arch/x86/include/asm/spinlock.h b/arch/x86/include/asm/spinlock.h index 921bea7..0526f59 100644 --- a/arch/x86/include/asm/spinlock.h +++ b/arch/x86/include/asm/spinlock.h @@ -26,6 +26,14 @@ extern struct static_key paravirt_ticketlocks_enabled; static __always_inline bool static_key_false(struct static_key *key); +#ifdef CONFIG_PARAVIRT_SPINLOCKS +#define vcpu_is_preempted vcpu_is_preempted +static inline bool vcpu_is_preempted(int cpu) +{ + return pv_lock_ops.vcpu_is_preempted(cpu); +} +#endif + #include /* diff --git a/arch/x86/kernel/paravirt-spinlocks.c b/arch/x86/kernel/paravirt-spinlocks.c index 2c55a00..2f204dd 100644 --- a/arch/x86/kernel/paravirt-spinlocks.c +++ b/arch/x86/kernel/paravirt-spinlocks.c @@ -21,12 +21,18 @@ bool pv_is_native_spin_unlock(void) __raw_callee_save___native_queued_spin_unlock; } +static bool native_vcpu_is_preempted(int cpu) +{ + return 0; +} + struct pv_lock_ops pv_lock_ops = { #ifdef CONFIG_SMP .queued_spin_lock_slowpath = native_queued_spin_lock_slowpath, .queued_spin_unlock = PV_CALLEE_SAVE(__native_queued_spin_unlock), .wait = paravirt_nop, .kick = paravirt_nop, + .vcpu_is_preempted = native_vcpu_is_preempted, #endif /* SMP */ }; EXPORT_SYMBOL(pv_lock_ops); -- 2.4.11
[PATCH v7 01/11] kernel/sched: introduce vcpu preempted check interface
This patch support to fix lock holder preemption issue. For kernel users, we could use bool vcpu_is_preempted(int cpu) to detect if one vcpu is preempted or not. The default implementation is a macro defined by false. So compiler can wrap it out if arch dose not support such vcpu preempted check. Suggested-by: Peter Zijlstra (Intel) Signed-off-by: Pan Xinhui Acked-by: Christian Borntraeger Acked-by: Paolo Bonzini Tested-by: Juergen Gross --- include/linux/sched.h | 12 1 file changed, 12 insertions(+) diff --git a/include/linux/sched.h b/include/linux/sched.h index 348f51b..44c1ce7 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -3506,6 +3506,18 @@ static inline void set_task_cpu(struct task_struct *p, unsigned int cpu) #endif /* CONFIG_SMP */ +/* + * In order to deal with a various lock holder preemption issues provide an + * interface to see if a vCPU is currently running or not. + * + * This allows us to terminate optimistic spin loops and block, analogous to + * the native optimistic spin heuristic of testing if the lock owner task is + * running or not. + */ +#ifndef vcpu_is_preempted +#define vcpu_is_preempted(cpu) false +#endif + extern long sched_setaffinity(pid_t pid, const struct cpumask *new_mask); extern long sched_getaffinity(pid_t pid, struct cpumask *mask); -- 2.4.11
[PATCH v7 04/11] powerpc/spinlock: support vcpu preempted check
This is to fix some lock holder preemption issues. Some other locks implementation do a spin loop before acquiring the lock itself. Currently kernel has an interface of bool vcpu_is_preempted(int cpu). It takes the cpu as parameter and return true if the cpu is preempted. Then kernel can break the spin loops upon the retval of vcpu_is_preempted. As kernel has used this interface, So lets support it. Only pSeries need support it. And the fact is PowerNV is built into same kernel image with pSeries. So we need return false if we are runnig as PowerNV. The another fact is that lppaca->yiled_count keeps zero on PowerNV. So we can just skip the machine type check. Suggested-by: Boqun Feng Suggested-by: Peter Zijlstra (Intel) Signed-off-by: Pan Xinhui --- arch/powerpc/include/asm/spinlock.h | 8 1 file changed, 8 insertions(+) diff --git a/arch/powerpc/include/asm/spinlock.h b/arch/powerpc/include/asm/spinlock.h index fa37fe9..8c1b913 100644 --- a/arch/powerpc/include/asm/spinlock.h +++ b/arch/powerpc/include/asm/spinlock.h @@ -52,6 +52,14 @@ #define SYNC_IO #endif +#ifdef CONFIG_PPC_PSERIES +#define vcpu_is_preempted vcpu_is_preempted +static inline bool vcpu_is_preempted(int cpu) +{ + return !!(be32_to_cpu(lppaca_of(cpu).yield_count) & 1); +} +#endif + static __always_inline int arch_spin_value_unlocked(arch_spinlock_t lock) { return lock.slock == 0; -- 2.4.11
[PATCH v7 02/11] locking/osq: Drop the overload of osq_lock()
An over-committed guest with more vCPUs than pCPUs has a heavy overload in osq_lock(). This is because vCPU A hold the osq lock and yield out, vCPU B wait per_cpu node->locked to be set. IOW, vCPU B wait vCPU A to run and unlock the osq lock. Kernel has an interface bool vcpu_is_preempted(int cpu) to detect if a vCPU is currently running or not. So break the spin loops on true condition. test case: perf record -a perf bench sched messaging -g 400 -p && perf report before patch: 18.09% sched-messaging [kernel.vmlinux] [k] osq_lock 12.28% sched-messaging [kernel.vmlinux] [k] rwsem_spin_on_owner 5.27% sched-messaging [kernel.vmlinux] [k] mutex_unlock 3.89% sched-messaging [kernel.vmlinux] [k] wait_consider_task 3.64% sched-messaging [kernel.vmlinux] [k] _raw_write_lock_irq 3.41% sched-messaging [kernel.vmlinux] [k] mutex_spin_on_owner.is 2.49% sched-messaging [kernel.vmlinux] [k] system_call after patch: 20.68% sched-messaging [kernel.vmlinux] [k] mutex_spin_on_owner 8.45% sched-messaging [kernel.vmlinux] [k] mutex_unlock 4.12% sched-messaging [kernel.vmlinux] [k] system_call 3.01% sched-messaging [kernel.vmlinux] [k] system_call_common 2.83% sched-messaging [kernel.vmlinux] [k] copypage_power7 2.64% sched-messaging [kernel.vmlinux] [k] rwsem_spin_on_owner 2.00% sched-messaging [kernel.vmlinux] [k] osq_lock Suggested-by: Boqun Feng Signed-off-by: Pan Xinhui Acked-by: Christian Borntraeger Acked-by: Paolo Bonzini Tested-by: Juergen Gross --- kernel/locking/osq_lock.c | 8 +++- 1 file changed, 7 insertions(+), 1 deletion(-) diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c index 05a3785..091f97f 100644 --- a/kernel/locking/osq_lock.c +++ b/kernel/locking/osq_lock.c @@ -21,6 +21,11 @@ static inline int encode_cpu(int cpu_nr) return cpu_nr + 1; } +static inline int node_cpu(struct optimistic_spin_node *node) +{ + return node->cpu - 1; +} + static inline struct optimistic_spin_node *decode_cpu(int encoded_cpu_val) { int cpu_nr = encoded_cpu_val - 1; @@ -118,8 +123,9 @@ bool osq_lock(struct optimistic_spin_queue *lock) while (!READ_ONCE(node->locked)) { /* * If we need to reschedule bail... so we can block. +* Use vcpu_is_preempted to detect lock holder preemption issue. */ - if (need_resched()) + if (need_resched() || vcpu_is_preempted(node_cpu(node->prev))) goto unqueue; cpu_relax_lowlatency(); -- 2.4.11
[PATCH v7 07/11] KVM: Introduce kvm_write_guest_offset_cached
It allows us to update some status or field of one struct partially. We can also save one kvm_read_guest_cached if we just update one filed of the struct regardless of its current value. Signed-off-by: Pan Xinhui Acked-by: Paolo Bonzini --- include/linux/kvm_host.h | 2 ++ virt/kvm/kvm_main.c | 20 ++-- 2 files changed, 16 insertions(+), 6 deletions(-) diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index 01c0b9c..6f00237 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -645,6 +645,8 @@ int kvm_write_guest(struct kvm *kvm, gpa_t gpa, const void *data, unsigned long len); int kvm_write_guest_cached(struct kvm *kvm, struct gfn_to_hva_cache *ghc, void *data, unsigned long len); +int kvm_write_guest_offset_cached(struct kvm *kvm, struct gfn_to_hva_cache *ghc, + void *data, int offset, unsigned long len); int kvm_gfn_to_hva_cache_init(struct kvm *kvm, struct gfn_to_hva_cache *ghc, gpa_t gpa, unsigned long len); int kvm_clear_guest_page(struct kvm *kvm, gfn_t gfn, int offset, int len); diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index 2907b7b..95308ee 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -1972,30 +1972,38 @@ int kvm_gfn_to_hva_cache_init(struct kvm *kvm, struct gfn_to_hva_cache *ghc, } EXPORT_SYMBOL_GPL(kvm_gfn_to_hva_cache_init); -int kvm_write_guest_cached(struct kvm *kvm, struct gfn_to_hva_cache *ghc, - void *data, unsigned long len) +int kvm_write_guest_offset_cached(struct kvm *kvm, struct gfn_to_hva_cache *ghc, + void *data, int offset, unsigned long len) { struct kvm_memslots *slots = kvm_memslots(kvm); int r; + gpa_t gpa = ghc->gpa + offset; - BUG_ON(len > ghc->len); + BUG_ON(len + offset > ghc->len); if (slots->generation != ghc->generation) kvm_gfn_to_hva_cache_init(kvm, ghc, ghc->gpa, ghc->len); if (unlikely(!ghc->memslot)) - return kvm_write_guest(kvm, ghc->gpa, data, len); + return kvm_write_guest(kvm, gpa, data, len); if (kvm_is_error_hva(ghc->hva)) return -EFAULT; - r = __copy_to_user((void __user *)ghc->hva, data, len); + r = __copy_to_user((void __user *)ghc->hva + offset, data, len); if (r) return -EFAULT; - mark_page_dirty_in_slot(ghc->memslot, ghc->gpa >> PAGE_SHIFT); + mark_page_dirty_in_slot(ghc->memslot, gpa >> PAGE_SHIFT); return 0; } +EXPORT_SYMBOL_GPL(kvm_write_guest_offset_cached); + +int kvm_write_guest_cached(struct kvm *kvm, struct gfn_to_hva_cache *ghc, + void *data, unsigned long len) +{ + return kvm_write_guest_offset_cached(kvm, ghc, data, 0, len); +} EXPORT_SYMBOL_GPL(kvm_write_guest_cached); int kvm_read_guest_cached(struct kvm *kvm, struct gfn_to_hva_cache *ghc, -- 2.4.11
[PATCH v7 05/11] s390/spinlock: Provide vcpu_is_preempted
From: Christian Borntraeger this implements the s390 backend for commit "kernel/sched: introduce vcpu preempted check interface" by reworking the existing smp_vcpu_scheduled into arch_vcpu_is_preempted. We can then also get rid of the local cpu_is_preempted function by moving the CIF_ENABLED_WAIT test into arch_vcpu_is_preempted. Signed-off-by: Christian Borntraeger Acked-by: Heiko Carstens --- arch/s390/include/asm/spinlock.h | 8 arch/s390/kernel/smp.c | 9 +++-- arch/s390/lib/spinlock.c | 25 - 3 files changed, 23 insertions(+), 19 deletions(-) diff --git a/arch/s390/include/asm/spinlock.h b/arch/s390/include/asm/spinlock.h index 7e9e09f..7ecd890 100644 --- a/arch/s390/include/asm/spinlock.h +++ b/arch/s390/include/asm/spinlock.h @@ -23,6 +23,14 @@ _raw_compare_and_swap(unsigned int *lock, unsigned int old, unsigned int new) return __sync_bool_compare_and_swap(lock, old, new); } +#ifndef CONFIG_SMP +static inline bool arch_vcpu_is_preempted(int cpu) { return false; } +#else +bool arch_vcpu_is_preempted(int cpu); +#endif + +#define vcpu_is_preempted arch_vcpu_is_preempted + /* * Simple spin lock operations. There are two variants, one clears IRQ's * on the local processor, one does not. diff --git a/arch/s390/kernel/smp.c b/arch/s390/kernel/smp.c index 35531fe..b988ed1 100644 --- a/arch/s390/kernel/smp.c +++ b/arch/s390/kernel/smp.c @@ -368,10 +368,15 @@ int smp_find_processor_id(u16 address) return -1; } -int smp_vcpu_scheduled(int cpu) +bool arch_vcpu_is_preempted(int cpu) { - return pcpu_running(pcpu_devices + cpu); + if (test_cpu_flag_of(CIF_ENABLED_WAIT, cpu)) + return false; + if (pcpu_running(pcpu_devices + cpu)) + return false; + return true; } +EXPORT_SYMBOL(arch_vcpu_is_preempted); void smp_yield_cpu(int cpu) { diff --git a/arch/s390/lib/spinlock.c b/arch/s390/lib/spinlock.c index e5f50a7..e48a48e 100644 --- a/arch/s390/lib/spinlock.c +++ b/arch/s390/lib/spinlock.c @@ -37,15 +37,6 @@ static inline void _raw_compare_and_delay(unsigned int *lock, unsigned int old) asm(".insn rsy,0xeb22,%0,0,%1" : : "d" (old), "Q" (*lock)); } -static inline int cpu_is_preempted(int cpu) -{ - if (test_cpu_flag_of(CIF_ENABLED_WAIT, cpu)) - return 0; - if (smp_vcpu_scheduled(cpu)) - return 0; - return 1; -} - void arch_spin_lock_wait(arch_spinlock_t *lp) { unsigned int cpu = SPINLOCK_LOCKVAL; @@ -62,7 +53,7 @@ void arch_spin_lock_wait(arch_spinlock_t *lp) continue; } /* First iteration: check if the lock owner is running. */ - if (first_diag && cpu_is_preempted(~owner)) { + if (first_diag && arch_vcpu_is_preempted(~owner)) { smp_yield_cpu(~owner); first_diag = 0; continue; @@ -81,7 +72,7 @@ void arch_spin_lock_wait(arch_spinlock_t *lp) * yield the CPU unconditionally. For LPAR rely on the * sense running status. */ - if (!MACHINE_IS_LPAR || cpu_is_preempted(~owner)) { + if (!MACHINE_IS_LPAR || arch_vcpu_is_preempted(~owner)) { smp_yield_cpu(~owner); first_diag = 0; } @@ -108,7 +99,7 @@ void arch_spin_lock_wait_flags(arch_spinlock_t *lp, unsigned long flags) continue; } /* Check if the lock owner is running. */ - if (first_diag && cpu_is_preempted(~owner)) { + if (first_diag && arch_vcpu_is_preempted(~owner)) { smp_yield_cpu(~owner); first_diag = 0; continue; @@ -127,7 +118,7 @@ void arch_spin_lock_wait_flags(arch_spinlock_t *lp, unsigned long flags) * yield the CPU unconditionally. For LPAR rely on the * sense running status. */ - if (!MACHINE_IS_LPAR || cpu_is_preempted(~owner)) { + if (!MACHINE_IS_LPAR || arch_vcpu_is_preempted(~owner)) { smp_yield_cpu(~owner); first_diag = 0; } @@ -165,7 +156,7 @@ void _raw_read_lock_wait(arch_rwlock_t *rw) owner = 0; while (1) { if (count-- <= 0) { - if (owner && cpu_is_preempted(~owner)) + if (owner && arch_vcpu_is_preempted(~owner)) smp_yield_cpu(~owner); count = spin_retry; } @@ -211,7 +202,7 @@ void _raw_write_lock_wait(arch_rwlock_t *rw, unsigned int prev) owner = 0; while (1) { if (count-- <= 0) { - if (owner && cpu_is_preempted(~owner)) + if
[PATCH v7 03/11] kernel/locking: Drop the overload of {mutex,rwsem}_spin_on_owner
An over-committed guest with more vCPUs than pCPUs has a heavy overload in the two spin_on_owner. This blames on the lock holder preemption issue. Kernel has an interface bool vcpu_is_preempted(int cpu) to see if a vCPU is currently running or not. So break the spin loops on true condition. test-case: perf record -a perf bench sched messaging -g 400 -p && perf report before patch: 20.68% sched-messaging [kernel.vmlinux] [k] mutex_spin_on_owner 8.45% sched-messaging [kernel.vmlinux] [k] mutex_unlock 4.12% sched-messaging [kernel.vmlinux] [k] system_call 3.01% sched-messaging [kernel.vmlinux] [k] system_call_common 2.83% sched-messaging [kernel.vmlinux] [k] copypage_power7 2.64% sched-messaging [kernel.vmlinux] [k] rwsem_spin_on_owner 2.00% sched-messaging [kernel.vmlinux] [k] osq_lock after patch: 9.99% sched-messaging [kernel.vmlinux] [k] mutex_unlock 5.28% sched-messaging [unknown] [H] 0xc00768e0 4.27% sched-messaging [kernel.vmlinux] [k] __copy_tofrom_user_power7 3.77% sched-messaging [kernel.vmlinux] [k] copypage_power7 3.24% sched-messaging [kernel.vmlinux] [k] _raw_write_lock_irq 3.02% sched-messaging [kernel.vmlinux] [k] system_call 2.69% sched-messaging [kernel.vmlinux] [k] wait_consider_task Signed-off-by: Pan Xinhui Acked-by: Christian Borntraeger Acked-by: Paolo Bonzini Tested-by: Juergen Gross --- kernel/locking/mutex.c | 13 +++-- kernel/locking/rwsem-xadd.c | 14 +++--- 2 files changed, 22 insertions(+), 5 deletions(-) diff --git a/kernel/locking/mutex.c b/kernel/locking/mutex.c index a70b90d..24face6 100644 --- a/kernel/locking/mutex.c +++ b/kernel/locking/mutex.c @@ -236,7 +236,11 @@ bool mutex_spin_on_owner(struct mutex *lock, struct task_struct *owner) */ barrier(); - if (!owner->on_cpu || need_resched()) { + /* +* Use vcpu_is_preempted to detect lock holder preemption issue. +*/ + if (!owner->on_cpu || need_resched() || + vcpu_is_preempted(task_cpu(owner))) { ret = false; break; } @@ -261,8 +265,13 @@ static inline int mutex_can_spin_on_owner(struct mutex *lock) rcu_read_lock(); owner = READ_ONCE(lock->owner); + + /* +* As lock holder preemption issue, we both skip spinning if task is not +* on cpu or its cpu is preempted +*/ if (owner) - retval = owner->on_cpu; + retval = owner->on_cpu && !vcpu_is_preempted(task_cpu(owner)); rcu_read_unlock(); /* * if lock->owner is not set, the mutex owner may have just acquired diff --git a/kernel/locking/rwsem-xadd.c b/kernel/locking/rwsem-xadd.c index 2337b4b..b664ce1 100644 --- a/kernel/locking/rwsem-xadd.c +++ b/kernel/locking/rwsem-xadd.c @@ -336,7 +336,11 @@ static inline bool rwsem_can_spin_on_owner(struct rw_semaphore *sem) goto done; } - ret = owner->on_cpu; + /* +* As lock holder preemption issue, we both skip spinning if task is not +* on cpu or its cpu is preempted +*/ + ret = owner->on_cpu && !vcpu_is_preempted(task_cpu(owner)); done: rcu_read_unlock(); return ret; @@ -362,8 +366,12 @@ static noinline bool rwsem_spin_on_owner(struct rw_semaphore *sem) */ barrier(); - /* abort spinning when need_resched or owner is not running */ - if (!owner->on_cpu || need_resched()) { + /* +* abort spinning when need_resched or owner is not running or +* owner's cpu is preempted. +*/ + if (!owner->on_cpu || need_resched() || + vcpu_is_preempted(task_cpu(owner))) { rcu_read_unlock(); return false; } -- 2.4.11
[PATCH v7 00/11] implement vcpu preempted check
change from v6: fix typos and remove uncessary comments. change from v5: spilt x86/kvm patch into guest/host part. introduce kvm_write_guest_offset_cached. fix some typos. rebase patch onto 4.9.2 change from v4: spilt x86 kvm vcpu preempted check into two patches. add documentation patch. add x86 vcpu preempted check patch under xen add s390 vcpu preempted check patch change from v3: add x86 vcpu preempted check patch change from v2: no code change, fix typos, update some comments change from v1: a simplier definition of default vcpu_is_preempted skip mahcine type check on ppc, and add config. remove dedicated macro. add one patch to drop overload of rwsem_spin_on_owner and mutex_spin_on_owner. add more comments thanks boqun and Peter's suggestion. This patch set aims to fix lock holder preemption issues. test-case: perf record -a perf bench sched messaging -g 400 -p && perf report 18.09% sched-messaging [kernel.vmlinux] [k] osq_lock 12.28% sched-messaging [kernel.vmlinux] [k] rwsem_spin_on_owner 5.27% sched-messaging [kernel.vmlinux] [k] mutex_unlock 3.89% sched-messaging [kernel.vmlinux] [k] wait_consider_task 3.64% sched-messaging [kernel.vmlinux] [k] _raw_write_lock_irq 3.41% sched-messaging [kernel.vmlinux] [k] mutex_spin_on_owner.is 2.49% sched-messaging [kernel.vmlinux] [k] system_call We introduce interface bool vcpu_is_preempted(int cpu) and use it in some spin loops of osq_lock, rwsem_spin_on_owner and mutex_spin_on_owner. These spin_on_onwer variant also cause rcu stall before we apply this patch set We also have observed some performace improvements in uninx benchmark tests. PPC test result: 1 copy - 0.94% 2 copy - 7.17% 4 copy - 11.9% 8 copy - 3.04% 16 copy - 15.11% details below: Without patch: 1 copy - File Write 4096 bufsize 8000 maxblocks 2188223.0 KBps (30.0 s, 1 samples) 2 copy - File Write 4096 bufsize 8000 maxblocks 1804433.0 KBps (30.0 s, 1 samples) 4 copy - File Write 4096 bufsize 8000 maxblocks 1237257.0 KBps (30.0 s, 1 samples) 8 copy - File Write 4096 bufsize 8000 maxblocks 1032658.0 KBps (30.0 s, 1 samples) 16 copy - File Write 4096 bufsize 8000 maxblocks 768000.0 KBps (30.1 s, 1 samples) With patch: 1 copy - File Write 4096 bufsize 8000 maxblocks 2209189.0 KBps (30.0 s, 1 samples) 2 copy - File Write 4096 bufsize 8000 maxblocks 1943816.0 KBps (30.0 s, 1 samples) 4 copy - File Write 4096 bufsize 8000 maxblocks 1405591.0 KBps (30.0 s, 1 samples) 8 copy - File Write 4096 bufsize 8000 maxblocks 1065080.0 KBps (30.0 s, 1 samples) 16 copy - File Write 4096 bufsize 8000 maxblocks 904762.0 KBps (30.0 s, 1 samples) X86 test result: test-case after-patch before-patch Execl Throughput |18307.9 lps |11701.6 lps File Copy 1024 bufsize 2000 maxblocks | 1352407.3 KBps | 790418.9 KBps File Copy 256 bufsize 500 maxblocks| 367555.6 KBps | 222867.7 KBps File Copy 4096 bufsize 8000 maxblocks | 3675649.7 KBps | 1780614.4 KBps Pipe Throughput| 11872208.7 lps | 11855628.9 lps Pipe-based Context Switching | 1495126.5 lps | 1490533.9 lps Process Creation |29881.2 lps |28572.8 lps Shell Scripts (1 concurrent) |23224.3 lpm |22607.4 lpm Shell Scripts (8 concurrent) | 3531.4 lpm | 3211.9 lpm System Call Overhead | 10385653.0 lps | 10419979.0 lps Christian Borntraeger (1): s390/spinlock: Provide vcpu_is_preempted Juergen Gross (1): x86, xen: support vcpu preempted check Pan Xinhui (9): kernel/sched: introduce vcpu preempted check interface locking/osq: Drop the overload of osq_lock() kernel/locking: Drop the overload of {mutex,rwsem}_spin_on_owner powerpc/spinlock: support vcpu preempted check x86, paravirt: Add interface to support kvm/xen vcpu preempted check KVM: Introduce kvm_write_guest_offset_cached x86, kvm/x86.c: support vcpu preempted check x86, kernel/kvm.c: support vcpu preempted check Documentation: virtual: kvm: Support vcpu preempted check Documentation/virtual/kvm/msr.txt | 9 - arch/powerpc/include/asm/spinlock.h | 8 arch/s390/include/asm/spinlock.h | 8 arch/s390/kernel/smp.c| 9 +++-- arch/s390/lib/spinlock.c | 25 - arch/x86/include/asm/paravirt_types.h | 2 ++ arch/x86/include/asm/spinlock.h | 8 arch/x86/include/uapi/asm/kvm_para.h | 4 +++- arch/x86/kernel/kvm.c | 12 arch/x86/kernel/paravirt-spinlocks.c | 6 ++ arch/x86/kvm/x86.c| 16 arch/x86/xen/spinlock.c | 3 ++- include/lin
Re: [Xen-devel] [PATCH v6 00/11] implement vcpu preempted check
在 2016/10/29 03:38, Konrad Rzeszutek Wilk 写道: On Fri, Oct 28, 2016 at 04:11:16AM -0400, Pan Xinhui wrote: change from v5: spilt x86/kvm patch into guest/host part. introduce kvm_write_guest_offset_cached. fix some typos. rebase patch onto 4.9.2 change from v4: spilt x86 kvm vcpu preempted check into two patches. add documentation patch. add x86 vcpu preempted check patch under xen add s390 vcpu preempted check patch change from v3: add x86 vcpu preempted check patch change from v2: no code change, fix typos, update some comments change from v1: a simplier definition of default vcpu_is_preempted skip mahcine type check on ppc, and add config. remove dedicated macro. add one patch to drop overload of rwsem_spin_on_owner and mutex_spin_on_owner. add more comments thanks boqun and Peter's suggestion. This patch set aims to fix lock holder preemption issues. Do you have a git tree with these patches? Currently no, sorry :( I make a tar file for this patcheset. Maybe a little easier to apply :) thanks xinhui test-case: perf record -a perf bench sched messaging -g 400 -p && perf report 18.09% sched-messaging [kernel.vmlinux] [k] osq_lock 12.28% sched-messaging [kernel.vmlinux] [k] rwsem_spin_on_owner 5.27% sched-messaging [kernel.vmlinux] [k] mutex_unlock 3.89% sched-messaging [kernel.vmlinux] [k] wait_consider_task 3.64% sched-messaging [kernel.vmlinux] [k] _raw_write_lock_irq 3.41% sched-messaging [kernel.vmlinux] [k] mutex_spin_on_owner.is 2.49% sched-messaging [kernel.vmlinux] [k] system_call We introduce interface bool vcpu_is_preempted(int cpu) and use it in some spin loops of osq_lock, rwsem_spin_on_owner and mutex_spin_on_owner. These spin_on_onwer variant also cause rcu stall before we apply this patch set We also have observed some performace improvements in uninx benchmark tests. PPC test result: 1 copy - 0.94% 2 copy - 7.17% 4 copy - 11.9% 8 copy - 3.04% 16 copy - 15.11% details below: Without patch: 1 copy - File Write 4096 bufsize 8000 maxblocks 2188223.0 KBps (30.0 s, 1 samples) 2 copy - File Write 4096 bufsize 8000 maxblocks 1804433.0 KBps (30.0 s, 1 samples) 4 copy - File Write 4096 bufsize 8000 maxblocks 1237257.0 KBps (30.0 s, 1 samples) 8 copy - File Write 4096 bufsize 8000 maxblocks 1032658.0 KBps (30.0 s, 1 samples) 16 copy - File Write 4096 bufsize 8000 maxblocks 768000.0 KBps (30.1 s, 1 samples) With patch: 1 copy - File Write 4096 bufsize 8000 maxblocks 2209189.0 KBps (30.0 s, 1 samples) 2 copy - File Write 4096 bufsize 8000 maxblocks 1943816.0 KBps (30.0 s, 1 samples) 4 copy - File Write 4096 bufsize 8000 maxblocks 1405591.0 KBps (30.0 s, 1 samples) 8 copy - File Write 4096 bufsize 8000 maxblocks 1065080.0 KBps (30.0 s, 1 samples) 16 copy - File Write 4096 bufsize 8000 maxblocks 904762.0 KBps (30.0 s, 1 samples) X86 test result: test-case after-patch before-patch Execl Throughput |18307.9 lps |11701.6 lps File Copy 1024 bufsize 2000 maxblocks | 1352407.3 KBps | 790418.9 KBps File Copy 256 bufsize 500 maxblocks| 367555.6 KBps | 222867.7 KBps File Copy 4096 bufsize 8000 maxblocks | 3675649.7 KBps | 1780614.4 KBps Pipe Throughput| 11872208.7 lps | 11855628.9 lps Pipe-based Context Switching | 1495126.5 lps | 1490533.9 lps Process Creation |29881.2 lps |28572.8 lps Shell Scripts (1 concurrent) |23224.3 lpm |22607.4 lpm Shell Scripts (8 concurrent) | 3531.4 lpm | 3211.9 lpm System Call Overhead | 10385653.0 lps | 10419979.0 lps Christian Borntraeger (1): s390/spinlock: Provide vcpu_is_preempted Juergen Gross (1): x86, xen: support vcpu preempted check Pan Xinhui (9): kernel/sched: introduce vcpu preempted check interface locking/osq: Drop the overload of osq_lock() kernel/locking: Drop the overload of {mutex,rwsem}_spin_on_owner powerpc/spinlock: support vcpu preempted check x86, paravirt: Add interface to support kvm/xen vcpu preempted check KVM: Introduce kvm_write_guest_offset_cached x86, kvm/x86.c: support vcpu preempted check x86, kernel/kvm.c: support vcpu preempted check Documentation: virtual: kvm: Support vcpu preempted check Documentation/virtual/kvm/msr.txt | 9 - arch/powerpc/include/asm/spinlock.h | 8 arch/s390/include/asm/spinlock.h | 8 arch/s390/kernel/smp.c| 9 +++-- arch/s390/lib/spinlock.c | 25 - arch/x86/include/asm/paravirt_types.h | 2 ++ arch/x86/include/asm/spinlock.h | 8 arch/x86/include/uapi/asm/kvm_para.h | 4 +++- arch/x86/kernel/kvm.c | 12 +
[PATCH v6 05/11] s390/spinlock: Provide vcpu_is_preempted
From: Christian Borntraeger this implements the s390 backend for commit "kernel/sched: introduce vcpu preempted check interface" by reworking the existing smp_vcpu_scheduled into arch_vcpu_is_preempted. We can then also get rid of the local cpu_is_preempted function by moving the CIF_ENABLED_WAIT test into arch_vcpu_is_preempted. Signed-off-by: Christian Borntraeger Acked-by: Heiko Carstens --- arch/s390/include/asm/spinlock.h | 8 arch/s390/kernel/smp.c | 9 +++-- arch/s390/lib/spinlock.c | 25 - 3 files changed, 23 insertions(+), 19 deletions(-) diff --git a/arch/s390/include/asm/spinlock.h b/arch/s390/include/asm/spinlock.h index 7e9e09f..7ecd890 100644 --- a/arch/s390/include/asm/spinlock.h +++ b/arch/s390/include/asm/spinlock.h @@ -23,6 +23,14 @@ _raw_compare_and_swap(unsigned int *lock, unsigned int old, unsigned int new) return __sync_bool_compare_and_swap(lock, old, new); } +#ifndef CONFIG_SMP +static inline bool arch_vcpu_is_preempted(int cpu) { return false; } +#else +bool arch_vcpu_is_preempted(int cpu); +#endif + +#define vcpu_is_preempted arch_vcpu_is_preempted + /* * Simple spin lock operations. There are two variants, one clears IRQ's * on the local processor, one does not. diff --git a/arch/s390/kernel/smp.c b/arch/s390/kernel/smp.c index 35531fe..b988ed1 100644 --- a/arch/s390/kernel/smp.c +++ b/arch/s390/kernel/smp.c @@ -368,10 +368,15 @@ int smp_find_processor_id(u16 address) return -1; } -int smp_vcpu_scheduled(int cpu) +bool arch_vcpu_is_preempted(int cpu) { - return pcpu_running(pcpu_devices + cpu); + if (test_cpu_flag_of(CIF_ENABLED_WAIT, cpu)) + return false; + if (pcpu_running(pcpu_devices + cpu)) + return false; + return true; } +EXPORT_SYMBOL(arch_vcpu_is_preempted); void smp_yield_cpu(int cpu) { diff --git a/arch/s390/lib/spinlock.c b/arch/s390/lib/spinlock.c index e5f50a7..e48a48e 100644 --- a/arch/s390/lib/spinlock.c +++ b/arch/s390/lib/spinlock.c @@ -37,15 +37,6 @@ static inline void _raw_compare_and_delay(unsigned int *lock, unsigned int old) asm(".insn rsy,0xeb22,%0,0,%1" : : "d" (old), "Q" (*lock)); } -static inline int cpu_is_preempted(int cpu) -{ - if (test_cpu_flag_of(CIF_ENABLED_WAIT, cpu)) - return 0; - if (smp_vcpu_scheduled(cpu)) - return 0; - return 1; -} - void arch_spin_lock_wait(arch_spinlock_t *lp) { unsigned int cpu = SPINLOCK_LOCKVAL; @@ -62,7 +53,7 @@ void arch_spin_lock_wait(arch_spinlock_t *lp) continue; } /* First iteration: check if the lock owner is running. */ - if (first_diag && cpu_is_preempted(~owner)) { + if (first_diag && arch_vcpu_is_preempted(~owner)) { smp_yield_cpu(~owner); first_diag = 0; continue; @@ -81,7 +72,7 @@ void arch_spin_lock_wait(arch_spinlock_t *lp) * yield the CPU unconditionally. For LPAR rely on the * sense running status. */ - if (!MACHINE_IS_LPAR || cpu_is_preempted(~owner)) { + if (!MACHINE_IS_LPAR || arch_vcpu_is_preempted(~owner)) { smp_yield_cpu(~owner); first_diag = 0; } @@ -108,7 +99,7 @@ void arch_spin_lock_wait_flags(arch_spinlock_t *lp, unsigned long flags) continue; } /* Check if the lock owner is running. */ - if (first_diag && cpu_is_preempted(~owner)) { + if (first_diag && arch_vcpu_is_preempted(~owner)) { smp_yield_cpu(~owner); first_diag = 0; continue; @@ -127,7 +118,7 @@ void arch_spin_lock_wait_flags(arch_spinlock_t *lp, unsigned long flags) * yield the CPU unconditionally. For LPAR rely on the * sense running status. */ - if (!MACHINE_IS_LPAR || cpu_is_preempted(~owner)) { + if (!MACHINE_IS_LPAR || arch_vcpu_is_preempted(~owner)) { smp_yield_cpu(~owner); first_diag = 0; } @@ -165,7 +156,7 @@ void _raw_read_lock_wait(arch_rwlock_t *rw) owner = 0; while (1) { if (count-- <= 0) { - if (owner && cpu_is_preempted(~owner)) + if (owner && arch_vcpu_is_preempted(~owner)) smp_yield_cpu(~owner); count = spin_retry; } @@ -211,7 +202,7 @@ void _raw_write_lock_wait(arch_rwlock_t *rw, unsigned int prev) owner = 0; while (1) { if (count-- <= 0) { - if (owner && cpu_is_preempted(~owner)) + if
[PATCH v6 07/11] KVM: Introduce kvm_write_guest_offset_cached
It allows us to update some status or field of one struct partially. We can also save one kvm_read_guest_cached if we just update one filed of the struct regardless of its current value. Signed-off-by: Pan Xinhui --- include/linux/kvm_host.h | 2 ++ virt/kvm/kvm_main.c | 20 ++-- 2 files changed, 16 insertions(+), 6 deletions(-) diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index 01c0b9c..6f00237 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -645,6 +645,8 @@ int kvm_write_guest(struct kvm *kvm, gpa_t gpa, const void *data, unsigned long len); int kvm_write_guest_cached(struct kvm *kvm, struct gfn_to_hva_cache *ghc, void *data, unsigned long len); +int kvm_write_guest_offset_cached(struct kvm *kvm, struct gfn_to_hva_cache *ghc, + void *data, int offset, unsigned long len); int kvm_gfn_to_hva_cache_init(struct kvm *kvm, struct gfn_to_hva_cache *ghc, gpa_t gpa, unsigned long len); int kvm_clear_guest_page(struct kvm *kvm, gfn_t gfn, int offset, int len); diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index 2907b7b..95308ee 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -1972,30 +1972,38 @@ int kvm_gfn_to_hva_cache_init(struct kvm *kvm, struct gfn_to_hva_cache *ghc, } EXPORT_SYMBOL_GPL(kvm_gfn_to_hva_cache_init); -int kvm_write_guest_cached(struct kvm *kvm, struct gfn_to_hva_cache *ghc, - void *data, unsigned long len) +int kvm_write_guest_offset_cached(struct kvm *kvm, struct gfn_to_hva_cache *ghc, + void *data, int offset, unsigned long len) { struct kvm_memslots *slots = kvm_memslots(kvm); int r; + gpa_t gpa = ghc->gpa + offset; - BUG_ON(len > ghc->len); + BUG_ON(len + offset > ghc->len); if (slots->generation != ghc->generation) kvm_gfn_to_hva_cache_init(kvm, ghc, ghc->gpa, ghc->len); if (unlikely(!ghc->memslot)) - return kvm_write_guest(kvm, ghc->gpa, data, len); + return kvm_write_guest(kvm, gpa, data, len); if (kvm_is_error_hva(ghc->hva)) return -EFAULT; - r = __copy_to_user((void __user *)ghc->hva, data, len); + r = __copy_to_user((void __user *)ghc->hva + offset, data, len); if (r) return -EFAULT; - mark_page_dirty_in_slot(ghc->memslot, ghc->gpa >> PAGE_SHIFT); + mark_page_dirty_in_slot(ghc->memslot, gpa >> PAGE_SHIFT); return 0; } +EXPORT_SYMBOL_GPL(kvm_write_guest_offset_cached); + +int kvm_write_guest_cached(struct kvm *kvm, struct gfn_to_hva_cache *ghc, + void *data, unsigned long len) +{ + return kvm_write_guest_offset_cached(kvm, ghc, data, 0, len); +} EXPORT_SYMBOL_GPL(kvm_write_guest_cached); int kvm_read_guest_cached(struct kvm *kvm, struct gfn_to_hva_cache *ghc, -- 2.4.11
[PATCH v6 03/11] kernel/locking: Drop the overload of {mutex,rwsem}_spin_on_owner
An over-committed guest with more vCPUs than pCPUs has a heavy overload in the two spin_on_owner. This blames on the lock holder preemption issue. Kernel has an interface bool vcpu_is_preempted(int cpu) to see if a vCPU is currently running or not. So break the spin loops on true condition. test-case: perf record -a perf bench sched messaging -g 400 -p && perf report before patch: 20.68% sched-messaging [kernel.vmlinux] [k] mutex_spin_on_owner 8.45% sched-messaging [kernel.vmlinux] [k] mutex_unlock 4.12% sched-messaging [kernel.vmlinux] [k] system_call 3.01% sched-messaging [kernel.vmlinux] [k] system_call_common 2.83% sched-messaging [kernel.vmlinux] [k] copypage_power7 2.64% sched-messaging [kernel.vmlinux] [k] rwsem_spin_on_owner 2.00% sched-messaging [kernel.vmlinux] [k] osq_lock after patch: 9.99% sched-messaging [kernel.vmlinux] [k] mutex_unlock 5.28% sched-messaging [unknown] [H] 0xc00768e0 4.27% sched-messaging [kernel.vmlinux] [k] __copy_tofrom_user_power7 3.77% sched-messaging [kernel.vmlinux] [k] copypage_power7 3.24% sched-messaging [kernel.vmlinux] [k] _raw_write_lock_irq 3.02% sched-messaging [kernel.vmlinux] [k] system_call 2.69% sched-messaging [kernel.vmlinux] [k] wait_consider_task Signed-off-by: Pan Xinhui Acked-by: Christian Borntraeger Tested-by: Juergen Gross --- kernel/locking/mutex.c | 15 +-- kernel/locking/rwsem-xadd.c | 16 +--- 2 files changed, 26 insertions(+), 5 deletions(-) diff --git a/kernel/locking/mutex.c b/kernel/locking/mutex.c index a70b90d..82108f5 100644 --- a/kernel/locking/mutex.c +++ b/kernel/locking/mutex.c @@ -236,7 +236,13 @@ bool mutex_spin_on_owner(struct mutex *lock, struct task_struct *owner) */ barrier(); - if (!owner->on_cpu || need_resched()) { + /* +* Use vcpu_is_preempted to detech lock holder preemption issue +* and break. vcpu_is_preempted is a macro defined by false if +* arch does not support vcpu preempted check, +*/ + if (!owner->on_cpu || need_resched() || + vcpu_is_preempted(task_cpu(owner))) { ret = false; break; } @@ -261,8 +267,13 @@ static inline int mutex_can_spin_on_owner(struct mutex *lock) rcu_read_lock(); owner = READ_ONCE(lock->owner); + + /* +* As lock holder preemption issue, we both skip spinning if task is not +* on cpu or its cpu is preempted +*/ if (owner) - retval = owner->on_cpu; + retval = owner->on_cpu && !vcpu_is_preempted(task_cpu(owner)); rcu_read_unlock(); /* * if lock->owner is not set, the mutex owner may have just acquired diff --git a/kernel/locking/rwsem-xadd.c b/kernel/locking/rwsem-xadd.c index 2337b4b..0897179 100644 --- a/kernel/locking/rwsem-xadd.c +++ b/kernel/locking/rwsem-xadd.c @@ -336,7 +336,11 @@ static inline bool rwsem_can_spin_on_owner(struct rw_semaphore *sem) goto done; } - ret = owner->on_cpu; + /* +* As lock holder preemption issue, we both skip spinning if task is not +* on cpu or its cpu is preempted +*/ + ret = owner->on_cpu && !vcpu_is_preempted(task_cpu(owner)); done: rcu_read_unlock(); return ret; @@ -362,8 +366,14 @@ static noinline bool rwsem_spin_on_owner(struct rw_semaphore *sem) */ barrier(); - /* abort spinning when need_resched or owner is not running */ - if (!owner->on_cpu || need_resched()) { + /* +* abort spinning when need_resched or owner is not running or +* owner's cpu is preempted. vcpu_is_preempted is a macro +* defined by false if arch does not support vcpu preempted +* check +*/ + if (!owner->on_cpu || need_resched() || + vcpu_is_preempted(task_cpu(owner))) { rcu_read_unlock(); return false; } -- 2.4.11
[PATCH v6 06/11] x86, paravirt: Add interface to support kvm/xen vcpu preempted check
This is to fix some lock holder preemption issues. Some other locks implementation do a spin loop before acquiring the lock itself. Currently kernel has an interface of bool vcpu_is_preempted(int cpu). It takes the cpu as parameter and return true if the cpu is preempted. Then kernel can break the spin loops upon on the retval of vcpu_is_preempted. As kernel has used this interface, So lets support it. To deal with kernel and kvm/xen, add vcpu_is_preempted into struct pv_lock_ops. Then kvm or xen could provide their own implementation to support vcpu_is_preempted. Signed-off-by: Pan Xinhui --- arch/x86/include/asm/paravirt_types.h | 2 ++ arch/x86/include/asm/spinlock.h | 8 arch/x86/kernel/paravirt-spinlocks.c | 6 ++ 3 files changed, 16 insertions(+) diff --git a/arch/x86/include/asm/paravirt_types.h b/arch/x86/include/asm/paravirt_types.h index 0f400c0..38c3bb7 100644 --- a/arch/x86/include/asm/paravirt_types.h +++ b/arch/x86/include/asm/paravirt_types.h @@ -310,6 +310,8 @@ struct pv_lock_ops { void (*wait)(u8 *ptr, u8 val); void (*kick)(int cpu); + + bool (*vcpu_is_preempted)(int cpu); }; /* This contains all the paravirt structures: we get a convenient diff --git a/arch/x86/include/asm/spinlock.h b/arch/x86/include/asm/spinlock.h index 921bea7..0526f59 100644 --- a/arch/x86/include/asm/spinlock.h +++ b/arch/x86/include/asm/spinlock.h @@ -26,6 +26,14 @@ extern struct static_key paravirt_ticketlocks_enabled; static __always_inline bool static_key_false(struct static_key *key); +#ifdef CONFIG_PARAVIRT_SPINLOCKS +#define vcpu_is_preempted vcpu_is_preempted +static inline bool vcpu_is_preempted(int cpu) +{ + return pv_lock_ops.vcpu_is_preempted(cpu); +} +#endif + #include /* diff --git a/arch/x86/kernel/paravirt-spinlocks.c b/arch/x86/kernel/paravirt-spinlocks.c index 2c55a00..2f204dd 100644 --- a/arch/x86/kernel/paravirt-spinlocks.c +++ b/arch/x86/kernel/paravirt-spinlocks.c @@ -21,12 +21,18 @@ bool pv_is_native_spin_unlock(void) __raw_callee_save___native_queued_spin_unlock; } +static bool native_vcpu_is_preempted(int cpu) +{ + return 0; +} + struct pv_lock_ops pv_lock_ops = { #ifdef CONFIG_SMP .queued_spin_lock_slowpath = native_queued_spin_lock_slowpath, .queued_spin_unlock = PV_CALLEE_SAVE(__native_queued_spin_unlock), .wait = paravirt_nop, .kick = paravirt_nop, + .vcpu_is_preempted = native_vcpu_is_preempted, #endif /* SMP */ }; EXPORT_SYMBOL(pv_lock_ops); -- 2.4.11
[PATCH v6 10/11] x86, xen: support vcpu preempted check
From: Juergen Gross Support the vcpu_is_preempted() functionality under Xen. This will enhance lock performance on overcommitted hosts (more runnable vcpus than physical cpus in the system) as doing busy waits for preempted vcpus will hurt system performance far worse than early yielding. A quick test (4 vcpus on 1 physical cpu doing a parallel build job with "make -j 8") reduced system time by about 5% with this patch. Signed-off-by: Juergen Gross Signed-off-by: Pan Xinhui --- arch/x86/xen/spinlock.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/arch/x86/xen/spinlock.c b/arch/x86/xen/spinlock.c index 3d6e006..74756bb 100644 --- a/arch/x86/xen/spinlock.c +++ b/arch/x86/xen/spinlock.c @@ -114,7 +114,6 @@ void xen_uninit_lock_cpu(int cpu) per_cpu(irq_name, cpu) = NULL; } - /* * Our init of PV spinlocks is split in two init functions due to us * using paravirt patching and jump labels patching and having to do @@ -137,6 +136,8 @@ void __init xen_init_spinlocks(void) pv_lock_ops.queued_spin_unlock = PV_CALLEE_SAVE(__pv_queued_spin_unlock); pv_lock_ops.wait = xen_qlock_wait; pv_lock_ops.kick = xen_qlock_kick; + + pv_lock_ops.vcpu_is_preempted = xen_vcpu_stolen; } /* -- 2.4.11
[PATCH v6 08/11] x86, kvm/x86.c: support vcpu preempted check
Support the vcpu_is_preempted() functionality under KVM. This will enhance lock performance on overcommitted hosts (more runnable vcpus than physical cpus in the system) as doing busy waits for preempted vcpus will hurt system performance far worse than early yielding. Use one field of struct kvm_steal_time ::preempted to indicate that if one vcpu is running or not. Signed-off-by: Pan Xinhui --- arch/x86/include/uapi/asm/kvm_para.h | 4 +++- arch/x86/kvm/x86.c | 16 2 files changed, 19 insertions(+), 1 deletion(-) diff --git a/arch/x86/include/uapi/asm/kvm_para.h b/arch/x86/include/uapi/asm/kvm_para.h index 94dc8ca..1421a65 100644 --- a/arch/x86/include/uapi/asm/kvm_para.h +++ b/arch/x86/include/uapi/asm/kvm_para.h @@ -45,7 +45,9 @@ struct kvm_steal_time { __u64 steal; __u32 version; __u32 flags; - __u32 pad[12]; + __u8 preempted; + __u8 u8_pad[3]; + __u32 pad[11]; }; #define KVM_STEAL_ALIGNMENT_BITS 5 diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index e375235..f06e115 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -2057,6 +2057,8 @@ static void record_steal_time(struct kvm_vcpu *vcpu) &vcpu->arch.st.steal, sizeof(struct kvm_steal_time return; + vcpu->arch.st.steal.preempted = 0; + if (vcpu->arch.st.steal.version & 1) vcpu->arch.st.steal.version += 1; /* first time write, random junk */ @@ -2810,8 +2812,22 @@ void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu) kvm_make_request(KVM_REQ_STEAL_UPDATE, vcpu); } +static void kvm_steal_time_set_preempted(struct kvm_vcpu *vcpu) +{ + if (!(vcpu->arch.st.msr_val & KVM_MSR_ENABLED)) + return; + + vcpu->arch.st.steal.preempted = 1; + + kvm_write_guest_offset_cached(vcpu->kvm, &vcpu->arch.st.stime, + &vcpu->arch.st.steal.preempted, + offsetof(struct kvm_steal_time, preempted), + sizeof(vcpu->arch.st.steal.preempted)); +} + void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu) { + kvm_steal_time_set_preempted(vcpu); kvm_x86_ops->vcpu_put(vcpu); kvm_put_guest_fpu(vcpu); vcpu->arch.last_host_tsc = rdtsc(); -- 2.4.11
[PATCH v6 04/11] powerpc/spinlock: support vcpu preempted check
This is to fix some lock holder preemption issues. Some other locks implementation do a spin loop before acquiring the lock itself. Currently kernel has an interface of bool vcpu_is_preempted(int cpu). It takes the cpu as parameter and return true if the cpu is preempted. Then kernel can break the spin loops upon on the retval of vcpu_is_preempted. As kernel has used this interface, So lets support it. Only pSeries need support it. And the fact is powerNV are built into same kernel image with pSeries. So we need return false if we are runnig as powerNV. The another fact is that lppaca->yiled_count keeps zero on powerNV. So we can just skip the machine type check. Suggested-by: Boqun Feng Suggested-by: Peter Zijlstra (Intel) Signed-off-by: Pan Xinhui --- arch/powerpc/include/asm/spinlock.h | 8 1 file changed, 8 insertions(+) diff --git a/arch/powerpc/include/asm/spinlock.h b/arch/powerpc/include/asm/spinlock.h index fa37fe9..8c1b913 100644 --- a/arch/powerpc/include/asm/spinlock.h +++ b/arch/powerpc/include/asm/spinlock.h @@ -52,6 +52,14 @@ #define SYNC_IO #endif +#ifdef CONFIG_PPC_PSERIES +#define vcpu_is_preempted vcpu_is_preempted +static inline bool vcpu_is_preempted(int cpu) +{ + return !!(be32_to_cpu(lppaca_of(cpu).yield_count) & 1); +} +#endif + static __always_inline int arch_spin_value_unlocked(arch_spinlock_t lock) { return lock.slock == 0; -- 2.4.11
[PATCH v6 11/11] Documentation: virtual: kvm: Support vcpu preempted check
Commit ("x86, kvm: support vcpu preempted check") add one field "__u8 preempted" into struct kvm_steal_time. This field tells if one vcpu is running or not. It is zero if 1) some old KVM deos not support this filed. 2) the vcpu is not preempted. Other values means the vcpu has been preempted. Signed-off-by: Pan Xinhui Acked-by: Radim Krčmář --- Documentation/virtual/kvm/msr.txt | 9 - 1 file changed, 8 insertions(+), 1 deletion(-) diff --git a/Documentation/virtual/kvm/msr.txt b/Documentation/virtual/kvm/msr.txt index 2a71c8f..ab2ab76 100644 --- a/Documentation/virtual/kvm/msr.txt +++ b/Documentation/virtual/kvm/msr.txt @@ -208,7 +208,9 @@ MSR_KVM_STEAL_TIME: 0x4b564d03 __u64 steal; __u32 version; __u32 flags; - __u32 pad[12]; + __u8 preempted; + __u8 u8_pad[3]; + __u32 pad[11]; } whose data will be filled in by the hypervisor periodically. Only one @@ -232,6 +234,11 @@ MSR_KVM_STEAL_TIME: 0x4b564d03 nanoseconds. Time during which the vcpu is idle, will not be reported as steal time. + preempted: indicate the VCPU who owns this struct is running or + not. Non-zero values mean the VCPU has been preempted. Zero + means the VCPU is not preempted. NOTE, it is always zero if the + the hypervisor doesn't support this field. + MSR_KVM_EOI_EN: 0x4b564d04 data: Bit 0 is 1 when PV end of interrupt is enabled on the vcpu; 0 when disabled. Bit 1 is reserved and must be zero. When PV end of -- 2.4.11
[PATCH v6 02/11] locking/osq: Drop the overload of osq_lock()
An over-committed guest with more vCPUs than pCPUs has a heavy overload in osq_lock(). This is because vCPU A hold the osq lock and yield out, vCPU B wait per_cpu node->locked to be set. IOW, vCPU B wait vCPU A to run and unlock the osq lock. Kernel has an interface bool vcpu_is_preempted(int cpu) to see if a vCPU is currently running or not. So break the spin loops on true condition. test case: perf record -a perf bench sched messaging -g 400 -p && perf report before patch: 18.09% sched-messaging [kernel.vmlinux] [k] osq_lock 12.28% sched-messaging [kernel.vmlinux] [k] rwsem_spin_on_owner 5.27% sched-messaging [kernel.vmlinux] [k] mutex_unlock 3.89% sched-messaging [kernel.vmlinux] [k] wait_consider_task 3.64% sched-messaging [kernel.vmlinux] [k] _raw_write_lock_irq 3.41% sched-messaging [kernel.vmlinux] [k] mutex_spin_on_owner.is 2.49% sched-messaging [kernel.vmlinux] [k] system_call after patch: 20.68% sched-messaging [kernel.vmlinux] [k] mutex_spin_on_owner 8.45% sched-messaging [kernel.vmlinux] [k] mutex_unlock 4.12% sched-messaging [kernel.vmlinux] [k] system_call 3.01% sched-messaging [kernel.vmlinux] [k] system_call_common 2.83% sched-messaging [kernel.vmlinux] [k] copypage_power7 2.64% sched-messaging [kernel.vmlinux] [k] rwsem_spin_on_owner 2.00% sched-messaging [kernel.vmlinux] [k] osq_lock Suggested-by: Boqun Feng Signed-off-by: Pan Xinhui Acked-by: Christian Borntraeger Tested-by: Juergen Gross --- kernel/locking/osq_lock.c | 10 +- 1 file changed, 9 insertions(+), 1 deletion(-) diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c index 05a3785..39d1385 100644 --- a/kernel/locking/osq_lock.c +++ b/kernel/locking/osq_lock.c @@ -21,6 +21,11 @@ static inline int encode_cpu(int cpu_nr) return cpu_nr + 1; } +static inline int node_cpu(struct optimistic_spin_node *node) +{ + return node->cpu - 1; +} + static inline struct optimistic_spin_node *decode_cpu(int encoded_cpu_val) { int cpu_nr = encoded_cpu_val - 1; @@ -118,8 +123,11 @@ bool osq_lock(struct optimistic_spin_queue *lock) while (!READ_ONCE(node->locked)) { /* * If we need to reschedule bail... so we can block. +* Use vcpu_is_preempted to detech lock holder preemption issue +* and break. vcpu_is_preempted is a macro defined by false if +* arch does not support vcpu preempted check, */ - if (need_resched()) + if (need_resched() || vcpu_is_preempted(node_cpu(node->prev))) goto unqueue; cpu_relax_lowlatency(); -- 2.4.11
[PATCH v6 09/11] x86, kernel/kvm.c: support vcpu preempted check
Support the vcpu_is_preempted() functionality under KVM. This will enhance lock performance on overcommitted hosts (more runnable vcpus than physical cpus in the system) as doing busy waits for preempted vcpus will hurt system performance far worse than early yielding. struct kvm_steal_time::preempted indicate that if one vcpu is running or not after commit("x86, kvm/x86.c: support vcpu preempted check"). unix benchmark result: host: kernel 4.8.1, i5-4570, 4 cpus guest: kernel 4.8.1, 8 vcpus test-case after-patch before-patch Execl Throughput |18307.9 lps |11701.6 lps File Copy 1024 bufsize 2000 maxblocks | 1352407.3 KBps | 790418.9 KBps File Copy 256 bufsize 500 maxblocks| 367555.6 KBps | 222867.7 KBps File Copy 4096 bufsize 8000 maxblocks | 3675649.7 KBps | 1780614.4 KBps Pipe Throughput| 11872208.7 lps | 11855628.9 lps Pipe-based Context Switching | 1495126.5 lps | 1490533.9 lps Process Creation |29881.2 lps |28572.8 lps Shell Scripts (1 concurrent) |23224.3 lpm |22607.4 lpm Shell Scripts (8 concurrent) | 3531.4 lpm | 3211.9 lpm System Call Overhead | 10385653.0 lps | 10419979.0 lps Signed-off-by: Pan Xinhui --- arch/x86/kernel/kvm.c | 12 1 file changed, 12 insertions(+) diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c index edbbfc8..0b48dd2 100644 --- a/arch/x86/kernel/kvm.c +++ b/arch/x86/kernel/kvm.c @@ -415,6 +415,15 @@ void kvm_disable_steal_time(void) wrmsr(MSR_KVM_STEAL_TIME, 0, 0); } +static bool kvm_vcpu_is_preempted(int cpu) +{ + struct kvm_steal_time *src; + + src = &per_cpu(steal_time, cpu); + + return !!src->preempted; +} + #ifdef CONFIG_SMP static void __init kvm_smp_prepare_boot_cpu(void) { @@ -471,6 +480,9 @@ void __init kvm_guest_init(void) if (kvm_para_has_feature(KVM_FEATURE_STEAL_TIME)) { has_steal_clock = 1; pv_time_ops.steal_clock = kvm_steal_clock; +#ifdef CONFIG_PARAVIRT_SPINLOCKS + pv_lock_ops.vcpu_is_preempted = kvm_vcpu_is_preempted; +#endif } if (kvm_para_has_feature(KVM_FEATURE_PV_EOI)) -- 2.4.11
[PATCH v6 01/11] kernel/sched: introduce vcpu preempted check interface
This patch support to fix lock holder preemption issue. For kernel users, we could use bool vcpu_is_preempted(int cpu) to detech if one vcpu is preempted or not. The default implementation is a macro defined by false. So compiler can wrap it out if arch dose not support such vcpu pteempted check. Suggested-by: Peter Zijlstra (Intel) Signed-off-by: Pan Xinhui Acked-by: Christian Borntraeger Tested-by: Juergen Gross --- include/linux/sched.h | 12 1 file changed, 12 insertions(+) diff --git a/include/linux/sched.h b/include/linux/sched.h index 348f51b..44c1ce7 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -3506,6 +3506,18 @@ static inline void set_task_cpu(struct task_struct *p, unsigned int cpu) #endif /* CONFIG_SMP */ +/* + * In order to deal with a various lock holder preemption issues provide an + * interface to see if a vCPU is currently running or not. + * + * This allows us to terminate optimistic spin loops and block, analogous to + * the native optimistic spin heuristic of testing if the lock owner task is + * running or not. + */ +#ifndef vcpu_is_preempted +#define vcpu_is_preempted(cpu) false +#endif + extern long sched_setaffinity(pid_t pid, const struct cpumask *new_mask); extern long sched_getaffinity(pid_t pid, struct cpumask *mask); -- 2.4.11
[PATCH v6 00/11] implement vcpu preempted check
change from v5: spilt x86/kvm patch into guest/host part. introduce kvm_write_guest_offset_cached. fix some typos. rebase patch onto 4.9.2 change from v4: spilt x86 kvm vcpu preempted check into two patches. add documentation patch. add x86 vcpu preempted check patch under xen add s390 vcpu preempted check patch change from v3: add x86 vcpu preempted check patch change from v2: no code change, fix typos, update some comments change from v1: a simplier definition of default vcpu_is_preempted skip mahcine type check on ppc, and add config. remove dedicated macro. add one patch to drop overload of rwsem_spin_on_owner and mutex_spin_on_owner. add more comments thanks boqun and Peter's suggestion. This patch set aims to fix lock holder preemption issues. test-case: perf record -a perf bench sched messaging -g 400 -p && perf report 18.09% sched-messaging [kernel.vmlinux] [k] osq_lock 12.28% sched-messaging [kernel.vmlinux] [k] rwsem_spin_on_owner 5.27% sched-messaging [kernel.vmlinux] [k] mutex_unlock 3.89% sched-messaging [kernel.vmlinux] [k] wait_consider_task 3.64% sched-messaging [kernel.vmlinux] [k] _raw_write_lock_irq 3.41% sched-messaging [kernel.vmlinux] [k] mutex_spin_on_owner.is 2.49% sched-messaging [kernel.vmlinux] [k] system_call We introduce interface bool vcpu_is_preempted(int cpu) and use it in some spin loops of osq_lock, rwsem_spin_on_owner and mutex_spin_on_owner. These spin_on_onwer variant also cause rcu stall before we apply this patch set We also have observed some performace improvements in uninx benchmark tests. PPC test result: 1 copy - 0.94% 2 copy - 7.17% 4 copy - 11.9% 8 copy - 3.04% 16 copy - 15.11% details below: Without patch: 1 copy - File Write 4096 bufsize 8000 maxblocks 2188223.0 KBps (30.0 s, 1 samples) 2 copy - File Write 4096 bufsize 8000 maxblocks 1804433.0 KBps (30.0 s, 1 samples) 4 copy - File Write 4096 bufsize 8000 maxblocks 1237257.0 KBps (30.0 s, 1 samples) 8 copy - File Write 4096 bufsize 8000 maxblocks 1032658.0 KBps (30.0 s, 1 samples) 16 copy - File Write 4096 bufsize 8000 maxblocks 768000.0 KBps (30.1 s, 1 samples) With patch: 1 copy - File Write 4096 bufsize 8000 maxblocks 2209189.0 KBps (30.0 s, 1 samples) 2 copy - File Write 4096 bufsize 8000 maxblocks 1943816.0 KBps (30.0 s, 1 samples) 4 copy - File Write 4096 bufsize 8000 maxblocks 1405591.0 KBps (30.0 s, 1 samples) 8 copy - File Write 4096 bufsize 8000 maxblocks 1065080.0 KBps (30.0 s, 1 samples) 16 copy - File Write 4096 bufsize 8000 maxblocks 904762.0 KBps (30.0 s, 1 samples) X86 test result: test-case after-patch before-patch Execl Throughput |18307.9 lps |11701.6 lps File Copy 1024 bufsize 2000 maxblocks | 1352407.3 KBps | 790418.9 KBps File Copy 256 bufsize 500 maxblocks| 367555.6 KBps | 222867.7 KBps File Copy 4096 bufsize 8000 maxblocks | 3675649.7 KBps | 1780614.4 KBps Pipe Throughput| 11872208.7 lps | 11855628.9 lps Pipe-based Context Switching | 1495126.5 lps | 1490533.9 lps Process Creation |29881.2 lps |28572.8 lps Shell Scripts (1 concurrent) |23224.3 lpm |22607.4 lpm Shell Scripts (8 concurrent) | 3531.4 lpm | 3211.9 lpm System Call Overhead | 10385653.0 lps | 10419979.0 lps Christian Borntraeger (1): s390/spinlock: Provide vcpu_is_preempted Juergen Gross (1): x86, xen: support vcpu preempted check Pan Xinhui (9): kernel/sched: introduce vcpu preempted check interface locking/osq: Drop the overload of osq_lock() kernel/locking: Drop the overload of {mutex,rwsem}_spin_on_owner powerpc/spinlock: support vcpu preempted check x86, paravirt: Add interface to support kvm/xen vcpu preempted check KVM: Introduce kvm_write_guest_offset_cached x86, kvm/x86.c: support vcpu preempted check x86, kernel/kvm.c: support vcpu preempted check Documentation: virtual: kvm: Support vcpu preempted check Documentation/virtual/kvm/msr.txt | 9 - arch/powerpc/include/asm/spinlock.h | 8 arch/s390/include/asm/spinlock.h | 8 arch/s390/kernel/smp.c| 9 +++-- arch/s390/lib/spinlock.c | 25 - arch/x86/include/asm/paravirt_types.h | 2 ++ arch/x86/include/asm/spinlock.h | 8 arch/x86/include/uapi/asm/kvm_para.h | 4 +++- arch/x86/kernel/kvm.c | 12 arch/x86/kernel/paravirt-spinlocks.c | 6 ++ arch/x86/kvm/x86.c| 16 arch/x86/xen/spinlock.c | 3 ++- include/linux/kvm_host.h | 2 ++ include/linux/sched.h
Re: [PATCH v4 5/5] x86, kvm: support vcpu preempted check
在 2016/10/24 23:18, Paolo Bonzini 写道: On 24/10/2016 17:14, Radim Krčmář wrote: 2016-10-24 16:39+0200, Paolo Bonzini: On 19/10/2016 19:24, Radim Krčmář wrote: + if (vcpu->arch.st.msr_val & KVM_MSR_ENABLED) + if (kvm_read_guest_cached(vcpu->kvm, &vcpu->arch.st.stime, + &vcpu->arch.st.steal, + sizeof(struct kvm_steal_time)) == 0) { + vcpu->arch.st.steal.preempted = 1; + kvm_write_guest_cached(vcpu->kvm, &vcpu->arch.st.stime, + &vcpu->arch.st.steal, + sizeof(struct kvm_steal_time)); + } Please name this block of code. Something like kvm_steal_time_set_preempted(vcpu); While at it: 1) the kvm_read_guest_cached is not necessary. You can rig the call to kvm_write_guest_cached so that it only writes vcpu->arch.st.steal.preempted. I agree. kvm_write_guest_cached() always writes from offset 0, so we'd want a new function that allows to specify a starting offset. Yeah, let's leave it for a follow-up then! I think I can make a having-offset version. :) Thanks, Paolo Using cached vcpu->arch.st.steal to avoid the read wouldn't be as good.
Re: [PATCH v5 9/9] Documentation: virtual: kvm: Support vcpu preempted check
This is new version for [PATCH v6 9/9] Documentation: virtual: kvm: Support vcpu preempted check change: an explicit pad[3] after __u8 preempted. a typo fix in the commit log. From defac64d7c6a50d5f18ef64a7c776af3e21e8b68 Mon Sep 17 00:00:00 2001 From: Pan Xinhui Date: Thu, 20 Oct 2016 09:33:36 -0400 Subject: [PATCH v6 9/9] Documentation: virtual: kvm: Support vcpu preempted check Commit ("x86, kvm: support vcpu preempted check") add one field "__u8 preempted" into struct kvm_steal_time. This field tells if one vcpu is running or not. It is zero if 1) some old KVM deos not support this filed. 2) the vcpu is not preempted. Other values mean the vcpu has been preempted. Signed-off-by: Pan Xinhui Acked-by: Radim Krčmář --- Documentation/virtual/kvm/msr.txt | 9 - 1 file changed, 8 insertions(+), 1 deletion(-) diff --git a/Documentation/virtual/kvm/msr.txt b/Documentation/virtual/kvm/msr.txt index 2a71c8f..ab2ab76 100644 --- a/Documentation/virtual/kvm/msr.txt +++ b/Documentation/virtual/kvm/msr.txt @@ -208,7 +208,9 @@ MSR_KVM_STEAL_TIME: 0x4b564d03 __u64 steal; __u32 version; __u32 flags; - __u32 pad[12]; + __u8 preempted; + __u8 u8_pad[3]; + __u32 pad[11]; } whose data will be filled in by the hypervisor periodically. Only one @@ -232,6 +234,11 @@ MSR_KVM_STEAL_TIME: 0x4b564d03 nanoseconds. Time during which the vcpu is idle, will not be reported as steal time. + preempted: indicate the VCPU who owns this struct is running or + not. Non-zero values mean the VCPU has been preempted. Zero + means the VCPU is not preempted. NOTE, it is always zero if the + the hypervisor doesn't support this field. + MSR_KVM_EOI_EN: 0x4b564d04 data: Bit 0 is 1 when PV end of interrupt is enabled on the vcpu; 0 when disabled. Bit 1 is reserved and must be zero. When PV end of -- 2.4.11
Re: [PATCH v5 6/9] x86, kvm: support vcpu preempted check
This is new version for [PATCH v6 6/9] x86, kvm: support vcpu preempted check change: an explicit pad[3] after __u8 preempted. From b876ea1a2a724c004b543b2c103a1f8faa5f106e Mon Sep 17 00:00:00 2001 From: Pan Xinhui Date: Thu, 20 Oct 2016 08:14:41 -0400 Subject: [PATCH v6 6/9] x86, kvm: support vcpu preempted check Support the vcpu_is_preempted() functionality under KVM. This will enhance lock performance on overcommitted hosts (more runnable vcpus than physical cpus in the system) as doing busy waits for preempted vcpus will hurt system performance far worse than early yielding. Use one field of struct kvm_steal_time to indicate that if one vcpu is running or not. unix benchmark result: host: kernel 4.8.1, i5-4570, 4 cpus guest: kernel 4.8.1, 8 vcpus test-case after-patch before-patch Execl Throughput |18307.9 lps |11701.6 lps File Copy 1024 bufsize 2000 maxblocks | 1352407.3 KBps | 790418.9 KBps File Copy 256 bufsize 500 maxblocks| 367555.6 KBps | 222867.7 KBps File Copy 4096 bufsize 8000 maxblocks | 3675649.7 KBps | 1780614.4 KBps Pipe Throughput| 11872208.7 lps | 11855628.9 lps Pipe-based Context Switching | 1495126.5 lps | 1490533.9 lps Process Creation |29881.2 lps |28572.8 lps Shell Scripts (1 concurrent) |23224.3 lpm |22607.4 lpm Shell Scripts (8 concurrent) | 3531.4 lpm | 3211.9 lpm System Call Overhead | 10385653.0 lps | 10419979.0 lps Signed-off-by: Pan Xinhui --- arch/x86/include/uapi/asm/kvm_para.h | 4 +++- arch/x86/kernel/kvm.c| 12 arch/x86/kvm/x86.c | 18 ++ 3 files changed, 33 insertions(+), 1 deletion(-) diff --git a/arch/x86/include/uapi/asm/kvm_para.h b/arch/x86/include/uapi/asm/kvm_para.h index 94dc8ca..1421a65 100644 --- a/arch/x86/include/uapi/asm/kvm_para.h +++ b/arch/x86/include/uapi/asm/kvm_para.h @@ -45,7 +45,9 @@ struct kvm_steal_time { __u64 steal; __u32 version; __u32 flags; - __u32 pad[12]; + __u8 preempted; + __u8 u8_pad[3]; + __u32 pad[11]; }; #define KVM_STEAL_ALIGNMENT_BITS 5 diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c index edbbfc8..0b48dd2 100644 --- a/arch/x86/kernel/kvm.c +++ b/arch/x86/kernel/kvm.c @@ -415,6 +415,15 @@ void kvm_disable_steal_time(void) wrmsr(MSR_KVM_STEAL_TIME, 0, 0); } +static bool kvm_vcpu_is_preempted(int cpu) +{ + struct kvm_steal_time *src; + + src = &per_cpu(steal_time, cpu); + + return !!src->preempted; +} + #ifdef CONFIG_SMP static void __init kvm_smp_prepare_boot_cpu(void) { @@ -471,6 +480,9 @@ void __init kvm_guest_init(void) if (kvm_para_has_feature(KVM_FEATURE_STEAL_TIME)) { has_steal_clock = 1; pv_time_ops.steal_clock = kvm_steal_clock; +#ifdef CONFIG_PARAVIRT_SPINLOCKS + pv_lock_ops.vcpu_is_preempted = kvm_vcpu_is_preempted; +#endif } if (kvm_para_has_feature(KVM_FEATURE_PV_EOI)) diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 6c633de..a627537 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -2057,6 +2057,8 @@ static void record_steal_time(struct kvm_vcpu *vcpu) &vcpu->arch.st.steal, sizeof(struct kvm_steal_time return; + vcpu->arch.st.steal.preempted = 0; + if (vcpu->arch.st.steal.version & 1) vcpu->arch.st.steal.version += 1; /* first time write, random junk */ @@ -2810,8 +2812,24 @@ void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu) kvm_make_request(KVM_REQ_STEAL_UPDATE, vcpu); } +static void kvm_steal_time_set_preempted(struct kvm_vcpu *vcpu) +{ + if (!(vcpu->arch.st.msr_val & KVM_MSR_ENABLED)) + return; + + if (unlikely(kvm_read_guest_cached(vcpu->kvm, &vcpu->arch.st.stime, + &vcpu->arch.st.steal, sizeof(struct kvm_steal_time + return; + + vcpu->arch.st.steal.preempted = 1; + + kvm_write_guest_cached(vcpu->kvm, &vcpu->arch.st.stime, + &vcpu->arch.st.steal, sizeof(struct kvm_steal_time)); +} + void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu) { + kvm_steal_time_set_preempted(vcpu); kvm_x86_ops->vcpu_put(vcpu); kvm_put_guest_fpu(vcpu); vcpu->arch.last_host_tsc = rdtsc(); -- 2.4.11
Re: [PATCH v5 9/9] Documentation: virtual: kvm: Support vcpu preempted check
On 2016年10月22日 02:39, rkrc...@redhat.com wrote: 2016-10-21 11:27+, David Laight: From: Pan Xinhui Sent: 20 October 2016 22:28 Commit ("x86, kvm: support vcpu preempted check") add one field "__u8 preempted" into struct kvm_steal_time. This field tells if one vcpu is running or not. It is zero if 1) some old KVM deos not support this filed. 2) the vcpu is preempted. Other values means the vcpu has been preempted. Signed-off-by: Pan Xinhui --- Documentation/virtual/kvm/msr.txt | 8 +++- 1 file changed, 7 insertions(+), 1 deletion(-) diff --git a/Documentation/virtual/kvm/msr.txt b/Documentation/virtual/kvm/msr.txt index 2a71c8f..3376f13 100644 --- a/Documentation/virtual/kvm/msr.txt +++ b/Documentation/virtual/kvm/msr.txt @@ -208,7 +208,8 @@ MSR_KVM_STEAL_TIME: 0x4b564d03 __u64 steal; __u32 version; __u32 flags; - __u32 pad[12]; + __u8 preempted; + __u32 pad[11]; } I think I'd be explicit about the 3 pad bytes you've left. Seconded. With that change are all KVM bits like below? __u8 preempted; __u8 kvm_pad[3]; Acked-by: Radim Krčmář thanks!
Re: [PATCH v5 9/9] Documentation: virtual: kvm: Support vcpu preempted check
On 2016年10月21日 19:27, David Laight wrote: From: Pan Xinhui Sent: 20 October 2016 22:28 Commit ("x86, kvm: support vcpu preempted check") add one field "__u8 preempted" into struct kvm_steal_time. This field tells if one vcpu is running or not. It is zero if 1) some old KVM deos not support this filed. 2) the vcpu is preempted. Other values means the vcpu has been preempted. Signed-off-by: Pan Xinhui --- Documentation/virtual/kvm/msr.txt | 8 +++- 1 file changed, 7 insertions(+), 1 deletion(-) diff --git a/Documentation/virtual/kvm/msr.txt b/Documentation/virtual/kvm/msr.txt index 2a71c8f..3376f13 100644 --- a/Documentation/virtual/kvm/msr.txt +++ b/Documentation/virtual/kvm/msr.txt @@ -208,7 +208,8 @@ MSR_KVM_STEAL_TIME: 0x4b564d03 __u64 steal; __u32 version; __u32 flags; - __u32 pad[12]; + __u8 preempted; + __u32 pad[11]; } I think I'd be explicit about the 3 pad bytes you've left. yes,I will do it in next version. thanks David
[PATCH v5 8/9] s390/spinlock: Provide vcpu_is_preempted
From: Christian Borntraeger this implements the s390 backend for commit "kernel/sched: introduce vcpu preempted check interface" by reworking the existing smp_vcpu_scheduled into arch_vcpu_is_preempted. We can then also get rid of the local cpu_is_preempted function by moving the CIF_ENABLED_WAIT test into arch_vcpu_is_preempted. Signed-off-by: Christian Borntraeger Acked-by: Heiko Carstens --- arch/s390/include/asm/spinlock.h | 8 arch/s390/kernel/smp.c | 9 +++-- arch/s390/lib/spinlock.c | 25 - 3 files changed, 23 insertions(+), 19 deletions(-) diff --git a/arch/s390/include/asm/spinlock.h b/arch/s390/include/asm/spinlock.h index 7e9e09f..7ecd890 100644 --- a/arch/s390/include/asm/spinlock.h +++ b/arch/s390/include/asm/spinlock.h @@ -23,6 +23,14 @@ _raw_compare_and_swap(unsigned int *lock, unsigned int old, unsigned int new) return __sync_bool_compare_and_swap(lock, old, new); } +#ifndef CONFIG_SMP +static inline bool arch_vcpu_is_preempted(int cpu) { return false; } +#else +bool arch_vcpu_is_preempted(int cpu); +#endif + +#define vcpu_is_preempted arch_vcpu_is_preempted + /* * Simple spin lock operations. There are two variants, one clears IRQ's * on the local processor, one does not. diff --git a/arch/s390/kernel/smp.c b/arch/s390/kernel/smp.c index 35531fe..b988ed1 100644 --- a/arch/s390/kernel/smp.c +++ b/arch/s390/kernel/smp.c @@ -368,10 +368,15 @@ int smp_find_processor_id(u16 address) return -1; } -int smp_vcpu_scheduled(int cpu) +bool arch_vcpu_is_preempted(int cpu) { - return pcpu_running(pcpu_devices + cpu); + if (test_cpu_flag_of(CIF_ENABLED_WAIT, cpu)) + return false; + if (pcpu_running(pcpu_devices + cpu)) + return false; + return true; } +EXPORT_SYMBOL(arch_vcpu_is_preempted); void smp_yield_cpu(int cpu) { diff --git a/arch/s390/lib/spinlock.c b/arch/s390/lib/spinlock.c index e5f50a7..e48a48e 100644 --- a/arch/s390/lib/spinlock.c +++ b/arch/s390/lib/spinlock.c @@ -37,15 +37,6 @@ static inline void _raw_compare_and_delay(unsigned int *lock, unsigned int old) asm(".insn rsy,0xeb22,%0,0,%1" : : "d" (old), "Q" (*lock)); } -static inline int cpu_is_preempted(int cpu) -{ - if (test_cpu_flag_of(CIF_ENABLED_WAIT, cpu)) - return 0; - if (smp_vcpu_scheduled(cpu)) - return 0; - return 1; -} - void arch_spin_lock_wait(arch_spinlock_t *lp) { unsigned int cpu = SPINLOCK_LOCKVAL; @@ -62,7 +53,7 @@ void arch_spin_lock_wait(arch_spinlock_t *lp) continue; } /* First iteration: check if the lock owner is running. */ - if (first_diag && cpu_is_preempted(~owner)) { + if (first_diag && arch_vcpu_is_preempted(~owner)) { smp_yield_cpu(~owner); first_diag = 0; continue; @@ -81,7 +72,7 @@ void arch_spin_lock_wait(arch_spinlock_t *lp) * yield the CPU unconditionally. For LPAR rely on the * sense running status. */ - if (!MACHINE_IS_LPAR || cpu_is_preempted(~owner)) { + if (!MACHINE_IS_LPAR || arch_vcpu_is_preempted(~owner)) { smp_yield_cpu(~owner); first_diag = 0; } @@ -108,7 +99,7 @@ void arch_spin_lock_wait_flags(arch_spinlock_t *lp, unsigned long flags) continue; } /* Check if the lock owner is running. */ - if (first_diag && cpu_is_preempted(~owner)) { + if (first_diag && arch_vcpu_is_preempted(~owner)) { smp_yield_cpu(~owner); first_diag = 0; continue; @@ -127,7 +118,7 @@ void arch_spin_lock_wait_flags(arch_spinlock_t *lp, unsigned long flags) * yield the CPU unconditionally. For LPAR rely on the * sense running status. */ - if (!MACHINE_IS_LPAR || cpu_is_preempted(~owner)) { + if (!MACHINE_IS_LPAR || arch_vcpu_is_preempted(~owner)) { smp_yield_cpu(~owner); first_diag = 0; } @@ -165,7 +156,7 @@ void _raw_read_lock_wait(arch_rwlock_t *rw) owner = 0; while (1) { if (count-- <= 0) { - if (owner && cpu_is_preempted(~owner)) + if (owner && arch_vcpu_is_preempted(~owner)) smp_yield_cpu(~owner); count = spin_retry; } @@ -211,7 +202,7 @@ void _raw_write_lock_wait(arch_rwlock_t *rw, unsigned int prev) owner = 0; while (1) { if (count-- <= 0) { - if (owner && cpu_is_preempted(~owner)) + if
[PATCH v5 9/9] Documentation: virtual: kvm: Support vcpu preempted check
Commit ("x86, kvm: support vcpu preempted check") add one field "__u8 preempted" into struct kvm_steal_time. This field tells if one vcpu is running or not. It is zero if 1) some old KVM deos not support this filed. 2) the vcpu is preempted. Other values means the vcpu has been preempted. Signed-off-by: Pan Xinhui --- Documentation/virtual/kvm/msr.txt | 8 +++- 1 file changed, 7 insertions(+), 1 deletion(-) diff --git a/Documentation/virtual/kvm/msr.txt b/Documentation/virtual/kvm/msr.txt index 2a71c8f..3376f13 100644 --- a/Documentation/virtual/kvm/msr.txt +++ b/Documentation/virtual/kvm/msr.txt @@ -208,7 +208,8 @@ MSR_KVM_STEAL_TIME: 0x4b564d03 __u64 steal; __u32 version; __u32 flags; - __u32 pad[12]; + __u8 preempted; + __u32 pad[11]; } whose data will be filled in by the hypervisor periodically. Only one @@ -232,6 +233,11 @@ MSR_KVM_STEAL_TIME: 0x4b564d03 nanoseconds. Time during which the vcpu is idle, will not be reported as steal time. + preempted: indicate the VCPU who owns this struct is running or + not. Non-zero values mean the VCPU has been preempted. Zero + means the VCPU is not preempted. NOTE, it is always zero if the + the hypervisor doesn't support this field. + MSR_KVM_EOI_EN: 0x4b564d04 data: Bit 0 is 1 when PV end of interrupt is enabled on the vcpu; 0 when disabled. Bit 1 is reserved and must be zero. When PV end of -- 2.4.11
[PATCH v5 2/9] locking/osq: Drop the overload of osq_lock()
An over-committed guest with more vCPUs than pCPUs has a heavy overload in osq_lock(). This is because vCPU A hold the osq lock and yield out, vCPU B wait per_cpu node->locked to be set. IOW, vCPU B wait vCPU A to run and unlock the osq lock. Kernel has an interface bool vcpu_is_preempted(int cpu) to see if a vCPU is currently running or not. So break the spin loops on true condition. test case: perf record -a perf bench sched messaging -g 400 -p && perf report before patch: 18.09% sched-messaging [kernel.vmlinux] [k] osq_lock 12.28% sched-messaging [kernel.vmlinux] [k] rwsem_spin_on_owner 5.27% sched-messaging [kernel.vmlinux] [k] mutex_unlock 3.89% sched-messaging [kernel.vmlinux] [k] wait_consider_task 3.64% sched-messaging [kernel.vmlinux] [k] _raw_write_lock_irq 3.41% sched-messaging [kernel.vmlinux] [k] mutex_spin_on_owner.is 2.49% sched-messaging [kernel.vmlinux] [k] system_call after patch: 20.68% sched-messaging [kernel.vmlinux] [k] mutex_spin_on_owner 8.45% sched-messaging [kernel.vmlinux] [k] mutex_unlock 4.12% sched-messaging [kernel.vmlinux] [k] system_call 3.01% sched-messaging [kernel.vmlinux] [k] system_call_common 2.83% sched-messaging [kernel.vmlinux] [k] copypage_power7 2.64% sched-messaging [kernel.vmlinux] [k] rwsem_spin_on_owner 2.00% sched-messaging [kernel.vmlinux] [k] osq_lock Suggested-by: Boqun Feng Signed-off-by: Pan Xinhui Acked-by: Christian Borntraeger Tested-by: Juergen Gross --- kernel/locking/osq_lock.c | 10 +- 1 file changed, 9 insertions(+), 1 deletion(-) diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c index 05a3785..39d1385 100644 --- a/kernel/locking/osq_lock.c +++ b/kernel/locking/osq_lock.c @@ -21,6 +21,11 @@ static inline int encode_cpu(int cpu_nr) return cpu_nr + 1; } +static inline int node_cpu(struct optimistic_spin_node *node) +{ + return node->cpu - 1; +} + static inline struct optimistic_spin_node *decode_cpu(int encoded_cpu_val) { int cpu_nr = encoded_cpu_val - 1; @@ -118,8 +123,11 @@ bool osq_lock(struct optimistic_spin_queue *lock) while (!READ_ONCE(node->locked)) { /* * If we need to reschedule bail... so we can block. +* Use vcpu_is_preempted to detech lock holder preemption issue +* and break. vcpu_is_preempted is a macro defined by false if +* arch does not support vcpu preempted check, */ - if (need_resched()) + if (need_resched() || vcpu_is_preempted(node_cpu(node->prev))) goto unqueue; cpu_relax_lowlatency(); -- 2.4.11
[PATCH v5 7/9] x86, xen: support vcpu preempted check
From: Juergen Gross Support the vcpu_is_preempted() functionality under Xen. This will enhance lock performance on overcommitted hosts (more runnable vcpus than physical cpus in the system) as doing busy waits for preempted vcpus will hurt system performance far worse than early yielding. A quick test (4 vcpus on 1 physical cpu doing a parallel build job with "make -j 8") reduced system time by about 5% with this patch. Signed-off-by: Juergen Gross Signed-off-by: Pan Xinhui --- arch/x86/xen/spinlock.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/arch/x86/xen/spinlock.c b/arch/x86/xen/spinlock.c index 3d6e006..74756bb 100644 --- a/arch/x86/xen/spinlock.c +++ b/arch/x86/xen/spinlock.c @@ -114,7 +114,6 @@ void xen_uninit_lock_cpu(int cpu) per_cpu(irq_name, cpu) = NULL; } - /* * Our init of PV spinlocks is split in two init functions due to us * using paravirt patching and jump labels patching and having to do @@ -137,6 +136,8 @@ void __init xen_init_spinlocks(void) pv_lock_ops.queued_spin_unlock = PV_CALLEE_SAVE(__pv_queued_spin_unlock); pv_lock_ops.wait = xen_qlock_wait; pv_lock_ops.kick = xen_qlock_kick; + + pv_lock_ops.vcpu_is_preempted = xen_vcpu_stolen; } /* -- 2.4.11
[PATCH v5 6/9] x86, kvm: support vcpu preempted check
Support the vcpu_is_preempted() functionality under KVM. This will enhance lock performance on overcommitted hosts (more runnable vcpus than physical cpus in the system) as doing busy waits for preempted vcpus will hurt system performance far worse than early yielding. Use one field of struct kvm_steal_time to indicate that if one vcpu is running or not. unix benchmark result: host: kernel 4.8.1, i5-4570, 4 cpus guest: kernel 4.8.1, 8 vcpus test-case after-patch before-patch Execl Throughput |18307.9 lps |11701.6 lps File Copy 1024 bufsize 2000 maxblocks | 1352407.3 KBps | 790418.9 KBps File Copy 256 bufsize 500 maxblocks| 367555.6 KBps | 222867.7 KBps File Copy 4096 bufsize 8000 maxblocks | 3675649.7 KBps | 1780614.4 KBps Pipe Throughput| 11872208.7 lps | 11855628.9 lps Pipe-based Context Switching | 1495126.5 lps | 1490533.9 lps Process Creation |29881.2 lps |28572.8 lps Shell Scripts (1 concurrent) |23224.3 lpm |22607.4 lpm Shell Scripts (8 concurrent) | 3531.4 lpm | 3211.9 lpm System Call Overhead | 10385653.0 lps | 10419979.0 lps Signed-off-by: Pan Xinhui --- arch/x86/include/uapi/asm/kvm_para.h | 3 ++- arch/x86/kernel/kvm.c| 12 arch/x86/kvm/x86.c | 18 ++ 3 files changed, 32 insertions(+), 1 deletion(-) diff --git a/arch/x86/include/uapi/asm/kvm_para.h b/arch/x86/include/uapi/asm/kvm_para.h index 94dc8ca..b3fec56 100644 --- a/arch/x86/include/uapi/asm/kvm_para.h +++ b/arch/x86/include/uapi/asm/kvm_para.h @@ -45,7 +45,8 @@ struct kvm_steal_time { __u64 steal; __u32 version; __u32 flags; - __u32 pad[12]; + __u8 preempted; + __u32 pad[11]; }; #define KVM_STEAL_ALIGNMENT_BITS 5 diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c index edbbfc8..0b48dd2 100644 --- a/arch/x86/kernel/kvm.c +++ b/arch/x86/kernel/kvm.c @@ -415,6 +415,15 @@ void kvm_disable_steal_time(void) wrmsr(MSR_KVM_STEAL_TIME, 0, 0); } +static bool kvm_vcpu_is_preempted(int cpu) +{ + struct kvm_steal_time *src; + + src = &per_cpu(steal_time, cpu); + + return !!src->preempted; +} + #ifdef CONFIG_SMP static void __init kvm_smp_prepare_boot_cpu(void) { @@ -471,6 +480,9 @@ void __init kvm_guest_init(void) if (kvm_para_has_feature(KVM_FEATURE_STEAL_TIME)) { has_steal_clock = 1; pv_time_ops.steal_clock = kvm_steal_clock; +#ifdef CONFIG_PARAVIRT_SPINLOCKS + pv_lock_ops.vcpu_is_preempted = kvm_vcpu_is_preempted; +#endif } if (kvm_para_has_feature(KVM_FEATURE_PV_EOI)) diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 6c633de..a627537 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -2057,6 +2057,8 @@ static void record_steal_time(struct kvm_vcpu *vcpu) &vcpu->arch.st.steal, sizeof(struct kvm_steal_time return; + vcpu->arch.st.steal.preempted = 0; + if (vcpu->arch.st.steal.version & 1) vcpu->arch.st.steal.version += 1; /* first time write, random junk */ @@ -2810,8 +2812,24 @@ void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu) kvm_make_request(KVM_REQ_STEAL_UPDATE, vcpu); } +static void kvm_steal_time_set_preempted(struct kvm_vcpu *vcpu) +{ + if (!(vcpu->arch.st.msr_val & KVM_MSR_ENABLED)) + return; + + if (unlikely(kvm_read_guest_cached(vcpu->kvm, &vcpu->arch.st.stime, + &vcpu->arch.st.steal, sizeof(struct kvm_steal_time + return; + + vcpu->arch.st.steal.preempted = 1; + + kvm_write_guest_cached(vcpu->kvm, &vcpu->arch.st.stime, + &vcpu->arch.st.steal, sizeof(struct kvm_steal_time)); +} + void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu) { + kvm_steal_time_set_preempted(vcpu); kvm_x86_ops->vcpu_put(vcpu); kvm_put_guest_fpu(vcpu); vcpu->arch.last_host_tsc = rdtsc(); -- 2.4.11
[PATCH v5 3/9] kernel/locking: Drop the overload of {mutex,rwsem}_spin_on_owner
An over-committed guest with more vCPUs than pCPUs has a heavy overload in the two spin_on_owner. This blames on the lock holder preemption issue. Kernel has an interface bool vcpu_is_preempted(int cpu) to see if a vCPU is currently running or not. So break the spin loops on true condition. test-case: perf record -a perf bench sched messaging -g 400 -p && perf report before patch: 20.68% sched-messaging [kernel.vmlinux] [k] mutex_spin_on_owner 8.45% sched-messaging [kernel.vmlinux] [k] mutex_unlock 4.12% sched-messaging [kernel.vmlinux] [k] system_call 3.01% sched-messaging [kernel.vmlinux] [k] system_call_common 2.83% sched-messaging [kernel.vmlinux] [k] copypage_power7 2.64% sched-messaging [kernel.vmlinux] [k] rwsem_spin_on_owner 2.00% sched-messaging [kernel.vmlinux] [k] osq_lock after patch: 9.99% sched-messaging [kernel.vmlinux] [k] mutex_unlock 5.28% sched-messaging [unknown] [H] 0xc00768e0 4.27% sched-messaging [kernel.vmlinux] [k] __copy_tofrom_user_power7 3.77% sched-messaging [kernel.vmlinux] [k] copypage_power7 3.24% sched-messaging [kernel.vmlinux] [k] _raw_write_lock_irq 3.02% sched-messaging [kernel.vmlinux] [k] system_call 2.69% sched-messaging [kernel.vmlinux] [k] wait_consider_task Signed-off-by: Pan Xinhui Acked-by: Christian Borntraeger Tested-by: Juergen Gross --- kernel/locking/mutex.c | 15 +-- kernel/locking/rwsem-xadd.c | 16 +--- 2 files changed, 26 insertions(+), 5 deletions(-) diff --git a/kernel/locking/mutex.c b/kernel/locking/mutex.c index a70b90d..82108f5 100644 --- a/kernel/locking/mutex.c +++ b/kernel/locking/mutex.c @@ -236,7 +236,13 @@ bool mutex_spin_on_owner(struct mutex *lock, struct task_struct *owner) */ barrier(); - if (!owner->on_cpu || need_resched()) { + /* +* Use vcpu_is_preempted to detech lock holder preemption issue +* and break. vcpu_is_preempted is a macro defined by false if +* arch does not support vcpu preempted check, +*/ + if (!owner->on_cpu || need_resched() || + vcpu_is_preempted(task_cpu(owner))) { ret = false; break; } @@ -261,8 +267,13 @@ static inline int mutex_can_spin_on_owner(struct mutex *lock) rcu_read_lock(); owner = READ_ONCE(lock->owner); + + /* +* As lock holder preemption issue, we both skip spinning if task is not +* on cpu or its cpu is preempted +*/ if (owner) - retval = owner->on_cpu; + retval = owner->on_cpu && !vcpu_is_preempted(task_cpu(owner)); rcu_read_unlock(); /* * if lock->owner is not set, the mutex owner may have just acquired diff --git a/kernel/locking/rwsem-xadd.c b/kernel/locking/rwsem-xadd.c index 2337b4b..0897179 100644 --- a/kernel/locking/rwsem-xadd.c +++ b/kernel/locking/rwsem-xadd.c @@ -336,7 +336,11 @@ static inline bool rwsem_can_spin_on_owner(struct rw_semaphore *sem) goto done; } - ret = owner->on_cpu; + /* +* As lock holder preemption issue, we both skip spinning if task is not +* on cpu or its cpu is preempted +*/ + ret = owner->on_cpu && !vcpu_is_preempted(task_cpu(owner)); done: rcu_read_unlock(); return ret; @@ -362,8 +366,14 @@ static noinline bool rwsem_spin_on_owner(struct rw_semaphore *sem) */ barrier(); - /* abort spinning when need_resched or owner is not running */ - if (!owner->on_cpu || need_resched()) { + /* +* abort spinning when need_resched or owner is not running or +* owner's cpu is preempted. vcpu_is_preempted is a macro +* defined by false if arch does not support vcpu preempted +* check +*/ + if (!owner->on_cpu || need_resched() || + vcpu_is_preempted(task_cpu(owner))) { rcu_read_unlock(); return false; } -- 2.4.11
[PATCH v5 1/9] kernel/sched: introduce vcpu preempted check interface
This patch support to fix lock holder preemption issue. For kernel users, we could use bool vcpu_is_preempted(int cpu) to detech if one vcpu is preempted or not. The default implementation is a macro defined by false. So compiler can wrap it out if arch dose not support such vcpu pteempted check. Suggested-by: Peter Zijlstra (Intel) Signed-off-by: Pan Xinhui Acked-by: Christian Borntraeger Tested-by: Juergen Gross --- include/linux/sched.h | 12 1 file changed, 12 insertions(+) diff --git a/include/linux/sched.h b/include/linux/sched.h index 348f51b..44c1ce7 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -3506,6 +3506,18 @@ static inline void set_task_cpu(struct task_struct *p, unsigned int cpu) #endif /* CONFIG_SMP */ +/* + * In order to deal with a various lock holder preemption issues provide an + * interface to see if a vCPU is currently running or not. + * + * This allows us to terminate optimistic spin loops and block, analogous to + * the native optimistic spin heuristic of testing if the lock owner task is + * running or not. + */ +#ifndef vcpu_is_preempted +#define vcpu_is_preempted(cpu) false +#endif + extern long sched_setaffinity(pid_t pid, const struct cpumask *new_mask); extern long sched_getaffinity(pid_t pid, struct cpumask *mask); -- 2.4.11
[PATCH v5 0/9] implement vcpu preempted check
change from v4: spilt x86 kvm vcpu preempted check into two patches. add documentation patch. add x86 vcpu preempted check patch under xen add s390 vcpu preempted check patch change from v3: add x86 vcpu preempted check patch change from v2: no code change, fix typos, update some comments change from v1: a simplier definition of default vcpu_is_preempted skip mahcine type check on ppc, and add config. remove dedicated macro. add one patch to drop overload of rwsem_spin_on_owner and mutex_spin_on_owner. add more comments thanks boqun and Peter's suggestion. This patch set aims to fix lock holder preemption issues. test-case: perf record -a perf bench sched messaging -g 400 -p && perf report 18.09% sched-messaging [kernel.vmlinux] [k] osq_lock 12.28% sched-messaging [kernel.vmlinux] [k] rwsem_spin_on_owner 5.27% sched-messaging [kernel.vmlinux] [k] mutex_unlock 3.89% sched-messaging [kernel.vmlinux] [k] wait_consider_task 3.64% sched-messaging [kernel.vmlinux] [k] _raw_write_lock_irq 3.41% sched-messaging [kernel.vmlinux] [k] mutex_spin_on_owner.is 2.49% sched-messaging [kernel.vmlinux] [k] system_call We introduce interface bool vcpu_is_preempted(int cpu) and use it in some spin loops of osq_lock, rwsem_spin_on_owner and mutex_spin_on_owner. These spin_on_onwer variant also cause rcu stall before we apply this patch set We also have observed some performace improvements in uninx benchmark tests. PPC test result: 1 copy - 0.94% 2 copy - 7.17% 4 copy - 11.9% 8 copy - 3.04% 16 copy - 15.11% details below: Without patch: 1 copy - File Write 4096 bufsize 8000 maxblocks 2188223.0 KBps (30.0 s, 1 samples) 2 copy - File Write 4096 bufsize 8000 maxblocks 1804433.0 KBps (30.0 s, 1 samples) 4 copy - File Write 4096 bufsize 8000 maxblocks 1237257.0 KBps (30.0 s, 1 samples) 8 copy - File Write 4096 bufsize 8000 maxblocks 1032658.0 KBps (30.0 s, 1 samples) 16 copy - File Write 4096 bufsize 8000 maxblocks 768000.0 KBps (30.1 s, 1 samples) With patch: 1 copy - File Write 4096 bufsize 8000 maxblocks 2209189.0 KBps (30.0 s, 1 samples) 2 copy - File Write 4096 bufsize 8000 maxblocks 1943816.0 KBps (30.0 s, 1 samples) 4 copy - File Write 4096 bufsize 8000 maxblocks 1405591.0 KBps (30.0 s, 1 samples) 8 copy - File Write 4096 bufsize 8000 maxblocks 1065080.0 KBps (30.0 s, 1 samples) 16 copy - File Write 4096 bufsize 8000 maxblocks 904762.0 KBps (30.0 s, 1 samples) X86 test result: test-case after-patch before-patch Execl Throughput |18307.9 lps |11701.6 lps File Copy 1024 bufsize 2000 maxblocks | 1352407.3 KBps | 790418.9 KBps File Copy 256 bufsize 500 maxblocks| 367555.6 KBps | 222867.7 KBps File Copy 4096 bufsize 8000 maxblocks | 3675649.7 KBps | 1780614.4 KBps Pipe Throughput| 11872208.7 lps | 11855628.9 lps Pipe-based Context Switching | 1495126.5 lps | 1490533.9 lps Process Creation |29881.2 lps |28572.8 lps Shell Scripts (1 concurrent) |23224.3 lpm |22607.4 lpm Shell Scripts (8 concurrent) | 3531.4 lpm | 3211.9 lpm System Call Overhead | 10385653.0 lps | 10419979.0 lps Christian Borntraeger (1): s390/spinlock: Provide vcpu_is_preempted Juergen Gross (1): x86, xen: support vcpu preempted check Pan Xinhui (7): kernel/sched: introduce vcpu preempted check interface locking/osq: Drop the overload of osq_lock() kernel/locking: Drop the overload of {mutex,rwsem}_spin_on_owner powerpc/spinlock: support vcpu preempted check x86, paravirt: Add interface to support kvm/xen vcpu preempted check x86, kvm: support vcpu preempted check Documentation: virtual: kvm: Support vcpu preempted check Documentation/virtual/kvm/msr.txt | 8 +++- arch/powerpc/include/asm/spinlock.h | 8 arch/s390/include/asm/spinlock.h | 8 arch/s390/kernel/smp.c| 9 +++-- arch/s390/lib/spinlock.c | 25 - arch/x86/include/asm/paravirt_types.h | 2 ++ arch/x86/include/asm/spinlock.h | 8 arch/x86/include/uapi/asm/kvm_para.h | 3 ++- arch/x86/kernel/kvm.c | 12 arch/x86/kernel/paravirt-spinlocks.c | 6 ++ arch/x86/kvm/x86.c| 18 ++ arch/x86/xen/spinlock.c | 3 ++- include/linux/sched.h | 12 kernel/locking/mutex.c| 15 +-- kernel/locking/osq_lock.c | 10 +- kernel/locking/rwsem-xadd.c | 16 +--- 16 files changed, 135 insertions(+), 28 deletions(-) -- 2.4.11
[PATCH v5 5/9] x86, paravirt: Add interface to support kvm/xen vcpu preempted check
This is to fix some lock holder preemption issues. Some other locks implementation do a spin loop before acquiring the lock itself. Currently kernel has an interface of bool vcpu_is_preempted(int cpu). It takes the cpu as parameter and return true if the cpu is preempted. Then kernel can break the spin loops upon on the retval of vcpu_is_preempted. As kernel has used this interface, So lets support it. To deal with kernel and kvm/xen, add vcpu_is_preempted into struct pv_lock_ops. Then kvm or xen could provide their own implementation to support vcpu_is_preempted. Signed-off-by: Pan Xinhui --- arch/x86/include/asm/paravirt_types.h | 2 ++ arch/x86/include/asm/spinlock.h | 8 arch/x86/kernel/paravirt-spinlocks.c | 6 ++ 3 files changed, 16 insertions(+) diff --git a/arch/x86/include/asm/paravirt_types.h b/arch/x86/include/asm/paravirt_types.h index 0f400c0..38c3bb7 100644 --- a/arch/x86/include/asm/paravirt_types.h +++ b/arch/x86/include/asm/paravirt_types.h @@ -310,6 +310,8 @@ struct pv_lock_ops { void (*wait)(u8 *ptr, u8 val); void (*kick)(int cpu); + + bool (*vcpu_is_preempted)(int cpu); }; /* This contains all the paravirt structures: we get a convenient diff --git a/arch/x86/include/asm/spinlock.h b/arch/x86/include/asm/spinlock.h index 921bea7..0526f59 100644 --- a/arch/x86/include/asm/spinlock.h +++ b/arch/x86/include/asm/spinlock.h @@ -26,6 +26,14 @@ extern struct static_key paravirt_ticketlocks_enabled; static __always_inline bool static_key_false(struct static_key *key); +#ifdef CONFIG_PARAVIRT_SPINLOCKS +#define vcpu_is_preempted vcpu_is_preempted +static inline bool vcpu_is_preempted(int cpu) +{ + return pv_lock_ops.vcpu_is_preempted(cpu); +} +#endif + #include /* diff --git a/arch/x86/kernel/paravirt-spinlocks.c b/arch/x86/kernel/paravirt-spinlocks.c index 2c55a00..2f204dd 100644 --- a/arch/x86/kernel/paravirt-spinlocks.c +++ b/arch/x86/kernel/paravirt-spinlocks.c @@ -21,12 +21,18 @@ bool pv_is_native_spin_unlock(void) __raw_callee_save___native_queued_spin_unlock; } +static bool native_vcpu_is_preempted(int cpu) +{ + return 0; +} + struct pv_lock_ops pv_lock_ops = { #ifdef CONFIG_SMP .queued_spin_lock_slowpath = native_queued_spin_lock_slowpath, .queued_spin_unlock = PV_CALLEE_SAVE(__native_queued_spin_unlock), .wait = paravirt_nop, .kick = paravirt_nop, + .vcpu_is_preempted = native_vcpu_is_preempted, #endif /* SMP */ }; EXPORT_SYMBOL(pv_lock_ops); -- 2.4.11
[PATCH v5 4/9] powerpc/spinlock: support vcpu preempted check
This is to fix some lock holder preemption issues. Some other locks implementation do a spin loop before acquiring the lock itself. Currently kernel has an interface of bool vcpu_is_preempted(int cpu). It takes the cpu as parameter and return true if the cpu is preempted. Then kernel can break the spin loops upon on the retval of vcpu_is_preempted. As kernel has used this interface, So lets support it. Only pSeries need support it. And the fact is powerNV are built into same kernel image with pSeries. So we need return false if we are runnig as powerNV. The another fact is that lppaca->yiled_count keeps zero on powerNV. So we can just skip the machine type check. Suggested-by: Boqun Feng Suggested-by: Peter Zijlstra (Intel) Signed-off-by: Pan Xinhui --- arch/powerpc/include/asm/spinlock.h | 8 1 file changed, 8 insertions(+) diff --git a/arch/powerpc/include/asm/spinlock.h b/arch/powerpc/include/asm/spinlock.h index abb6b0f..f4a9524 100644 --- a/arch/powerpc/include/asm/spinlock.h +++ b/arch/powerpc/include/asm/spinlock.h @@ -52,6 +52,14 @@ #define SYNC_IO #endif +#ifdef CONFIG_PPC_PSERIES +#define vcpu_is_preempted vcpu_is_preempted +static inline bool vcpu_is_preempted(int cpu) +{ + return !!(be32_to_cpu(lppaca_of(cpu).yield_count) & 1); +} +#endif + #if defined(CONFIG_PPC_SPLPAR) /* We only yield to the hypervisor if we are in shared processor mode */ #define SHARED_PROCESSOR (lppaca_shared_proc(local_paca->lppaca_ptr)) -- 2.4.11
Re: [PATCH v4 5/5] x86, kvm: support vcpu preempted check
在 2016/10/20 01:24, Radim Krčmář 写道: 2016-10-19 06:20-0400, Pan Xinhui: This is to fix some lock holder preemption issues. Some other locks implementation do a spin loop before acquiring the lock itself. Currently kernel has an interface of bool vcpu_is_preempted(int cpu). It takes the cpu as parameter and return true if the cpu is preempted. Then kernel can break the spin loops upon on the retval of vcpu_is_preempted. As kernel has used this interface, So lets support it. We use one field of struct kvm_steal_time to indicate that if one vcpu is running or not. unix benchmark result: host: kernel 4.8.1, i5-4570, 4 cpus guest: kernel 4.8.1, 8 vcpus test-case after-patch before-patch Execl Throughput |18307.9 lps |11701.6 lps File Copy 1024 bufsize 2000 maxblocks | 1352407.3 KBps | 790418.9 KBps File Copy 256 bufsize 500 maxblocks| 367555.6 KBps | 222867.7 KBps File Copy 4096 bufsize 8000 maxblocks | 3675649.7 KBps | 1780614.4 KBps Pipe Throughput| 11872208.7 lps | 11855628.9 lps Pipe-based Context Switching | 1495126.5 lps | 1490533.9 lps Process Creation |29881.2 lps |28572.8 lps Shell Scripts (1 concurrent) |23224.3 lpm |22607.4 lpm Shell Scripts (8 concurrent) | 3531.4 lpm | 3211.9 lpm System Call Overhead | 10385653.0 lps | 10419979.0 lps Signed-off-by: Pan Xinhui --- diff --git a/arch/x86/include/asm/paravirt_types.h b/arch/x86/include/asm/paravirt_types.h @@ -98,6 +98,10 @@ struct pv_time_ops { unsigned long long (*steal_clock)(int cpu); }; +struct pv_vcpu_ops { + bool (*vcpu_is_preempted)(int cpu); +}; + (I would put it into pv_lock_ops to save the plumbing.) hi, Radim thanks for your reply. yes, a new struct leads patch into unnecessary lines changed. I do that just because I am not sure which existing xxx_ops I should place the vcpu_is_preempted in. diff --git a/arch/x86/include/uapi/asm/kvm_para.h b/arch/x86/include/uapi/asm/kvm_para.h @@ -45,7 +45,8 @@ struct kvm_steal_time { __u64 steal; __u32 version; __u32 flags; - __u32 pad[12]; + __u32 preempted; Why __u32 instead of __u8? I thought it is 32-bits aligned... yes, u8 is good to store the preempt status. + __u32 pad[11]; }; Please document the change in Documentation/virtual/kvm/msr.txt, section MSR_KVM_STEAL_TIME. okay, I totally forgot to do that. thanks! diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c @@ -415,6 +415,15 @@ void kvm_disable_steal_time(void) +static bool kvm_vcpu_is_preempted(int cpu) +{ + struct kvm_steal_time *src; + + src = &per_cpu(steal_time, cpu); + + return !!src->preempted; +} + #ifdef CONFIG_SMP static void __init kvm_smp_prepare_boot_cpu(void) { @@ -488,6 +497,8 @@ void __init kvm_guest_init(void) kvm_guest_cpu_init(); #endif + pv_vcpu_ops.vcpu_is_preempted = kvm_vcpu_is_preempted; Would be nicer to assign conditionally in the KVM_FEATURE_STEAL_TIME block. The steal_time structure has to be zeroed, so this code would work, but the native function (return false) is better if we know that the kvm_vcpu_is_preempted() would always return false anway. yes, agree. Will do that. I once thought we can patch the code runtime. we replace binary code "call 0x #pv_vcpu_ops.vcpu_is_preempted" with "xor eax, eax" however it is not worth doing that. the performace improvements might be very small. Old KVMs won't have the feature, so we could also assign only when KVM reports it, but that requires extra definitions and the performance gain is fairly small, so I'm ok with this. diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c @@ -2057,6 +2057,8 @@ static void record_steal_time(struct kvm_vcpu *vcpu) &vcpu->arch.st.steal, sizeof(struct kvm_steal_time return; + vcpu->arch.st.steal.preempted = 0; + if (vcpu->arch.st.steal.version & 1) vcpu->arch.st.steal.version += 1; /* first time write, random junk */ @@ -2812,6 +2814,16 @@ void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu) void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu) { + if (vcpu->arch.st.msr_val & KVM_MSR_ENABLED) + if (kvm_read_guest_cached(vcpu->kvm, &vcpu->arch.st.stime, + &vcpu->arch.st.steal, + sizeof(struct kvm_steal_time)) == 0) { + vcpu->arch.st.steal.preempted = 1; + kvm_write_guest_cached(vcpu->kvm, &vcpu->arch.st.stime, + &vcpu->arch.st.steal, + sizeof(struct kvm_steal_time)); + } Please name t
Re: [PATCH v4 0/5] implement vcpu preempted check
在 2016/10/19 23:58, Juergen Gross 写道: On 19/10/16 12:20, Pan Xinhui wrote: change from v3: add x86 vcpu preempted check patch change from v2: no code change, fix typos, update some comments change from v1: a simplier definition of default vcpu_is_preempted skip mahcine type check on ppc, and add config. remove dedicated macro. add one patch to drop overload of rwsem_spin_on_owner and mutex_spin_on_owner. add more comments thanks boqun and Peter's suggestion. This patch set aims to fix lock holder preemption issues. test-case: perf record -a perf bench sched messaging -g 400 -p && perf report 18.09% sched-messaging [kernel.vmlinux] [k] osq_lock 12.28% sched-messaging [kernel.vmlinux] [k] rwsem_spin_on_owner 5.27% sched-messaging [kernel.vmlinux] [k] mutex_unlock 3.89% sched-messaging [kernel.vmlinux] [k] wait_consider_task 3.64% sched-messaging [kernel.vmlinux] [k] _raw_write_lock_irq 3.41% sched-messaging [kernel.vmlinux] [k] mutex_spin_on_owner.is 2.49% sched-messaging [kernel.vmlinux] [k] system_call We introduce interface bool vcpu_is_preempted(int cpu) and use it in some spin loops of osq_lock, rwsem_spin_on_owner and mutex_spin_on_owner. These spin_on_onwer variant also cause rcu stall before we apply this patch set We also have observed some performace improvements. PPC test result: 1 copy - 0.94% 2 copy - 7.17% 4 copy - 11.9% 8 copy - 3.04% 16 copy - 15.11% details below: Without patch: 1 copy - File Write 4096 bufsize 8000 maxblocks 2188223.0 KBps (30.0 s, 1 samples) 2 copy - File Write 4096 bufsize 8000 maxblocks 1804433.0 KBps (30.0 s, 1 samples) 4 copy - File Write 4096 bufsize 8000 maxblocks 1237257.0 KBps (30.0 s, 1 samples) 8 copy - File Write 4096 bufsize 8000 maxblocks 1032658.0 KBps (30.0 s, 1 samples) 16 copy - File Write 4096 bufsize 8000 maxblocks 768000.0 KBps (30.1 s, 1 samples) With patch: 1 copy - File Write 4096 bufsize 8000 maxblocks 2209189.0 KBps (30.0 s, 1 samples) 2 copy - File Write 4096 bufsize 8000 maxblocks 1943816.0 KBps (30.0 s, 1 samples) 4 copy - File Write 4096 bufsize 8000 maxblocks 1405591.0 KBps (30.0 s, 1 samples) 8 copy - File Write 4096 bufsize 8000 maxblocks 1065080.0 KBps (30.0 s, 1 samples) 16 copy - File Write 4096 bufsize 8000 maxblocks 904762.0 KBps (30.0 s, 1 samples) X86 test result: test-case after-patch before-patch Execl Throughput |18307.9 lps |11701.6 lps File Copy 1024 bufsize 2000 maxblocks | 1352407.3 KBps | 790418.9 KBps File Copy 256 bufsize 500 maxblocks| 367555.6 KBps | 222867.7 KBps File Copy 4096 bufsize 8000 maxblocks | 3675649.7 KBps | 1780614.4 KBps Pipe Throughput| 11872208.7 lps | 11855628.9 lps Pipe-based Context Switching | 1495126.5 lps | 1490533.9 lps Process Creation |29881.2 lps |28572.8 lps Shell Scripts (1 concurrent) |23224.3 lpm |22607.4 lpm Shell Scripts (8 concurrent) | 3531.4 lpm | 3211.9 lpm System Call Overhead | 10385653.0 lps | 10419979.0 lps Pan Xinhui (5): kernel/sched: introduce vcpu preempted check interface locking/osq: Drop the overload of osq_lock() kernel/locking: Drop the overload of {mutex,rwsem}_spin_on_owner powerpc/spinlock: support vcpu preempted check x86, kvm: support vcpu preempted check The attached patch adds Xen support for x86. Please tell me whether you want to add this patch to your series or if I should post it when your series has been accepted. hi, Juergen Your patch is pretty small and nice :) thanks! I can include your patch into my next patchset after this patchset reviewed. :) You can add my Tested-by: Juergen Gross for patches 1-3 and 5 (paravirt parts only). Thanks a lot! xinhui Juergen arch/powerpc/include/asm/spinlock.h | 8 arch/x86/include/asm/paravirt_types.h | 6 ++ arch/x86/include/asm/spinlock.h | 8 arch/x86/include/uapi/asm/kvm_para.h | 3 ++- arch/x86/kernel/kvm.c | 11 +++ arch/x86/kernel/paravirt.c| 11 +++ arch/x86/kvm/x86.c| 12 include/linux/sched.h | 12 kernel/locking/mutex.c| 15 +-- kernel/locking/osq_lock.c | 10 +- kernel/locking/rwsem-xadd.c | 16 +--- 11 files changed, 105 insertions(+), 7 deletions(-)
[PATCH v4 4/5] powerpc/spinlock: support vcpu preempted check
This is to fix some lock holder preemption issues. Some other locks implementation do a spin loop before acquiring the lock itself. Currently kernel has an interface of bool vcpu_is_preempted(int cpu). It takes the cpu as parameter and return true if the cpu is preempted. Then kernel can break the spin loops upon on the retval of vcpu_is_preempted. As kernel has used this interface, So lets support it. Only pSeries need support it. And the fact is powerNV are built into same kernel image with pSeries. So we need return false if we are runnig as powerNV. The another fact is that lppaca->yiled_count keeps zero on powerNV. So we can just skip the machine type check. Suggested-by: Boqun Feng Suggested-by: Peter Zijlstra (Intel) Signed-off-by: Pan Xinhui --- arch/powerpc/include/asm/spinlock.h | 8 1 file changed, 8 insertions(+) diff --git a/arch/powerpc/include/asm/spinlock.h b/arch/powerpc/include/asm/spinlock.h index abb6b0f..af4285b 100644 --- a/arch/powerpc/include/asm/spinlock.h +++ b/arch/powerpc/include/asm/spinlock.h @@ -52,6 +52,14 @@ #define SYNC_IO #endif +#ifdef CONFIG_PPC_PSERIES +#define vcpu_is_preempted vcpu_is_preempted +static inline bool vcpu_is_preempted(int cpu) +{ + return !!(be32_to_cpu(lppaca_of(cpu).yield_count) & 1); +} +#endif + #if defined(CONFIG_PPC_SPLPAR) /* We only yield to the hypervisor if we are in shared processor mode */ #define SHARED_PROCESSOR (lppaca_shared_proc(local_paca->lppaca_ptr)) -- 2.4.11
[PATCH v4 5/5] x86, kvm: support vcpu preempted check
This is to fix some lock holder preemption issues. Some other locks implementation do a spin loop before acquiring the lock itself. Currently kernel has an interface of bool vcpu_is_preempted(int cpu). It takes the cpu as parameter and return true if the cpu is preempted. Then kernel can break the spin loops upon on the retval of vcpu_is_preempted. As kernel has used this interface, So lets support it. We use one field of struct kvm_steal_time to indicate that if one vcpu is running or not. unix benchmark result: host: kernel 4.8.1, i5-4570, 4 cpus guest: kernel 4.8.1, 8 vcpus test-case after-patch before-patch Execl Throughput |18307.9 lps |11701.6 lps File Copy 1024 bufsize 2000 maxblocks | 1352407.3 KBps | 790418.9 KBps File Copy 256 bufsize 500 maxblocks| 367555.6 KBps | 222867.7 KBps File Copy 4096 bufsize 8000 maxblocks | 3675649.7 KBps | 1780614.4 KBps Pipe Throughput| 11872208.7 lps | 11855628.9 lps Pipe-based Context Switching | 1495126.5 lps | 1490533.9 lps Process Creation |29881.2 lps |28572.8 lps Shell Scripts (1 concurrent) |23224.3 lpm |22607.4 lpm Shell Scripts (8 concurrent) | 3531.4 lpm | 3211.9 lpm System Call Overhead | 10385653.0 lps | 10419979.0 lps Signed-off-by: Pan Xinhui --- arch/x86/include/asm/paravirt_types.h | 6 ++ arch/x86/include/asm/spinlock.h | 8 arch/x86/include/uapi/asm/kvm_para.h | 3 ++- arch/x86/kernel/kvm.c | 11 +++ arch/x86/kernel/paravirt.c| 11 +++ arch/x86/kvm/x86.c| 12 6 files changed, 50 insertions(+), 1 deletion(-) diff --git a/arch/x86/include/asm/paravirt_types.h b/arch/x86/include/asm/paravirt_types.h index 0f400c0..b1c7937 100644 --- a/arch/x86/include/asm/paravirt_types.h +++ b/arch/x86/include/asm/paravirt_types.h @@ -98,6 +98,10 @@ struct pv_time_ops { unsigned long long (*steal_clock)(int cpu); }; +struct pv_vcpu_ops { + bool (*vcpu_is_preempted)(int cpu); +}; + struct pv_cpu_ops { /* hooks for various privileged instructions */ unsigned long (*get_debugreg)(int regno); @@ -318,6 +322,7 @@ struct pv_lock_ops { struct paravirt_patch_template { struct pv_init_ops pv_init_ops; struct pv_time_ops pv_time_ops; + struct pv_vcpu_ops pv_vcpu_ops; struct pv_cpu_ops pv_cpu_ops; struct pv_irq_ops pv_irq_ops; struct pv_mmu_ops pv_mmu_ops; @@ -327,6 +332,7 @@ struct paravirt_patch_template { extern struct pv_info pv_info; extern struct pv_init_ops pv_init_ops; extern struct pv_time_ops pv_time_ops; +extern struct pv_vcpu_ops pv_vcpu_ops; extern struct pv_cpu_ops pv_cpu_ops; extern struct pv_irq_ops pv_irq_ops; extern struct pv_mmu_ops pv_mmu_ops; diff --git a/arch/x86/include/asm/spinlock.h b/arch/x86/include/asm/spinlock.h index 921bea7..52fd942 100644 --- a/arch/x86/include/asm/spinlock.h +++ b/arch/x86/include/asm/spinlock.h @@ -26,6 +26,14 @@ extern struct static_key paravirt_ticketlocks_enabled; static __always_inline bool static_key_false(struct static_key *key); +#ifdef CONFIG_PARAVIRT +#define vcpu_is_preempted vcpu_is_preempted +static inline bool vcpu_is_preempted(int cpu) +{ + return pv_vcpu_ops.vcpu_is_preempted(cpu); +} +#endif + #include /* diff --git a/arch/x86/include/uapi/asm/kvm_para.h b/arch/x86/include/uapi/asm/kvm_para.h index 94dc8ca..e9c12a1 100644 --- a/arch/x86/include/uapi/asm/kvm_para.h +++ b/arch/x86/include/uapi/asm/kvm_para.h @@ -45,7 +45,8 @@ struct kvm_steal_time { __u64 steal; __u32 version; __u32 flags; - __u32 pad[12]; + __u32 preempted; + __u32 pad[11]; }; #define KVM_STEAL_ALIGNMENT_BITS 5 diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c index edbbfc8..0011bef 100644 --- a/arch/x86/kernel/kvm.c +++ b/arch/x86/kernel/kvm.c @@ -415,6 +415,15 @@ void kvm_disable_steal_time(void) wrmsr(MSR_KVM_STEAL_TIME, 0, 0); } +static bool kvm_vcpu_is_preempted(int cpu) +{ + struct kvm_steal_time *src; + + src = &per_cpu(steal_time, cpu); + + return !!src->preempted; +} + #ifdef CONFIG_SMP static void __init kvm_smp_prepare_boot_cpu(void) { @@ -488,6 +497,8 @@ void __init kvm_guest_init(void) kvm_guest_cpu_init(); #endif + pv_vcpu_ops.vcpu_is_preempted = kvm_vcpu_is_preempted; + /* * Hard lockup detection is enabled by default. Disable it, as guests * can get false positives too easily, for example if the host is diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c index bbf3d59..7adb7e9 100644 --- a/arch/x86/kernel/paravirt.c +++ b/arch/x86/kernel/paravirt.c @@ -122,6 +122,7 @@ static void *get_call_destination(u8 type) struct paravirt_patch_templa
[PATCH v4 3/5] kernel/locking: Drop the overload of {mutex,rwsem}_spin_on_owner
An over-committed guest with more vCPUs than pCPUs has a heavy overload in the two spin_on_owner. This blames on the lock holder preemption issue. Kernel has an interface bool vcpu_is_preempted(int cpu) to see if a vCPU is currently running or not. So break the spin loops on true condition. test-case: perf record -a perf bench sched messaging -g 400 -p && perf report before patch: 20.68% sched-messaging [kernel.vmlinux] [k] mutex_spin_on_owner 8.45% sched-messaging [kernel.vmlinux] [k] mutex_unlock 4.12% sched-messaging [kernel.vmlinux] [k] system_call 3.01% sched-messaging [kernel.vmlinux] [k] system_call_common 2.83% sched-messaging [kernel.vmlinux] [k] copypage_power7 2.64% sched-messaging [kernel.vmlinux] [k] rwsem_spin_on_owner 2.00% sched-messaging [kernel.vmlinux] [k] osq_lock after patch: 9.99% sched-messaging [kernel.vmlinux] [k] mutex_unlock 5.28% sched-messaging [unknown] [H] 0xc00768e0 4.27% sched-messaging [kernel.vmlinux] [k] __copy_tofrom_user_power7 3.77% sched-messaging [kernel.vmlinux] [k] copypage_power7 3.24% sched-messaging [kernel.vmlinux] [k] _raw_write_lock_irq 3.02% sched-messaging [kernel.vmlinux] [k] system_call 2.69% sched-messaging [kernel.vmlinux] [k] wait_consider_task Signed-off-by: Pan Xinhui --- kernel/locking/mutex.c | 15 +-- kernel/locking/rwsem-xadd.c | 16 +--- 2 files changed, 26 insertions(+), 5 deletions(-) diff --git a/kernel/locking/mutex.c b/kernel/locking/mutex.c index a70b90d..8927e96 100644 --- a/kernel/locking/mutex.c +++ b/kernel/locking/mutex.c @@ -236,7 +236,13 @@ bool mutex_spin_on_owner(struct mutex *lock, struct task_struct *owner) */ barrier(); - if (!owner->on_cpu || need_resched()) { + /* +* Use vcpu_is_preempted to detech lock holder preemption issue +* and break. vcpu_is_preempted is a macro defined by false if +* arch does not support vcpu preempted check, +*/ + if (!owner->on_cpu || need_resched() || + vcpu_is_preempted(task_cpu(owner))) { ret = false; break; } @@ -261,8 +267,13 @@ static inline int mutex_can_spin_on_owner(struct mutex *lock) rcu_read_lock(); owner = READ_ONCE(lock->owner); + + /* +* As lock holder preemption issue, we both skip spinning if task is not +* on cpu or its cpu is preempted +*/ if (owner) - retval = owner->on_cpu; + retval = owner->on_cpu && !vcpu_is_preempted(task_cpu(owner)); rcu_read_unlock(); /* * if lock->owner is not set, the mutex owner may have just acquired diff --git a/kernel/locking/rwsem-xadd.c b/kernel/locking/rwsem-xadd.c index 2337b4b..ad0b5bb 100644 --- a/kernel/locking/rwsem-xadd.c +++ b/kernel/locking/rwsem-xadd.c @@ -336,7 +336,11 @@ static inline bool rwsem_can_spin_on_owner(struct rw_semaphore *sem) goto done; } - ret = owner->on_cpu; + /* +* As lock holder preemption issue, we both skip spinning if task is not +* on cpu or its cpu is preempted +*/ + ret = owner->on_cpu && !vcpu_is_preempted(task_cpu(owner)); done: rcu_read_unlock(); return ret; @@ -362,8 +366,14 @@ static noinline bool rwsem_spin_on_owner(struct rw_semaphore *sem) */ barrier(); - /* abort spinning when need_resched or owner is not running */ - if (!owner->on_cpu || need_resched()) { + /* +* abort spinning when need_resched or owner is not running or +* owner's cpu is preempted. vcpu_is_preempted is a macro +* defined by false if arch does not support vcpu preempted +* check +*/ + if (!owner->on_cpu || need_resched() || + vcpu_is_preempted(task_cpu(owner))) { rcu_read_unlock(); return false; } -- 2.4.11
[PATCH v4 0/5] implement vcpu preempted check
change from v3: add x86 vcpu preempted check patch change from v2: no code change, fix typos, update some comments change from v1: a simplier definition of default vcpu_is_preempted skip mahcine type check on ppc, and add config. remove dedicated macro. add one patch to drop overload of rwsem_spin_on_owner and mutex_spin_on_owner. add more comments thanks boqun and Peter's suggestion. This patch set aims to fix lock holder preemption issues. test-case: perf record -a perf bench sched messaging -g 400 -p && perf report 18.09% sched-messaging [kernel.vmlinux] [k] osq_lock 12.28% sched-messaging [kernel.vmlinux] [k] rwsem_spin_on_owner 5.27% sched-messaging [kernel.vmlinux] [k] mutex_unlock 3.89% sched-messaging [kernel.vmlinux] [k] wait_consider_task 3.64% sched-messaging [kernel.vmlinux] [k] _raw_write_lock_irq 3.41% sched-messaging [kernel.vmlinux] [k] mutex_spin_on_owner.is 2.49% sched-messaging [kernel.vmlinux] [k] system_call We introduce interface bool vcpu_is_preempted(int cpu) and use it in some spin loops of osq_lock, rwsem_spin_on_owner and mutex_spin_on_owner. These spin_on_onwer variant also cause rcu stall before we apply this patch set We also have observed some performace improvements. PPC test result: 1 copy - 0.94% 2 copy - 7.17% 4 copy - 11.9% 8 copy - 3.04% 16 copy - 15.11% details below: Without patch: 1 copy - File Write 4096 bufsize 8000 maxblocks 2188223.0 KBps (30.0 s, 1 samples) 2 copy - File Write 4096 bufsize 8000 maxblocks 1804433.0 KBps (30.0 s, 1 samples) 4 copy - File Write 4096 bufsize 8000 maxblocks 1237257.0 KBps (30.0 s, 1 samples) 8 copy - File Write 4096 bufsize 8000 maxblocks 1032658.0 KBps (30.0 s, 1 samples) 16 copy - File Write 4096 bufsize 8000 maxblocks 768000.0 KBps (30.1 s, 1 samples) With patch: 1 copy - File Write 4096 bufsize 8000 maxblocks 2209189.0 KBps (30.0 s, 1 samples) 2 copy - File Write 4096 bufsize 8000 maxblocks 1943816.0 KBps (30.0 s, 1 samples) 4 copy - File Write 4096 bufsize 8000 maxblocks 1405591.0 KBps (30.0 s, 1 samples) 8 copy - File Write 4096 bufsize 8000 maxblocks 1065080.0 KBps (30.0 s, 1 samples) 16 copy - File Write 4096 bufsize 8000 maxblocks 904762.0 KBps (30.0 s, 1 samples) X86 test result: test-case after-patch before-patch Execl Throughput |18307.9 lps |11701.6 lps File Copy 1024 bufsize 2000 maxblocks | 1352407.3 KBps | 790418.9 KBps File Copy 256 bufsize 500 maxblocks| 367555.6 KBps | 222867.7 KBps File Copy 4096 bufsize 8000 maxblocks | 3675649.7 KBps | 1780614.4 KBps Pipe Throughput| 11872208.7 lps | 11855628.9 lps Pipe-based Context Switching | 1495126.5 lps | 1490533.9 lps Process Creation |29881.2 lps |28572.8 lps Shell Scripts (1 concurrent) |23224.3 lpm |22607.4 lpm Shell Scripts (8 concurrent) | 3531.4 lpm | 3211.9 lpm System Call Overhead | 10385653.0 lps | 10419979.0 lps Pan Xinhui (5): kernel/sched: introduce vcpu preempted check interface locking/osq: Drop the overload of osq_lock() kernel/locking: Drop the overload of {mutex,rwsem}_spin_on_owner powerpc/spinlock: support vcpu preempted check x86, kvm: support vcpu preempted check arch/powerpc/include/asm/spinlock.h | 8 arch/x86/include/asm/paravirt_types.h | 6 ++ arch/x86/include/asm/spinlock.h | 8 arch/x86/include/uapi/asm/kvm_para.h | 3 ++- arch/x86/kernel/kvm.c | 11 +++ arch/x86/kernel/paravirt.c| 11 +++ arch/x86/kvm/x86.c| 12 include/linux/sched.h | 12 kernel/locking/mutex.c| 15 +-- kernel/locking/osq_lock.c | 10 +- kernel/locking/rwsem-xadd.c | 16 +--- 11 files changed, 105 insertions(+), 7 deletions(-) -- 2.4.11
[PATCH v4 2/5] locking/osq: Drop the overload of osq_lock()
An over-committed guest with more vCPUs than pCPUs has a heavy overload in osq_lock(). This is because vCPU A hold the osq lock and yield out, vCPU B wait per_cpu node->locked to be set. IOW, vCPU B wait vCPU A to run and unlock the osq lock. Kernel has an interface bool vcpu_is_preempted(int cpu) to see if a vCPU is currently running or not. So break the spin loops on true condition. test case: perf record -a perf bench sched messaging -g 400 -p && perf report before patch: 18.09% sched-messaging [kernel.vmlinux] [k] osq_lock 12.28% sched-messaging [kernel.vmlinux] [k] rwsem_spin_on_owner 5.27% sched-messaging [kernel.vmlinux] [k] mutex_unlock 3.89% sched-messaging [kernel.vmlinux] [k] wait_consider_task 3.64% sched-messaging [kernel.vmlinux] [k] _raw_write_lock_irq 3.41% sched-messaging [kernel.vmlinux] [k] mutex_spin_on_owner.is 2.49% sched-messaging [kernel.vmlinux] [k] system_call after patch: 20.68% sched-messaging [kernel.vmlinux] [k] mutex_spin_on_owner 8.45% sched-messaging [kernel.vmlinux] [k] mutex_unlock 4.12% sched-messaging [kernel.vmlinux] [k] system_call 3.01% sched-messaging [kernel.vmlinux] [k] system_call_common 2.83% sched-messaging [kernel.vmlinux] [k] copypage_power7 2.64% sched-messaging [kernel.vmlinux] [k] rwsem_spin_on_owner 2.00% sched-messaging [kernel.vmlinux] [k] osq_lock Suggested-by: Boqun Feng Signed-off-by: Pan Xinhui --- kernel/locking/osq_lock.c | 10 +- 1 file changed, 9 insertions(+), 1 deletion(-) diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c index 05a3785..39d1385 100644 --- a/kernel/locking/osq_lock.c +++ b/kernel/locking/osq_lock.c @@ -21,6 +21,11 @@ static inline int encode_cpu(int cpu_nr) return cpu_nr + 1; } +static inline int node_cpu(struct optimistic_spin_node *node) +{ + return node->cpu - 1; +} + static inline struct optimistic_spin_node *decode_cpu(int encoded_cpu_val) { int cpu_nr = encoded_cpu_val - 1; @@ -118,8 +123,11 @@ bool osq_lock(struct optimistic_spin_queue *lock) while (!READ_ONCE(node->locked)) { /* * If we need to reschedule bail... so we can block. +* Use vcpu_is_preempted to detech lock holder preemption issue +* and break. vcpu_is_preempted is a macro defined by false if +* arch does not support vcpu preempted check, */ - if (need_resched()) + if (need_resched() || vcpu_is_preempted(node_cpu(node->prev))) goto unqueue; cpu_relax_lowlatency(); -- 2.4.11
[PATCH v4 1/5] kernel/sched: introduce vcpu preempted check interface
This patch support to fix lock holder preemption issue. For kernel users, we could use bool vcpu_is_preempted(int cpu) to detech if one vcpu is preempted or not. The default implementation is a macro defined by false. So compiler can wrap it out if arch dose not support such vcpu pteempted check. Suggested-by: Peter Zijlstra (Intel) Signed-off-by: Pan Xinhui --- include/linux/sched.h | 12 1 file changed, 12 insertions(+) diff --git a/include/linux/sched.h b/include/linux/sched.h index 348f51b..44c1ce7 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -3506,6 +3506,18 @@ static inline void set_task_cpu(struct task_struct *p, unsigned int cpu) #endif /* CONFIG_SMP */ +/* + * In order to deal with a various lock holder preemption issues provide an + * interface to see if a vCPU is currently running or not. + * + * This allows us to terminate optimistic spin loops and block, analogous to + * the native optimistic spin heuristic of testing if the lock owner task is + * running or not. + */ +#ifndef vcpu_is_preempted +#define vcpu_is_preempted(cpu) false +#endif + extern long sched_setaffinity(pid_t pid, const struct cpumask *new_mask); extern long sched_getaffinity(pid_t pid, struct cpumask *mask); -- 2.4.11