Re: [RESEND PATCH v5] locking/pvqspinlock: Relax cmpxchg's to improve performance on some archs
On 08/21/2017 03:42 PM, Peter Zijlstra wrote: > On Mon, Aug 21, 2017 at 09:25:50PM +0200, Peter Zijlstra wrote: >> On Mon, Aug 21, 2017 at 07:00:02PM +0100, Will Deacon wrote: No, I meant _from_ the LL load, not _to_ a later load. >>> Sorry, I'm still not following enough to give you a definitive answer on >>> that. Could you give an example, please? These sequences usually run in >>> a loop, so the conditional branch back (based on the status flag) is where >>> the read-after-read comes in. >>> >>> Any control dependencies from the loaded data exist regardless of the status >>> flag. >> Basically what Waiman ended up doing, something like: >> >> if (cmpxchg_relaxed(>state, vcpu_halted, vcpu_hashed) != >> vcpu_halted) >> return; >> >> WRITE_ONCE(l->locked, _Q_SLOW_VAL); >> >> Where the STORE depends on the LL value being 'complete'. >> pn->state == vcpu_halted is the prerequisite of putting _Q_SLOW_VAL into the lock. The order of writing vcpu_hashed into pn->state doesn't really matter. The cmpxchg_relaxed() here should synchronize with the cmpxchg() in pv_wait_node(). Cheers, Longman
Re: [RESEND PATCH v5] locking/pvqspinlock: Relax cmpxchg's to improve performance on some archs
On 08/21/2017 03:42 PM, Peter Zijlstra wrote: > On Mon, Aug 21, 2017 at 09:25:50PM +0200, Peter Zijlstra wrote: >> On Mon, Aug 21, 2017 at 07:00:02PM +0100, Will Deacon wrote: No, I meant _from_ the LL load, not _to_ a later load. >>> Sorry, I'm still not following enough to give you a definitive answer on >>> that. Could you give an example, please? These sequences usually run in >>> a loop, so the conditional branch back (based on the status flag) is where >>> the read-after-read comes in. >>> >>> Any control dependencies from the loaded data exist regardless of the status >>> flag. >> Basically what Waiman ended up doing, something like: >> >> if (cmpxchg_relaxed(>state, vcpu_halted, vcpu_hashed) != >> vcpu_halted) >> return; >> >> WRITE_ONCE(l->locked, _Q_SLOW_VAL); >> >> Where the STORE depends on the LL value being 'complete'. >> pn->state == vcpu_halted is the prerequisite of putting _Q_SLOW_VAL into the lock. The order of writing vcpu_hashed into pn->state doesn't really matter. The cmpxchg_relaxed() here should synchronize with the cmpxchg() in pv_wait_node(). Cheers, Longman
Re: [RESEND PATCH v5] locking/pvqspinlock: Relax cmpxchg's to improve performance on some archs
On Mon, Aug 21, 2017 at 09:25:50PM +0200, Peter Zijlstra wrote: > On Mon, Aug 21, 2017 at 07:00:02PM +0100, Will Deacon wrote: > > > No, I meant _from_ the LL load, not _to_ a later load. > > > > Sorry, I'm still not following enough to give you a definitive answer on > > that. Could you give an example, please? These sequences usually run in > > a loop, so the conditional branch back (based on the status flag) is where > > the read-after-read comes in. > > > > Any control dependencies from the loaded data exist regardless of the status > > flag. > > Basically what Waiman ended up doing, something like: > > if (cmpxchg_relaxed(>state, vcpu_halted, vcpu_hashed) != > vcpu_halted) > return; > > WRITE_ONCE(l->locked, _Q_SLOW_VAL); > > Where the STORE depends on the LL value being 'complete'. Yup, that's ordered as you would expect. Thanks for the example! Will
Re: [RESEND PATCH v5] locking/pvqspinlock: Relax cmpxchg's to improve performance on some archs
On Mon, Aug 21, 2017 at 09:25:50PM +0200, Peter Zijlstra wrote: > On Mon, Aug 21, 2017 at 07:00:02PM +0100, Will Deacon wrote: > > > No, I meant _from_ the LL load, not _to_ a later load. > > > > Sorry, I'm still not following enough to give you a definitive answer on > > that. Could you give an example, please? These sequences usually run in > > a loop, so the conditional branch back (based on the status flag) is where > > the read-after-read comes in. > > > > Any control dependencies from the loaded data exist regardless of the status > > flag. > > Basically what Waiman ended up doing, something like: > > if (cmpxchg_relaxed(>state, vcpu_halted, vcpu_hashed) != > vcpu_halted) > return; > > WRITE_ONCE(l->locked, _Q_SLOW_VAL); > > Where the STORE depends on the LL value being 'complete'. Yup, that's ordered as you would expect. Thanks for the example! Will
Re: [RESEND PATCH v5] locking/pvqspinlock: Relax cmpxchg's to improve performance on some archs
On Mon, Aug 21, 2017 at 09:25:50PM +0200, Peter Zijlstra wrote: > On Mon, Aug 21, 2017 at 07:00:02PM +0100, Will Deacon wrote: > > > No, I meant _from_ the LL load, not _to_ a later load. > > > > Sorry, I'm still not following enough to give you a definitive answer on > > that. Could you give an example, please? These sequences usually run in > > a loop, so the conditional branch back (based on the status flag) is where > > the read-after-read comes in. > > > > Any control dependencies from the loaded data exist regardless of the status > > flag. > > Basically what Waiman ended up doing, something like: > > if (cmpxchg_relaxed(>state, vcpu_halted, vcpu_hashed) != > vcpu_halted) > return; > > WRITE_ONCE(l->locked, _Q_SLOW_VAL); > > Where the STORE depends on the LL value being 'complete'. > > > For any RmW we can only create a control dependency from the LOAD. The > the same could be done for something like: > > if (atomic_inc_not_zero(>refs)) > WRITE_ONCE(obj->foo, 1); Obviously I meant the hypothetical atomic_inc_not_zero_relaxed() here, otherwise all the implied smp_mb() spoil the game. > Where we only do the STORE if we acquire the reference. While the > WRITE_ONCE() will not be ordered against the increment, it is ordered > against the LL and we know it must not be 0. > > Per the LL/SC loop we'll have observed a !0 value and committed the SC > (which need not be visible or ordered against any later store) but both > STORES (SC and the WRITE_ONCE) must be after the ->refs LOAD.
Re: [RESEND PATCH v5] locking/pvqspinlock: Relax cmpxchg's to improve performance on some archs
On Mon, Aug 21, 2017 at 09:25:50PM +0200, Peter Zijlstra wrote: > On Mon, Aug 21, 2017 at 07:00:02PM +0100, Will Deacon wrote: > > > No, I meant _from_ the LL load, not _to_ a later load. > > > > Sorry, I'm still not following enough to give you a definitive answer on > > that. Could you give an example, please? These sequences usually run in > > a loop, so the conditional branch back (based on the status flag) is where > > the read-after-read comes in. > > > > Any control dependencies from the loaded data exist regardless of the status > > flag. > > Basically what Waiman ended up doing, something like: > > if (cmpxchg_relaxed(>state, vcpu_halted, vcpu_hashed) != > vcpu_halted) > return; > > WRITE_ONCE(l->locked, _Q_SLOW_VAL); > > Where the STORE depends on the LL value being 'complete'. > > > For any RmW we can only create a control dependency from the LOAD. The > the same could be done for something like: > > if (atomic_inc_not_zero(>refs)) > WRITE_ONCE(obj->foo, 1); Obviously I meant the hypothetical atomic_inc_not_zero_relaxed() here, otherwise all the implied smp_mb() spoil the game. > Where we only do the STORE if we acquire the reference. While the > WRITE_ONCE() will not be ordered against the increment, it is ordered > against the LL and we know it must not be 0. > > Per the LL/SC loop we'll have observed a !0 value and committed the SC > (which need not be visible or ordered against any later store) but both > STORES (SC and the WRITE_ONCE) must be after the ->refs LOAD.
Re: [RESEND PATCH v5] locking/pvqspinlock: Relax cmpxchg's to improve performance on some archs
On Mon, Aug 21, 2017 at 07:00:02PM +0100, Will Deacon wrote: > > No, I meant _from_ the LL load, not _to_ a later load. > > Sorry, I'm still not following enough to give you a definitive answer on > that. Could you give an example, please? These sequences usually run in > a loop, so the conditional branch back (based on the status flag) is where > the read-after-read comes in. > > Any control dependencies from the loaded data exist regardless of the status > flag. Basically what Waiman ended up doing, something like: if (cmpxchg_relaxed(>state, vcpu_halted, vcpu_hashed) != vcpu_halted) return; WRITE_ONCE(l->locked, _Q_SLOW_VAL); Where the STORE depends on the LL value being 'complete'. For any RmW we can only create a control dependency from the LOAD. The the same could be done for something like: if (atomic_inc_not_zero(>refs)) WRITE_ONCE(obj->foo, 1); Where we only do the STORE if we acquire the reference. While the WRITE_ONCE() will not be ordered against the increment, it is ordered against the LL and we know it must not be 0. Per the LL/SC loop we'll have observed a !0 value and committed the SC (which need not be visible or ordered against any later store) but both STORES (SC and the WRITE_ONCE) must be after the ->refs LOAD.
Re: [RESEND PATCH v5] locking/pvqspinlock: Relax cmpxchg's to improve performance on some archs
On Mon, Aug 21, 2017 at 07:00:02PM +0100, Will Deacon wrote: > > No, I meant _from_ the LL load, not _to_ a later load. > > Sorry, I'm still not following enough to give you a definitive answer on > that. Could you give an example, please? These sequences usually run in > a loop, so the conditional branch back (based on the status flag) is where > the read-after-read comes in. > > Any control dependencies from the loaded data exist regardless of the status > flag. Basically what Waiman ended up doing, something like: if (cmpxchg_relaxed(>state, vcpu_halted, vcpu_hashed) != vcpu_halted) return; WRITE_ONCE(l->locked, _Q_SLOW_VAL); Where the STORE depends on the LL value being 'complete'. For any RmW we can only create a control dependency from the LOAD. The the same could be done for something like: if (atomic_inc_not_zero(>refs)) WRITE_ONCE(obj->foo, 1); Where we only do the STORE if we acquire the reference. While the WRITE_ONCE() will not be ordered against the increment, it is ordered against the LL and we know it must not be 0. Per the LL/SC loop we'll have observed a !0 value and committed the SC (which need not be visible or ordered against any later store) but both STORES (SC and the WRITE_ONCE) must be after the ->refs LOAD.
Re: [RESEND PATCH v5] locking/pvqspinlock: Relax cmpxchg's to improve performance on some archs
On Mon, Aug 21, 2017 at 12:55:08PM +0200, Peter Zijlstra wrote: > On Tue, Aug 15, 2017 at 07:40:35PM +0100, Will Deacon wrote: > > On Mon, Aug 14, 2017 at 08:47:11PM +0200, Peter Zijlstra wrote: > > > On Mon, Aug 14, 2017 at 01:01:22PM +0100, Will Deacon wrote: > > > > Yeah, that's right, you can't use the STXR status flag to create control > > > > dependencies. > > > > > > Just for my elucidation; you can't use it to create a control dependency > > > on the store, but you can use it to create a control dependency on the > > > corresponding load, right? > > > > Hmm, sort of, but I'd say that the reads are really ordered due to > > read-after-read ordering in that case. Control dependencies to loads > > don't give you order. > > No, I meant _from_ the LL load, not _to_ a later load. Sorry, I'm still not following enough to give you a definitive answer on that. Could you give an example, please? These sequences usually run in a loop, so the conditional branch back (based on the status flag) is where the read-after-read comes in. Any control dependencies from the loaded data exist regardless of the status flag. > > > Now, IIRC, we've defined control dependencies as being LOAD->STORE > > > ordering, so in that respect nothing is lost. But maybe we should > > > explicitly mention that if the LOAD is part of an (otherwise) atomic RmW > > > the STORE is not constrained. > > > > I could well be misreading your suggestion, but it feels like that's too > > weak. You can definitely still have control dependencies off the LL part > > of the LL/SC pair, just not off the SC part. > > > > E.g. this version of LB is forbidden on arm64: > > > > P0: > > if (atomic_inc_return_relaxed() == 2) > > atomic_set(, 1); > > > > P1: > > if (atomic_inc_return_relaxed() == 2) > > atomic_set(, 1); > > > > Perhaps when you say "the STORE", you mean the store in the atomic RmW, > > rather than the store in the LOAD->STORE control dependency? > > Yes. So I was looking to exclude (SC) STORE -> STORE order through > control dependencies. Ok, good. Will
Re: [RESEND PATCH v5] locking/pvqspinlock: Relax cmpxchg's to improve performance on some archs
On Mon, Aug 21, 2017 at 12:55:08PM +0200, Peter Zijlstra wrote: > On Tue, Aug 15, 2017 at 07:40:35PM +0100, Will Deacon wrote: > > On Mon, Aug 14, 2017 at 08:47:11PM +0200, Peter Zijlstra wrote: > > > On Mon, Aug 14, 2017 at 01:01:22PM +0100, Will Deacon wrote: > > > > Yeah, that's right, you can't use the STXR status flag to create control > > > > dependencies. > > > > > > Just for my elucidation; you can't use it to create a control dependency > > > on the store, but you can use it to create a control dependency on the > > > corresponding load, right? > > > > Hmm, sort of, but I'd say that the reads are really ordered due to > > read-after-read ordering in that case. Control dependencies to loads > > don't give you order. > > No, I meant _from_ the LL load, not _to_ a later load. Sorry, I'm still not following enough to give you a definitive answer on that. Could you give an example, please? These sequences usually run in a loop, so the conditional branch back (based on the status flag) is where the read-after-read comes in. Any control dependencies from the loaded data exist regardless of the status flag. > > > Now, IIRC, we've defined control dependencies as being LOAD->STORE > > > ordering, so in that respect nothing is lost. But maybe we should > > > explicitly mention that if the LOAD is part of an (otherwise) atomic RmW > > > the STORE is not constrained. > > > > I could well be misreading your suggestion, but it feels like that's too > > weak. You can definitely still have control dependencies off the LL part > > of the LL/SC pair, just not off the SC part. > > > > E.g. this version of LB is forbidden on arm64: > > > > P0: > > if (atomic_inc_return_relaxed() == 2) > > atomic_set(, 1); > > > > P1: > > if (atomic_inc_return_relaxed() == 2) > > atomic_set(, 1); > > > > Perhaps when you say "the STORE", you mean the store in the atomic RmW, > > rather than the store in the LOAD->STORE control dependency? > > Yes. So I was looking to exclude (SC) STORE -> STORE order through > control dependencies. Ok, good. Will
Re: [RESEND PATCH v5] locking/pvqspinlock: Relax cmpxchg's to improve performance on some archs
On Tue, Aug 15, 2017 at 07:40:35PM +0100, Will Deacon wrote: > On Mon, Aug 14, 2017 at 08:47:11PM +0200, Peter Zijlstra wrote: > > On Mon, Aug 14, 2017 at 01:01:22PM +0100, Will Deacon wrote: > > > Yeah, that's right, you can't use the STXR status flag to create control > > > dependencies. > > > > Just for my elucidation; you can't use it to create a control dependency > > on the store, but you can use it to create a control dependency on the > > corresponding load, right? > > Hmm, sort of, but I'd say that the reads are really ordered due to > read-after-read ordering in that case. Control dependencies to loads > don't give you order. No, I meant _from_ the LL load, not _to_ a later load. > > Now, IIRC, we've defined control dependencies as being LOAD->STORE > > ordering, so in that respect nothing is lost. But maybe we should > > explicitly mention that if the LOAD is part of an (otherwise) atomic RmW > > the STORE is not constrained. > > I could well be misreading your suggestion, but it feels like that's too > weak. You can definitely still have control dependencies off the LL part > of the LL/SC pair, just not off the SC part. > > E.g. this version of LB is forbidden on arm64: > > P0: > if (atomic_inc_return_relaxed() == 2) > atomic_set(, 1); > > P1: > if (atomic_inc_return_relaxed() == 2) > atomic_set(, 1); > > Perhaps when you say "the STORE", you mean the store in the atomic RmW, > rather than the store in the LOAD->STORE control dependency? Yes. So I was looking to exclude (SC) STORE -> STORE order through control dependencies.
Re: [RESEND PATCH v5] locking/pvqspinlock: Relax cmpxchg's to improve performance on some archs
On Tue, Aug 15, 2017 at 07:40:35PM +0100, Will Deacon wrote: > On Mon, Aug 14, 2017 at 08:47:11PM +0200, Peter Zijlstra wrote: > > On Mon, Aug 14, 2017 at 01:01:22PM +0100, Will Deacon wrote: > > > Yeah, that's right, you can't use the STXR status flag to create control > > > dependencies. > > > > Just for my elucidation; you can't use it to create a control dependency > > on the store, but you can use it to create a control dependency on the > > corresponding load, right? > > Hmm, sort of, but I'd say that the reads are really ordered due to > read-after-read ordering in that case. Control dependencies to loads > don't give you order. No, I meant _from_ the LL load, not _to_ a later load. > > Now, IIRC, we've defined control dependencies as being LOAD->STORE > > ordering, so in that respect nothing is lost. But maybe we should > > explicitly mention that if the LOAD is part of an (otherwise) atomic RmW > > the STORE is not constrained. > > I could well be misreading your suggestion, but it feels like that's too > weak. You can definitely still have control dependencies off the LL part > of the LL/SC pair, just not off the SC part. > > E.g. this version of LB is forbidden on arm64: > > P0: > if (atomic_inc_return_relaxed() == 2) > atomic_set(, 1); > > P1: > if (atomic_inc_return_relaxed() == 2) > atomic_set(, 1); > > Perhaps when you say "the STORE", you mean the store in the atomic RmW, > rather than the store in the LOAD->STORE control dependency? Yes. So I was looking to exclude (SC) STORE -> STORE order through control dependencies.
Re: [RESEND PATCH v5] locking/pvqspinlock: Relax cmpxchg's to improve performance on some archs
On Mon, Aug 14, 2017 at 08:47:11PM +0200, Peter Zijlstra wrote: > On Mon, Aug 14, 2017 at 01:01:22PM +0100, Will Deacon wrote: > > Yeah, that's right, you can't use the STXR status flag to create control > > dependencies. > > Just for my elucidation; you can't use it to create a control dependency > on the store, but you can use it to create a control dependency on the > corresponding load, right? Hmm, sort of, but I'd say that the reads are really ordered due to read-after-read ordering in that case. Control dependencies to loads don't give you order. > Now, IIRC, we've defined control dependencies as being LOAD->STORE > ordering, so in that respect nothing is lost. But maybe we should > explicitly mention that if the LOAD is part of an (otherwise) atomic RmW > the STORE is not constrained. I could well be misreading your suggestion, but it feels like that's too weak. You can definitely still have control dependencies off the LL part of the LL/SC pair, just not off the SC part. E.g. this version of LB is forbidden on arm64: P0: if (atomic_inc_return_relaxed() == 2) atomic_set(, 1); P1: if (atomic_inc_return_relaxed() == 2) atomic_set(, 1); Perhaps when you say "the STORE", you mean the store in the atomic RmW, rather than the store in the LOAD->STORE control dependency? Will
Re: [RESEND PATCH v5] locking/pvqspinlock: Relax cmpxchg's to improve performance on some archs
On Mon, Aug 14, 2017 at 08:47:11PM +0200, Peter Zijlstra wrote: > On Mon, Aug 14, 2017 at 01:01:22PM +0100, Will Deacon wrote: > > Yeah, that's right, you can't use the STXR status flag to create control > > dependencies. > > Just for my elucidation; you can't use it to create a control dependency > on the store, but you can use it to create a control dependency on the > corresponding load, right? Hmm, sort of, but I'd say that the reads are really ordered due to read-after-read ordering in that case. Control dependencies to loads don't give you order. > Now, IIRC, we've defined control dependencies as being LOAD->STORE > ordering, so in that respect nothing is lost. But maybe we should > explicitly mention that if the LOAD is part of an (otherwise) atomic RmW > the STORE is not constrained. I could well be misreading your suggestion, but it feels like that's too weak. You can definitely still have control dependencies off the LL part of the LL/SC pair, just not off the SC part. E.g. this version of LB is forbidden on arm64: P0: if (atomic_inc_return_relaxed() == 2) atomic_set(, 1); P1: if (atomic_inc_return_relaxed() == 2) atomic_set(, 1); Perhaps when you say "the STORE", you mean the store in the atomic RmW, rather than the store in the LOAD->STORE control dependency? Will
Re: [RESEND PATCH v5] locking/pvqspinlock: Relax cmpxchg's to improve performance on some archs
On Mon, Aug 14, 2017 at 01:01:22PM +0100, Will Deacon wrote: > Yeah, that's right, you can't use the STXR status flag to create control > dependencies. Just for my elucidation; you can't use it to create a control dependency on the store, but you can use it to create a control dependency on the corresponding load, right? Now, IIRC, we've defined control dependencies as being LOAD->STORE ordering, so in that respect nothing is lost. But maybe we should explicitly mention that if the LOAD is part of an (otherwise) atomic RmW the STORE is not constrained.
Re: [RESEND PATCH v5] locking/pvqspinlock: Relax cmpxchg's to improve performance on some archs
On Mon, Aug 14, 2017 at 01:01:22PM +0100, Will Deacon wrote: > Yeah, that's right, you can't use the STXR status flag to create control > dependencies. Just for my elucidation; you can't use it to create a control dependency on the store, but you can use it to create a control dependency on the corresponding load, right? Now, IIRC, we've defined control dependencies as being LOAD->STORE ordering, so in that respect nothing is lost. But maybe we should explicitly mention that if the LOAD is part of an (otherwise) atomic RmW the STORE is not constrained.
Re: [RESEND PATCH v5] locking/pvqspinlock: Relax cmpxchg's to improve performance on some archs
On Mon, Aug 14, 2017 at 11:01:10AM -0400, Waiman Long wrote: > On 08/14/2017 08:01 AM, Will Deacon wrote: > > On Fri, Aug 11, 2017 at 11:06:01AM +0200, Peter Zijlstra wrote: > >> On Thu, Aug 10, 2017 at 02:18:30PM -0400, Waiman Long wrote: > >>> On 08/10/2017 12:22 PM, Waiman Long wrote: > On 08/10/2017 12:15 PM, Peter Zijlstra wrote: > > Might as well do an explicit: > > > > smp_mb__before_atomic() > > cmpxchg_relaxed() > > smp_mb__after_atomic() > > > > I suppose and not introduce new primitives. > >>> I think we don't need smp_mb__after_atomic(). The read has to be fully > >>> ordered, but the write part may not need it as the control dependency of > >>> the old value should guard against incorrect action. Right? > >> You'd think that, but IIRC there was something funny about using the SC > >> return flag for control dependencies. Will? > > Yeah, that's right, you can't use the STXR status flag to create control > > dependencies. > > > > Will > > Actually, the code sequence that I plan to use are: > > smp_mb__before_atomic(); > if (cmpxchg_relaxed(>state, vcpu_halted, vcpu_hashed) > != vcpu_halted) > return; > > WRITE_ONCE(l->locked, _Q_SLOW_VAL); > (void)pv_hash(lock, pn); > > I am planning to use the comparison of the returned value (pn->state) > again vcpu_halted as the control dependency. I don't see how the status > flag of STXR is affecting this. Thanks for the context. I agree that you've got a control dependency in this case, so the WRITE_ONCE will be ordered after the LL part of the cmpxchg. It could still be reordered with respect to the write part, however. Will
Re: [RESEND PATCH v5] locking/pvqspinlock: Relax cmpxchg's to improve performance on some archs
On Mon, Aug 14, 2017 at 11:01:10AM -0400, Waiman Long wrote: > On 08/14/2017 08:01 AM, Will Deacon wrote: > > On Fri, Aug 11, 2017 at 11:06:01AM +0200, Peter Zijlstra wrote: > >> On Thu, Aug 10, 2017 at 02:18:30PM -0400, Waiman Long wrote: > >>> On 08/10/2017 12:22 PM, Waiman Long wrote: > On 08/10/2017 12:15 PM, Peter Zijlstra wrote: > > Might as well do an explicit: > > > > smp_mb__before_atomic() > > cmpxchg_relaxed() > > smp_mb__after_atomic() > > > > I suppose and not introduce new primitives. > >>> I think we don't need smp_mb__after_atomic(). The read has to be fully > >>> ordered, but the write part may not need it as the control dependency of > >>> the old value should guard against incorrect action. Right? > >> You'd think that, but IIRC there was something funny about using the SC > >> return flag for control dependencies. Will? > > Yeah, that's right, you can't use the STXR status flag to create control > > dependencies. > > > > Will > > Actually, the code sequence that I plan to use are: > > smp_mb__before_atomic(); > if (cmpxchg_relaxed(>state, vcpu_halted, vcpu_hashed) > != vcpu_halted) > return; > > WRITE_ONCE(l->locked, _Q_SLOW_VAL); > (void)pv_hash(lock, pn); > > I am planning to use the comparison of the returned value (pn->state) > again vcpu_halted as the control dependency. I don't see how the status > flag of STXR is affecting this. Thanks for the context. I agree that you've got a control dependency in this case, so the WRITE_ONCE will be ordered after the LL part of the cmpxchg. It could still be reordered with respect to the write part, however. Will
Re: [RESEND PATCH v5] locking/pvqspinlock: Relax cmpxchg's to improve performance on some archs
On 08/14/2017 08:01 AM, Will Deacon wrote: > On Fri, Aug 11, 2017 at 11:06:01AM +0200, Peter Zijlstra wrote: >> On Thu, Aug 10, 2017 at 02:18:30PM -0400, Waiman Long wrote: >>> On 08/10/2017 12:22 PM, Waiman Long wrote: On 08/10/2017 12:15 PM, Peter Zijlstra wrote: > Might as well do an explicit: > > smp_mb__before_atomic() > cmpxchg_relaxed() > smp_mb__after_atomic() > > I suppose and not introduce new primitives. >>> I think we don't need smp_mb__after_atomic(). The read has to be fully >>> ordered, but the write part may not need it as the control dependency of >>> the old value should guard against incorrect action. Right? >> You'd think that, but IIRC there was something funny about using the SC >> return flag for control dependencies. Will? > Yeah, that's right, you can't use the STXR status flag to create control > dependencies. > > Will Actually, the code sequence that I plan to use are: smp_mb__before_atomic(); if (cmpxchg_relaxed(>state, vcpu_halted, vcpu_hashed) != vcpu_halted) return; WRITE_ONCE(l->locked, _Q_SLOW_VAL); (void)pv_hash(lock, pn); I am planning to use the comparison of the returned value (pn->state) again vcpu_halted as the control dependency. I don't see how the status flag of STXR is affecting this. Cheers, Longman
Re: [RESEND PATCH v5] locking/pvqspinlock: Relax cmpxchg's to improve performance on some archs
On 08/14/2017 08:01 AM, Will Deacon wrote: > On Fri, Aug 11, 2017 at 11:06:01AM +0200, Peter Zijlstra wrote: >> On Thu, Aug 10, 2017 at 02:18:30PM -0400, Waiman Long wrote: >>> On 08/10/2017 12:22 PM, Waiman Long wrote: On 08/10/2017 12:15 PM, Peter Zijlstra wrote: > Might as well do an explicit: > > smp_mb__before_atomic() > cmpxchg_relaxed() > smp_mb__after_atomic() > > I suppose and not introduce new primitives. >>> I think we don't need smp_mb__after_atomic(). The read has to be fully >>> ordered, but the write part may not need it as the control dependency of >>> the old value should guard against incorrect action. Right? >> You'd think that, but IIRC there was something funny about using the SC >> return flag for control dependencies. Will? > Yeah, that's right, you can't use the STXR status flag to create control > dependencies. > > Will Actually, the code sequence that I plan to use are: smp_mb__before_atomic(); if (cmpxchg_relaxed(>state, vcpu_halted, vcpu_hashed) != vcpu_halted) return; WRITE_ONCE(l->locked, _Q_SLOW_VAL); (void)pv_hash(lock, pn); I am planning to use the comparison of the returned value (pn->state) again vcpu_halted as the control dependency. I don't see how the status flag of STXR is affecting this. Cheers, Longman
Re: [RESEND PATCH v5] locking/pvqspinlock: Relax cmpxchg's to improve performance on some archs
On Fri, Aug 11, 2017 at 11:06:01AM +0200, Peter Zijlstra wrote: > On Thu, Aug 10, 2017 at 02:18:30PM -0400, Waiman Long wrote: > > On 08/10/2017 12:22 PM, Waiman Long wrote: > > > On 08/10/2017 12:15 PM, Peter Zijlstra wrote: > > > >> Might as well do an explicit: > > >> > > >> smp_mb__before_atomic() > > >> cmpxchg_relaxed() > > >> smp_mb__after_atomic() > > >> > > >> I suppose and not introduce new primitives. > > > > I think we don't need smp_mb__after_atomic(). The read has to be fully > > ordered, but the write part may not need it as the control dependency of > > the old value should guard against incorrect action. Right? > > You'd think that, but IIRC there was something funny about using the SC > return flag for control dependencies. Will? Yeah, that's right, you can't use the STXR status flag to create control dependencies. Will
Re: [RESEND PATCH v5] locking/pvqspinlock: Relax cmpxchg's to improve performance on some archs
On Fri, Aug 11, 2017 at 11:06:01AM +0200, Peter Zijlstra wrote: > On Thu, Aug 10, 2017 at 02:18:30PM -0400, Waiman Long wrote: > > On 08/10/2017 12:22 PM, Waiman Long wrote: > > > On 08/10/2017 12:15 PM, Peter Zijlstra wrote: > > > >> Might as well do an explicit: > > >> > > >> smp_mb__before_atomic() > > >> cmpxchg_relaxed() > > >> smp_mb__after_atomic() > > >> > > >> I suppose and not introduce new primitives. > > > > I think we don't need smp_mb__after_atomic(). The read has to be fully > > ordered, but the write part may not need it as the control dependency of > > the old value should guard against incorrect action. Right? > > You'd think that, but IIRC there was something funny about using the SC > return flag for control dependencies. Will? Yeah, that's right, you can't use the STXR status flag to create control dependencies. Will
Re: [RESEND PATCH v5] locking/pvqspinlock: Relax cmpxchg's to improve performance on some archs
On Thu, Aug 10, 2017 at 02:18:30PM -0400, Waiman Long wrote: > On 08/10/2017 12:22 PM, Waiman Long wrote: > > On 08/10/2017 12:15 PM, Peter Zijlstra wrote: > >> Might as well do an explicit: > >> > >>smp_mb__before_atomic() > >>cmpxchg_relaxed() > >>smp_mb__after_atomic() > >> > >> I suppose and not introduce new primitives. > > I think we don't need smp_mb__after_atomic(). The read has to be fully > ordered, but the write part may not need it as the control dependency of > the old value should guard against incorrect action. Right? You'd think that, but IIRC there was something funny about using the SC return flag for control dependencies. Will?
Re: [RESEND PATCH v5] locking/pvqspinlock: Relax cmpxchg's to improve performance on some archs
On Thu, Aug 10, 2017 at 02:18:30PM -0400, Waiman Long wrote: > On 08/10/2017 12:22 PM, Waiman Long wrote: > > On 08/10/2017 12:15 PM, Peter Zijlstra wrote: > >> Might as well do an explicit: > >> > >>smp_mb__before_atomic() > >>cmpxchg_relaxed() > >>smp_mb__after_atomic() > >> > >> I suppose and not introduce new primitives. > > I think we don't need smp_mb__after_atomic(). The read has to be fully > ordered, but the write part may not need it as the control dependency of > the old value should guard against incorrect action. Right? You'd think that, but IIRC there was something funny about using the SC return flag for control dependencies. Will?
Re: [RESEND PATCH v5] locking/pvqspinlock: Relax cmpxchg's to improve performance on some archs
On Thu, Aug 10, 2017 at 11:13:17AM +0200, Peter Zijlstra wrote: > On Thu, Aug 10, 2017 at 04:12:13PM +0800, Boqun Feng wrote: > > > > Or is the reason this doesn't work on PPC that its RCpc? > > So that :-) > > > Here is an example why PPC needs a sync() before the cmpxchg(): > > > > https://marc.info/?l=linux-kernel=144485396224519=2 > > > > and Paul Mckenney's detailed explanation about why this could happen: > > > > https://marc.info/?l=linux-kernel=144485909826241=2 > > > > (Somehow, I feel like he was answering to a similar question question as > > you ask here ;-)) > > Yes, and I had vague memories of having gone over this before, but > couldn't quickly find things. Thanks! > > > And I think aarch64 doesn't have a problem here because it is "(other) > > multi-copy atomic". Will? > > Right, its the RCpc vs RCsc thing. The ARM64 release is as you say > multi-copy atomic, whereas the PPC lwsync is not. > > This still leaves us with the situation that we need an smp_mb() between > smp_store_release() and a possibly failing cmpxchg() if we want to > guarantee the cmpxchg()'s load comes after the store-release. For whatever it is worth, this is why C11 allows specifying one memory-order strength for the success case and another for the failure case. But it is not immediately clear that we need another level of combinatorial API explosion... Thanx, Paul
Re: [RESEND PATCH v5] locking/pvqspinlock: Relax cmpxchg's to improve performance on some archs
On Thu, Aug 10, 2017 at 11:13:17AM +0200, Peter Zijlstra wrote: > On Thu, Aug 10, 2017 at 04:12:13PM +0800, Boqun Feng wrote: > > > > Or is the reason this doesn't work on PPC that its RCpc? > > So that :-) > > > Here is an example why PPC needs a sync() before the cmpxchg(): > > > > https://marc.info/?l=linux-kernel=144485396224519=2 > > > > and Paul Mckenney's detailed explanation about why this could happen: > > > > https://marc.info/?l=linux-kernel=144485909826241=2 > > > > (Somehow, I feel like he was answering to a similar question question as > > you ask here ;-)) > > Yes, and I had vague memories of having gone over this before, but > couldn't quickly find things. Thanks! > > > And I think aarch64 doesn't have a problem here because it is "(other) > > multi-copy atomic". Will? > > Right, its the RCpc vs RCsc thing. The ARM64 release is as you say > multi-copy atomic, whereas the PPC lwsync is not. > > This still leaves us with the situation that we need an smp_mb() between > smp_store_release() and a possibly failing cmpxchg() if we want to > guarantee the cmpxchg()'s load comes after the store-release. For whatever it is worth, this is why C11 allows specifying one memory-order strength for the success case and another for the failure case. But it is not immediately clear that we need another level of combinatorial API explosion... Thanx, Paul
Re: [RESEND PATCH v5] locking/pvqspinlock: Relax cmpxchg's to improve performance on some archs
On 08/10/2017 12:22 PM, Waiman Long wrote: > On 08/10/2017 12:15 PM, Peter Zijlstra wrote: >> On Thu, Aug 10, 2017 at 09:58:57AM -0400, Waiman Long wrote: >>> On 08/10/2017 09:27 AM, Waiman Long wrote: On 08/10/2017 07:50 AM, Peter Zijlstra wrote: > On Wed, May 24, 2017 at 09:38:28AM -0400, Waiman Long wrote: >> # of thread w/o patchwith patch % Change >> --- --- >>4 4053.3 Mop/s 4223.7 Mop/s +4.2% >>8 3310.4 Mop/s 3406.0 Mop/s +2.9% >> 12 2576.4 Mop/s 2674.6 Mop/s +3.8% > Waiman, could you run those numbers again but with the below 'fixed' ? > >> @@ -361,6 +361,13 @@ static void pv_kick_node(struct qspinlock *lock, >> struct mcs_spinlock *node) >> * observe its next->locked value and advance itself. >> * >> * Matches with smp_store_mb() and cmpxchg() in pv_wait_node() >> + * >> + * The write to next->locked in arch_mcs_spin_unlock_contended() >> + * must be ordered before the read of pn->state in the cmpxchg() >> + * below for the code to work correctly. However, this is not >> + * guaranteed on all architectures when the cmpxchg() call >> fails. >> + * Both x86 and PPC can provide that guarantee, but other >> + * architectures not necessarily. >> */ > smp_mb(); > >> if (cmpxchg(>state, vcpu_halted, vcpu_hashed) != >> vcpu_halted) >> return; > Ideally this Power CPU can optimize back-to-back SYNC instructions, but > who knows... Yes, I can run the numbers again. However, the changes here is in the slowpath. My current patch optimizes the fast path only and my original test doesn't stress the slowpath at all, I think. I will have to make some changes to stress the slowpath. >>> Looking at past emails, I remember why I put the comment there. Putting >>> an smp_mb() here will definitely has an negative performance impact on >>> x86. So I put in the comment here to remind me that the current code may >>> not work for ARM64. >>> >>> To fix that, my current thought is to have a cmpxchg variant that >>> guarantees ordering for both success and failure, for example, >>> cmpxchg_ordered(). In that way, we only need to insert the barrier for >>> architectures that need it. That will be a separate patch instead of >>> integrating into this one. >> Might as well do an explicit: >> >> smp_mb__before_atomic() >> cmpxchg_relaxed() >> smp_mb__after_atomic() >> >> I suppose and not introduce new primitives. I think we don't need smp_mb__after_atomic(). The read has to be fully ordered, but the write part may not need it as the control dependency of the old value should guard against incorrect action. Right? Cheers, Longman
Re: [RESEND PATCH v5] locking/pvqspinlock: Relax cmpxchg's to improve performance on some archs
On 08/10/2017 12:22 PM, Waiman Long wrote: > On 08/10/2017 12:15 PM, Peter Zijlstra wrote: >> On Thu, Aug 10, 2017 at 09:58:57AM -0400, Waiman Long wrote: >>> On 08/10/2017 09:27 AM, Waiman Long wrote: On 08/10/2017 07:50 AM, Peter Zijlstra wrote: > On Wed, May 24, 2017 at 09:38:28AM -0400, Waiman Long wrote: >> # of thread w/o patchwith patch % Change >> --- --- >>4 4053.3 Mop/s 4223.7 Mop/s +4.2% >>8 3310.4 Mop/s 3406.0 Mop/s +2.9% >> 12 2576.4 Mop/s 2674.6 Mop/s +3.8% > Waiman, could you run those numbers again but with the below 'fixed' ? > >> @@ -361,6 +361,13 @@ static void pv_kick_node(struct qspinlock *lock, >> struct mcs_spinlock *node) >> * observe its next->locked value and advance itself. >> * >> * Matches with smp_store_mb() and cmpxchg() in pv_wait_node() >> + * >> + * The write to next->locked in arch_mcs_spin_unlock_contended() >> + * must be ordered before the read of pn->state in the cmpxchg() >> + * below for the code to work correctly. However, this is not >> + * guaranteed on all architectures when the cmpxchg() call >> fails. >> + * Both x86 and PPC can provide that guarantee, but other >> + * architectures not necessarily. >> */ > smp_mb(); > >> if (cmpxchg(>state, vcpu_halted, vcpu_hashed) != >> vcpu_halted) >> return; > Ideally this Power CPU can optimize back-to-back SYNC instructions, but > who knows... Yes, I can run the numbers again. However, the changes here is in the slowpath. My current patch optimizes the fast path only and my original test doesn't stress the slowpath at all, I think. I will have to make some changes to stress the slowpath. >>> Looking at past emails, I remember why I put the comment there. Putting >>> an smp_mb() here will definitely has an negative performance impact on >>> x86. So I put in the comment here to remind me that the current code may >>> not work for ARM64. >>> >>> To fix that, my current thought is to have a cmpxchg variant that >>> guarantees ordering for both success and failure, for example, >>> cmpxchg_ordered(). In that way, we only need to insert the barrier for >>> architectures that need it. That will be a separate patch instead of >>> integrating into this one. >> Might as well do an explicit: >> >> smp_mb__before_atomic() >> cmpxchg_relaxed() >> smp_mb__after_atomic() >> >> I suppose and not introduce new primitives. I think we don't need smp_mb__after_atomic(). The read has to be fully ordered, but the write part may not need it as the control dependency of the old value should guard against incorrect action. Right? Cheers, Longman
Re: [RESEND PATCH v5] locking/pvqspinlock: Relax cmpxchg's to improve performance on some archs
On 08/10/2017 12:15 PM, Peter Zijlstra wrote: > On Thu, Aug 10, 2017 at 09:58:57AM -0400, Waiman Long wrote: >> On 08/10/2017 09:27 AM, Waiman Long wrote: >>> On 08/10/2017 07:50 AM, Peter Zijlstra wrote: On Wed, May 24, 2017 at 09:38:28AM -0400, Waiman Long wrote: > # of thread w/o patchwith patch % Change > --- --- >4 4053.3 Mop/s 4223.7 Mop/s +4.2% >8 3310.4 Mop/s 3406.0 Mop/s +2.9% > 12 2576.4 Mop/s 2674.6 Mop/s +3.8% Waiman, could you run those numbers again but with the below 'fixed' ? > @@ -361,6 +361,13 @@ static void pv_kick_node(struct qspinlock *lock, > struct mcs_spinlock *node) >* observe its next->locked value and advance itself. >* >* Matches with smp_store_mb() and cmpxchg() in pv_wait_node() > + * > + * The write to next->locked in arch_mcs_spin_unlock_contended() > + * must be ordered before the read of pn->state in the cmpxchg() > + * below for the code to work correctly. However, this is not > + * guaranteed on all architectures when the cmpxchg() call fails. > + * Both x86 and PPC can provide that guarantee, but other > + * architectures not necessarily. >*/ smp_mb(); > if (cmpxchg(>state, vcpu_halted, vcpu_hashed) != vcpu_halted) > return; Ideally this Power CPU can optimize back-to-back SYNC instructions, but who knows... >>> Yes, I can run the numbers again. However, the changes here is in the >>> slowpath. My current patch optimizes the fast path only and my original >>> test doesn't stress the slowpath at all, I think. I will have to make >>> some changes to stress the slowpath. >> Looking at past emails, I remember why I put the comment there. Putting >> an smp_mb() here will definitely has an negative performance impact on >> x86. So I put in the comment here to remind me that the current code may >> not work for ARM64. >> >> To fix that, my current thought is to have a cmpxchg variant that >> guarantees ordering for both success and failure, for example, >> cmpxchg_ordered(). In that way, we only need to insert the barrier for >> architectures that need it. That will be a separate patch instead of >> integrating into this one. > Might as well do an explicit: > > smp_mb__before_atomic() > cmpxchg_relaxed() > smp_mb__after_atomic() > > I suppose and not introduce new primitives. Right. I think that will work without impacting current x86 performance. Will update my patch accordingly. Thanks, Longman
Re: [RESEND PATCH v5] locking/pvqspinlock: Relax cmpxchg's to improve performance on some archs
On 08/10/2017 12:15 PM, Peter Zijlstra wrote: > On Thu, Aug 10, 2017 at 09:58:57AM -0400, Waiman Long wrote: >> On 08/10/2017 09:27 AM, Waiman Long wrote: >>> On 08/10/2017 07:50 AM, Peter Zijlstra wrote: On Wed, May 24, 2017 at 09:38:28AM -0400, Waiman Long wrote: > # of thread w/o patchwith patch % Change > --- --- >4 4053.3 Mop/s 4223.7 Mop/s +4.2% >8 3310.4 Mop/s 3406.0 Mop/s +2.9% > 12 2576.4 Mop/s 2674.6 Mop/s +3.8% Waiman, could you run those numbers again but with the below 'fixed' ? > @@ -361,6 +361,13 @@ static void pv_kick_node(struct qspinlock *lock, > struct mcs_spinlock *node) >* observe its next->locked value and advance itself. >* >* Matches with smp_store_mb() and cmpxchg() in pv_wait_node() > + * > + * The write to next->locked in arch_mcs_spin_unlock_contended() > + * must be ordered before the read of pn->state in the cmpxchg() > + * below for the code to work correctly. However, this is not > + * guaranteed on all architectures when the cmpxchg() call fails. > + * Both x86 and PPC can provide that guarantee, but other > + * architectures not necessarily. >*/ smp_mb(); > if (cmpxchg(>state, vcpu_halted, vcpu_hashed) != vcpu_halted) > return; Ideally this Power CPU can optimize back-to-back SYNC instructions, but who knows... >>> Yes, I can run the numbers again. However, the changes here is in the >>> slowpath. My current patch optimizes the fast path only and my original >>> test doesn't stress the slowpath at all, I think. I will have to make >>> some changes to stress the slowpath. >> Looking at past emails, I remember why I put the comment there. Putting >> an smp_mb() here will definitely has an negative performance impact on >> x86. So I put in the comment here to remind me that the current code may >> not work for ARM64. >> >> To fix that, my current thought is to have a cmpxchg variant that >> guarantees ordering for both success and failure, for example, >> cmpxchg_ordered(). In that way, we only need to insert the barrier for >> architectures that need it. That will be a separate patch instead of >> integrating into this one. > Might as well do an explicit: > > smp_mb__before_atomic() > cmpxchg_relaxed() > smp_mb__after_atomic() > > I suppose and not introduce new primitives. Right. I think that will work without impacting current x86 performance. Will update my patch accordingly. Thanks, Longman
Re: [RESEND PATCH v5] locking/pvqspinlock: Relax cmpxchg's to improve performance on some archs
On Thu, Aug 10, 2017 at 09:58:57AM -0400, Waiman Long wrote: > On 08/10/2017 09:27 AM, Waiman Long wrote: > > On 08/10/2017 07:50 AM, Peter Zijlstra wrote: > >> On Wed, May 24, 2017 at 09:38:28AM -0400, Waiman Long wrote: > >>> # of thread w/o patchwith patch % Change > >>> --- --- > >>>4 4053.3 Mop/s 4223.7 Mop/s +4.2% > >>>8 3310.4 Mop/s 3406.0 Mop/s +2.9% > >>> 12 2576.4 Mop/s 2674.6 Mop/s +3.8% > >> Waiman, could you run those numbers again but with the below 'fixed' ? > >> > >>> @@ -361,6 +361,13 @@ static void pv_kick_node(struct qspinlock *lock, > >>> struct mcs_spinlock *node) > >>>* observe its next->locked value and advance itself. > >>>* > >>>* Matches with smp_store_mb() and cmpxchg() in pv_wait_node() > >>> + * > >>> + * The write to next->locked in arch_mcs_spin_unlock_contended() > >>> + * must be ordered before the read of pn->state in the cmpxchg() > >>> + * below for the code to work correctly. However, this is not > >>> + * guaranteed on all architectures when the cmpxchg() call fails. > >>> + * Both x86 and PPC can provide that guarantee, but other > >>> + * architectures not necessarily. > >>>*/ > >>smp_mb(); > >> > >>> if (cmpxchg(>state, vcpu_halted, vcpu_hashed) != vcpu_halted) > >>> return; > >> Ideally this Power CPU can optimize back-to-back SYNC instructions, but > >> who knows... > > Yes, I can run the numbers again. However, the changes here is in the > > slowpath. My current patch optimizes the fast path only and my original > > test doesn't stress the slowpath at all, I think. I will have to make > > some changes to stress the slowpath. > > Looking at past emails, I remember why I put the comment there. Putting > an smp_mb() here will definitely has an negative performance impact on > x86. So I put in the comment here to remind me that the current code may > not work for ARM64. > > To fix that, my current thought is to have a cmpxchg variant that > guarantees ordering for both success and failure, for example, > cmpxchg_ordered(). In that way, we only need to insert the barrier for > architectures that need it. That will be a separate patch instead of > integrating into this one. Might as well do an explicit: smp_mb__before_atomic() cmpxchg_relaxed() smp_mb__after_atomic() I suppose and not introduce new primitives.
Re: [RESEND PATCH v5] locking/pvqspinlock: Relax cmpxchg's to improve performance on some archs
On Thu, Aug 10, 2017 at 09:58:57AM -0400, Waiman Long wrote: > On 08/10/2017 09:27 AM, Waiman Long wrote: > > On 08/10/2017 07:50 AM, Peter Zijlstra wrote: > >> On Wed, May 24, 2017 at 09:38:28AM -0400, Waiman Long wrote: > >>> # of thread w/o patchwith patch % Change > >>> --- --- > >>>4 4053.3 Mop/s 4223.7 Mop/s +4.2% > >>>8 3310.4 Mop/s 3406.0 Mop/s +2.9% > >>> 12 2576.4 Mop/s 2674.6 Mop/s +3.8% > >> Waiman, could you run those numbers again but with the below 'fixed' ? > >> > >>> @@ -361,6 +361,13 @@ static void pv_kick_node(struct qspinlock *lock, > >>> struct mcs_spinlock *node) > >>>* observe its next->locked value and advance itself. > >>>* > >>>* Matches with smp_store_mb() and cmpxchg() in pv_wait_node() > >>> + * > >>> + * The write to next->locked in arch_mcs_spin_unlock_contended() > >>> + * must be ordered before the read of pn->state in the cmpxchg() > >>> + * below for the code to work correctly. However, this is not > >>> + * guaranteed on all architectures when the cmpxchg() call fails. > >>> + * Both x86 and PPC can provide that guarantee, but other > >>> + * architectures not necessarily. > >>>*/ > >>smp_mb(); > >> > >>> if (cmpxchg(>state, vcpu_halted, vcpu_hashed) != vcpu_halted) > >>> return; > >> Ideally this Power CPU can optimize back-to-back SYNC instructions, but > >> who knows... > > Yes, I can run the numbers again. However, the changes here is in the > > slowpath. My current patch optimizes the fast path only and my original > > test doesn't stress the slowpath at all, I think. I will have to make > > some changes to stress the slowpath. > > Looking at past emails, I remember why I put the comment there. Putting > an smp_mb() here will definitely has an negative performance impact on > x86. So I put in the comment here to remind me that the current code may > not work for ARM64. > > To fix that, my current thought is to have a cmpxchg variant that > guarantees ordering for both success and failure, for example, > cmpxchg_ordered(). In that way, we only need to insert the barrier for > architectures that need it. That will be a separate patch instead of > integrating into this one. Might as well do an explicit: smp_mb__before_atomic() cmpxchg_relaxed() smp_mb__after_atomic() I suppose and not introduce new primitives.
Re: [RESEND PATCH v5] locking/pvqspinlock: Relax cmpxchg's to improve performance on some archs
On 08/10/2017 09:27 AM, Waiman Long wrote: > On 08/10/2017 07:50 AM, Peter Zijlstra wrote: >> On Wed, May 24, 2017 at 09:38:28AM -0400, Waiman Long wrote: >>> # of thread w/o patchwith patch % Change >>> --- --- >>>4 4053.3 Mop/s 4223.7 Mop/s +4.2% >>>8 3310.4 Mop/s 3406.0 Mop/s +2.9% >>> 12 2576.4 Mop/s 2674.6 Mop/s +3.8% >> Waiman, could you run those numbers again but with the below 'fixed' ? >> >>> @@ -361,6 +361,13 @@ static void pv_kick_node(struct qspinlock *lock, >>> struct mcs_spinlock *node) >>> * observe its next->locked value and advance itself. >>> * >>> * Matches with smp_store_mb() and cmpxchg() in pv_wait_node() >>> +* >>> +* The write to next->locked in arch_mcs_spin_unlock_contended() >>> +* must be ordered before the read of pn->state in the cmpxchg() >>> +* below for the code to work correctly. However, this is not >>> +* guaranteed on all architectures when the cmpxchg() call fails. >>> +* Both x86 and PPC can provide that guarantee, but other >>> +* architectures not necessarily. >>> */ >> smp_mb(); >> >>> if (cmpxchg(>state, vcpu_halted, vcpu_hashed) != vcpu_halted) >>> return; >> Ideally this Power CPU can optimize back-to-back SYNC instructions, but >> who knows... > Yes, I can run the numbers again. However, the changes here is in the > slowpath. My current patch optimizes the fast path only and my original > test doesn't stress the slowpath at all, I think. I will have to make > some changes to stress the slowpath. Looking at past emails, I remember why I put the comment there. Putting an smp_mb() here will definitely has an negative performance impact on x86. So I put in the comment here to remind me that the current code may not work for ARM64. To fix that, my current thought is to have a cmpxchg variant that guarantees ordering for both success and failure, for example, cmpxchg_ordered(). In that way, we only need to insert the barrier for architectures that need it. That will be a separate patch instead of integrating into this one. Cheers, Longman
Re: [RESEND PATCH v5] locking/pvqspinlock: Relax cmpxchg's to improve performance on some archs
On 08/10/2017 09:27 AM, Waiman Long wrote: > On 08/10/2017 07:50 AM, Peter Zijlstra wrote: >> On Wed, May 24, 2017 at 09:38:28AM -0400, Waiman Long wrote: >>> # of thread w/o patchwith patch % Change >>> --- --- >>>4 4053.3 Mop/s 4223.7 Mop/s +4.2% >>>8 3310.4 Mop/s 3406.0 Mop/s +2.9% >>> 12 2576.4 Mop/s 2674.6 Mop/s +3.8% >> Waiman, could you run those numbers again but with the below 'fixed' ? >> >>> @@ -361,6 +361,13 @@ static void pv_kick_node(struct qspinlock *lock, >>> struct mcs_spinlock *node) >>> * observe its next->locked value and advance itself. >>> * >>> * Matches with smp_store_mb() and cmpxchg() in pv_wait_node() >>> +* >>> +* The write to next->locked in arch_mcs_spin_unlock_contended() >>> +* must be ordered before the read of pn->state in the cmpxchg() >>> +* below for the code to work correctly. However, this is not >>> +* guaranteed on all architectures when the cmpxchg() call fails. >>> +* Both x86 and PPC can provide that guarantee, but other >>> +* architectures not necessarily. >>> */ >> smp_mb(); >> >>> if (cmpxchg(>state, vcpu_halted, vcpu_hashed) != vcpu_halted) >>> return; >> Ideally this Power CPU can optimize back-to-back SYNC instructions, but >> who knows... > Yes, I can run the numbers again. However, the changes here is in the > slowpath. My current patch optimizes the fast path only and my original > test doesn't stress the slowpath at all, I think. I will have to make > some changes to stress the slowpath. Looking at past emails, I remember why I put the comment there. Putting an smp_mb() here will definitely has an negative performance impact on x86. So I put in the comment here to remind me that the current code may not work for ARM64. To fix that, my current thought is to have a cmpxchg variant that guarantees ordering for both success and failure, for example, cmpxchg_ordered(). In that way, we only need to insert the barrier for architectures that need it. That will be a separate patch instead of integrating into this one. Cheers, Longman
Re: [RESEND PATCH v5] locking/pvqspinlock: Relax cmpxchg's to improve performance on some archs
On 08/10/2017 07:50 AM, Peter Zijlstra wrote: > On Wed, May 24, 2017 at 09:38:28AM -0400, Waiman Long wrote: >> # of thread w/o patchwith patch % Change >> --- --- >>4 4053.3 Mop/s 4223.7 Mop/s +4.2% >>8 3310.4 Mop/s 3406.0 Mop/s +2.9% >> 12 2576.4 Mop/s 2674.6 Mop/s +3.8% > Waiman, could you run those numbers again but with the below 'fixed' ? > >> @@ -361,6 +361,13 @@ static void pv_kick_node(struct qspinlock *lock, struct >> mcs_spinlock *node) >> * observe its next->locked value and advance itself. >> * >> * Matches with smp_store_mb() and cmpxchg() in pv_wait_node() >> + * >> + * The write to next->locked in arch_mcs_spin_unlock_contended() >> + * must be ordered before the read of pn->state in the cmpxchg() >> + * below for the code to work correctly. However, this is not >> + * guaranteed on all architectures when the cmpxchg() call fails. >> + * Both x86 and PPC can provide that guarantee, but other >> + * architectures not necessarily. >> */ > smp_mb(); > >> if (cmpxchg(>state, vcpu_halted, vcpu_hashed) != vcpu_halted) >> return; > Ideally this Power CPU can optimize back-to-back SYNC instructions, but > who knows... Yes, I can run the numbers again. However, the changes here is in the slowpath. My current patch optimizes the fast path only and my original test doesn't stress the slowpath at all, I think. I will have to make some changes to stress the slowpath. Cheers, Longman
Re: [RESEND PATCH v5] locking/pvqspinlock: Relax cmpxchg's to improve performance on some archs
On 08/10/2017 07:50 AM, Peter Zijlstra wrote: > On Wed, May 24, 2017 at 09:38:28AM -0400, Waiman Long wrote: >> # of thread w/o patchwith patch % Change >> --- --- >>4 4053.3 Mop/s 4223.7 Mop/s +4.2% >>8 3310.4 Mop/s 3406.0 Mop/s +2.9% >> 12 2576.4 Mop/s 2674.6 Mop/s +3.8% > Waiman, could you run those numbers again but with the below 'fixed' ? > >> @@ -361,6 +361,13 @@ static void pv_kick_node(struct qspinlock *lock, struct >> mcs_spinlock *node) >> * observe its next->locked value and advance itself. >> * >> * Matches with smp_store_mb() and cmpxchg() in pv_wait_node() >> + * >> + * The write to next->locked in arch_mcs_spin_unlock_contended() >> + * must be ordered before the read of pn->state in the cmpxchg() >> + * below for the code to work correctly. However, this is not >> + * guaranteed on all architectures when the cmpxchg() call fails. >> + * Both x86 and PPC can provide that guarantee, but other >> + * architectures not necessarily. >> */ > smp_mb(); > >> if (cmpxchg(>state, vcpu_halted, vcpu_hashed) != vcpu_halted) >> return; > Ideally this Power CPU can optimize back-to-back SYNC instructions, but > who knows... Yes, I can run the numbers again. However, the changes here is in the slowpath. My current patch optimizes the fast path only and my original test doesn't stress the slowpath at all, I think. I will have to make some changes to stress the slowpath. Cheers, Longman
Re: [RESEND PATCH v5] locking/pvqspinlock: Relax cmpxchg's to improve performance on some archs
On Wed, May 24, 2017 at 09:38:28AM -0400, Waiman Long wrote: > > # of thread w/o patchwith patch % Change > --- --- >4 4053.3 Mop/s 4223.7 Mop/s +4.2% >8 3310.4 Mop/s 3406.0 Mop/s +2.9% > 12 2576.4 Mop/s 2674.6 Mop/s +3.8% Waiman, could you run those numbers again but with the below 'fixed' ? > @@ -361,6 +361,13 @@ static void pv_kick_node(struct qspinlock *lock, struct > mcs_spinlock *node) >* observe its next->locked value and advance itself. >* >* Matches with smp_store_mb() and cmpxchg() in pv_wait_node() > + * > + * The write to next->locked in arch_mcs_spin_unlock_contended() > + * must be ordered before the read of pn->state in the cmpxchg() > + * below for the code to work correctly. However, this is not > + * guaranteed on all architectures when the cmpxchg() call fails. > + * Both x86 and PPC can provide that guarantee, but other > + * architectures not necessarily. >*/ smp_mb(); > if (cmpxchg(>state, vcpu_halted, vcpu_hashed) != vcpu_halted) > return; Ideally this Power CPU can optimize back-to-back SYNC instructions, but who knows...
Re: [RESEND PATCH v5] locking/pvqspinlock: Relax cmpxchg's to improve performance on some archs
On Wed, May 24, 2017 at 09:38:28AM -0400, Waiman Long wrote: > > # of thread w/o patchwith patch % Change > --- --- >4 4053.3 Mop/s 4223.7 Mop/s +4.2% >8 3310.4 Mop/s 3406.0 Mop/s +2.9% > 12 2576.4 Mop/s 2674.6 Mop/s +3.8% Waiman, could you run those numbers again but with the below 'fixed' ? > @@ -361,6 +361,13 @@ static void pv_kick_node(struct qspinlock *lock, struct > mcs_spinlock *node) >* observe its next->locked value and advance itself. >* >* Matches with smp_store_mb() and cmpxchg() in pv_wait_node() > + * > + * The write to next->locked in arch_mcs_spin_unlock_contended() > + * must be ordered before the read of pn->state in the cmpxchg() > + * below for the code to work correctly. However, this is not > + * guaranteed on all architectures when the cmpxchg() call fails. > + * Both x86 and PPC can provide that guarantee, but other > + * architectures not necessarily. >*/ smp_mb(); > if (cmpxchg(>state, vcpu_halted, vcpu_hashed) != vcpu_halted) > return; Ideally this Power CPU can optimize back-to-back SYNC instructions, but who knows...
Re: [RESEND PATCH v5] locking/pvqspinlock: Relax cmpxchg's to improve performance on some archs
On Thu, Aug 10, 2017 at 04:12:13PM +0800, Boqun Feng wrote: > > Or is the reason this doesn't work on PPC that its RCpc? So that :-) > Here is an example why PPC needs a sync() before the cmpxchg(): > > https://marc.info/?l=linux-kernel=144485396224519=2 > > and Paul Mckenney's detailed explanation about why this could happen: > > https://marc.info/?l=linux-kernel=144485909826241=2 > > (Somehow, I feel like he was answering to a similar question question as > you ask here ;-)) Yes, and I had vague memories of having gone over this before, but couldn't quickly find things. Thanks! > And I think aarch64 doesn't have a problem here because it is "(other) > multi-copy atomic". Will? Right, its the RCpc vs RCsc thing. The ARM64 release is as you say multi-copy atomic, whereas the PPC lwsync is not. This still leaves us with the situation that we need an smp_mb() between smp_store_release() and a possibly failing cmpxchg() if we want to guarantee the cmpxchg()'s load comes after the store-release.
Re: [RESEND PATCH v5] locking/pvqspinlock: Relax cmpxchg's to improve performance on some archs
On Thu, Aug 10, 2017 at 04:12:13PM +0800, Boqun Feng wrote: > > Or is the reason this doesn't work on PPC that its RCpc? So that :-) > Here is an example why PPC needs a sync() before the cmpxchg(): > > https://marc.info/?l=linux-kernel=144485396224519=2 > > and Paul Mckenney's detailed explanation about why this could happen: > > https://marc.info/?l=linux-kernel=144485909826241=2 > > (Somehow, I feel like he was answering to a similar question question as > you ask here ;-)) Yes, and I had vague memories of having gone over this before, but couldn't quickly find things. Thanks! > And I think aarch64 doesn't have a problem here because it is "(other) > multi-copy atomic". Will? Right, its the RCpc vs RCsc thing. The ARM64 release is as you say multi-copy atomic, whereas the PPC lwsync is not. This still leaves us with the situation that we need an smp_mb() between smp_store_release() and a possibly failing cmpxchg() if we want to guarantee the cmpxchg()'s load comes after the store-release.
Re: [RESEND PATCH v5] locking/pvqspinlock: Relax cmpxchg's to improve performance on some archs
On Wed, Aug 09, 2017 at 05:15:33PM +0200, Peter Zijlstra wrote: > On Wed, Aug 09, 2017 at 05:06:03PM +0200, Peter Zijlstra wrote: > > Now, ARM64 for instance plays funny games, it does something along the > > lines of: > > > > cmpxchg(ptr, old, new) > > { > > do { > > r = LL(ptr); > > if (r != old) > > return r; /* no barriers */ > > r = new > > } while (SC_release(ptr, r)); > > smp_mb(); > > return r; > > } > > > > Thereby ordering things relative to the store on ptr, but the load can > > very much escape. The thinking is that if success, we must observe the > > latest value of ptr, but even in that case the load is not ordered and > > could happen before. > > > > However, since we're guaranteed to observe the latest value of ptr (on > > success) it doesn't matter if we reordered the load, there is no newer > > value possible. > > > > So heaps of tricky, but correct afaict. Will? > > And could not PPC do something similar: > > cmpxchg(ptr, old, new) > { > lwsync(); > dp { > r = LL(ptr); > if (r != old) > return; > r = new; > } while (SC(ptr, r)); > sync(); > return r; > } > > ? > > the lwsync would make it store-release on SC with similar reasoning as > above. > > And lwsync allows 'stores reordered after loads', which allows the prior > smp_store_release() to leak past. > > Or is the reason this doesn't work on PPC that its RCpc? Here is an example why PPC needs a sync() before the cmpxchg(): https://marc.info/?l=linux-kernel=144485396224519=2 and Paul Mckenney's detailed explanation about why this could happen: https://marc.info/?l=linux-kernel=144485909826241=2 (Somehow, I feel like he was answering to a similar question question as you ask here ;-)) And I think aarch64 doesn't have a problem here because it is "(other) multi-copy atomic". Will? Regards, Boqun signature.asc Description: PGP signature
Re: [RESEND PATCH v5] locking/pvqspinlock: Relax cmpxchg's to improve performance on some archs
On Wed, Aug 09, 2017 at 05:15:33PM +0200, Peter Zijlstra wrote: > On Wed, Aug 09, 2017 at 05:06:03PM +0200, Peter Zijlstra wrote: > > Now, ARM64 for instance plays funny games, it does something along the > > lines of: > > > > cmpxchg(ptr, old, new) > > { > > do { > > r = LL(ptr); > > if (r != old) > > return r; /* no barriers */ > > r = new > > } while (SC_release(ptr, r)); > > smp_mb(); > > return r; > > } > > > > Thereby ordering things relative to the store on ptr, but the load can > > very much escape. The thinking is that if success, we must observe the > > latest value of ptr, but even in that case the load is not ordered and > > could happen before. > > > > However, since we're guaranteed to observe the latest value of ptr (on > > success) it doesn't matter if we reordered the load, there is no newer > > value possible. > > > > So heaps of tricky, but correct afaict. Will? > > And could not PPC do something similar: > > cmpxchg(ptr, old, new) > { > lwsync(); > dp { > r = LL(ptr); > if (r != old) > return; > r = new; > } while (SC(ptr, r)); > sync(); > return r; > } > > ? > > the lwsync would make it store-release on SC with similar reasoning as > above. > > And lwsync allows 'stores reordered after loads', which allows the prior > smp_store_release() to leak past. > > Or is the reason this doesn't work on PPC that its RCpc? Here is an example why PPC needs a sync() before the cmpxchg(): https://marc.info/?l=linux-kernel=144485396224519=2 and Paul Mckenney's detailed explanation about why this could happen: https://marc.info/?l=linux-kernel=144485909826241=2 (Somehow, I feel like he was answering to a similar question question as you ask here ;-)) And I think aarch64 doesn't have a problem here because it is "(other) multi-copy atomic". Will? Regards, Boqun signature.asc Description: PGP signature
Re: [RESEND PATCH v5] locking/pvqspinlock: Relax cmpxchg's to improve performance on some archs
On Wed, Aug 09, 2017 at 05:06:03PM +0200, Peter Zijlstra wrote: > Now, ARM64 for instance plays funny games, it does something along the > lines of: > > cmpxchg(ptr, old, new) > { > do { > r = LL(ptr); > if (r != old) > return r; /* no barriers */ > r = new > } while (SC_release(ptr, r)); > smp_mb(); > return r; > } > > Thereby ordering things relative to the store on ptr, but the load can > very much escape. The thinking is that if success, we must observe the > latest value of ptr, but even in that case the load is not ordered and > could happen before. > > However, since we're guaranteed to observe the latest value of ptr (on > success) it doesn't matter if we reordered the load, there is no newer > value possible. > > So heaps of tricky, but correct afaict. Will? And could not PPC do something similar: cmpxchg(ptr, old, new) { lwsync(); dp { r = LL(ptr); if (r != old) return; r = new; } while (SC(ptr, r)); sync(); return r; } ? the lwsync would make it store-release on SC with similar reasoning as above. And lwsync allows 'stores reordered after loads', which allows the prior smp_store_release() to leak past. Or is the reason this doesn't work on PPC that its RCpc?
Re: [RESEND PATCH v5] locking/pvqspinlock: Relax cmpxchg's to improve performance on some archs
On Wed, Aug 09, 2017 at 05:06:03PM +0200, Peter Zijlstra wrote: > Now, ARM64 for instance plays funny games, it does something along the > lines of: > > cmpxchg(ptr, old, new) > { > do { > r = LL(ptr); > if (r != old) > return r; /* no barriers */ > r = new > } while (SC_release(ptr, r)); > smp_mb(); > return r; > } > > Thereby ordering things relative to the store on ptr, but the load can > very much escape. The thinking is that if success, we must observe the > latest value of ptr, but even in that case the load is not ordered and > could happen before. > > However, since we're guaranteed to observe the latest value of ptr (on > success) it doesn't matter if we reordered the load, there is no newer > value possible. > > So heaps of tricky, but correct afaict. Will? And could not PPC do something similar: cmpxchg(ptr, old, new) { lwsync(); dp { r = LL(ptr); if (r != old) return; r = new; } while (SC(ptr, r)); sync(); return r; } ? the lwsync would make it store-release on SC with similar reasoning as above. And lwsync allows 'stores reordered after loads', which allows the prior smp_store_release() to leak past. Or is the reason this doesn't work on PPC that its RCpc?
Re: [RESEND PATCH v5] locking/pvqspinlock: Relax cmpxchg's to improve performance on some archs
On Wed, May 24, 2017 at 09:38:28AM -0400, Waiman Long wrote: > @@ -361,6 +361,13 @@ static void pv_kick_node(struct qspinlock *lock, struct > mcs_spinlock *node) >* observe its next->locked value and advance itself. >* >* Matches with smp_store_mb() and cmpxchg() in pv_wait_node() > + * > + * The write to next->locked in arch_mcs_spin_unlock_contended() > + * must be ordered before the read of pn->state in the cmpxchg() > + * below for the code to work correctly. However, this is not > + * guaranteed on all architectures when the cmpxchg() call fails. > + * Both x86 and PPC can provide that guarantee, but other > + * architectures not necessarily. >*/ > if (cmpxchg(>state, vcpu_halted, vcpu_hashed) != vcpu_halted) > return; Instead of documenting this, should we not fix it properly? So what we want is to order: smp_store_release(, 1); cmpxchg(, 0, 1); Such that the store to x is before the load of y. Now, we document cmpxchg() to have smp_mb() before and smp_mb() after (if success). So per that definition, there would appear no way the load of y can be reordered before the store to x. Now, ARM64 for instance plays funny games, it does something along the lines of: cmpxchg(ptr, old, new) { do { r = LL(ptr); if (r != old) return r; /* no barriers */ r = new } while (SC_release(ptr, r)); smp_mb(); return r; } Thereby ordering things relative to the store on ptr, but the load can very much escape. The thinking is that if success, we must observe the latest value of ptr, but even in that case the load is not ordered and could happen before. However, since we're guaranteed to observe the latest value of ptr (on success) it doesn't matter if we reordered the load, there is no newer value possible. So heaps of tricky, but correct afaict. Will? Of course, since we need that load to be ordered even in case of a failed cmpxchg() we _should_ add an unconditional smp_mb() here. Which I understand you not wanting to do. Even smp_mb__before_atomic() is no help, because that's smp_mb() for both PPC and ARM64.
Re: [RESEND PATCH v5] locking/pvqspinlock: Relax cmpxchg's to improve performance on some archs
On Wed, May 24, 2017 at 09:38:28AM -0400, Waiman Long wrote: > @@ -361,6 +361,13 @@ static void pv_kick_node(struct qspinlock *lock, struct > mcs_spinlock *node) >* observe its next->locked value and advance itself. >* >* Matches with smp_store_mb() and cmpxchg() in pv_wait_node() > + * > + * The write to next->locked in arch_mcs_spin_unlock_contended() > + * must be ordered before the read of pn->state in the cmpxchg() > + * below for the code to work correctly. However, this is not > + * guaranteed on all architectures when the cmpxchg() call fails. > + * Both x86 and PPC can provide that guarantee, but other > + * architectures not necessarily. >*/ > if (cmpxchg(>state, vcpu_halted, vcpu_hashed) != vcpu_halted) > return; Instead of documenting this, should we not fix it properly? So what we want is to order: smp_store_release(, 1); cmpxchg(, 0, 1); Such that the store to x is before the load of y. Now, we document cmpxchg() to have smp_mb() before and smp_mb() after (if success). So per that definition, there would appear no way the load of y can be reordered before the store to x. Now, ARM64 for instance plays funny games, it does something along the lines of: cmpxchg(ptr, old, new) { do { r = LL(ptr); if (r != old) return r; /* no barriers */ r = new } while (SC_release(ptr, r)); smp_mb(); return r; } Thereby ordering things relative to the store on ptr, but the load can very much escape. The thinking is that if success, we must observe the latest value of ptr, but even in that case the load is not ordered and could happen before. However, since we're guaranteed to observe the latest value of ptr (on success) it doesn't matter if we reordered the load, there is no newer value possible. So heaps of tricky, but correct afaict. Will? Of course, since we need that load to be ordered even in case of a failed cmpxchg() we _should_ add an unconditional smp_mb() here. Which I understand you not wanting to do. Even smp_mb__before_atomic() is no help, because that's smp_mb() for both PPC and ARM64.
Re: [RESEND PATCH v5] locking/pvqspinlock: Relax cmpxchg's to improve performance on some archs
On 05/24/2017 09:38 AM, Waiman Long wrote: > All the locking related cmpxchg's in the following functions are > replaced with the _acquire variants: > - pv_queued_spin_steal_lock() > - trylock_clear_pending() > > This change should help performance on architectures that use LL/SC. > > On a 2-core 16-thread Power8 system with pvqspinlock explicitly > enabled, the performance of a locking microbenchmark with and without > this patch on a 4.10-rc8 kernel with Xinhui's PPC qspinlock patch > were as follows: > > # of thread w/o patchwith patch % Change > --- --- >4 4053.3 Mop/s 4223.7 Mop/s +4.2% >8 3310.4 Mop/s 3406.0 Mop/s +2.9% > 12 2576.4 Mop/s 2674.6 Mop/s +3.8% > > Signed-off-by: Waiman Long> --- > v4->v5: > - Correct some grammatical issues in comment. > > v3->v4: > - Update the comment in pv_kick_node() to mention that the code > may not work in some archs. > > v2->v3: > - Reduce scope by relaxing cmpxchg's in fast path only. > > v1->v2: > - Add comments in changelog and code for the rationale of the change. > > kernel/locking/qspinlock_paravirt.h | 19 +-- > 1 file changed, 13 insertions(+), 6 deletions(-) > > diff --git a/kernel/locking/qspinlock_paravirt.h > b/kernel/locking/qspinlock_paravirt.h > index e6b2f7a..4614e39 100644 > --- a/kernel/locking/qspinlock_paravirt.h > +++ b/kernel/locking/qspinlock_paravirt.h > @@ -72,7 +72,7 @@ static inline bool pv_queued_spin_steal_lock(struct > qspinlock *lock) > struct __qspinlock *l = (void *)lock; > > if (!(atomic_read(>val) & _Q_LOCKED_PENDING_MASK) && > - (cmpxchg(>locked, 0, _Q_LOCKED_VAL) == 0)) { > + (cmpxchg_acquire(>locked, 0, _Q_LOCKED_VAL) == 0)) { > qstat_inc(qstat_pv_lock_stealing, true); > return true; > } > @@ -101,16 +101,16 @@ static __always_inline void clear_pending(struct > qspinlock *lock) > > /* > * The pending bit check in pv_queued_spin_steal_lock() isn't a memory > - * barrier. Therefore, an atomic cmpxchg() is used to acquire the lock > - * just to be sure that it will get it. > + * barrier. Therefore, an atomic cmpxchg_acquire() is used to acquire the > + * lock just to be sure that it will get it. > */ > static __always_inline int trylock_clear_pending(struct qspinlock > *lock)●●● > { > struct __qspinlock *l = (void *)lock; > > return !READ_ONCE(l->locked) && > -(cmpxchg(>locked_pending, _Q_PENDING_VAL, _Q_LOCKED_VAL) > - == _Q_PENDING_VAL); > +(cmpxchg_acquire(>locked_pending, _Q_PENDING_VAL, > + _Q_LOCKED_VAL) == _Q_PENDING_VAL); > } > #else /* _Q_PENDING_BITS == 8 */ > static __always_inline void set_pending(struct qspinlock *lock) > @@ -138,7 +138,7 @@ static __always_inline int trylock_clear_pending(struct > qspinlock *lock) >*/ > old = val; > new = (val & ~_Q_PENDING_MASK) | _Q_LOCKED_VAL; > - val = atomic_cmpxchg(>val, old, new); > + val = atomic_cmpxchg_acquire(>val, old, new); > > if (val == old) > return 1; > @@ -361,6 +361,13 @@ static void pv_kick_node(struct qspinlock *lock, struct > mcs_spinlock *node) >* observe its next->locked value and advance itself. >* >* Matches with smp_store_mb() and cmpxchg() in pv_wait_node() > + * > + * The write to next->locked in arch_mcs_spin_unlock_contended() > + * must be ordered before the read of pn->state in the cmpxchg() > + * below for the code to work correctly. However, this is not > + * guaranteed on all architectures when the cmpxchg() call fails. > + * Both x86 and PPC can provide that guarantee, but other > + * architectures not necessarily. >*/ > if (cmpxchg(>state, vcpu_halted, vcpu_hashed) != vcpu_halted) > return; Hi, Is this patch OK to be merged or is some more testing or changes are needed? Cheers, Longman
Re: [RESEND PATCH v5] locking/pvqspinlock: Relax cmpxchg's to improve performance on some archs
On 05/24/2017 09:38 AM, Waiman Long wrote: > All the locking related cmpxchg's in the following functions are > replaced with the _acquire variants: > - pv_queued_spin_steal_lock() > - trylock_clear_pending() > > This change should help performance on architectures that use LL/SC. > > On a 2-core 16-thread Power8 system with pvqspinlock explicitly > enabled, the performance of a locking microbenchmark with and without > this patch on a 4.10-rc8 kernel with Xinhui's PPC qspinlock patch > were as follows: > > # of thread w/o patchwith patch % Change > --- --- >4 4053.3 Mop/s 4223.7 Mop/s +4.2% >8 3310.4 Mop/s 3406.0 Mop/s +2.9% > 12 2576.4 Mop/s 2674.6 Mop/s +3.8% > > Signed-off-by: Waiman Long > --- > v4->v5: > - Correct some grammatical issues in comment. > > v3->v4: > - Update the comment in pv_kick_node() to mention that the code > may not work in some archs. > > v2->v3: > - Reduce scope by relaxing cmpxchg's in fast path only. > > v1->v2: > - Add comments in changelog and code for the rationale of the change. > > kernel/locking/qspinlock_paravirt.h | 19 +-- > 1 file changed, 13 insertions(+), 6 deletions(-) > > diff --git a/kernel/locking/qspinlock_paravirt.h > b/kernel/locking/qspinlock_paravirt.h > index e6b2f7a..4614e39 100644 > --- a/kernel/locking/qspinlock_paravirt.h > +++ b/kernel/locking/qspinlock_paravirt.h > @@ -72,7 +72,7 @@ static inline bool pv_queued_spin_steal_lock(struct > qspinlock *lock) > struct __qspinlock *l = (void *)lock; > > if (!(atomic_read(>val) & _Q_LOCKED_PENDING_MASK) && > - (cmpxchg(>locked, 0, _Q_LOCKED_VAL) == 0)) { > + (cmpxchg_acquire(>locked, 0, _Q_LOCKED_VAL) == 0)) { > qstat_inc(qstat_pv_lock_stealing, true); > return true; > } > @@ -101,16 +101,16 @@ static __always_inline void clear_pending(struct > qspinlock *lock) > > /* > * The pending bit check in pv_queued_spin_steal_lock() isn't a memory > - * barrier. Therefore, an atomic cmpxchg() is used to acquire the lock > - * just to be sure that it will get it. > + * barrier. Therefore, an atomic cmpxchg_acquire() is used to acquire the > + * lock just to be sure that it will get it. > */ > static __always_inline int trylock_clear_pending(struct qspinlock > *lock)●●● > { > struct __qspinlock *l = (void *)lock; > > return !READ_ONCE(l->locked) && > -(cmpxchg(>locked_pending, _Q_PENDING_VAL, _Q_LOCKED_VAL) > - == _Q_PENDING_VAL); > +(cmpxchg_acquire(>locked_pending, _Q_PENDING_VAL, > + _Q_LOCKED_VAL) == _Q_PENDING_VAL); > } > #else /* _Q_PENDING_BITS == 8 */ > static __always_inline void set_pending(struct qspinlock *lock) > @@ -138,7 +138,7 @@ static __always_inline int trylock_clear_pending(struct > qspinlock *lock) >*/ > old = val; > new = (val & ~_Q_PENDING_MASK) | _Q_LOCKED_VAL; > - val = atomic_cmpxchg(>val, old, new); > + val = atomic_cmpxchg_acquire(>val, old, new); > > if (val == old) > return 1; > @@ -361,6 +361,13 @@ static void pv_kick_node(struct qspinlock *lock, struct > mcs_spinlock *node) >* observe its next->locked value and advance itself. >* >* Matches with smp_store_mb() and cmpxchg() in pv_wait_node() > + * > + * The write to next->locked in arch_mcs_spin_unlock_contended() > + * must be ordered before the read of pn->state in the cmpxchg() > + * below for the code to work correctly. However, this is not > + * guaranteed on all architectures when the cmpxchg() call fails. > + * Both x86 and PPC can provide that guarantee, but other > + * architectures not necessarily. >*/ > if (cmpxchg(>state, vcpu_halted, vcpu_hashed) != vcpu_halted) > return; Hi, Is this patch OK to be merged or is some more testing or changes are needed? Cheers, Longman
[RESEND PATCH v5] locking/pvqspinlock: Relax cmpxchg's to improve performance on some archs
All the locking related cmpxchg's in the following functions are replaced with the _acquire variants: - pv_queued_spin_steal_lock() - trylock_clear_pending() This change should help performance on architectures that use LL/SC. On a 2-core 16-thread Power8 system with pvqspinlock explicitly enabled, the performance of a locking microbenchmark with and without this patch on a 4.10-rc8 kernel with Xinhui's PPC qspinlock patch were as follows: # of thread w/o patchwith patch % Change --- --- 4 4053.3 Mop/s 4223.7 Mop/s +4.2% 8 3310.4 Mop/s 3406.0 Mop/s +2.9% 12 2576.4 Mop/s 2674.6 Mop/s +3.8% Signed-off-by: Waiman Long--- v4->v5: - Correct some grammatical issues in comment. v3->v4: - Update the comment in pv_kick_node() to mention that the code may not work in some archs. v2->v3: - Reduce scope by relaxing cmpxchg's in fast path only. v1->v2: - Add comments in changelog and code for the rationale of the change. kernel/locking/qspinlock_paravirt.h | 19 +-- 1 file changed, 13 insertions(+), 6 deletions(-) diff --git a/kernel/locking/qspinlock_paravirt.h b/kernel/locking/qspinlock_paravirt.h index e6b2f7a..4614e39 100644 --- a/kernel/locking/qspinlock_paravirt.h +++ b/kernel/locking/qspinlock_paravirt.h @@ -72,7 +72,7 @@ static inline bool pv_queued_spin_steal_lock(struct qspinlock *lock) struct __qspinlock *l = (void *)lock; if (!(atomic_read(>val) & _Q_LOCKED_PENDING_MASK) && - (cmpxchg(>locked, 0, _Q_LOCKED_VAL) == 0)) { + (cmpxchg_acquire(>locked, 0, _Q_LOCKED_VAL) == 0)) { qstat_inc(qstat_pv_lock_stealing, true); return true; } @@ -101,16 +101,16 @@ static __always_inline void clear_pending(struct qspinlock *lock) /* * The pending bit check in pv_queued_spin_steal_lock() isn't a memory - * barrier. Therefore, an atomic cmpxchg() is used to acquire the lock - * just to be sure that it will get it. + * barrier. Therefore, an atomic cmpxchg_acquire() is used to acquire the + * lock just to be sure that it will get it. */ static __always_inline int trylock_clear_pending(struct qspinlock *lock) { struct __qspinlock *l = (void *)lock; return !READ_ONCE(l->locked) && - (cmpxchg(>locked_pending, _Q_PENDING_VAL, _Q_LOCKED_VAL) - == _Q_PENDING_VAL); + (cmpxchg_acquire(>locked_pending, _Q_PENDING_VAL, + _Q_LOCKED_VAL) == _Q_PENDING_VAL); } #else /* _Q_PENDING_BITS == 8 */ static __always_inline void set_pending(struct qspinlock *lock) @@ -138,7 +138,7 @@ static __always_inline int trylock_clear_pending(struct qspinlock *lock) */ old = val; new = (val & ~_Q_PENDING_MASK) | _Q_LOCKED_VAL; - val = atomic_cmpxchg(>val, old, new); + val = atomic_cmpxchg_acquire(>val, old, new); if (val == old) return 1; @@ -361,6 +361,13 @@ static void pv_kick_node(struct qspinlock *lock, struct mcs_spinlock *node) * observe its next->locked value and advance itself. * * Matches with smp_store_mb() and cmpxchg() in pv_wait_node() +* +* The write to next->locked in arch_mcs_spin_unlock_contended() +* must be ordered before the read of pn->state in the cmpxchg() +* below for the code to work correctly. However, this is not +* guaranteed on all architectures when the cmpxchg() call fails. +* Both x86 and PPC can provide that guarantee, but other +* architectures not necessarily. */ if (cmpxchg(>state, vcpu_halted, vcpu_hashed) != vcpu_halted) return; -- 1.8.3.1
[RESEND PATCH v5] locking/pvqspinlock: Relax cmpxchg's to improve performance on some archs
All the locking related cmpxchg's in the following functions are replaced with the _acquire variants: - pv_queued_spin_steal_lock() - trylock_clear_pending() This change should help performance on architectures that use LL/SC. On a 2-core 16-thread Power8 system with pvqspinlock explicitly enabled, the performance of a locking microbenchmark with and without this patch on a 4.10-rc8 kernel with Xinhui's PPC qspinlock patch were as follows: # of thread w/o patchwith patch % Change --- --- 4 4053.3 Mop/s 4223.7 Mop/s +4.2% 8 3310.4 Mop/s 3406.0 Mop/s +2.9% 12 2576.4 Mop/s 2674.6 Mop/s +3.8% Signed-off-by: Waiman Long --- v4->v5: - Correct some grammatical issues in comment. v3->v4: - Update the comment in pv_kick_node() to mention that the code may not work in some archs. v2->v3: - Reduce scope by relaxing cmpxchg's in fast path only. v1->v2: - Add comments in changelog and code for the rationale of the change. kernel/locking/qspinlock_paravirt.h | 19 +-- 1 file changed, 13 insertions(+), 6 deletions(-) diff --git a/kernel/locking/qspinlock_paravirt.h b/kernel/locking/qspinlock_paravirt.h index e6b2f7a..4614e39 100644 --- a/kernel/locking/qspinlock_paravirt.h +++ b/kernel/locking/qspinlock_paravirt.h @@ -72,7 +72,7 @@ static inline bool pv_queued_spin_steal_lock(struct qspinlock *lock) struct __qspinlock *l = (void *)lock; if (!(atomic_read(>val) & _Q_LOCKED_PENDING_MASK) && - (cmpxchg(>locked, 0, _Q_LOCKED_VAL) == 0)) { + (cmpxchg_acquire(>locked, 0, _Q_LOCKED_VAL) == 0)) { qstat_inc(qstat_pv_lock_stealing, true); return true; } @@ -101,16 +101,16 @@ static __always_inline void clear_pending(struct qspinlock *lock) /* * The pending bit check in pv_queued_spin_steal_lock() isn't a memory - * barrier. Therefore, an atomic cmpxchg() is used to acquire the lock - * just to be sure that it will get it. + * barrier. Therefore, an atomic cmpxchg_acquire() is used to acquire the + * lock just to be sure that it will get it. */ static __always_inline int trylock_clear_pending(struct qspinlock *lock) { struct __qspinlock *l = (void *)lock; return !READ_ONCE(l->locked) && - (cmpxchg(>locked_pending, _Q_PENDING_VAL, _Q_LOCKED_VAL) - == _Q_PENDING_VAL); + (cmpxchg_acquire(>locked_pending, _Q_PENDING_VAL, + _Q_LOCKED_VAL) == _Q_PENDING_VAL); } #else /* _Q_PENDING_BITS == 8 */ static __always_inline void set_pending(struct qspinlock *lock) @@ -138,7 +138,7 @@ static __always_inline int trylock_clear_pending(struct qspinlock *lock) */ old = val; new = (val & ~_Q_PENDING_MASK) | _Q_LOCKED_VAL; - val = atomic_cmpxchg(>val, old, new); + val = atomic_cmpxchg_acquire(>val, old, new); if (val == old) return 1; @@ -361,6 +361,13 @@ static void pv_kick_node(struct qspinlock *lock, struct mcs_spinlock *node) * observe its next->locked value and advance itself. * * Matches with smp_store_mb() and cmpxchg() in pv_wait_node() +* +* The write to next->locked in arch_mcs_spin_unlock_contended() +* must be ordered before the read of pn->state in the cmpxchg() +* below for the code to work correctly. However, this is not +* guaranteed on all architectures when the cmpxchg() call fails. +* Both x86 and PPC can provide that guarantee, but other +* architectures not necessarily. */ if (cmpxchg(>state, vcpu_halted, vcpu_hashed) != vcpu_halted) return; -- 1.8.3.1