Re: [RESEND PATCH v5] locking/pvqspinlock: Relax cmpxchg's to improve performance on some archs

2017-08-22 Thread Waiman Long
On 08/21/2017 03:42 PM, Peter Zijlstra wrote:
> On Mon, Aug 21, 2017 at 09:25:50PM +0200, Peter Zijlstra wrote:
>> On Mon, Aug 21, 2017 at 07:00:02PM +0100, Will Deacon wrote:
 No, I meant _from_ the LL load, not _to_ a later load.
>>> Sorry, I'm still not following enough to give you a definitive answer on
>>> that. Could you give an example, please? These sequences usually run in
>>> a loop, so the conditional branch back (based on the status flag) is where
>>> the read-after-read comes in.
>>>
>>> Any control dependencies from the loaded data exist regardless of the status
>>> flag.
>> Basically what Waiman ended up doing, something like:
>>
>> if (cmpxchg_relaxed(>state, vcpu_halted, vcpu_hashed) != 
>> vcpu_halted)
>> return;
>>
>> WRITE_ONCE(l->locked, _Q_SLOW_VAL);
>>
>> Where the STORE depends on the LL value being 'complete'.
>>

pn->state == vcpu_halted is the prerequisite of putting _Q_SLOW_VAL into
the lock. The order of writing vcpu_hashed into pn->state doesn't really
matter. The cmpxchg_relaxed() here should synchronize with the cmpxchg()
in pv_wait_node().

Cheers,
Longman



Re: [RESEND PATCH v5] locking/pvqspinlock: Relax cmpxchg's to improve performance on some archs

2017-08-22 Thread Waiman Long
On 08/21/2017 03:42 PM, Peter Zijlstra wrote:
> On Mon, Aug 21, 2017 at 09:25:50PM +0200, Peter Zijlstra wrote:
>> On Mon, Aug 21, 2017 at 07:00:02PM +0100, Will Deacon wrote:
 No, I meant _from_ the LL load, not _to_ a later load.
>>> Sorry, I'm still not following enough to give you a definitive answer on
>>> that. Could you give an example, please? These sequences usually run in
>>> a loop, so the conditional branch back (based on the status flag) is where
>>> the read-after-read comes in.
>>>
>>> Any control dependencies from the loaded data exist regardless of the status
>>> flag.
>> Basically what Waiman ended up doing, something like:
>>
>> if (cmpxchg_relaxed(>state, vcpu_halted, vcpu_hashed) != 
>> vcpu_halted)
>> return;
>>
>> WRITE_ONCE(l->locked, _Q_SLOW_VAL);
>>
>> Where the STORE depends on the LL value being 'complete'.
>>

pn->state == vcpu_halted is the prerequisite of putting _Q_SLOW_VAL into
the lock. The order of writing vcpu_hashed into pn->state doesn't really
matter. The cmpxchg_relaxed() here should synchronize with the cmpxchg()
in pv_wait_node().

Cheers,
Longman



Re: [RESEND PATCH v5] locking/pvqspinlock: Relax cmpxchg's to improve performance on some archs

2017-08-22 Thread Will Deacon
On Mon, Aug 21, 2017 at 09:25:50PM +0200, Peter Zijlstra wrote:
> On Mon, Aug 21, 2017 at 07:00:02PM +0100, Will Deacon wrote:
> > > No, I meant _from_ the LL load, not _to_ a later load.
> > 
> > Sorry, I'm still not following enough to give you a definitive answer on
> > that. Could you give an example, please? These sequences usually run in
> > a loop, so the conditional branch back (based on the status flag) is where
> > the read-after-read comes in.
> > 
> > Any control dependencies from the loaded data exist regardless of the status
> > flag.
> 
> Basically what Waiman ended up doing, something like:
> 
> if (cmpxchg_relaxed(>state, vcpu_halted, vcpu_hashed) != 
> vcpu_halted)
> return;
> 
> WRITE_ONCE(l->locked, _Q_SLOW_VAL);
> 
> Where the STORE depends on the LL value being 'complete'.

Yup, that's ordered as you would expect. Thanks for the example!

Will


Re: [RESEND PATCH v5] locking/pvqspinlock: Relax cmpxchg's to improve performance on some archs

2017-08-22 Thread Will Deacon
On Mon, Aug 21, 2017 at 09:25:50PM +0200, Peter Zijlstra wrote:
> On Mon, Aug 21, 2017 at 07:00:02PM +0100, Will Deacon wrote:
> > > No, I meant _from_ the LL load, not _to_ a later load.
> > 
> > Sorry, I'm still not following enough to give you a definitive answer on
> > that. Could you give an example, please? These sequences usually run in
> > a loop, so the conditional branch back (based on the status flag) is where
> > the read-after-read comes in.
> > 
> > Any control dependencies from the loaded data exist regardless of the status
> > flag.
> 
> Basically what Waiman ended up doing, something like:
> 
> if (cmpxchg_relaxed(>state, vcpu_halted, vcpu_hashed) != 
> vcpu_halted)
> return;
> 
> WRITE_ONCE(l->locked, _Q_SLOW_VAL);
> 
> Where the STORE depends on the LL value being 'complete'.

Yup, that's ordered as you would expect. Thanks for the example!

Will


Re: [RESEND PATCH v5] locking/pvqspinlock: Relax cmpxchg's to improve performance on some archs

2017-08-21 Thread Peter Zijlstra
On Mon, Aug 21, 2017 at 09:25:50PM +0200, Peter Zijlstra wrote:
> On Mon, Aug 21, 2017 at 07:00:02PM +0100, Will Deacon wrote:
> > > No, I meant _from_ the LL load, not _to_ a later load.
> > 
> > Sorry, I'm still not following enough to give you a definitive answer on
> > that. Could you give an example, please? These sequences usually run in
> > a loop, so the conditional branch back (based on the status flag) is where
> > the read-after-read comes in.
> > 
> > Any control dependencies from the loaded data exist regardless of the status
> > flag.
> 
> Basically what Waiman ended up doing, something like:
> 
> if (cmpxchg_relaxed(>state, vcpu_halted, vcpu_hashed) != 
> vcpu_halted)
> return;
> 
> WRITE_ONCE(l->locked, _Q_SLOW_VAL);
> 
> Where the STORE depends on the LL value being 'complete'.
> 
> 
> For any RmW we can only create a control dependency from the LOAD. The
> the same could be done for something like:
> 
>   if (atomic_inc_not_zero(>refs))
>   WRITE_ONCE(obj->foo, 1);

Obviously I meant the hypothetical atomic_inc_not_zero_relaxed() here,
otherwise all the implied smp_mb() spoil the game.

> Where we only do the STORE if we acquire the reference. While the
> WRITE_ONCE() will not be ordered against the increment, it is ordered
> against the LL and we know it must not be 0.
> 
> Per the LL/SC loop we'll have observed a !0 value and committed the SC
> (which need not be visible or ordered against any later store) but both
> STORES (SC and the WRITE_ONCE) must be after the ->refs LOAD.


Re: [RESEND PATCH v5] locking/pvqspinlock: Relax cmpxchg's to improve performance on some archs

2017-08-21 Thread Peter Zijlstra
On Mon, Aug 21, 2017 at 09:25:50PM +0200, Peter Zijlstra wrote:
> On Mon, Aug 21, 2017 at 07:00:02PM +0100, Will Deacon wrote:
> > > No, I meant _from_ the LL load, not _to_ a later load.
> > 
> > Sorry, I'm still not following enough to give you a definitive answer on
> > that. Could you give an example, please? These sequences usually run in
> > a loop, so the conditional branch back (based on the status flag) is where
> > the read-after-read comes in.
> > 
> > Any control dependencies from the loaded data exist regardless of the status
> > flag.
> 
> Basically what Waiman ended up doing, something like:
> 
> if (cmpxchg_relaxed(>state, vcpu_halted, vcpu_hashed) != 
> vcpu_halted)
> return;
> 
> WRITE_ONCE(l->locked, _Q_SLOW_VAL);
> 
> Where the STORE depends on the LL value being 'complete'.
> 
> 
> For any RmW we can only create a control dependency from the LOAD. The
> the same could be done for something like:
> 
>   if (atomic_inc_not_zero(>refs))
>   WRITE_ONCE(obj->foo, 1);

Obviously I meant the hypothetical atomic_inc_not_zero_relaxed() here,
otherwise all the implied smp_mb() spoil the game.

> Where we only do the STORE if we acquire the reference. While the
> WRITE_ONCE() will not be ordered against the increment, it is ordered
> against the LL and we know it must not be 0.
> 
> Per the LL/SC loop we'll have observed a !0 value and committed the SC
> (which need not be visible or ordered against any later store) but both
> STORES (SC and the WRITE_ONCE) must be after the ->refs LOAD.


Re: [RESEND PATCH v5] locking/pvqspinlock: Relax cmpxchg's to improve performance on some archs

2017-08-21 Thread Peter Zijlstra
On Mon, Aug 21, 2017 at 07:00:02PM +0100, Will Deacon wrote:
> > No, I meant _from_ the LL load, not _to_ a later load.
> 
> Sorry, I'm still not following enough to give you a definitive answer on
> that. Could you give an example, please? These sequences usually run in
> a loop, so the conditional branch back (based on the status flag) is where
> the read-after-read comes in.
> 
> Any control dependencies from the loaded data exist regardless of the status
> flag.

Basically what Waiman ended up doing, something like:

if (cmpxchg_relaxed(>state, vcpu_halted, vcpu_hashed) != 
vcpu_halted)
return;

WRITE_ONCE(l->locked, _Q_SLOW_VAL);

Where the STORE depends on the LL value being 'complete'.


For any RmW we can only create a control dependency from the LOAD. The
the same could be done for something like:

if (atomic_inc_not_zero(>refs))
WRITE_ONCE(obj->foo, 1);

Where we only do the STORE if we acquire the reference. While the
WRITE_ONCE() will not be ordered against the increment, it is ordered
against the LL and we know it must not be 0.

Per the LL/SC loop we'll have observed a !0 value and committed the SC
(which need not be visible or ordered against any later store) but both
STORES (SC and the WRITE_ONCE) must be after the ->refs LOAD.


Re: [RESEND PATCH v5] locking/pvqspinlock: Relax cmpxchg's to improve performance on some archs

2017-08-21 Thread Peter Zijlstra
On Mon, Aug 21, 2017 at 07:00:02PM +0100, Will Deacon wrote:
> > No, I meant _from_ the LL load, not _to_ a later load.
> 
> Sorry, I'm still not following enough to give you a definitive answer on
> that. Could you give an example, please? These sequences usually run in
> a loop, so the conditional branch back (based on the status flag) is where
> the read-after-read comes in.
> 
> Any control dependencies from the loaded data exist regardless of the status
> flag.

Basically what Waiman ended up doing, something like:

if (cmpxchg_relaxed(>state, vcpu_halted, vcpu_hashed) != 
vcpu_halted)
return;

WRITE_ONCE(l->locked, _Q_SLOW_VAL);

Where the STORE depends on the LL value being 'complete'.


For any RmW we can only create a control dependency from the LOAD. The
the same could be done for something like:

if (atomic_inc_not_zero(>refs))
WRITE_ONCE(obj->foo, 1);

Where we only do the STORE if we acquire the reference. While the
WRITE_ONCE() will not be ordered against the increment, it is ordered
against the LL and we know it must not be 0.

Per the LL/SC loop we'll have observed a !0 value and committed the SC
(which need not be visible or ordered against any later store) but both
STORES (SC and the WRITE_ONCE) must be after the ->refs LOAD.


Re: [RESEND PATCH v5] locking/pvqspinlock: Relax cmpxchg's to improve performance on some archs

2017-08-21 Thread Will Deacon
On Mon, Aug 21, 2017 at 12:55:08PM +0200, Peter Zijlstra wrote:
> On Tue, Aug 15, 2017 at 07:40:35PM +0100, Will Deacon wrote:
> > On Mon, Aug 14, 2017 at 08:47:11PM +0200, Peter Zijlstra wrote:
> > > On Mon, Aug 14, 2017 at 01:01:22PM +0100, Will Deacon wrote:
> > > > Yeah, that's right, you can't use the STXR status flag to create control
> > > > dependencies.
> > > 
> > > Just for my elucidation; you can't use it to create a control dependency
> > > on the store, but you can use it to create a control dependency on the
> > > corresponding load, right?
> > 
> > Hmm, sort of, but I'd say that the reads are really ordered due to
> > read-after-read ordering in that case. Control dependencies to loads
> > don't give you order.
> 
> No, I meant _from_ the LL load, not _to_ a later load.

Sorry, I'm still not following enough to give you a definitive answer on
that. Could you give an example, please? These sequences usually run in
a loop, so the conditional branch back (based on the status flag) is where
the read-after-read comes in.

Any control dependencies from the loaded data exist regardless of the status
flag.

> > > Now, IIRC, we've defined control dependencies as being LOAD->STORE
> > > ordering, so in that respect nothing is lost. But maybe we should
> > > explicitly mention that if the LOAD is part of an (otherwise) atomic RmW
> > > the STORE is not constrained.
> > 
> > I could well be misreading your suggestion, but it feels like that's too
> > weak. You can definitely still have control dependencies off the LL part
> > of the LL/SC pair, just not off the SC part.
> > 
> > E.g. this version of LB is forbidden on arm64:
> > 
> > P0:
> > if (atomic_inc_return_relaxed() == 2)
> > atomic_set(, 1);
> > 
> > P1:
> > if (atomic_inc_return_relaxed() == 2)
> > atomic_set(, 1);
> > 
> > Perhaps when you say "the STORE", you mean the store in the atomic RmW,
> > rather than the store in the LOAD->STORE control dependency?
> 
> Yes. So I was looking to exclude (SC) STORE -> STORE order through
> control dependencies.

Ok, good.

Will


Re: [RESEND PATCH v5] locking/pvqspinlock: Relax cmpxchg's to improve performance on some archs

2017-08-21 Thread Will Deacon
On Mon, Aug 21, 2017 at 12:55:08PM +0200, Peter Zijlstra wrote:
> On Tue, Aug 15, 2017 at 07:40:35PM +0100, Will Deacon wrote:
> > On Mon, Aug 14, 2017 at 08:47:11PM +0200, Peter Zijlstra wrote:
> > > On Mon, Aug 14, 2017 at 01:01:22PM +0100, Will Deacon wrote:
> > > > Yeah, that's right, you can't use the STXR status flag to create control
> > > > dependencies.
> > > 
> > > Just for my elucidation; you can't use it to create a control dependency
> > > on the store, but you can use it to create a control dependency on the
> > > corresponding load, right?
> > 
> > Hmm, sort of, but I'd say that the reads are really ordered due to
> > read-after-read ordering in that case. Control dependencies to loads
> > don't give you order.
> 
> No, I meant _from_ the LL load, not _to_ a later load.

Sorry, I'm still not following enough to give you a definitive answer on
that. Could you give an example, please? These sequences usually run in
a loop, so the conditional branch back (based on the status flag) is where
the read-after-read comes in.

Any control dependencies from the loaded data exist regardless of the status
flag.

> > > Now, IIRC, we've defined control dependencies as being LOAD->STORE
> > > ordering, so in that respect nothing is lost. But maybe we should
> > > explicitly mention that if the LOAD is part of an (otherwise) atomic RmW
> > > the STORE is not constrained.
> > 
> > I could well be misreading your suggestion, but it feels like that's too
> > weak. You can definitely still have control dependencies off the LL part
> > of the LL/SC pair, just not off the SC part.
> > 
> > E.g. this version of LB is forbidden on arm64:
> > 
> > P0:
> > if (atomic_inc_return_relaxed() == 2)
> > atomic_set(, 1);
> > 
> > P1:
> > if (atomic_inc_return_relaxed() == 2)
> > atomic_set(, 1);
> > 
> > Perhaps when you say "the STORE", you mean the store in the atomic RmW,
> > rather than the store in the LOAD->STORE control dependency?
> 
> Yes. So I was looking to exclude (SC) STORE -> STORE order through
> control dependencies.

Ok, good.

Will


Re: [RESEND PATCH v5] locking/pvqspinlock: Relax cmpxchg's to improve performance on some archs

2017-08-21 Thread Peter Zijlstra
On Tue, Aug 15, 2017 at 07:40:35PM +0100, Will Deacon wrote:
> On Mon, Aug 14, 2017 at 08:47:11PM +0200, Peter Zijlstra wrote:
> > On Mon, Aug 14, 2017 at 01:01:22PM +0100, Will Deacon wrote:
> > > Yeah, that's right, you can't use the STXR status flag to create control
> > > dependencies.
> > 
> > Just for my elucidation; you can't use it to create a control dependency
> > on the store, but you can use it to create a control dependency on the
> > corresponding load, right?
> 
> Hmm, sort of, but I'd say that the reads are really ordered due to
> read-after-read ordering in that case. Control dependencies to loads
> don't give you order.

No, I meant _from_ the LL load, not _to_ a later load.

> > Now, IIRC, we've defined control dependencies as being LOAD->STORE
> > ordering, so in that respect nothing is lost. But maybe we should
> > explicitly mention that if the LOAD is part of an (otherwise) atomic RmW
> > the STORE is not constrained.
> 
> I could well be misreading your suggestion, but it feels like that's too
> weak. You can definitely still have control dependencies off the LL part
> of the LL/SC pair, just not off the SC part.
> 
> E.g. this version of LB is forbidden on arm64:
> 
> P0:
> if (atomic_inc_return_relaxed() == 2)
>   atomic_set(, 1);
> 
> P1:
> if (atomic_inc_return_relaxed() == 2)
>   atomic_set(, 1);
> 
> Perhaps when you say "the STORE", you mean the store in the atomic RmW,
> rather than the store in the LOAD->STORE control dependency?

Yes. So I was looking to exclude (SC) STORE -> STORE order through
control dependencies.




Re: [RESEND PATCH v5] locking/pvqspinlock: Relax cmpxchg's to improve performance on some archs

2017-08-21 Thread Peter Zijlstra
On Tue, Aug 15, 2017 at 07:40:35PM +0100, Will Deacon wrote:
> On Mon, Aug 14, 2017 at 08:47:11PM +0200, Peter Zijlstra wrote:
> > On Mon, Aug 14, 2017 at 01:01:22PM +0100, Will Deacon wrote:
> > > Yeah, that's right, you can't use the STXR status flag to create control
> > > dependencies.
> > 
> > Just for my elucidation; you can't use it to create a control dependency
> > on the store, but you can use it to create a control dependency on the
> > corresponding load, right?
> 
> Hmm, sort of, but I'd say that the reads are really ordered due to
> read-after-read ordering in that case. Control dependencies to loads
> don't give you order.

No, I meant _from_ the LL load, not _to_ a later load.

> > Now, IIRC, we've defined control dependencies as being LOAD->STORE
> > ordering, so in that respect nothing is lost. But maybe we should
> > explicitly mention that if the LOAD is part of an (otherwise) atomic RmW
> > the STORE is not constrained.
> 
> I could well be misreading your suggestion, but it feels like that's too
> weak. You can definitely still have control dependencies off the LL part
> of the LL/SC pair, just not off the SC part.
> 
> E.g. this version of LB is forbidden on arm64:
> 
> P0:
> if (atomic_inc_return_relaxed() == 2)
>   atomic_set(, 1);
> 
> P1:
> if (atomic_inc_return_relaxed() == 2)
>   atomic_set(, 1);
> 
> Perhaps when you say "the STORE", you mean the store in the atomic RmW,
> rather than the store in the LOAD->STORE control dependency?

Yes. So I was looking to exclude (SC) STORE -> STORE order through
control dependencies.




Re: [RESEND PATCH v5] locking/pvqspinlock: Relax cmpxchg's to improve performance on some archs

2017-08-15 Thread Will Deacon
On Mon, Aug 14, 2017 at 08:47:11PM +0200, Peter Zijlstra wrote:
> On Mon, Aug 14, 2017 at 01:01:22PM +0100, Will Deacon wrote:
> > Yeah, that's right, you can't use the STXR status flag to create control
> > dependencies.
> 
> Just for my elucidation; you can't use it to create a control dependency
> on the store, but you can use it to create a control dependency on the
> corresponding load, right?

Hmm, sort of, but I'd say that the reads are really ordered due to
read-after-read ordering in that case. Control dependencies to loads
don't give you order.

> Now, IIRC, we've defined control dependencies as being LOAD->STORE
> ordering, so in that respect nothing is lost. But maybe we should
> explicitly mention that if the LOAD is part of an (otherwise) atomic RmW
> the STORE is not constrained.

I could well be misreading your suggestion, but it feels like that's too
weak. You can definitely still have control dependencies off the LL part
of the LL/SC pair, just not off the SC part.

E.g. this version of LB is forbidden on arm64:

P0:
if (atomic_inc_return_relaxed() == 2)
atomic_set(, 1);

P1:
if (atomic_inc_return_relaxed() == 2)
atomic_set(, 1);

Perhaps when you say "the STORE", you mean the store in the atomic RmW,
rather than the store in the LOAD->STORE control dependency?

Will


Re: [RESEND PATCH v5] locking/pvqspinlock: Relax cmpxchg's to improve performance on some archs

2017-08-15 Thread Will Deacon
On Mon, Aug 14, 2017 at 08:47:11PM +0200, Peter Zijlstra wrote:
> On Mon, Aug 14, 2017 at 01:01:22PM +0100, Will Deacon wrote:
> > Yeah, that's right, you can't use the STXR status flag to create control
> > dependencies.
> 
> Just for my elucidation; you can't use it to create a control dependency
> on the store, but you can use it to create a control dependency on the
> corresponding load, right?

Hmm, sort of, but I'd say that the reads are really ordered due to
read-after-read ordering in that case. Control dependencies to loads
don't give you order.

> Now, IIRC, we've defined control dependencies as being LOAD->STORE
> ordering, so in that respect nothing is lost. But maybe we should
> explicitly mention that if the LOAD is part of an (otherwise) atomic RmW
> the STORE is not constrained.

I could well be misreading your suggestion, but it feels like that's too
weak. You can definitely still have control dependencies off the LL part
of the LL/SC pair, just not off the SC part.

E.g. this version of LB is forbidden on arm64:

P0:
if (atomic_inc_return_relaxed() == 2)
atomic_set(, 1);

P1:
if (atomic_inc_return_relaxed() == 2)
atomic_set(, 1);

Perhaps when you say "the STORE", you mean the store in the atomic RmW,
rather than the store in the LOAD->STORE control dependency?

Will


Re: [RESEND PATCH v5] locking/pvqspinlock: Relax cmpxchg's to improve performance on some archs

2017-08-14 Thread Peter Zijlstra
On Mon, Aug 14, 2017 at 01:01:22PM +0100, Will Deacon wrote:
> Yeah, that's right, you can't use the STXR status flag to create control
> dependencies.

Just for my elucidation; you can't use it to create a control dependency
on the store, but you can use it to create a control dependency on the
corresponding load, right?

Now, IIRC, we've defined control dependencies as being LOAD->STORE
ordering, so in that respect nothing is lost. But maybe we should
explicitly mention that if the LOAD is part of an (otherwise) atomic RmW
the STORE is not constrained.


Re: [RESEND PATCH v5] locking/pvqspinlock: Relax cmpxchg's to improve performance on some archs

2017-08-14 Thread Peter Zijlstra
On Mon, Aug 14, 2017 at 01:01:22PM +0100, Will Deacon wrote:
> Yeah, that's right, you can't use the STXR status flag to create control
> dependencies.

Just for my elucidation; you can't use it to create a control dependency
on the store, but you can use it to create a control dependency on the
corresponding load, right?

Now, IIRC, we've defined control dependencies as being LOAD->STORE
ordering, so in that respect nothing is lost. But maybe we should
explicitly mention that if the LOAD is part of an (otherwise) atomic RmW
the STORE is not constrained.


Re: [RESEND PATCH v5] locking/pvqspinlock: Relax cmpxchg's to improve performance on some archs

2017-08-14 Thread Will Deacon
On Mon, Aug 14, 2017 at 11:01:10AM -0400, Waiman Long wrote:
> On 08/14/2017 08:01 AM, Will Deacon wrote:
> > On Fri, Aug 11, 2017 at 11:06:01AM +0200, Peter Zijlstra wrote:
> >> On Thu, Aug 10, 2017 at 02:18:30PM -0400, Waiman Long wrote:
> >>> On 08/10/2017 12:22 PM, Waiman Long wrote:
>  On 08/10/2017 12:15 PM, Peter Zijlstra wrote:
> > Might as well do an explicit:
> >
> > smp_mb__before_atomic()
> > cmpxchg_relaxed()
> > smp_mb__after_atomic()
> >
> > I suppose and not introduce new primitives.
> >>> I think we don't need smp_mb__after_atomic(). The read has to be fully
> >>> ordered, but the write part may not need it as the control dependency of
> >>> the old value should guard against incorrect action. Right?
> >> You'd think that, but IIRC there was something funny about using the SC
> >> return flag for control dependencies. Will?
> > Yeah, that's right, you can't use the STXR status flag to create control
> > dependencies.
> >
> > Will
> 
> Actually, the code sequence that I plan to use are:
> 
> smp_mb__before_atomic();
> if (cmpxchg_relaxed(>state, vcpu_halted, vcpu_hashed)
> != vcpu_halted)
> return;
> 
> WRITE_ONCE(l->locked, _Q_SLOW_VAL);
> (void)pv_hash(lock, pn);
> 
> I am planning to use the comparison of the returned value (pn->state)
> again vcpu_halted as the control dependency. I don't see how the status
> flag of STXR is affecting this.

Thanks for the context. I agree that you've got a control dependency in this
case, so the WRITE_ONCE will be ordered after the LL part of the cmpxchg. It
could still be reordered with respect to the write part, however.

Will


Re: [RESEND PATCH v5] locking/pvqspinlock: Relax cmpxchg's to improve performance on some archs

2017-08-14 Thread Will Deacon
On Mon, Aug 14, 2017 at 11:01:10AM -0400, Waiman Long wrote:
> On 08/14/2017 08:01 AM, Will Deacon wrote:
> > On Fri, Aug 11, 2017 at 11:06:01AM +0200, Peter Zijlstra wrote:
> >> On Thu, Aug 10, 2017 at 02:18:30PM -0400, Waiman Long wrote:
> >>> On 08/10/2017 12:22 PM, Waiman Long wrote:
>  On 08/10/2017 12:15 PM, Peter Zijlstra wrote:
> > Might as well do an explicit:
> >
> > smp_mb__before_atomic()
> > cmpxchg_relaxed()
> > smp_mb__after_atomic()
> >
> > I suppose and not introduce new primitives.
> >>> I think we don't need smp_mb__after_atomic(). The read has to be fully
> >>> ordered, but the write part may not need it as the control dependency of
> >>> the old value should guard against incorrect action. Right?
> >> You'd think that, but IIRC there was something funny about using the SC
> >> return flag for control dependencies. Will?
> > Yeah, that's right, you can't use the STXR status flag to create control
> > dependencies.
> >
> > Will
> 
> Actually, the code sequence that I plan to use are:
> 
> smp_mb__before_atomic();
> if (cmpxchg_relaxed(>state, vcpu_halted, vcpu_hashed)
> != vcpu_halted)
> return;
> 
> WRITE_ONCE(l->locked, _Q_SLOW_VAL);
> (void)pv_hash(lock, pn);
> 
> I am planning to use the comparison of the returned value (pn->state)
> again vcpu_halted as the control dependency. I don't see how the status
> flag of STXR is affecting this.

Thanks for the context. I agree that you've got a control dependency in this
case, so the WRITE_ONCE will be ordered after the LL part of the cmpxchg. It
could still be reordered with respect to the write part, however.

Will


Re: [RESEND PATCH v5] locking/pvqspinlock: Relax cmpxchg's to improve performance on some archs

2017-08-14 Thread Waiman Long
On 08/14/2017 08:01 AM, Will Deacon wrote:
> On Fri, Aug 11, 2017 at 11:06:01AM +0200, Peter Zijlstra wrote:
>> On Thu, Aug 10, 2017 at 02:18:30PM -0400, Waiman Long wrote:
>>> On 08/10/2017 12:22 PM, Waiman Long wrote:
 On 08/10/2017 12:15 PM, Peter Zijlstra wrote:
> Might as well do an explicit:
>
>   smp_mb__before_atomic()
>   cmpxchg_relaxed()
>   smp_mb__after_atomic()
>
> I suppose and not introduce new primitives.
>>> I think we don't need smp_mb__after_atomic(). The read has to be fully
>>> ordered, but the write part may not need it as the control dependency of
>>> the old value should guard against incorrect action. Right?
>> You'd think that, but IIRC there was something funny about using the SC
>> return flag for control dependencies. Will?
> Yeah, that's right, you can't use the STXR status flag to create control
> dependencies.
>
> Will

Actually, the code sequence that I plan to use are:

smp_mb__before_atomic();
if (cmpxchg_relaxed(>state, vcpu_halted, vcpu_hashed)
!= vcpu_halted)
return;

WRITE_ONCE(l->locked, _Q_SLOW_VAL);
(void)pv_hash(lock, pn);

I am planning to use the comparison of the returned value (pn->state)
again vcpu_halted as the control dependency. I don't see how the status
flag of STXR is affecting this.

Cheers,
Longman




Re: [RESEND PATCH v5] locking/pvqspinlock: Relax cmpxchg's to improve performance on some archs

2017-08-14 Thread Waiman Long
On 08/14/2017 08:01 AM, Will Deacon wrote:
> On Fri, Aug 11, 2017 at 11:06:01AM +0200, Peter Zijlstra wrote:
>> On Thu, Aug 10, 2017 at 02:18:30PM -0400, Waiman Long wrote:
>>> On 08/10/2017 12:22 PM, Waiman Long wrote:
 On 08/10/2017 12:15 PM, Peter Zijlstra wrote:
> Might as well do an explicit:
>
>   smp_mb__before_atomic()
>   cmpxchg_relaxed()
>   smp_mb__after_atomic()
>
> I suppose and not introduce new primitives.
>>> I think we don't need smp_mb__after_atomic(). The read has to be fully
>>> ordered, but the write part may not need it as the control dependency of
>>> the old value should guard against incorrect action. Right?
>> You'd think that, but IIRC there was something funny about using the SC
>> return flag for control dependencies. Will?
> Yeah, that's right, you can't use the STXR status flag to create control
> dependencies.
>
> Will

Actually, the code sequence that I plan to use are:

smp_mb__before_atomic();
if (cmpxchg_relaxed(>state, vcpu_halted, vcpu_hashed)
!= vcpu_halted)
return;

WRITE_ONCE(l->locked, _Q_SLOW_VAL);
(void)pv_hash(lock, pn);

I am planning to use the comparison of the returned value (pn->state)
again vcpu_halted as the control dependency. I don't see how the status
flag of STXR is affecting this.

Cheers,
Longman




Re: [RESEND PATCH v5] locking/pvqspinlock: Relax cmpxchg's to improve performance on some archs

2017-08-14 Thread Will Deacon
On Fri, Aug 11, 2017 at 11:06:01AM +0200, Peter Zijlstra wrote:
> On Thu, Aug 10, 2017 at 02:18:30PM -0400, Waiman Long wrote:
> > On 08/10/2017 12:22 PM, Waiman Long wrote:
> > > On 08/10/2017 12:15 PM, Peter Zijlstra wrote:
> 
> > >> Might as well do an explicit:
> > >>
> > >>  smp_mb__before_atomic()
> > >>  cmpxchg_relaxed()
> > >>  smp_mb__after_atomic()
> > >>
> > >> I suppose and not introduce new primitives.
> > 
> > I think we don't need smp_mb__after_atomic(). The read has to be fully
> > ordered, but the write part may not need it as the control dependency of
> > the old value should guard against incorrect action. Right?
> 
> You'd think that, but IIRC there was something funny about using the SC
> return flag for control dependencies. Will?

Yeah, that's right, you can't use the STXR status flag to create control
dependencies.

Will


Re: [RESEND PATCH v5] locking/pvqspinlock: Relax cmpxchg's to improve performance on some archs

2017-08-14 Thread Will Deacon
On Fri, Aug 11, 2017 at 11:06:01AM +0200, Peter Zijlstra wrote:
> On Thu, Aug 10, 2017 at 02:18:30PM -0400, Waiman Long wrote:
> > On 08/10/2017 12:22 PM, Waiman Long wrote:
> > > On 08/10/2017 12:15 PM, Peter Zijlstra wrote:
> 
> > >> Might as well do an explicit:
> > >>
> > >>  smp_mb__before_atomic()
> > >>  cmpxchg_relaxed()
> > >>  smp_mb__after_atomic()
> > >>
> > >> I suppose and not introduce new primitives.
> > 
> > I think we don't need smp_mb__after_atomic(). The read has to be fully
> > ordered, but the write part may not need it as the control dependency of
> > the old value should guard against incorrect action. Right?
> 
> You'd think that, but IIRC there was something funny about using the SC
> return flag for control dependencies. Will?

Yeah, that's right, you can't use the STXR status flag to create control
dependencies.

Will


Re: [RESEND PATCH v5] locking/pvqspinlock: Relax cmpxchg's to improve performance on some archs

2017-08-11 Thread Peter Zijlstra
On Thu, Aug 10, 2017 at 02:18:30PM -0400, Waiman Long wrote:
> On 08/10/2017 12:22 PM, Waiman Long wrote:
> > On 08/10/2017 12:15 PM, Peter Zijlstra wrote:

> >> Might as well do an explicit:
> >>
> >>smp_mb__before_atomic()
> >>cmpxchg_relaxed()
> >>smp_mb__after_atomic()
> >>
> >> I suppose and not introduce new primitives.
> 
> I think we don't need smp_mb__after_atomic(). The read has to be fully
> ordered, but the write part may not need it as the control dependency of
> the old value should guard against incorrect action. Right?

You'd think that, but IIRC there was something funny about using the SC
return flag for control dependencies. Will?


Re: [RESEND PATCH v5] locking/pvqspinlock: Relax cmpxchg's to improve performance on some archs

2017-08-11 Thread Peter Zijlstra
On Thu, Aug 10, 2017 at 02:18:30PM -0400, Waiman Long wrote:
> On 08/10/2017 12:22 PM, Waiman Long wrote:
> > On 08/10/2017 12:15 PM, Peter Zijlstra wrote:

> >> Might as well do an explicit:
> >>
> >>smp_mb__before_atomic()
> >>cmpxchg_relaxed()
> >>smp_mb__after_atomic()
> >>
> >> I suppose and not introduce new primitives.
> 
> I think we don't need smp_mb__after_atomic(). The read has to be fully
> ordered, but the write part may not need it as the control dependency of
> the old value should guard against incorrect action. Right?

You'd think that, but IIRC there was something funny about using the SC
return flag for control dependencies. Will?


Re: [RESEND PATCH v5] locking/pvqspinlock: Relax cmpxchg's to improve performance on some archs

2017-08-10 Thread Paul E. McKenney
On Thu, Aug 10, 2017 at 11:13:17AM +0200, Peter Zijlstra wrote:
> On Thu, Aug 10, 2017 at 04:12:13PM +0800, Boqun Feng wrote:
> 
> > > Or is the reason this doesn't work on PPC that its RCpc?
> 
> So that :-)
> 
> > Here is an example why PPC needs a sync() before the cmpxchg():
> > 
> > https://marc.info/?l=linux-kernel=144485396224519=2
> > 
> > and Paul Mckenney's detailed explanation about why this could happen:
> > 
> > https://marc.info/?l=linux-kernel=144485909826241=2
> > 
> > (Somehow, I feel like he was answering to a similar question question as
> > you ask here ;-))
> 
> Yes, and I had vague memories of having gone over this before, but
> couldn't quickly find things. Thanks!
> 
> > And I think aarch64 doesn't have a problem here because it is "(other)
> > multi-copy atomic". Will?
> 
> Right, its the RCpc vs RCsc thing. The ARM64 release is as you say
> multi-copy atomic, whereas the PPC lwsync is not.
> 
> This still leaves us with the situation that we need an smp_mb() between
> smp_store_release() and a possibly failing cmpxchg() if we want to
> guarantee the cmpxchg()'s load comes after the store-release.

For whatever it is worth, this is why C11 allows specifying one
memory-order strength for the success case and another for the failure
case.  But it is not immediately clear that we need another level
of combinatorial API explosion...

Thanx, Paul



Re: [RESEND PATCH v5] locking/pvqspinlock: Relax cmpxchg's to improve performance on some archs

2017-08-10 Thread Paul E. McKenney
On Thu, Aug 10, 2017 at 11:13:17AM +0200, Peter Zijlstra wrote:
> On Thu, Aug 10, 2017 at 04:12:13PM +0800, Boqun Feng wrote:
> 
> > > Or is the reason this doesn't work on PPC that its RCpc?
> 
> So that :-)
> 
> > Here is an example why PPC needs a sync() before the cmpxchg():
> > 
> > https://marc.info/?l=linux-kernel=144485396224519=2
> > 
> > and Paul Mckenney's detailed explanation about why this could happen:
> > 
> > https://marc.info/?l=linux-kernel=144485909826241=2
> > 
> > (Somehow, I feel like he was answering to a similar question question as
> > you ask here ;-))
> 
> Yes, and I had vague memories of having gone over this before, but
> couldn't quickly find things. Thanks!
> 
> > And I think aarch64 doesn't have a problem here because it is "(other)
> > multi-copy atomic". Will?
> 
> Right, its the RCpc vs RCsc thing. The ARM64 release is as you say
> multi-copy atomic, whereas the PPC lwsync is not.
> 
> This still leaves us with the situation that we need an smp_mb() between
> smp_store_release() and a possibly failing cmpxchg() if we want to
> guarantee the cmpxchg()'s load comes after the store-release.

For whatever it is worth, this is why C11 allows specifying one
memory-order strength for the success case and another for the failure
case.  But it is not immediately clear that we need another level
of combinatorial API explosion...

Thanx, Paul



Re: [RESEND PATCH v5] locking/pvqspinlock: Relax cmpxchg's to improve performance on some archs

2017-08-10 Thread Waiman Long
On 08/10/2017 12:22 PM, Waiman Long wrote:
> On 08/10/2017 12:15 PM, Peter Zijlstra wrote:
>> On Thu, Aug 10, 2017 at 09:58:57AM -0400, Waiman Long wrote:
>>> On 08/10/2017 09:27 AM, Waiman Long wrote:
 On 08/10/2017 07:50 AM, Peter Zijlstra wrote:
> On Wed, May 24, 2017 at 09:38:28AM -0400, Waiman Long wrote:
>>   # of thread w/o patchwith patch  % Change
>>   --- ---  
>>4 4053.3 Mop/s  4223.7 Mop/s +4.2%
>>8 3310.4 Mop/s  3406.0 Mop/s +2.9%
>>   12 2576.4 Mop/s  2674.6 Mop/s +3.8%
> Waiman, could you run those numbers again but with the below 'fixed' ?
>
>> @@ -361,6 +361,13 @@ static void pv_kick_node(struct qspinlock *lock, 
>> struct mcs_spinlock *node)
>>   * observe its next->locked value and advance itself.
>>   *
>>   * Matches with smp_store_mb() and cmpxchg() in pv_wait_node()
>> + *
>> + * The write to next->locked in arch_mcs_spin_unlock_contended()
>> + * must be ordered before the read of pn->state in the cmpxchg()
>> + * below for the code to work correctly. However, this is not
>> + * guaranteed on all architectures when the cmpxchg() call 
>> fails.
>> + * Both x86 and PPC can provide that guarantee, but other
>> + * architectures not necessarily.
>>   */
>   smp_mb();
>
>>  if (cmpxchg(>state, vcpu_halted, vcpu_hashed) != 
>> vcpu_halted)
>>  return;
> Ideally this Power CPU can optimize back-to-back SYNC instructions, but
> who knows...
 Yes, I can run the numbers again. However, the changes here is in the
 slowpath. My current patch optimizes the fast path only and my original
 test doesn't stress the slowpath at all, I think. I will have to make
 some changes to stress the slowpath.
>>> Looking at past emails, I remember why I put the comment there. Putting
>>> an smp_mb() here will definitely has an negative performance impact on
>>> x86. So I put in the comment here to remind me that the current code may
>>> not work for ARM64.
>>>
>>> To fix that, my current thought is to have a cmpxchg variant that
>>> guarantees ordering for both success and failure, for example,
>>> cmpxchg_ordered(). In that way, we only need to insert the barrier for
>>> architectures that need it. That will be a separate patch instead of
>>> integrating into this one.
>> Might as well do an explicit:
>>
>>  smp_mb__before_atomic()
>>  cmpxchg_relaxed()
>>  smp_mb__after_atomic()
>>
>> I suppose and not introduce new primitives.

I think we don't need smp_mb__after_atomic(). The read has to be fully
ordered, but the write part may not need it as the control dependency of
the old value should guard against incorrect action. Right?

Cheers,
Longman



Re: [RESEND PATCH v5] locking/pvqspinlock: Relax cmpxchg's to improve performance on some archs

2017-08-10 Thread Waiman Long
On 08/10/2017 12:22 PM, Waiman Long wrote:
> On 08/10/2017 12:15 PM, Peter Zijlstra wrote:
>> On Thu, Aug 10, 2017 at 09:58:57AM -0400, Waiman Long wrote:
>>> On 08/10/2017 09:27 AM, Waiman Long wrote:
 On 08/10/2017 07:50 AM, Peter Zijlstra wrote:
> On Wed, May 24, 2017 at 09:38:28AM -0400, Waiman Long wrote:
>>   # of thread w/o patchwith patch  % Change
>>   --- ---  
>>4 4053.3 Mop/s  4223.7 Mop/s +4.2%
>>8 3310.4 Mop/s  3406.0 Mop/s +2.9%
>>   12 2576.4 Mop/s  2674.6 Mop/s +3.8%
> Waiman, could you run those numbers again but with the below 'fixed' ?
>
>> @@ -361,6 +361,13 @@ static void pv_kick_node(struct qspinlock *lock, 
>> struct mcs_spinlock *node)
>>   * observe its next->locked value and advance itself.
>>   *
>>   * Matches with smp_store_mb() and cmpxchg() in pv_wait_node()
>> + *
>> + * The write to next->locked in arch_mcs_spin_unlock_contended()
>> + * must be ordered before the read of pn->state in the cmpxchg()
>> + * below for the code to work correctly. However, this is not
>> + * guaranteed on all architectures when the cmpxchg() call 
>> fails.
>> + * Both x86 and PPC can provide that guarantee, but other
>> + * architectures not necessarily.
>>   */
>   smp_mb();
>
>>  if (cmpxchg(>state, vcpu_halted, vcpu_hashed) != 
>> vcpu_halted)
>>  return;
> Ideally this Power CPU can optimize back-to-back SYNC instructions, but
> who knows...
 Yes, I can run the numbers again. However, the changes here is in the
 slowpath. My current patch optimizes the fast path only and my original
 test doesn't stress the slowpath at all, I think. I will have to make
 some changes to stress the slowpath.
>>> Looking at past emails, I remember why I put the comment there. Putting
>>> an smp_mb() here will definitely has an negative performance impact on
>>> x86. So I put in the comment here to remind me that the current code may
>>> not work for ARM64.
>>>
>>> To fix that, my current thought is to have a cmpxchg variant that
>>> guarantees ordering for both success and failure, for example,
>>> cmpxchg_ordered(). In that way, we only need to insert the barrier for
>>> architectures that need it. That will be a separate patch instead of
>>> integrating into this one.
>> Might as well do an explicit:
>>
>>  smp_mb__before_atomic()
>>  cmpxchg_relaxed()
>>  smp_mb__after_atomic()
>>
>> I suppose and not introduce new primitives.

I think we don't need smp_mb__after_atomic(). The read has to be fully
ordered, but the write part may not need it as the control dependency of
the old value should guard against incorrect action. Right?

Cheers,
Longman



Re: [RESEND PATCH v5] locking/pvqspinlock: Relax cmpxchg's to improve performance on some archs

2017-08-10 Thread Waiman Long
On 08/10/2017 12:15 PM, Peter Zijlstra wrote:
> On Thu, Aug 10, 2017 at 09:58:57AM -0400, Waiman Long wrote:
>> On 08/10/2017 09:27 AM, Waiman Long wrote:
>>> On 08/10/2017 07:50 AM, Peter Zijlstra wrote:
 On Wed, May 24, 2017 at 09:38:28AM -0400, Waiman Long wrote:
>   # of thread w/o patchwith patch  % Change
>   --- ---  
>4 4053.3 Mop/s  4223.7 Mop/s +4.2%
>8 3310.4 Mop/s  3406.0 Mop/s +2.9%
>   12 2576.4 Mop/s  2674.6 Mop/s +3.8%
 Waiman, could you run those numbers again but with the below 'fixed' ?

> @@ -361,6 +361,13 @@ static void pv_kick_node(struct qspinlock *lock, 
> struct mcs_spinlock *node)
>* observe its next->locked value and advance itself.
>*
>* Matches with smp_store_mb() and cmpxchg() in pv_wait_node()
> +  *
> +  * The write to next->locked in arch_mcs_spin_unlock_contended()
> +  * must be ordered before the read of pn->state in the cmpxchg()
> +  * below for the code to work correctly. However, this is not
> +  * guaranteed on all architectures when the cmpxchg() call fails.
> +  * Both x86 and PPC can provide that guarantee, but other
> +  * architectures not necessarily.
>*/
smp_mb();

>   if (cmpxchg(>state, vcpu_halted, vcpu_hashed) != vcpu_halted)
>   return;
 Ideally this Power CPU can optimize back-to-back SYNC instructions, but
 who knows...
>>> Yes, I can run the numbers again. However, the changes here is in the
>>> slowpath. My current patch optimizes the fast path only and my original
>>> test doesn't stress the slowpath at all, I think. I will have to make
>>> some changes to stress the slowpath.
>> Looking at past emails, I remember why I put the comment there. Putting
>> an smp_mb() here will definitely has an negative performance impact on
>> x86. So I put in the comment here to remind me that the current code may
>> not work for ARM64.
>>
>> To fix that, my current thought is to have a cmpxchg variant that
>> guarantees ordering for both success and failure, for example,
>> cmpxchg_ordered(). In that way, we only need to insert the barrier for
>> architectures that need it. That will be a separate patch instead of
>> integrating into this one.
> Might as well do an explicit:
>
>   smp_mb__before_atomic()
>   cmpxchg_relaxed()
>   smp_mb__after_atomic()
>
> I suppose and not introduce new primitives.


Right. I think that will work without impacting current x86 performance.
Will update my patch accordingly.

Thanks,
Longman



Re: [RESEND PATCH v5] locking/pvqspinlock: Relax cmpxchg's to improve performance on some archs

2017-08-10 Thread Waiman Long
On 08/10/2017 12:15 PM, Peter Zijlstra wrote:
> On Thu, Aug 10, 2017 at 09:58:57AM -0400, Waiman Long wrote:
>> On 08/10/2017 09:27 AM, Waiman Long wrote:
>>> On 08/10/2017 07:50 AM, Peter Zijlstra wrote:
 On Wed, May 24, 2017 at 09:38:28AM -0400, Waiman Long wrote:
>   # of thread w/o patchwith patch  % Change
>   --- ---  
>4 4053.3 Mop/s  4223.7 Mop/s +4.2%
>8 3310.4 Mop/s  3406.0 Mop/s +2.9%
>   12 2576.4 Mop/s  2674.6 Mop/s +3.8%
 Waiman, could you run those numbers again but with the below 'fixed' ?

> @@ -361,6 +361,13 @@ static void pv_kick_node(struct qspinlock *lock, 
> struct mcs_spinlock *node)
>* observe its next->locked value and advance itself.
>*
>* Matches with smp_store_mb() and cmpxchg() in pv_wait_node()
> +  *
> +  * The write to next->locked in arch_mcs_spin_unlock_contended()
> +  * must be ordered before the read of pn->state in the cmpxchg()
> +  * below for the code to work correctly. However, this is not
> +  * guaranteed on all architectures when the cmpxchg() call fails.
> +  * Both x86 and PPC can provide that guarantee, but other
> +  * architectures not necessarily.
>*/
smp_mb();

>   if (cmpxchg(>state, vcpu_halted, vcpu_hashed) != vcpu_halted)
>   return;
 Ideally this Power CPU can optimize back-to-back SYNC instructions, but
 who knows...
>>> Yes, I can run the numbers again. However, the changes here is in the
>>> slowpath. My current patch optimizes the fast path only and my original
>>> test doesn't stress the slowpath at all, I think. I will have to make
>>> some changes to stress the slowpath.
>> Looking at past emails, I remember why I put the comment there. Putting
>> an smp_mb() here will definitely has an negative performance impact on
>> x86. So I put in the comment here to remind me that the current code may
>> not work for ARM64.
>>
>> To fix that, my current thought is to have a cmpxchg variant that
>> guarantees ordering for both success and failure, for example,
>> cmpxchg_ordered(). In that way, we only need to insert the barrier for
>> architectures that need it. That will be a separate patch instead of
>> integrating into this one.
> Might as well do an explicit:
>
>   smp_mb__before_atomic()
>   cmpxchg_relaxed()
>   smp_mb__after_atomic()
>
> I suppose and not introduce new primitives.


Right. I think that will work without impacting current x86 performance.
Will update my patch accordingly.

Thanks,
Longman



Re: [RESEND PATCH v5] locking/pvqspinlock: Relax cmpxchg's to improve performance on some archs

2017-08-10 Thread Peter Zijlstra
On Thu, Aug 10, 2017 at 09:58:57AM -0400, Waiman Long wrote:
> On 08/10/2017 09:27 AM, Waiman Long wrote:
> > On 08/10/2017 07:50 AM, Peter Zijlstra wrote:
> >> On Wed, May 24, 2017 at 09:38:28AM -0400, Waiman Long wrote:
> >>>   # of thread w/o patchwith patch  % Change
> >>>   --- ---  
> >>>4 4053.3 Mop/s  4223.7 Mop/s +4.2%
> >>>8 3310.4 Mop/s  3406.0 Mop/s +2.9%
> >>>   12 2576.4 Mop/s  2674.6 Mop/s +3.8%
> >> Waiman, could you run those numbers again but with the below 'fixed' ?
> >>
> >>> @@ -361,6 +361,13 @@ static void pv_kick_node(struct qspinlock *lock, 
> >>> struct mcs_spinlock *node)
> >>>* observe its next->locked value and advance itself.
> >>>*
> >>>* Matches with smp_store_mb() and cmpxchg() in pv_wait_node()
> >>> +  *
> >>> +  * The write to next->locked in arch_mcs_spin_unlock_contended()
> >>> +  * must be ordered before the read of pn->state in the cmpxchg()
> >>> +  * below for the code to work correctly. However, this is not
> >>> +  * guaranteed on all architectures when the cmpxchg() call fails.
> >>> +  * Both x86 and PPC can provide that guarantee, but other
> >>> +  * architectures not necessarily.
> >>>*/
> >>smp_mb();
> >>
> >>>   if (cmpxchg(>state, vcpu_halted, vcpu_hashed) != vcpu_halted)
> >>>   return;
> >> Ideally this Power CPU can optimize back-to-back SYNC instructions, but
> >> who knows...
> > Yes, I can run the numbers again. However, the changes here is in the
> > slowpath. My current patch optimizes the fast path only and my original
> > test doesn't stress the slowpath at all, I think. I will have to make
> > some changes to stress the slowpath.
> 
> Looking at past emails, I remember why I put the comment there. Putting
> an smp_mb() here will definitely has an negative performance impact on
> x86. So I put in the comment here to remind me that the current code may
> not work for ARM64.
> 
> To fix that, my current thought is to have a cmpxchg variant that
> guarantees ordering for both success and failure, for example,
> cmpxchg_ordered(). In that way, we only need to insert the barrier for
> architectures that need it. That will be a separate patch instead of
> integrating into this one.

Might as well do an explicit:

smp_mb__before_atomic()
cmpxchg_relaxed()
smp_mb__after_atomic()

I suppose and not introduce new primitives.


Re: [RESEND PATCH v5] locking/pvqspinlock: Relax cmpxchg's to improve performance on some archs

2017-08-10 Thread Peter Zijlstra
On Thu, Aug 10, 2017 at 09:58:57AM -0400, Waiman Long wrote:
> On 08/10/2017 09:27 AM, Waiman Long wrote:
> > On 08/10/2017 07:50 AM, Peter Zijlstra wrote:
> >> On Wed, May 24, 2017 at 09:38:28AM -0400, Waiman Long wrote:
> >>>   # of thread w/o patchwith patch  % Change
> >>>   --- ---  
> >>>4 4053.3 Mop/s  4223.7 Mop/s +4.2%
> >>>8 3310.4 Mop/s  3406.0 Mop/s +2.9%
> >>>   12 2576.4 Mop/s  2674.6 Mop/s +3.8%
> >> Waiman, could you run those numbers again but with the below 'fixed' ?
> >>
> >>> @@ -361,6 +361,13 @@ static void pv_kick_node(struct qspinlock *lock, 
> >>> struct mcs_spinlock *node)
> >>>* observe its next->locked value and advance itself.
> >>>*
> >>>* Matches with smp_store_mb() and cmpxchg() in pv_wait_node()
> >>> +  *
> >>> +  * The write to next->locked in arch_mcs_spin_unlock_contended()
> >>> +  * must be ordered before the read of pn->state in the cmpxchg()
> >>> +  * below for the code to work correctly. However, this is not
> >>> +  * guaranteed on all architectures when the cmpxchg() call fails.
> >>> +  * Both x86 and PPC can provide that guarantee, but other
> >>> +  * architectures not necessarily.
> >>>*/
> >>smp_mb();
> >>
> >>>   if (cmpxchg(>state, vcpu_halted, vcpu_hashed) != vcpu_halted)
> >>>   return;
> >> Ideally this Power CPU can optimize back-to-back SYNC instructions, but
> >> who knows...
> > Yes, I can run the numbers again. However, the changes here is in the
> > slowpath. My current patch optimizes the fast path only and my original
> > test doesn't stress the slowpath at all, I think. I will have to make
> > some changes to stress the slowpath.
> 
> Looking at past emails, I remember why I put the comment there. Putting
> an smp_mb() here will definitely has an negative performance impact on
> x86. So I put in the comment here to remind me that the current code may
> not work for ARM64.
> 
> To fix that, my current thought is to have a cmpxchg variant that
> guarantees ordering for both success and failure, for example,
> cmpxchg_ordered(). In that way, we only need to insert the barrier for
> architectures that need it. That will be a separate patch instead of
> integrating into this one.

Might as well do an explicit:

smp_mb__before_atomic()
cmpxchg_relaxed()
smp_mb__after_atomic()

I suppose and not introduce new primitives.


Re: [RESEND PATCH v5] locking/pvqspinlock: Relax cmpxchg's to improve performance on some archs

2017-08-10 Thread Waiman Long
On 08/10/2017 09:27 AM, Waiman Long wrote:
> On 08/10/2017 07:50 AM, Peter Zijlstra wrote:
>> On Wed, May 24, 2017 at 09:38:28AM -0400, Waiman Long wrote:
>>>   # of thread w/o patchwith patch  % Change
>>>   --- ---  
>>>4 4053.3 Mop/s  4223.7 Mop/s +4.2%
>>>8 3310.4 Mop/s  3406.0 Mop/s +2.9%
>>>   12 2576.4 Mop/s  2674.6 Mop/s +3.8%
>> Waiman, could you run those numbers again but with the below 'fixed' ?
>>
>>> @@ -361,6 +361,13 @@ static void pv_kick_node(struct qspinlock *lock, 
>>> struct mcs_spinlock *node)
>>>  * observe its next->locked value and advance itself.
>>>  *
>>>  * Matches with smp_store_mb() and cmpxchg() in pv_wait_node()
>>> +*
>>> +* The write to next->locked in arch_mcs_spin_unlock_contended()
>>> +* must be ordered before the read of pn->state in the cmpxchg()
>>> +* below for the code to work correctly. However, this is not
>>> +* guaranteed on all architectures when the cmpxchg() call fails.
>>> +* Both x86 and PPC can provide that guarantee, but other
>>> +* architectures not necessarily.
>>>  */
>>  smp_mb();
>>
>>> if (cmpxchg(>state, vcpu_halted, vcpu_hashed) != vcpu_halted)
>>> return;
>> Ideally this Power CPU can optimize back-to-back SYNC instructions, but
>> who knows...
> Yes, I can run the numbers again. However, the changes here is in the
> slowpath. My current patch optimizes the fast path only and my original
> test doesn't stress the slowpath at all, I think. I will have to make
> some changes to stress the slowpath.

Looking at past emails, I remember why I put the comment there. Putting
an smp_mb() here will definitely has an negative performance impact on
x86. So I put in the comment here to remind me that the current code may
not work for ARM64.

To fix that, my current thought is to have a cmpxchg variant that
guarantees ordering for both success and failure, for example,
cmpxchg_ordered(). In that way, we only need to insert the barrier for
architectures that need it. That will be a separate patch instead of
integrating into this one.

Cheers,
Longman



Re: [RESEND PATCH v5] locking/pvqspinlock: Relax cmpxchg's to improve performance on some archs

2017-08-10 Thread Waiman Long
On 08/10/2017 09:27 AM, Waiman Long wrote:
> On 08/10/2017 07:50 AM, Peter Zijlstra wrote:
>> On Wed, May 24, 2017 at 09:38:28AM -0400, Waiman Long wrote:
>>>   # of thread w/o patchwith patch  % Change
>>>   --- ---  
>>>4 4053.3 Mop/s  4223.7 Mop/s +4.2%
>>>8 3310.4 Mop/s  3406.0 Mop/s +2.9%
>>>   12 2576.4 Mop/s  2674.6 Mop/s +3.8%
>> Waiman, could you run those numbers again but with the below 'fixed' ?
>>
>>> @@ -361,6 +361,13 @@ static void pv_kick_node(struct qspinlock *lock, 
>>> struct mcs_spinlock *node)
>>>  * observe its next->locked value and advance itself.
>>>  *
>>>  * Matches with smp_store_mb() and cmpxchg() in pv_wait_node()
>>> +*
>>> +* The write to next->locked in arch_mcs_spin_unlock_contended()
>>> +* must be ordered before the read of pn->state in the cmpxchg()
>>> +* below for the code to work correctly. However, this is not
>>> +* guaranteed on all architectures when the cmpxchg() call fails.
>>> +* Both x86 and PPC can provide that guarantee, but other
>>> +* architectures not necessarily.
>>>  */
>>  smp_mb();
>>
>>> if (cmpxchg(>state, vcpu_halted, vcpu_hashed) != vcpu_halted)
>>> return;
>> Ideally this Power CPU can optimize back-to-back SYNC instructions, but
>> who knows...
> Yes, I can run the numbers again. However, the changes here is in the
> slowpath. My current patch optimizes the fast path only and my original
> test doesn't stress the slowpath at all, I think. I will have to make
> some changes to stress the slowpath.

Looking at past emails, I remember why I put the comment there. Putting
an smp_mb() here will definitely has an negative performance impact on
x86. So I put in the comment here to remind me that the current code may
not work for ARM64.

To fix that, my current thought is to have a cmpxchg variant that
guarantees ordering for both success and failure, for example,
cmpxchg_ordered(). In that way, we only need to insert the barrier for
architectures that need it. That will be a separate patch instead of
integrating into this one.

Cheers,
Longman



Re: [RESEND PATCH v5] locking/pvqspinlock: Relax cmpxchg's to improve performance on some archs

2017-08-10 Thread Waiman Long
On 08/10/2017 07:50 AM, Peter Zijlstra wrote:
> On Wed, May 24, 2017 at 09:38:28AM -0400, Waiman Long wrote:
>>   # of thread w/o patchwith patch  % Change
>>   --- ---  
>>4 4053.3 Mop/s  4223.7 Mop/s +4.2%
>>8 3310.4 Mop/s  3406.0 Mop/s +2.9%
>>   12 2576.4 Mop/s  2674.6 Mop/s +3.8%
> Waiman, could you run those numbers again but with the below 'fixed' ?
>
>> @@ -361,6 +361,13 @@ static void pv_kick_node(struct qspinlock *lock, struct 
>> mcs_spinlock *node)
>>   * observe its next->locked value and advance itself.
>>   *
>>   * Matches with smp_store_mb() and cmpxchg() in pv_wait_node()
>> + *
>> + * The write to next->locked in arch_mcs_spin_unlock_contended()
>> + * must be ordered before the read of pn->state in the cmpxchg()
>> + * below for the code to work correctly. However, this is not
>> + * guaranteed on all architectures when the cmpxchg() call fails.
>> + * Both x86 and PPC can provide that guarantee, but other
>> + * architectures not necessarily.
>>   */
>   smp_mb();
>
>>  if (cmpxchg(>state, vcpu_halted, vcpu_hashed) != vcpu_halted)
>>  return;
> Ideally this Power CPU can optimize back-to-back SYNC instructions, but
> who knows...

Yes, I can run the numbers again. However, the changes here is in the
slowpath. My current patch optimizes the fast path only and my original
test doesn't stress the slowpath at all, I think. I will have to make
some changes to stress the slowpath.

Cheers,
Longman



Re: [RESEND PATCH v5] locking/pvqspinlock: Relax cmpxchg's to improve performance on some archs

2017-08-10 Thread Waiman Long
On 08/10/2017 07:50 AM, Peter Zijlstra wrote:
> On Wed, May 24, 2017 at 09:38:28AM -0400, Waiman Long wrote:
>>   # of thread w/o patchwith patch  % Change
>>   --- ---  
>>4 4053.3 Mop/s  4223.7 Mop/s +4.2%
>>8 3310.4 Mop/s  3406.0 Mop/s +2.9%
>>   12 2576.4 Mop/s  2674.6 Mop/s +3.8%
> Waiman, could you run those numbers again but with the below 'fixed' ?
>
>> @@ -361,6 +361,13 @@ static void pv_kick_node(struct qspinlock *lock, struct 
>> mcs_spinlock *node)
>>   * observe its next->locked value and advance itself.
>>   *
>>   * Matches with smp_store_mb() and cmpxchg() in pv_wait_node()
>> + *
>> + * The write to next->locked in arch_mcs_spin_unlock_contended()
>> + * must be ordered before the read of pn->state in the cmpxchg()
>> + * below for the code to work correctly. However, this is not
>> + * guaranteed on all architectures when the cmpxchg() call fails.
>> + * Both x86 and PPC can provide that guarantee, but other
>> + * architectures not necessarily.
>>   */
>   smp_mb();
>
>>  if (cmpxchg(>state, vcpu_halted, vcpu_hashed) != vcpu_halted)
>>  return;
> Ideally this Power CPU can optimize back-to-back SYNC instructions, but
> who knows...

Yes, I can run the numbers again. However, the changes here is in the
slowpath. My current patch optimizes the fast path only and my original
test doesn't stress the slowpath at all, I think. I will have to make
some changes to stress the slowpath.

Cheers,
Longman



Re: [RESEND PATCH v5] locking/pvqspinlock: Relax cmpxchg's to improve performance on some archs

2017-08-10 Thread Peter Zijlstra
On Wed, May 24, 2017 at 09:38:28AM -0400, Waiman Long wrote:
> 
>   # of thread w/o patchwith patch  % Change
>   --- ---  
>4 4053.3 Mop/s  4223.7 Mop/s +4.2%
>8 3310.4 Mop/s  3406.0 Mop/s +2.9%
>   12 2576.4 Mop/s  2674.6 Mop/s +3.8%

Waiman, could you run those numbers again but with the below 'fixed' ?

> @@ -361,6 +361,13 @@ static void pv_kick_node(struct qspinlock *lock, struct 
> mcs_spinlock *node)
>* observe its next->locked value and advance itself.
>*
>* Matches with smp_store_mb() and cmpxchg() in pv_wait_node()
> +  *
> +  * The write to next->locked in arch_mcs_spin_unlock_contended()
> +  * must be ordered before the read of pn->state in the cmpxchg()
> +  * below for the code to work correctly. However, this is not
> +  * guaranteed on all architectures when the cmpxchg() call fails.
> +  * Both x86 and PPC can provide that guarantee, but other
> +  * architectures not necessarily.
>*/

smp_mb();

>   if (cmpxchg(>state, vcpu_halted, vcpu_hashed) != vcpu_halted)
>   return;

Ideally this Power CPU can optimize back-to-back SYNC instructions, but
who knows...


Re: [RESEND PATCH v5] locking/pvqspinlock: Relax cmpxchg's to improve performance on some archs

2017-08-10 Thread Peter Zijlstra
On Wed, May 24, 2017 at 09:38:28AM -0400, Waiman Long wrote:
> 
>   # of thread w/o patchwith patch  % Change
>   --- ---  
>4 4053.3 Mop/s  4223.7 Mop/s +4.2%
>8 3310.4 Mop/s  3406.0 Mop/s +2.9%
>   12 2576.4 Mop/s  2674.6 Mop/s +3.8%

Waiman, could you run those numbers again but with the below 'fixed' ?

> @@ -361,6 +361,13 @@ static void pv_kick_node(struct qspinlock *lock, struct 
> mcs_spinlock *node)
>* observe its next->locked value and advance itself.
>*
>* Matches with smp_store_mb() and cmpxchg() in pv_wait_node()
> +  *
> +  * The write to next->locked in arch_mcs_spin_unlock_contended()
> +  * must be ordered before the read of pn->state in the cmpxchg()
> +  * below for the code to work correctly. However, this is not
> +  * guaranteed on all architectures when the cmpxchg() call fails.
> +  * Both x86 and PPC can provide that guarantee, but other
> +  * architectures not necessarily.
>*/

smp_mb();

>   if (cmpxchg(>state, vcpu_halted, vcpu_hashed) != vcpu_halted)
>   return;

Ideally this Power CPU can optimize back-to-back SYNC instructions, but
who knows...


Re: [RESEND PATCH v5] locking/pvqspinlock: Relax cmpxchg's to improve performance on some archs

2017-08-10 Thread Peter Zijlstra
On Thu, Aug 10, 2017 at 04:12:13PM +0800, Boqun Feng wrote:

> > Or is the reason this doesn't work on PPC that its RCpc?

So that :-)

> Here is an example why PPC needs a sync() before the cmpxchg():
> 
>   https://marc.info/?l=linux-kernel=144485396224519=2
> 
> and Paul Mckenney's detailed explanation about why this could happen:
> 
>   https://marc.info/?l=linux-kernel=144485909826241=2
> 
> (Somehow, I feel like he was answering to a similar question question as
> you ask here ;-))

Yes, and I had vague memories of having gone over this before, but
couldn't quickly find things. Thanks!

> And I think aarch64 doesn't have a problem here because it is "(other)
> multi-copy atomic". Will?

Right, its the RCpc vs RCsc thing. The ARM64 release is as you say
multi-copy atomic, whereas the PPC lwsync is not.


This still leaves us with the situation that we need an smp_mb() between
smp_store_release() and a possibly failing cmpxchg() if we want to
guarantee the cmpxchg()'s load comes after the store-release.


Re: [RESEND PATCH v5] locking/pvqspinlock: Relax cmpxchg's to improve performance on some archs

2017-08-10 Thread Peter Zijlstra
On Thu, Aug 10, 2017 at 04:12:13PM +0800, Boqun Feng wrote:

> > Or is the reason this doesn't work on PPC that its RCpc?

So that :-)

> Here is an example why PPC needs a sync() before the cmpxchg():
> 
>   https://marc.info/?l=linux-kernel=144485396224519=2
> 
> and Paul Mckenney's detailed explanation about why this could happen:
> 
>   https://marc.info/?l=linux-kernel=144485909826241=2
> 
> (Somehow, I feel like he was answering to a similar question question as
> you ask here ;-))

Yes, and I had vague memories of having gone over this before, but
couldn't quickly find things. Thanks!

> And I think aarch64 doesn't have a problem here because it is "(other)
> multi-copy atomic". Will?

Right, its the RCpc vs RCsc thing. The ARM64 release is as you say
multi-copy atomic, whereas the PPC lwsync is not.


This still leaves us with the situation that we need an smp_mb() between
smp_store_release() and a possibly failing cmpxchg() if we want to
guarantee the cmpxchg()'s load comes after the store-release.


Re: [RESEND PATCH v5] locking/pvqspinlock: Relax cmpxchg's to improve performance on some archs

2017-08-10 Thread Boqun Feng
On Wed, Aug 09, 2017 at 05:15:33PM +0200, Peter Zijlstra wrote:
> On Wed, Aug 09, 2017 at 05:06:03PM +0200, Peter Zijlstra wrote:
> > Now, ARM64 for instance plays funny games, it does something along the
> > lines of:
> > 
> > cmpxchg(ptr, old, new)
> > {
> > do {
> > r = LL(ptr);
> > if (r != old)
> > return r; /* no barriers */
> > r = new
> > } while (SC_release(ptr, r));
> > smp_mb();
> > return r;
> > }
> > 
> > Thereby ordering things relative to the store on ptr, but the load can
> > very much escape. The thinking is that if success, we must observe the
> > latest value of ptr, but even in that case the load is not ordered and
> > could happen before.
> > 
> > However, since we're guaranteed to observe the latest value of ptr (on
> > success) it doesn't matter if we reordered the load, there is no newer
> > value possible.
> > 
> > So heaps of tricky, but correct afaict. Will?
> 
> And could not PPC do something similar:
> 
> cmpxchg(ptr, old, new)
> {
>   lwsync();
>   dp {
>   r = LL(ptr);
>   if (r != old)
>   return;
>   r = new;
>   } while (SC(ptr, r));
>   sync();
>   return r;
> }
> 
> ?
> 
> the lwsync would make it store-release on SC with similar reasoning as
> above.
> 
> And lwsync allows 'stores reordered after loads', which allows the prior
> smp_store_release() to leak past.
> 
> Or is the reason this doesn't work on PPC that its RCpc?

Here is an example why PPC needs a sync() before the cmpxchg():

https://marc.info/?l=linux-kernel=144485396224519=2

and Paul Mckenney's detailed explanation about why this could happen:

https://marc.info/?l=linux-kernel=144485909826241=2

(Somehow, I feel like he was answering to a similar question question as
you ask here ;-))

And I think aarch64 doesn't have a problem here because it is "(other)
multi-copy atomic". Will?

Regards,
Boqun


signature.asc
Description: PGP signature


Re: [RESEND PATCH v5] locking/pvqspinlock: Relax cmpxchg's to improve performance on some archs

2017-08-10 Thread Boqun Feng
On Wed, Aug 09, 2017 at 05:15:33PM +0200, Peter Zijlstra wrote:
> On Wed, Aug 09, 2017 at 05:06:03PM +0200, Peter Zijlstra wrote:
> > Now, ARM64 for instance plays funny games, it does something along the
> > lines of:
> > 
> > cmpxchg(ptr, old, new)
> > {
> > do {
> > r = LL(ptr);
> > if (r != old)
> > return r; /* no barriers */
> > r = new
> > } while (SC_release(ptr, r));
> > smp_mb();
> > return r;
> > }
> > 
> > Thereby ordering things relative to the store on ptr, but the load can
> > very much escape. The thinking is that if success, we must observe the
> > latest value of ptr, but even in that case the load is not ordered and
> > could happen before.
> > 
> > However, since we're guaranteed to observe the latest value of ptr (on
> > success) it doesn't matter if we reordered the load, there is no newer
> > value possible.
> > 
> > So heaps of tricky, but correct afaict. Will?
> 
> And could not PPC do something similar:
> 
> cmpxchg(ptr, old, new)
> {
>   lwsync();
>   dp {
>   r = LL(ptr);
>   if (r != old)
>   return;
>   r = new;
>   } while (SC(ptr, r));
>   sync();
>   return r;
> }
> 
> ?
> 
> the lwsync would make it store-release on SC with similar reasoning as
> above.
> 
> And lwsync allows 'stores reordered after loads', which allows the prior
> smp_store_release() to leak past.
> 
> Or is the reason this doesn't work on PPC that its RCpc?

Here is an example why PPC needs a sync() before the cmpxchg():

https://marc.info/?l=linux-kernel=144485396224519=2

and Paul Mckenney's detailed explanation about why this could happen:

https://marc.info/?l=linux-kernel=144485909826241=2

(Somehow, I feel like he was answering to a similar question question as
you ask here ;-))

And I think aarch64 doesn't have a problem here because it is "(other)
multi-copy atomic". Will?

Regards,
Boqun


signature.asc
Description: PGP signature


Re: [RESEND PATCH v5] locking/pvqspinlock: Relax cmpxchg's to improve performance on some archs

2017-08-09 Thread Peter Zijlstra
On Wed, Aug 09, 2017 at 05:06:03PM +0200, Peter Zijlstra wrote:
> Now, ARM64 for instance plays funny games, it does something along the
> lines of:
> 
> cmpxchg(ptr, old, new)
> {
>   do {
>   r = LL(ptr);
>   if (r != old)
>   return r; /* no barriers */
>   r = new
>   } while (SC_release(ptr, r));
>   smp_mb();
>   return r;
> }
> 
> Thereby ordering things relative to the store on ptr, but the load can
> very much escape. The thinking is that if success, we must observe the
> latest value of ptr, but even in that case the load is not ordered and
> could happen before.
> 
> However, since we're guaranteed to observe the latest value of ptr (on
> success) it doesn't matter if we reordered the load, there is no newer
> value possible.
> 
> So heaps of tricky, but correct afaict. Will?

And could not PPC do something similar:

cmpxchg(ptr, old, new)
{
lwsync();
dp {
r = LL(ptr);
if (r != old)
return;
r = new;
} while (SC(ptr, r));
sync();
return r;
}

?

the lwsync would make it store-release on SC with similar reasoning as
above.

And lwsync allows 'stores reordered after loads', which allows the prior
smp_store_release() to leak past.

Or is the reason this doesn't work on PPC that its RCpc?


Re: [RESEND PATCH v5] locking/pvqspinlock: Relax cmpxchg's to improve performance on some archs

2017-08-09 Thread Peter Zijlstra
On Wed, Aug 09, 2017 at 05:06:03PM +0200, Peter Zijlstra wrote:
> Now, ARM64 for instance plays funny games, it does something along the
> lines of:
> 
> cmpxchg(ptr, old, new)
> {
>   do {
>   r = LL(ptr);
>   if (r != old)
>   return r; /* no barriers */
>   r = new
>   } while (SC_release(ptr, r));
>   smp_mb();
>   return r;
> }
> 
> Thereby ordering things relative to the store on ptr, but the load can
> very much escape. The thinking is that if success, we must observe the
> latest value of ptr, but even in that case the load is not ordered and
> could happen before.
> 
> However, since we're guaranteed to observe the latest value of ptr (on
> success) it doesn't matter if we reordered the load, there is no newer
> value possible.
> 
> So heaps of tricky, but correct afaict. Will?

And could not PPC do something similar:

cmpxchg(ptr, old, new)
{
lwsync();
dp {
r = LL(ptr);
if (r != old)
return;
r = new;
} while (SC(ptr, r));
sync();
return r;
}

?

the lwsync would make it store-release on SC with similar reasoning as
above.

And lwsync allows 'stores reordered after loads', which allows the prior
smp_store_release() to leak past.

Or is the reason this doesn't work on PPC that its RCpc?


Re: [RESEND PATCH v5] locking/pvqspinlock: Relax cmpxchg's to improve performance on some archs

2017-08-09 Thread Peter Zijlstra
On Wed, May 24, 2017 at 09:38:28AM -0400, Waiman Long wrote:

> @@ -361,6 +361,13 @@ static void pv_kick_node(struct qspinlock *lock, struct 
> mcs_spinlock *node)
>* observe its next->locked value and advance itself.
>*
>* Matches with smp_store_mb() and cmpxchg() in pv_wait_node()
> +  *
> +  * The write to next->locked in arch_mcs_spin_unlock_contended()
> +  * must be ordered before the read of pn->state in the cmpxchg()
> +  * below for the code to work correctly. However, this is not
> +  * guaranteed on all architectures when the cmpxchg() call fails.
> +  * Both x86 and PPC can provide that guarantee, but other
> +  * architectures not necessarily.
>*/
>   if (cmpxchg(>state, vcpu_halted, vcpu_hashed) != vcpu_halted)
>   return;

Instead of documenting this, should we not fix it properly?

So what we want is to order:

smp_store_release(, 1);
cmpxchg(, 0, 1);

Such that the store to x is before the load of y. Now, we document
cmpxchg() to have smp_mb() before and smp_mb() after (if success). So
per that definition, there would appear no way the load of y can be
reordered before the store to x.

Now, ARM64 for instance plays funny games, it does something along the
lines of:

cmpxchg(ptr, old, new)
{
do {
r = LL(ptr);
if (r != old)
return r; /* no barriers */
r = new
} while (SC_release(ptr, r));
smp_mb();
return r;
}

Thereby ordering things relative to the store on ptr, but the load can
very much escape. The thinking is that if success, we must observe the
latest value of ptr, but even in that case the load is not ordered and
could happen before.

However, since we're guaranteed to observe the latest value of ptr (on
success) it doesn't matter if we reordered the load, there is no newer
value possible.

So heaps of tricky, but correct afaict. Will?


Of course, since we need that load to be ordered even in case of a
failed cmpxchg() we _should_ add an unconditional smp_mb() here. Which I
understand you not wanting to do. Even smp_mb__before_atomic() is no
help, because that's smp_mb() for both PPC and ARM64.


Re: [RESEND PATCH v5] locking/pvqspinlock: Relax cmpxchg's to improve performance on some archs

2017-08-09 Thread Peter Zijlstra
On Wed, May 24, 2017 at 09:38:28AM -0400, Waiman Long wrote:

> @@ -361,6 +361,13 @@ static void pv_kick_node(struct qspinlock *lock, struct 
> mcs_spinlock *node)
>* observe its next->locked value and advance itself.
>*
>* Matches with smp_store_mb() and cmpxchg() in pv_wait_node()
> +  *
> +  * The write to next->locked in arch_mcs_spin_unlock_contended()
> +  * must be ordered before the read of pn->state in the cmpxchg()
> +  * below for the code to work correctly. However, this is not
> +  * guaranteed on all architectures when the cmpxchg() call fails.
> +  * Both x86 and PPC can provide that guarantee, but other
> +  * architectures not necessarily.
>*/
>   if (cmpxchg(>state, vcpu_halted, vcpu_hashed) != vcpu_halted)
>   return;

Instead of documenting this, should we not fix it properly?

So what we want is to order:

smp_store_release(, 1);
cmpxchg(, 0, 1);

Such that the store to x is before the load of y. Now, we document
cmpxchg() to have smp_mb() before and smp_mb() after (if success). So
per that definition, there would appear no way the load of y can be
reordered before the store to x.

Now, ARM64 for instance plays funny games, it does something along the
lines of:

cmpxchg(ptr, old, new)
{
do {
r = LL(ptr);
if (r != old)
return r; /* no barriers */
r = new
} while (SC_release(ptr, r));
smp_mb();
return r;
}

Thereby ordering things relative to the store on ptr, but the load can
very much escape. The thinking is that if success, we must observe the
latest value of ptr, but even in that case the load is not ordered and
could happen before.

However, since we're guaranteed to observe the latest value of ptr (on
success) it doesn't matter if we reordered the load, there is no newer
value possible.

So heaps of tricky, but correct afaict. Will?


Of course, since we need that load to be ordered even in case of a
failed cmpxchg() we _should_ add an unconditional smp_mb() here. Which I
understand you not wanting to do. Even smp_mb__before_atomic() is no
help, because that's smp_mb() for both PPC and ARM64.


Re: [RESEND PATCH v5] locking/pvqspinlock: Relax cmpxchg's to improve performance on some archs

2017-08-09 Thread Waiman Long
On 05/24/2017 09:38 AM, Waiman Long wrote:
> All the locking related cmpxchg's in the following functions are
> replaced with the _acquire variants:
>  - pv_queued_spin_steal_lock()
>  - trylock_clear_pending()
>
> This change should help performance on architectures that use LL/SC.
>
> On a 2-core 16-thread Power8 system with pvqspinlock explicitly
> enabled, the performance of a locking microbenchmark with and without
> this patch on a 4.10-rc8 kernel with Xinhui's PPC qspinlock patch
> were as follows:
>
>   # of thread w/o patchwith patch  % Change
>   --- ---  
>4 4053.3 Mop/s  4223.7 Mop/s +4.2%
>8 3310.4 Mop/s  3406.0 Mop/s +2.9%
>   12 2576.4 Mop/s  2674.6 Mop/s +3.8%
>
> Signed-off-by: Waiman Long 
> ---
>  v4->v5:
>   - Correct some grammatical issues in comment.
>
>  v3->v4:
>   - Update the comment in pv_kick_node() to mention that the code
> may not work in some archs.
>
>  v2->v3:
>   - Reduce scope by relaxing cmpxchg's in fast path only.
>
>  v1->v2:
>   - Add comments in changelog and code for the rationale of the change.
>
>  kernel/locking/qspinlock_paravirt.h | 19 +--
>  1 file changed, 13 insertions(+), 6 deletions(-)
>
> diff --git a/kernel/locking/qspinlock_paravirt.h 
> b/kernel/locking/qspinlock_paravirt.h
> index e6b2f7a..4614e39 100644
> --- a/kernel/locking/qspinlock_paravirt.h
> +++ b/kernel/locking/qspinlock_paravirt.h
> @@ -72,7 +72,7 @@ static inline bool pv_queued_spin_steal_lock(struct 
> qspinlock *lock)
>   struct __qspinlock *l = (void *)lock;
>  
>   if (!(atomic_read(>val) & _Q_LOCKED_PENDING_MASK) &&
> - (cmpxchg(>locked, 0, _Q_LOCKED_VAL) == 0)) {
> + (cmpxchg_acquire(>locked, 0, _Q_LOCKED_VAL) == 0)) {
>   qstat_inc(qstat_pv_lock_stealing, true);
>   return true;
>   }
> @@ -101,16 +101,16 @@ static __always_inline void clear_pending(struct 
> qspinlock *lock)
>  
>  /*
>   * The pending bit check in pv_queued_spin_steal_lock() isn't a memory
> - * barrier. Therefore, an atomic cmpxchg() is used to acquire the lock
> - * just to be sure that it will get it.
> + * barrier. Therefore, an atomic cmpxchg_acquire() is used to acquire the
> + * lock just to be sure that it will get it.
>   */
>  static __always_inline int trylock_clear_pending(struct qspinlock 
> *lock)●●●
>  {
>   struct __qspinlock *l = (void *)lock;
>  
>   return !READ_ONCE(l->locked) &&
> -(cmpxchg(>locked_pending, _Q_PENDING_VAL, _Q_LOCKED_VAL)
> - == _Q_PENDING_VAL);
> +(cmpxchg_acquire(>locked_pending, _Q_PENDING_VAL,
> + _Q_LOCKED_VAL) == _Q_PENDING_VAL);
>  }
>  #else /* _Q_PENDING_BITS == 8 */
>  static __always_inline void set_pending(struct qspinlock *lock)
> @@ -138,7 +138,7 @@ static __always_inline int trylock_clear_pending(struct 
> qspinlock *lock)
>*/
>   old = val;
>   new = (val & ~_Q_PENDING_MASK) | _Q_LOCKED_VAL;
> - val = atomic_cmpxchg(>val, old, new);
> + val = atomic_cmpxchg_acquire(>val, old, new);
>  
>   if (val == old)
>   return 1;
> @@ -361,6 +361,13 @@ static void pv_kick_node(struct qspinlock *lock, struct 
> mcs_spinlock *node)
>* observe its next->locked value and advance itself.
>*
>* Matches with smp_store_mb() and cmpxchg() in pv_wait_node()
> +  *
> +  * The write to next->locked in arch_mcs_spin_unlock_contended()
> +  * must be ordered before the read of pn->state in the cmpxchg()
> +  * below for the code to work correctly. However, this is not
> +  * guaranteed on all architectures when the cmpxchg() call fails.
> +  * Both x86 and PPC can provide that guarantee, but other
> +  * architectures not necessarily.
>*/
>   if (cmpxchg(>state, vcpu_halted, vcpu_hashed) != vcpu_halted)
>   return;

Hi,

Is this patch OK to be merged or is some more testing or changes are needed?

Cheers,
Longman



Re: [RESEND PATCH v5] locking/pvqspinlock: Relax cmpxchg's to improve performance on some archs

2017-08-09 Thread Waiman Long
On 05/24/2017 09:38 AM, Waiman Long wrote:
> All the locking related cmpxchg's in the following functions are
> replaced with the _acquire variants:
>  - pv_queued_spin_steal_lock()
>  - trylock_clear_pending()
>
> This change should help performance on architectures that use LL/SC.
>
> On a 2-core 16-thread Power8 system with pvqspinlock explicitly
> enabled, the performance of a locking microbenchmark with and without
> this patch on a 4.10-rc8 kernel with Xinhui's PPC qspinlock patch
> were as follows:
>
>   # of thread w/o patchwith patch  % Change
>   --- ---  
>4 4053.3 Mop/s  4223.7 Mop/s +4.2%
>8 3310.4 Mop/s  3406.0 Mop/s +2.9%
>   12 2576.4 Mop/s  2674.6 Mop/s +3.8%
>
> Signed-off-by: Waiman Long 
> ---
>  v4->v5:
>   - Correct some grammatical issues in comment.
>
>  v3->v4:
>   - Update the comment in pv_kick_node() to mention that the code
> may not work in some archs.
>
>  v2->v3:
>   - Reduce scope by relaxing cmpxchg's in fast path only.
>
>  v1->v2:
>   - Add comments in changelog and code for the rationale of the change.
>
>  kernel/locking/qspinlock_paravirt.h | 19 +--
>  1 file changed, 13 insertions(+), 6 deletions(-)
>
> diff --git a/kernel/locking/qspinlock_paravirt.h 
> b/kernel/locking/qspinlock_paravirt.h
> index e6b2f7a..4614e39 100644
> --- a/kernel/locking/qspinlock_paravirt.h
> +++ b/kernel/locking/qspinlock_paravirt.h
> @@ -72,7 +72,7 @@ static inline bool pv_queued_spin_steal_lock(struct 
> qspinlock *lock)
>   struct __qspinlock *l = (void *)lock;
>  
>   if (!(atomic_read(>val) & _Q_LOCKED_PENDING_MASK) &&
> - (cmpxchg(>locked, 0, _Q_LOCKED_VAL) == 0)) {
> + (cmpxchg_acquire(>locked, 0, _Q_LOCKED_VAL) == 0)) {
>   qstat_inc(qstat_pv_lock_stealing, true);
>   return true;
>   }
> @@ -101,16 +101,16 @@ static __always_inline void clear_pending(struct 
> qspinlock *lock)
>  
>  /*
>   * The pending bit check in pv_queued_spin_steal_lock() isn't a memory
> - * barrier. Therefore, an atomic cmpxchg() is used to acquire the lock
> - * just to be sure that it will get it.
> + * barrier. Therefore, an atomic cmpxchg_acquire() is used to acquire the
> + * lock just to be sure that it will get it.
>   */
>  static __always_inline int trylock_clear_pending(struct qspinlock 
> *lock)●●●
>  {
>   struct __qspinlock *l = (void *)lock;
>  
>   return !READ_ONCE(l->locked) &&
> -(cmpxchg(>locked_pending, _Q_PENDING_VAL, _Q_LOCKED_VAL)
> - == _Q_PENDING_VAL);
> +(cmpxchg_acquire(>locked_pending, _Q_PENDING_VAL,
> + _Q_LOCKED_VAL) == _Q_PENDING_VAL);
>  }
>  #else /* _Q_PENDING_BITS == 8 */
>  static __always_inline void set_pending(struct qspinlock *lock)
> @@ -138,7 +138,7 @@ static __always_inline int trylock_clear_pending(struct 
> qspinlock *lock)
>*/
>   old = val;
>   new = (val & ~_Q_PENDING_MASK) | _Q_LOCKED_VAL;
> - val = atomic_cmpxchg(>val, old, new);
> + val = atomic_cmpxchg_acquire(>val, old, new);
>  
>   if (val == old)
>   return 1;
> @@ -361,6 +361,13 @@ static void pv_kick_node(struct qspinlock *lock, struct 
> mcs_spinlock *node)
>* observe its next->locked value and advance itself.
>*
>* Matches with smp_store_mb() and cmpxchg() in pv_wait_node()
> +  *
> +  * The write to next->locked in arch_mcs_spin_unlock_contended()
> +  * must be ordered before the read of pn->state in the cmpxchg()
> +  * below for the code to work correctly. However, this is not
> +  * guaranteed on all architectures when the cmpxchg() call fails.
> +  * Both x86 and PPC can provide that guarantee, but other
> +  * architectures not necessarily.
>*/
>   if (cmpxchg(>state, vcpu_halted, vcpu_hashed) != vcpu_halted)
>   return;

Hi,

Is this patch OK to be merged or is some more testing or changes are needed?

Cheers,
Longman



[RESEND PATCH v5] locking/pvqspinlock: Relax cmpxchg's to improve performance on some archs

2017-05-24 Thread Waiman Long
All the locking related cmpxchg's in the following functions are
replaced with the _acquire variants:
 - pv_queued_spin_steal_lock()
 - trylock_clear_pending()

This change should help performance on architectures that use LL/SC.

On a 2-core 16-thread Power8 system with pvqspinlock explicitly
enabled, the performance of a locking microbenchmark with and without
this patch on a 4.10-rc8 kernel with Xinhui's PPC qspinlock patch
were as follows:

  # of thread w/o patchwith patch  % Change
  --- ---  
   4 4053.3 Mop/s  4223.7 Mop/s +4.2%
   8 3310.4 Mop/s  3406.0 Mop/s +2.9%
  12 2576.4 Mop/s  2674.6 Mop/s +3.8%

Signed-off-by: Waiman Long 
---
 v4->v5:
  - Correct some grammatical issues in comment.

 v3->v4:
  - Update the comment in pv_kick_node() to mention that the code
may not work in some archs.

 v2->v3:
  - Reduce scope by relaxing cmpxchg's in fast path only.

 v1->v2:
  - Add comments in changelog and code for the rationale of the change.

 kernel/locking/qspinlock_paravirt.h | 19 +--
 1 file changed, 13 insertions(+), 6 deletions(-)

diff --git a/kernel/locking/qspinlock_paravirt.h 
b/kernel/locking/qspinlock_paravirt.h
index e6b2f7a..4614e39 100644
--- a/kernel/locking/qspinlock_paravirt.h
+++ b/kernel/locking/qspinlock_paravirt.h
@@ -72,7 +72,7 @@ static inline bool pv_queued_spin_steal_lock(struct qspinlock 
*lock)
struct __qspinlock *l = (void *)lock;
 
if (!(atomic_read(>val) & _Q_LOCKED_PENDING_MASK) &&
-   (cmpxchg(>locked, 0, _Q_LOCKED_VAL) == 0)) {
+   (cmpxchg_acquire(>locked, 0, _Q_LOCKED_VAL) == 0)) {
qstat_inc(qstat_pv_lock_stealing, true);
return true;
}
@@ -101,16 +101,16 @@ static __always_inline void clear_pending(struct 
qspinlock *lock)
 
 /*
  * The pending bit check in pv_queued_spin_steal_lock() isn't a memory
- * barrier. Therefore, an atomic cmpxchg() is used to acquire the lock
- * just to be sure that it will get it.
+ * barrier. Therefore, an atomic cmpxchg_acquire() is used to acquire the
+ * lock just to be sure that it will get it.
  */
 static __always_inline int trylock_clear_pending(struct qspinlock *lock)
 {
struct __qspinlock *l = (void *)lock;
 
return !READ_ONCE(l->locked) &&
-  (cmpxchg(>locked_pending, _Q_PENDING_VAL, _Q_LOCKED_VAL)
-   == _Q_PENDING_VAL);
+  (cmpxchg_acquire(>locked_pending, _Q_PENDING_VAL,
+   _Q_LOCKED_VAL) == _Q_PENDING_VAL);
 }
 #else /* _Q_PENDING_BITS == 8 */
 static __always_inline void set_pending(struct qspinlock *lock)
@@ -138,7 +138,7 @@ static __always_inline int trylock_clear_pending(struct 
qspinlock *lock)
 */
old = val;
new = (val & ~_Q_PENDING_MASK) | _Q_LOCKED_VAL;
-   val = atomic_cmpxchg(>val, old, new);
+   val = atomic_cmpxchg_acquire(>val, old, new);
 
if (val == old)
return 1;
@@ -361,6 +361,13 @@ static void pv_kick_node(struct qspinlock *lock, struct 
mcs_spinlock *node)
 * observe its next->locked value and advance itself.
 *
 * Matches with smp_store_mb() and cmpxchg() in pv_wait_node()
+*
+* The write to next->locked in arch_mcs_spin_unlock_contended()
+* must be ordered before the read of pn->state in the cmpxchg()
+* below for the code to work correctly. However, this is not
+* guaranteed on all architectures when the cmpxchg() call fails.
+* Both x86 and PPC can provide that guarantee, but other
+* architectures not necessarily.
 */
if (cmpxchg(>state, vcpu_halted, vcpu_hashed) != vcpu_halted)
return;
-- 
1.8.3.1



[RESEND PATCH v5] locking/pvqspinlock: Relax cmpxchg's to improve performance on some archs

2017-05-24 Thread Waiman Long
All the locking related cmpxchg's in the following functions are
replaced with the _acquire variants:
 - pv_queued_spin_steal_lock()
 - trylock_clear_pending()

This change should help performance on architectures that use LL/SC.

On a 2-core 16-thread Power8 system with pvqspinlock explicitly
enabled, the performance of a locking microbenchmark with and without
this patch on a 4.10-rc8 kernel with Xinhui's PPC qspinlock patch
were as follows:

  # of thread w/o patchwith patch  % Change
  --- ---  
   4 4053.3 Mop/s  4223.7 Mop/s +4.2%
   8 3310.4 Mop/s  3406.0 Mop/s +2.9%
  12 2576.4 Mop/s  2674.6 Mop/s +3.8%

Signed-off-by: Waiman Long 
---
 v4->v5:
  - Correct some grammatical issues in comment.

 v3->v4:
  - Update the comment in pv_kick_node() to mention that the code
may not work in some archs.

 v2->v3:
  - Reduce scope by relaxing cmpxchg's in fast path only.

 v1->v2:
  - Add comments in changelog and code for the rationale of the change.

 kernel/locking/qspinlock_paravirt.h | 19 +--
 1 file changed, 13 insertions(+), 6 deletions(-)

diff --git a/kernel/locking/qspinlock_paravirt.h 
b/kernel/locking/qspinlock_paravirt.h
index e6b2f7a..4614e39 100644
--- a/kernel/locking/qspinlock_paravirt.h
+++ b/kernel/locking/qspinlock_paravirt.h
@@ -72,7 +72,7 @@ static inline bool pv_queued_spin_steal_lock(struct qspinlock 
*lock)
struct __qspinlock *l = (void *)lock;
 
if (!(atomic_read(>val) & _Q_LOCKED_PENDING_MASK) &&
-   (cmpxchg(>locked, 0, _Q_LOCKED_VAL) == 0)) {
+   (cmpxchg_acquire(>locked, 0, _Q_LOCKED_VAL) == 0)) {
qstat_inc(qstat_pv_lock_stealing, true);
return true;
}
@@ -101,16 +101,16 @@ static __always_inline void clear_pending(struct 
qspinlock *lock)
 
 /*
  * The pending bit check in pv_queued_spin_steal_lock() isn't a memory
- * barrier. Therefore, an atomic cmpxchg() is used to acquire the lock
- * just to be sure that it will get it.
+ * barrier. Therefore, an atomic cmpxchg_acquire() is used to acquire the
+ * lock just to be sure that it will get it.
  */
 static __always_inline int trylock_clear_pending(struct qspinlock *lock)
 {
struct __qspinlock *l = (void *)lock;
 
return !READ_ONCE(l->locked) &&
-  (cmpxchg(>locked_pending, _Q_PENDING_VAL, _Q_LOCKED_VAL)
-   == _Q_PENDING_VAL);
+  (cmpxchg_acquire(>locked_pending, _Q_PENDING_VAL,
+   _Q_LOCKED_VAL) == _Q_PENDING_VAL);
 }
 #else /* _Q_PENDING_BITS == 8 */
 static __always_inline void set_pending(struct qspinlock *lock)
@@ -138,7 +138,7 @@ static __always_inline int trylock_clear_pending(struct 
qspinlock *lock)
 */
old = val;
new = (val & ~_Q_PENDING_MASK) | _Q_LOCKED_VAL;
-   val = atomic_cmpxchg(>val, old, new);
+   val = atomic_cmpxchg_acquire(>val, old, new);
 
if (val == old)
return 1;
@@ -361,6 +361,13 @@ static void pv_kick_node(struct qspinlock *lock, struct 
mcs_spinlock *node)
 * observe its next->locked value and advance itself.
 *
 * Matches with smp_store_mb() and cmpxchg() in pv_wait_node()
+*
+* The write to next->locked in arch_mcs_spin_unlock_contended()
+* must be ordered before the read of pn->state in the cmpxchg()
+* below for the code to work correctly. However, this is not
+* guaranteed on all architectures when the cmpxchg() call fails.
+* Both x86 and PPC can provide that guarantee, but other
+* architectures not necessarily.
 */
if (cmpxchg(>state, vcpu_halted, vcpu_hashed) != vcpu_halted)
return;
-- 
1.8.3.1