Re: [PATCH 4/4] locking: Introduce smp_cond_acquire()

2015-11-20 Thread Peter Zijlstra
On Thu, Nov 19, 2015 at 06:01:52PM +, Will Deacon wrote:
> For completeness, here's what I've currently got. I've failed to measure
> any performance impact on my 8-core systems, but that's not surprising.

> +static inline void arch_spin_unlock_wait(arch_spinlock_t *lock)
> +{
> + unsigned int tmp;
> + arch_spinlock_t lockval;
>
> + asm volatile(
> +"sevl\n"
> +"1:  wfe\n"

Using WFE here would lower the cacheline bouncing pressure a bit I
imagine. Sure we still pull it over into S(hared) after every store
but we don't keep banging on it making the initial e(X)clusive grab
hard.

> +"2:  ldaxr   %w0, %2\n"
> +"eor %w1, %w0, %w0, ror #16\n"
> +"cbnz%w1, 1b\n"
> + ARM64_LSE_ATOMIC_INSN(
> + /* LL/SC */
> +"stxr%w1, %w0, %2\n"
> + /* Serialise against any concurrent lockers */
> +"cbnz%w1, 2b\n",
> + /* LSE atomics */
> +"nop\n"
> +"nop\n")

I find these ARM64_LSE macro thingies aren't always easy to read, its
fairly easy to overlook the ',' separating the v8 and v8.1 parts, esp.
if you have further interleaving comments like in the above.

> + : "=&r" (lockval), "=&r" (tmp), "+Q" (*lock)
> + :
> + : "memory");
> +}
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 4/4] locking: Introduce smp_cond_acquire()

2015-11-19 Thread Will Deacon
On Wed, Nov 18, 2015 at 11:25:14AM +, Will Deacon wrote:
> On Tue, Nov 17, 2015 at 01:01:09PM -0800, Paul E. McKenney wrote:
> > On Tue, Nov 17, 2015 at 11:51:10AM +, Will Deacon wrote:
> > > On Mon, Nov 16, 2015 at 01:58:49PM -0800, Linus Torvalds wrote:
> > > > On Mon, Nov 16, 2015 at 8:24 AM, Will Deacon  
> > > > wrote:
> > > > >
> > > > > ... or we upgrade spin_unlock_wait to a LOCK operation, which might be
> > > > > slightly cheaper than spin_lock()+spin_unlock().
> > > > 
> > > > So traditionally the real concern has been the cacheline ping-pong
> > > > part of spin_unlock_wait(). I think adding a memory barrier (that
> > > > doesn't force any exclusive states, just ordering) to it is fine, but
> > > > I don't think we want to necessarily have it have to get the cacheline
> > > > into exclusive state.
> > > 
> > > The problem is, I don't think the memory-barrier buys you anything in
> > > the context of Boqun's example. In fact, he already had smp_mb() either
> > > side of the spin_unlock_wait() and its still broken on arm64 and ppc.
> > > 
> > > Paul is proposing adding a memory barrier after spin_lock() in the racing
> > > thread, but I personally think people will forget to add that.
> > 
> > A mechanical check would certainly make me feel better about it, so that
> > any lock that was passed to spin_unlock_wait() was required to have all
> > acquisitions followed by smp_mb__after_unlock_lock() or some such.
> > But I haven't yet given up on finding a better solution.
> 
> Right-o. I'll hack together the arm64 spin_unlock_wait fix, but hold off
> merging it for a few weeks in case we get struck by a sudden flash of
> inspiration.

For completeness, here's what I've currently got. I've failed to measure
any performance impact on my 8-core systems, but that's not surprising.

Will

--->8

>From da14adc1aef2f12b7a7def4d6b7dde254a91ebf1 Mon Sep 17 00:00:00 2001
From: Will Deacon 
Date: Thu, 19 Nov 2015 17:48:31 +
Subject: [PATCH] arm64: spinlock: serialise spin_unlock_wait against
 concurrent lockers

Boqun Feng reported a rather nasty ordering issue with spin_unlock_wait
on architectures implementing spin_lock with LL/SC sequences and acquire
semantics:

 | CPU 1   CPU 2 CPU 3
 | ==    ==
 |   spin_unlock(&lock);
 | spin_lock(&lock):
 |   r1 = *lock; // r1 == 0;
 | o = READ_ONCE(object); // reordered here
 | object = NULL;
 | smp_mb();
 | spin_unlock_wait(&lock);
 |   *lock = 1;
 | smp_mb();
 | o->dead = true;
 | if (o) // true
 |   BUG_ON(o->dead); // true!!

The crux of the problem is that spin_unlock_wait(&lock) can return on
CPU 1 whilst CPU 2 is in the process of taking the lock. This can be
resolved by upgrading spin_unlock_wait to a LOCK operation, forcing it
to serialise against a concurrent locker and giving it acquire semantics
in the process (although it is not at all clear whether this is needed -
different callers seem to assume different things about the barrier
semantics and architectures are similarly disjoint in their
implementations of the macro).

This patch implements spin_unlock_wait using an LL/SC sequence with
acquire semantics on arm64. For v8.1 systems with the LSE atomics, the
exclusive writeback is omitted, since the spin_lock operation is
indivisible and no intermediate state can be observed.

Signed-off-by: Will Deacon 
---
 arch/arm64/include/asm/spinlock.h | 24 ++--
 1 file changed, 22 insertions(+), 2 deletions(-)

diff --git a/arch/arm64/include/asm/spinlock.h 
b/arch/arm64/include/asm/spinlock.h
index c85e96d174a5..b531791a75ff 100644
--- a/arch/arm64/include/asm/spinlock.h
+++ b/arch/arm64/include/asm/spinlock.h
@@ -26,9 +26,29 @@
  * The memory barriers are implicit with the load-acquire and store-release
  * instructions.
  */
+static inline void arch_spin_unlock_wait(arch_spinlock_t *lock)
+{
+   unsigned int tmp;
+   arch_spinlock_t lockval;
 
-#define arch_spin_unlock_wait(lock) \
-   do { while (arch_spin_is_locked(lock)) cpu_relax(); } while (0)
+   asm volatile(
+"  sevl\n"
+"1:wfe\n"
+"2:ldaxr   %w0, %2\n"
+"  eor %w1, %w0, %w0, ror #16\n"
+"  cbnz%w1, 1b\n"
+   ARM64_LSE_ATOMIC_INSN(
+   /* LL/SC */
+"  stxr%w1, %w0, %2\n"
+   /* Serialise against any concurrent lockers */
+"  cbnz%w1, 2b\n",
+   /* LSE atomics */
+"  nop\n"
+"  nop\n")
+   : "=&r" (lockval), "=&r" (tmp), "+Q" (*lock)
+   :
+   : "memory");
+}
 
 #define arch_spin_lock_flags(lock, flags) arch_spin_lock(lock)
 
-- 
2.1.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.

Re: [PATCH 4/4] locking: Introduce smp_cond_acquire()

2015-11-18 Thread Will Deacon
On Tue, Nov 17, 2015 at 01:01:09PM -0800, Paul E. McKenney wrote:
> On Tue, Nov 17, 2015 at 11:51:10AM +, Will Deacon wrote:
> > On Mon, Nov 16, 2015 at 01:58:49PM -0800, Linus Torvalds wrote:
> > > On Mon, Nov 16, 2015 at 8:24 AM, Will Deacon  wrote:
> > > >
> > > > ... or we upgrade spin_unlock_wait to a LOCK operation, which might be
> > > > slightly cheaper than spin_lock()+spin_unlock().
> > > 
> > > So traditionally the real concern has been the cacheline ping-pong
> > > part of spin_unlock_wait(). I think adding a memory barrier (that
> > > doesn't force any exclusive states, just ordering) to it is fine, but
> > > I don't think we want to necessarily have it have to get the cacheline
> > > into exclusive state.
> > 
> > The problem is, I don't think the memory-barrier buys you anything in
> > the context of Boqun's example. In fact, he already had smp_mb() either
> > side of the spin_unlock_wait() and its still broken on arm64 and ppc.
> > 
> > Paul is proposing adding a memory barrier after spin_lock() in the racing
> > thread, but I personally think people will forget to add that.
> 
> A mechanical check would certainly make me feel better about it, so that
> any lock that was passed to spin_unlock_wait() was required to have all
> acquisitions followed by smp_mb__after_unlock_lock() or some such.
> But I haven't yet given up on finding a better solution.

Right-o. I'll hack together the arm64 spin_unlock_wait fix, but hold off
merging it for a few weeks in case we get struck by a sudden flash of
inspiration.

Will
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 4/4] locking: Introduce smp_cond_acquire()

2015-11-17 Thread Paul E. McKenney
On Tue, Nov 17, 2015 at 11:51:10AM +, Will Deacon wrote:
> Hi Linus,
> 
> On Mon, Nov 16, 2015 at 01:58:49PM -0800, Linus Torvalds wrote:
> > On Mon, Nov 16, 2015 at 8:24 AM, Will Deacon  wrote:
> > >
> > > ... or we upgrade spin_unlock_wait to a LOCK operation, which might be
> > > slightly cheaper than spin_lock()+spin_unlock().
> > 
> > So traditionally the real concern has been the cacheline ping-pong
> > part of spin_unlock_wait(). I think adding a memory barrier (that
> > doesn't force any exclusive states, just ordering) to it is fine, but
> > I don't think we want to necessarily have it have to get the cacheline
> > into exclusive state.
> 
> The problem is, I don't think the memory-barrier buys you anything in
> the context of Boqun's example. In fact, he already had smp_mb() either
> side of the spin_unlock_wait() and its still broken on arm64 and ppc.
> 
> Paul is proposing adding a memory barrier after spin_lock() in the racing
> thread, but I personally think people will forget to add that.

A mechanical check would certainly make me feel better about it, so that
any lock that was passed to spin_unlock_wait() was required to have all
acquisitions followed by smp_mb__after_unlock_lock() or some such.
But I haven't yet given up on finding a better solution.

Thanx, Paul

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 4/4] locking: Introduce smp_cond_acquire()

2015-11-17 Thread Will Deacon
Hi Linus,

On Mon, Nov 16, 2015 at 01:58:49PM -0800, Linus Torvalds wrote:
> On Mon, Nov 16, 2015 at 8:24 AM, Will Deacon  wrote:
> >
> > ... or we upgrade spin_unlock_wait to a LOCK operation, which might be
> > slightly cheaper than spin_lock()+spin_unlock().
> 
> So traditionally the real concern has been the cacheline ping-pong
> part of spin_unlock_wait(). I think adding a memory barrier (that
> doesn't force any exclusive states, just ordering) to it is fine, but
> I don't think we want to necessarily have it have to get the cacheline
> into exclusive state.

The problem is, I don't think the memory-barrier buys you anything in
the context of Boqun's example. In fact, he already had smp_mb() either
side of the spin_unlock_wait() and its still broken on arm64 and ppc.

Paul is proposing adding a memory barrier after spin_lock() in the racing
thread, but I personally think people will forget to add that.

> Because if spin_unlock_wait() ends up having to get the spinlock
> cacheline (for example, by writing the same value back with a SC), I
> don't think spin_unlock_wait() will really be all that much cheaper
> than just getting the spinlock, and in that case we shouldn't play
> complicated ordering games.

It was the lock-fairness guarantees that I was concerned about. A
spin_lock could place you into a queue, so you're no longer waiting for
a single spin_unlock(), you're now waiting for *all* the spin_unlocks
by the CPUs preceding you in the queue.

> On another issue:
> 
> I'm also looking at the ARM documentation for strx, and the
> _documentation_ says that it has no stronger ordering than a "store
> release", but I'm starting to wonder if that is actually true.
> 
> Because I do end up thinking that it does have the same "control
> dependency" to all subsequent writes (but not reads). So reads after
> the SC can percolate up, but I think writes are restricted.
> 
> Why? In order for the SC to be able to return success, the write
> itself may not have been actually done yet, but the cacheline for the
> write must have successfully be turned into exclusive ownership.
> Agreed?

If the LL/SC logic hangs off the coherency logic, then yes, but there
are other ways to build this (using a seperate "exclusive monitor" block)
and the architecture caters for this, too. See below.

> That means that by the time a SC returns success, no other CPU can see
> the old value of the spinlock any more. So by the time any subsequent
> stores in the locked region can be visible to any other CPU's, the
> locked value of the lock itself has to be visible too.
> 
> Agreed?

No. A successful SC is *not* multi-copy atomic, but you're right to
point out that it provides more guarantees than a plain store. In
particular, a successful SC cannot return its success value until its
corresponding write has fixed its place in the coherence order for the
location that it is updating.

To be more concrete (simplified AArch64 asm, X1 and X2 hold addresses of
zero-initialised locations):


P0
LDXRX0, [X1]
ADD X0, X0, #1
STXRX0, [X1]// Succeeds

P1
LDR X0, [X1]// Reads 1

STR #1, [X2]

P2
LDR X0, [X2]// Reads 1

LDR X0, [X1]// **Not required to read 1**


However:

P0
LDXRX0, [X1]
ADD X0, X0, #1
STXRX0, [X1]// Succeeds

P1
LDR X0, [X1]// Reads 1

STR #1, [X2]

P2
LDR X0, [X2]// Reads 1

STR #2, [X1]// Location at [X1] must be ordered {0->1->2}


We can also extend this example so that P2 instead does:

P2
LDR X0, [X2]// Reads 1

LDXRX0, [X1]
ADD X0, X0, #1
STXRX0, [X1]// Succeeds; location at [x1] must be ordered {0->1->2}

i.e. the STXR cannot succeed until the coherence order has been resolved,
which also requires the LDXR to return the up-to-date value.

> So I think that in effect, when a spinlock is implemnted with LL/SC,
> the loads inside the locked region are only ordered wrt the acquire on
> the LL, but the stores can be considered ordered wrt the SC.
> 
> No?

I initially fell into the same trap because a control dependency between
a load and a store is sufficient to create order (i.e. we don't speculate
writes). An SC is different, though, because the control dependency can
be resolved without the write being multi-copy atomic, whereas a read is
required to return its data (i.e. complete) before the control hazard can
be resolved.

I think that all of this means I should either:

  (1) Update the arm64 spin_unlock_wait to use LDXR/STXR (perhaps with
  acquire semantics?)

- or -

  (2) Replace spin_unlock_wait with spin_lock; spin_unlock, my worries
  about queuing notwithstanding.

The cacheline ping-pong in (1) can be mitigated somewhat by the use of
wfe, so it won't be any worse than a spin_lock().

Will
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vg

Re: [PATCH 4/4] locking: Introduce smp_cond_acquire()

2015-11-16 Thread Linus Torvalds
On Mon, Nov 16, 2015 at 8:24 AM, Will Deacon  wrote:
>
> ... or we upgrade spin_unlock_wait to a LOCK operation, which might be
> slightly cheaper than spin_lock()+spin_unlock().

So traditionally the real concern has been the cacheline ping-pong
part of spin_unlock_wait(). I think adding a memory barrier (that
doesn't force any exclusive states, just ordering) to it is fine, but
I don't think we want to necessarily have it have to get the cacheline
into exclusive state.

Because if spin_unlock_wait() ends up having to get the spinlock
cacheline (for example, by writing the same value back with a SC), I
don't think spin_unlock_wait() will really be all that much cheaper
than just getting the spinlock, and in that case we shouldn't play
complicated ordering games.

On another issue:

I'm also looking at the ARM documentation for strx, and the
_documentation_ says that it has no stronger ordering than a "store
release", but I'm starting to wonder if that is actually true.

Because I do end up thinking that it does have the same "control
dependency" to all subsequent writes (but not reads). So reads after
the SC can percolate up, but I think writes are restricted.

Why? In order for the SC to be able to return success, the write
itself may not have been actually done yet, but the cacheline for the
write must have successfully be turned into exclusive ownership.
Agreed?

That means that by the time a SC returns success, no other CPU can see
the old value of the spinlock any more. So by the time any subsequent
stores in the locked region can be visible to any other CPU's, the
locked value of the lock itself has to be visible too.

Agreed?

So I think that in effect, when a spinlock is implemnted with LL/SC,
the loads inside the locked region are only ordered wrt the acquire on
the LL, but the stores can be considered ordered wrt the SC.

No?

So I think a _successful_ SC - is still more ordered than just any
random store with release consistency.

Of course, I'm not sure that actually *helps* us, because I think the
problem tends to be loads in the locked region moving up earlier than
the actual store that sets the lock, but maybe it makes some
difference.

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 4/4] locking: Introduce smp_cond_acquire()

2015-11-16 Thread Paul E. McKenney
On Mon, Nov 16, 2015 at 04:46:36PM +, Will Deacon wrote:
> On Mon, Nov 16, 2015 at 08:44:43AM -0800, Paul E. McKenney wrote:
> > On Mon, Nov 16, 2015 at 04:24:53PM +, Will Deacon wrote:
> > > On Mon, Nov 16, 2015 at 05:04:45PM +0100, Peter Zijlstra wrote:
> > > > On Mon, Nov 16, 2015 at 04:56:58PM +0100, Peter Zijlstra wrote:
> > > > > On Thu, Nov 12, 2015 at 10:21:39AM -0800, Linus Torvalds wrote:
> > > > > > Now, the point of spin_unlock_wait() (and "spin_is_locked()") should
> > > > > > generally be that you have some external ordering guarantee that
> > > > > > guarantees that the lock has been taken. For example, for the IPC
> > > > > > semaphores, we do either one of:
> > > > > > 
> > > > > >  (a) get large lock, then - once you hold that lock - wait for each 
> > > > > > small lock
> > > > > > 
> > > > > > or
> > > > > > 
> > > > > >  (b) get small lock, then - once you hold that lock - check that the
> > > > > > largo lock is unlocked
> > > > > > 
> > > > > > and that's the case we should really worry about.  The other uses of
> > > > > > spin_unlock_wait() should have similar "I have other reasons to know
> > > > > > I've seen that the lock was taken, or will never be taken after this
> > > > > > because XYZ".
> > > > > 
> > > > > I don't think this is true for the usage in do_exit(), we have no
> > > > > knowledge on if pi_lock is taken or not. We just want to make sure 
> > > > > that
> > > > > _if_ it were taken, we wait until it is released.
> > > > 
> > > > And unless PPC would move to using RCsc locks with a SYNC in
> > > > spin_lock(), I don't think it makes sense to add
> > > > smp_mb__after_unlock_lock() to all tsk->pi_lock instances to fix this.
> > > > As that is far more expensive than flipping the exit path to do
> > > > spin_lock()+spin_unlock().
> > > 
> > > ... or we upgrade spin_unlock_wait to a LOCK operation, which might be
> > > slightly cheaper than spin_lock()+spin_unlock().
> > 
> > Or we supply a heavyweight version of spin_unlock_wait() that forces
> > the cache miss.  But I bet that the difference in overhead between
> > spin_lock()+spin_unlock() and the heavyweight version would be down in
> > the noise.
> 
> I'm not so sure. If the lock is ticket-based, then spin_lock() has to
> queue for its turn, whereas spin_unlock_wait could just wait for the
> next unlock.

Fair point, and it actually applies to high-contention spinlocks as well,
just a bit less deterministically.  OK, given that I believe that we do
see high contention on the lock in question, I withdraw any objections
to a heavy-weight form of spin_unlock_wait().

Thanx, Paul

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 4/4] locking: Introduce smp_cond_acquire()

2015-11-16 Thread Will Deacon
On Mon, Nov 16, 2015 at 08:44:43AM -0800, Paul E. McKenney wrote:
> On Mon, Nov 16, 2015 at 04:24:53PM +, Will Deacon wrote:
> > On Mon, Nov 16, 2015 at 05:04:45PM +0100, Peter Zijlstra wrote:
> > > On Mon, Nov 16, 2015 at 04:56:58PM +0100, Peter Zijlstra wrote:
> > > > On Thu, Nov 12, 2015 at 10:21:39AM -0800, Linus Torvalds wrote:
> > > > > Now, the point of spin_unlock_wait() (and "spin_is_locked()") should
> > > > > generally be that you have some external ordering guarantee that
> > > > > guarantees that the lock has been taken. For example, for the IPC
> > > > > semaphores, we do either one of:
> > > > > 
> > > > >  (a) get large lock, then - once you hold that lock - wait for each 
> > > > > small lock
> > > > > 
> > > > > or
> > > > > 
> > > > >  (b) get small lock, then - once you hold that lock - check that the
> > > > > largo lock is unlocked
> > > > > 
> > > > > and that's the case we should really worry about.  The other uses of
> > > > > spin_unlock_wait() should have similar "I have other reasons to know
> > > > > I've seen that the lock was taken, or will never be taken after this
> > > > > because XYZ".
> > > > 
> > > > I don't think this is true for the usage in do_exit(), we have no
> > > > knowledge on if pi_lock is taken or not. We just want to make sure that
> > > > _if_ it were taken, we wait until it is released.
> > > 
> > > And unless PPC would move to using RCsc locks with a SYNC in
> > > spin_lock(), I don't think it makes sense to add
> > > smp_mb__after_unlock_lock() to all tsk->pi_lock instances to fix this.
> > > As that is far more expensive than flipping the exit path to do
> > > spin_lock()+spin_unlock().
> > 
> > ... or we upgrade spin_unlock_wait to a LOCK operation, which might be
> > slightly cheaper than spin_lock()+spin_unlock().
> 
> Or we supply a heavyweight version of spin_unlock_wait() that forces
> the cache miss.  But I bet that the difference in overhead between
> spin_lock()+spin_unlock() and the heavyweight version would be down in
> the noise.

I'm not so sure. If the lock is ticket-based, then spin_lock() has to
queue for its turn, whereas spin_unlock_wait could just wait for the
next unlock.

Will
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 4/4] locking: Introduce smp_cond_acquire()

2015-11-16 Thread Paul E. McKenney
On Mon, Nov 16, 2015 at 04:24:53PM +, Will Deacon wrote:
> On Mon, Nov 16, 2015 at 05:04:45PM +0100, Peter Zijlstra wrote:
> > On Mon, Nov 16, 2015 at 04:56:58PM +0100, Peter Zijlstra wrote:
> > > On Thu, Nov 12, 2015 at 10:21:39AM -0800, Linus Torvalds wrote:
> > > > Now, the point of spin_unlock_wait() (and "spin_is_locked()") should
> > > > generally be that you have some external ordering guarantee that
> > > > guarantees that the lock has been taken. For example, for the IPC
> > > > semaphores, we do either one of:
> > > > 
> > > >  (a) get large lock, then - once you hold that lock - wait for each 
> > > > small lock
> > > > 
> > > > or
> > > > 
> > > >  (b) get small lock, then - once you hold that lock - check that the
> > > > largo lock is unlocked
> > > > 
> > > > and that's the case we should really worry about.  The other uses of
> > > > spin_unlock_wait() should have similar "I have other reasons to know
> > > > I've seen that the lock was taken, or will never be taken after this
> > > > because XYZ".
> > > 
> > > I don't think this is true for the usage in do_exit(), we have no
> > > knowledge on if pi_lock is taken or not. We just want to make sure that
> > > _if_ it were taken, we wait until it is released.
> > 
> > And unless PPC would move to using RCsc locks with a SYNC in
> > spin_lock(), I don't think it makes sense to add
> > smp_mb__after_unlock_lock() to all tsk->pi_lock instances to fix this.
> > As that is far more expensive than flipping the exit path to do
> > spin_lock()+spin_unlock().
> 
> ... or we upgrade spin_unlock_wait to a LOCK operation, which might be
> slightly cheaper than spin_lock()+spin_unlock().

Or we supply a heavyweight version of spin_unlock_wait() that forces
the cache miss.  But I bet that the difference in overhead between
spin_lock()+spin_unlock() and the heavyweight version would be down in
the noise.

Thanx, Paul

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 4/4] locking: Introduce smp_cond_acquire()

2015-11-16 Thread Will Deacon
On Mon, Nov 16, 2015 at 05:04:45PM +0100, Peter Zijlstra wrote:
> On Mon, Nov 16, 2015 at 04:56:58PM +0100, Peter Zijlstra wrote:
> > On Thu, Nov 12, 2015 at 10:21:39AM -0800, Linus Torvalds wrote:
> > > Now, the point of spin_unlock_wait() (and "spin_is_locked()") should
> > > generally be that you have some external ordering guarantee that
> > > guarantees that the lock has been taken. For example, for the IPC
> > > semaphores, we do either one of:
> > > 
> > >  (a) get large lock, then - once you hold that lock - wait for each small 
> > > lock
> > > 
> > > or
> > > 
> > >  (b) get small lock, then - once you hold that lock - check that the
> > > largo lock is unlocked
> > > 
> > > and that's the case we should really worry about.  The other uses of
> > > spin_unlock_wait() should have similar "I have other reasons to know
> > > I've seen that the lock was taken, or will never be taken after this
> > > because XYZ".
> > 
> > I don't think this is true for the usage in do_exit(), we have no
> > knowledge on if pi_lock is taken or not. We just want to make sure that
> > _if_ it were taken, we wait until it is released.
> 
> And unless PPC would move to using RCsc locks with a SYNC in
> spin_lock(), I don't think it makes sense to add
> smp_mb__after_unlock_lock() to all tsk->pi_lock instances to fix this.
> As that is far more expensive than flipping the exit path to do
> spin_lock()+spin_unlock().

... or we upgrade spin_unlock_wait to a LOCK operation, which might be
slightly cheaper than spin_lock()+spin_unlock().

Will
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 4/4] locking: Introduce smp_cond_acquire()

2015-11-16 Thread Peter Zijlstra
On Mon, Nov 16, 2015 at 04:56:58PM +0100, Peter Zijlstra wrote:
> On Thu, Nov 12, 2015 at 10:21:39AM -0800, Linus Torvalds wrote:
> > Now, the point of spin_unlock_wait() (and "spin_is_locked()") should
> > generally be that you have some external ordering guarantee that
> > guarantees that the lock has been taken. For example, for the IPC
> > semaphores, we do either one of:
> > 
> >  (a) get large lock, then - once you hold that lock - wait for each small 
> > lock
> > 
> > or
> > 
> >  (b) get small lock, then - once you hold that lock - check that the
> > largo lock is unlocked
> > 
> > and that's the case we should really worry about.  The other uses of
> > spin_unlock_wait() should have similar "I have other reasons to know
> > I've seen that the lock was taken, or will never be taken after this
> > because XYZ".
> 
> I don't think this is true for the usage in do_exit(), we have no
> knowledge on if pi_lock is taken or not. We just want to make sure that
> _if_ it were taken, we wait until it is released.

And unless PPC would move to using RCsc locks with a SYNC in
spin_lock(), I don't think it makes sense to add
smp_mb__after_unlock_lock() to all tsk->pi_lock instances to fix this.
As that is far more expensive than flipping the exit path to do
spin_lock()+spin_unlock().
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 4/4] locking: Introduce smp_cond_acquire()

2015-11-16 Thread Peter Zijlstra
On Thu, Nov 12, 2015 at 10:21:39AM -0800, Linus Torvalds wrote:
> Now, the point of spin_unlock_wait() (and "spin_is_locked()") should
> generally be that you have some external ordering guarantee that
> guarantees that the lock has been taken. For example, for the IPC
> semaphores, we do either one of:
> 
>  (a) get large lock, then - once you hold that lock - wait for each small lock
> 
> or
> 
>  (b) get small lock, then - once you hold that lock - check that the
> largo lock is unlocked
> 
> and that's the case we should really worry about.  The other uses of
> spin_unlock_wait() should have similar "I have other reasons to know
> I've seen that the lock was taken, or will never be taken after this
> because XYZ".

I don't think this is true for the usage in do_exit(), we have no
knowledge on if pi_lock is taken or not. We just want to make sure that
_if_ it were taken, we wait until it is released.

But I'm not sure where task_work_run() sits, at first reading it appears
to also not be true -- there doesn't appear to be a reason we know a
lock to be held.

It does however appear true for the usage in completion_done(), where by
having tested x->done, we know a pi_lock _was_ held.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 4/4] locking: Introduce smp_cond_acquire()

2015-11-16 Thread Will Deacon
Hi Paul,

On Thu, Nov 12, 2015 at 03:43:51PM -0800, Paul E. McKenney wrote:
> On Thu, Nov 12, 2015 at 09:33:39PM +, Will Deacon wrote:
> > I think we ended up concluding that smp_mb__after_unlock_lock is indeed
> > required, but I don't think we should just resurrect the old definition,
> > which doesn't keep UNLOCK -> LOCK distinct from RELEASE -> ACQUIRE. I'm
> > still working on documenting the different types of transitivity that we
> > identified in that thread, but it's slow going.
> > 
> > Also, as far as spin_unlock_wait is concerned, it is neither a LOCK or
> > an UNLOCK and this barrier doesn't offer us anything. Sure, it might
> > work because PPC defines it as smp_mb(), but it doesn't help on arm64
> > and defining the macro is overkill for us in most places (i.e. RCU).
> > 
> > If we decide that the current usage of spin_unlock_wait is valid, then I
> > would much rather implement a version of it in the arm64 backend that
> > does something like:
> > 
> >  1:  ldrex r1, [&lock]
> >  if r1 indicates that lock is taken, branch back to 1b
> >  strex r1, [&lock]
> >  if store failed, branch back to 1b
> > 
> > i.e. we don't just test the lock, but we also write it back atomically
> > if we discover that it's free. That would then clear the exclusive monitor
> > on any cores in the process of taking the lock and restore the ordering
> > that we need.
> 
> We could clearly do something similar in PowerPC, but I suspect that this
> would hurt really badly on large systems, given that there are PowerPC
> systems with more than a thousand hardware threads.  So one approach
> is ARM makes spin_unlock_wait() do the write, similar to spin_lock();
> spin_lock(), but PowerPC relies on smp_mb__after_unlock_lock().

Sure, I'm certainly not trying to tell you how to do this for PPC, but
the above would be better for arm64 (any huge system should be using the
8.1 atomics anyway).

> Or does someone have a better proposal?

I don't think I'm completely clear on the current proposal wrt
smp_mb__after_unlock_lock. To summarise my understanding (in the context
of Boqun's original example, which I've duplicated at the end of this
mail):

  * Putting smp_mb__after_unlock_lock() after the LOCK on CPU2 creates
global order, by upgrading the UNLOCK -> LOCK to a full barrier. If
we extend this to include the accesses made by the UNLOCK and LOCK
as happening *before* the notional full barrier, then a from-read
edge from CPU2 to CPU1 on `object' implies that the LOCK operation
is observed by CPU1 before it writes object = NULL. So we can add
this barrier and fix the test for PPC.

  * Upgrading spin_unlock_wait to a LOCK operation (which is basically
what I'm proposing for arm64) means that we now rely on LOCK -> LOCK
being part of a single, total order (i.e. all CPUs agree on the
order in which a lock was taken).

Assuming PPC has this global LOCK -> LOCK ordering, then we end up in a
sticky situation defining the kernel's memory model because we don't
have an architecture-agnostic semantic. The only way I can see to fix
this is by adding something like smp_mb__between_lock_unlock_wait, but
that's grim.

Do you see a way forward?

Will

--->8

CPU 1   CPU 2   CPU 3
==  ==
spin_unlock(&lock);
spin_lock(&lock):
  r1 = *lock; // r1 == 0;
o = READ_ONCE(object); // reordered here
object = NULL;
smp_mb();
spin_unlock_wait(&lock);
  *lock = 1;
smp_mb();
o->dead = true;
if (o) // true
  BUG_ON(o->dead); // true!!
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 4/4] locking: Introduce smp_cond_acquire()

2015-11-12 Thread Paul E. McKenney
On Thu, Nov 12, 2015 at 09:33:39PM +, Will Deacon wrote:
> On Thu, Nov 12, 2015 at 10:59:06AM -0800, Paul E. McKenney wrote:
> > On Thu, Nov 12, 2015 at 08:33:02PM +0100, Oleg Nesterov wrote:
> > > On 11/12, Peter Zijlstra wrote:
> > > >
> > > > On Thu, Nov 12, 2015 at 07:38:07PM +0100, Oleg Nesterov wrote:
> > > > > It seems that PPC needs to define smp_mb__before_spinlock() as full 
> > > > > mb(),
> > > > > no?
> > > >
> > > > It does:
> > > >
> > > > arch/powerpc/include/asm/barrier.h:#define smp_mb__before_spinlock() 
> > > > smp_mb()
> > > 
> > > Ah, indeed, thanks.
> > > 
> > > And given that it also defines smp_mb__after_unlock_lock() as smp_mb(),
> > > I am starting to understand how it can help to avoid the races with
> > > spin_unlock_wait() in (for example) do_exit().
> > > 
> > > But as Boqun has already mentioned, this means that 
> > > mb__after_unlock_lock()
> > > has the new meaning which should be documented.
> > > 
> > > Hmm. And 12d560f4 "Privatize smp_mb__after_unlock_lock()" should be 
> > > reverted
> > > then ;)
> > 
> > Surprisingly, this reverts cleanly against today's mainline, please see
> > the patch below.  Against my -rcu stack, not so much, but so it goes.  ;-)
> 
> I think we ended up concluding that smp_mb__after_unlock_lock is indeed
> required, but I don't think we should just resurrect the old definition,
> which doesn't keep UNLOCK -> LOCK distinct from RELEASE -> ACQUIRE. I'm
> still working on documenting the different types of transitivity that we
> identified in that thread, but it's slow going.
> 
> Also, as far as spin_unlock_wait is concerned, it is neither a LOCK or
> an UNLOCK and this barrier doesn't offer us anything. Sure, it might
> work because PPC defines it as smp_mb(), but it doesn't help on arm64
> and defining the macro is overkill for us in most places (i.e. RCU).
> 
> If we decide that the current usage of spin_unlock_wait is valid, then I
> would much rather implement a version of it in the arm64 backend that
> does something like:
> 
>  1:  ldrex r1, [&lock]
>  if r1 indicates that lock is taken, branch back to 1b
>  strex r1, [&lock]
>  if store failed, branch back to 1b
> 
> i.e. we don't just test the lock, but we also write it back atomically
> if we discover that it's free. That would then clear the exclusive monitor
> on any cores in the process of taking the lock and restore the ordering
> that we need.

We could clearly do something similar in PowerPC, but I suspect that this
would hurt really badly on large systems, given that there are PowerPC
systems with more than a thousand hardware threads.  So one approach
is ARM makes spin_unlock_wait() do the write, similar to spin_lock();
spin_lock(), but PowerPC relies on smp_mb__after_unlock_lock().

Or does someone have a better proposal?

Thanx, Paul

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 4/4] locking: Introduce smp_cond_acquire()

2015-11-12 Thread Will Deacon
On Thu, Nov 12, 2015 at 10:21:39AM -0800, Linus Torvalds wrote:
> On Wed, Nov 11, 2015 at 11:14 PM, Boqun Feng  wrote:
> >
> > Hmm.. probably incorrect.. because the ACQUIRE semantics of spin_lock()
> > only guarantees that the memory operations following spin_lock() can't
> > be reorder before the *LOAD* part of spin_lock() not the *STORE* part,
> > i.e. the case below can happen(assuming the spin_lock() is implemented
> > as ll/sc loop)
> >
> > spin_lock(&lock):
> >   r1 = *lock; // LL, r1 == 0
> > o = READ_ONCE(object); // could be reordered here.
> >   *lock = 1; // SC
> 
> It may be worth noting that at least in theory, not only reads may
> pass the store. If the spin-lock is done as
> 
> r1 = *lock   // read-acquire, r1 == 0
> *lock = 1// SC
> 
> then even *writes* inside the locked region might pass up through the
> "*lock = 1".

Right, but only if we didn't have the control dependency to branch back
in the case that the SC failed. In that case, the lock would be broken
because you'd dive into the critical section even if you failed to take
the lock.

> So other CPU's - that haven't taken the spinlock - could see the
> modifications inside the critical region before they actually see the
> lock itself change.
> 
> Now, the point of spin_unlock_wait() (and "spin_is_locked()") should
> generally be that you have some external ordering guarantee that
> guarantees that the lock has been taken. For example, for the IPC
> semaphores, we do either one of:
> 
>  (a) get large lock, then - once you hold that lock - wait for each small lock
> 
> or
> 
>  (b) get small lock, then - once you hold that lock - check that the
> largo lock is unlocked
> 
> and that's the case we should really worry about.  The other uses of
> spin_unlock_wait() should have similar "I have other reasons to know
> I've seen that the lock was taken, or will never be taken after this
> because XYZ".
> 
> This is why powerpc has a memory barrier in "arch_spin_is_locked()".
> Exactly so that the "check that the other lock is unlocked" is
> guaranteed to be ordered wrt the store that gets the first lock.
> 
> It looks like ARM64 gets this wrong and is fundamentally buggy wrt
> "spin_is_locked()" (and, as a result, "spin_unlock_wait()").

I don't see how a memory barrier would help us here, and Boqun's example
had smp_mb() either sides of the spin_unlock_wait(). What we actually need
is to make spin_unlock_wait more like a LOCK operation, so that it forces
parallel lockers to replay their LL/SC sequences.

I'll write a patch once I've heard more back from Paul about
smp_mb__after_unlock_lock().

> BUT! And this is a bug BUT:
> 
> It should be noted that that is purely an ARM64 bug. Not a bug in our
> users. If you have a spinlock where the "get lock write" part of the
> lock can be delayed, then you have to have a "arch_spin_is_locked()"
> that has the proper memory barriers.
> 
> Of course, ARM still hides their architecture manuals in odd places,
> so I can't double-check. But afaik, ARM64 store-conditional is "store
> exclusive with release", and it has only release semantics, and ARM64
> really does have the above bug.

The store-conditional in our spin_lock routine has no barrier semantics;
they are enforced by the load-exclusive-acquire that pairs with it.

> On that note: can anybody point me to the latest ARM64 8.1
> architecture manual in pdf form, without the "you have to register"
> crap? I thought ARM released it, but all my googling just points to
> the idiotic ARM service center that wants me to sign away something
> just to see the docs. Which I don't do.

We haven't yet released a document for the 8.1 instructions, but I can
certainly send you a copy when it's made available.

Will
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 4/4] locking: Introduce smp_cond_acquire()

2015-11-12 Thread Will Deacon
On Thu, Nov 12, 2015 at 07:02:51AM -0800, Paul E. McKenney wrote:
> On Thu, Nov 12, 2015 at 10:49:02PM +0800, Boqun Feng wrote:
> > On Thu, Nov 12, 2015 at 06:40:04AM -0800, Paul E. McKenney wrote:
> > [snip]
> > > 
> > > I cannot resist suggesting that any lock that interacts with
> > > spin_unlock_wait() must have all relevant acquisitions followed by
> > > smp_mb__after_unlock_lock().
> > > 
> > 
> > But
> > 
> > 1.  This would expand the purpose of smp_mb__after_unlock_lock(),
> > right? smp_mb__after_unlock_lock() is for making UNLOCK-LOCK
> > pair global transitive rather than guaranteeing no operations
> > can be reorder before the STORE part of LOCK/ACQUIRE.
> 
> Indeed it would.  Which might be OK.
> 
> > 2.  If ARM64 has the same problem as PPC now,
> > smp_mb__after_unlock_lock() can't help, as it's a no-op on
> > ARM64.
> 
> Agreed, and that is why we need Will to weigh in.

I really don't want to implement smp_mb__after_unlock_lock, because we
don't need it based on its current definition and I think there's a better
way to fix spin_unlock_wait (see my other post).

Will
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 4/4] locking: Introduce smp_cond_acquire()

2015-11-12 Thread Will Deacon
On Thu, Nov 12, 2015 at 10:59:06AM -0800, Paul E. McKenney wrote:
> On Thu, Nov 12, 2015 at 08:33:02PM +0100, Oleg Nesterov wrote:
> > On 11/12, Peter Zijlstra wrote:
> > >
> > > On Thu, Nov 12, 2015 at 07:38:07PM +0100, Oleg Nesterov wrote:
> > > > It seems that PPC needs to define smp_mb__before_spinlock() as full 
> > > > mb(),
> > > > no?
> > >
> > > It does:
> > >
> > > arch/powerpc/include/asm/barrier.h:#define smp_mb__before_spinlock() 
> > > smp_mb()
> > 
> > Ah, indeed, thanks.
> > 
> > And given that it also defines smp_mb__after_unlock_lock() as smp_mb(),
> > I am starting to understand how it can help to avoid the races with
> > spin_unlock_wait() in (for example) do_exit().
> > 
> > But as Boqun has already mentioned, this means that mb__after_unlock_lock()
> > has the new meaning which should be documented.
> > 
> > Hmm. And 12d560f4 "Privatize smp_mb__after_unlock_lock()" should be reverted
> > then ;)
> 
> Surprisingly, this reverts cleanly against today's mainline, please see
> the patch below.  Against my -rcu stack, not so much, but so it goes.  ;-)

I think we ended up concluding that smp_mb__after_unlock_lock is indeed
required, but I don't think we should just resurrect the old definition,
which doesn't keep UNLOCK -> LOCK distinct from RELEASE -> ACQUIRE. I'm
still working on documenting the different types of transitivity that we
identified in that thread, but it's slow going.

Also, as far as spin_unlock_wait is concerned, it is neither a LOCK or
an UNLOCK and this barrier doesn't offer us anything. Sure, it might
work because PPC defines it as smp_mb(), but it doesn't help on arm64
and defining the macro is overkill for us in most places (i.e. RCU).

If we decide that the current usage of spin_unlock_wait is valid, then I
would much rather implement a version of it in the arm64 backend that
does something like:

 1:  ldrex r1, [&lock]
 if r1 indicates that lock is taken, branch back to 1b
 strex r1, [&lock]
 if store failed, branch back to 1b

i.e. we don't just test the lock, but we also write it back atomically
if we discover that it's free. That would then clear the exclusive monitor
on any cores in the process of taking the lock and restore the ordering
that we need.

Will
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 4/4] locking: Introduce smp_cond_acquire()

2015-11-12 Thread Will Deacon
[sorry for the late reply, I'm away from my desk until Monday since I'm
 busy with family issues]

On Thu, Nov 12, 2015 at 07:20:42AM -0800, Paul E. McKenney wrote:
> On Thu, Nov 12, 2015 at 04:08:22PM +0100, Peter Zijlstra wrote:
> > On Thu, Nov 12, 2015 at 07:01:49AM -0800, Paul E. McKenney wrote:
> > > On Thu, Nov 12, 2015 at 03:50:13PM +0100, Peter Zijlstra wrote:
> > > > On Thu, Nov 12, 2015 at 06:40:04AM -0800, Paul E. McKenney wrote:
> > > > > I cannot resist suggesting that any lock that interacts with
> > > > > spin_unlock_wait() must have all relevant acquisitions followed by
> > > > > smp_mb__after_unlock_lock().
> > > > 
> > > > Ha! that would certainly help here. But it would mean that argh64v8 also
> > > > needs to define that, even though that is already RCsc.
> > > 
> > > Maybe.  It could also be that arm64 avoids the need somehow, for example
> > > via their RCsc behavior.  Their memory model is similar to PPC, but not
> > > exactly the same.
> > > 
> > > Will?
> > 
> > So when I spoke to Will earlier today, we agreed that LDAXR+STXR is
> > susceptible to the same problem. The STXR will allow loads to pass up
> > over that store.
> > 
> > On v8.1, which is using LDADDA, this is not an issue, for as the ACQUIRE
> > is part of the LOAD, the Read-Modify-Write is indivisible as a whole,
> > and therefore a subsequent load has nothing to pass over.
> 
> So one approach required for one level of hardware and another for the
> next level.  I can relate to that all too well...  :-/

Just to confirm, Peter's correct in that Boqun's litmus test is permitted
by the arm64 architecture when the ll/sc spinlock definitions are in use.

However, I don't think that strengthening smp_mb__after_unlock_lock is
the right way to solve this. I'll reply to the other part of the thread...

Will
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 4/4] locking: Introduce smp_cond_acquire()

2015-11-12 Thread Paul E. McKenney
On Thu, Nov 12, 2015 at 08:33:02PM +0100, Oleg Nesterov wrote:
> On 11/12, Peter Zijlstra wrote:
> >
> > On Thu, Nov 12, 2015 at 07:38:07PM +0100, Oleg Nesterov wrote:
> > > It seems that PPC needs to define smp_mb__before_spinlock() as full mb(),
> > > no?
> >
> > It does:
> >
> > arch/powerpc/include/asm/barrier.h:#define smp_mb__before_spinlock() 
> > smp_mb()
> 
> Ah, indeed, thanks.
> 
> And given that it also defines smp_mb__after_unlock_lock() as smp_mb(),
> I am starting to understand how it can help to avoid the races with
> spin_unlock_wait() in (for example) do_exit().
> 
> But as Boqun has already mentioned, this means that mb__after_unlock_lock()
> has the new meaning which should be documented.
> 
> Hmm. And 12d560f4 "Privatize smp_mb__after_unlock_lock()" should be reverted
> then ;)

Surprisingly, this reverts cleanly against today's mainline, please see
the patch below.  Against my -rcu stack, not so much, but so it goes.  ;-)

Thanx, Paul



commit eff0632b4181f91f2596d56f7c73194e1a869aff
Author: Paul E. McKenney 
Date:   Thu Nov 12 10:54:23 2015 -0800

Revert "rcu,locking: Privatize smp_mb__after_unlock_lock()"

This reverts commit 12d560f4ea87030667438a169912380be00cea4b.

The reason for this revert is that smp_mb__after_unlock_lock() might
prove useful outside of RCU after all for interactions between
the locking primitives and spin_unlock_wait().

diff --git a/Documentation/memory-barriers.txt 
b/Documentation/memory-barriers.txt
index aef9487303d0..d4501664d49f 100644
--- a/Documentation/memory-barriers.txt
+++ b/Documentation/memory-barriers.txt
@@ -1855,10 +1855,16 @@ RELEASE are to the same lock variable, but only from 
the perspective of
 another CPU not holding that lock.  In short, a ACQUIRE followed by an
 RELEASE may -not- be assumed to be a full memory barrier.
 
-Similarly, the reverse case of a RELEASE followed by an ACQUIRE does
-not imply a full memory barrier.  Therefore, the CPU's execution of the
-critical sections corresponding to the RELEASE and the ACQUIRE can cross,
-so that:
+Similarly, the reverse case of a RELEASE followed by an ACQUIRE does not
+imply a full memory barrier.  If it is necessary for a RELEASE-ACQUIRE
+pair to produce a full barrier, the ACQUIRE can be followed by an
+smp_mb__after_unlock_lock() invocation.  This will produce a full barrier
+(including transitivity) if either (a) the RELEASE and the ACQUIRE are
+executed by the same CPU or task, or (b) the RELEASE and ACQUIRE act on
+the same variable.  The smp_mb__after_unlock_lock() primitive is free
+on many architectures.  Without smp_mb__after_unlock_lock(), the CPU's
+execution of the critical sections corresponding to the RELEASE and the
+ACQUIRE can cross, so that:
 
*A = a;
RELEASE M
@@ -1896,6 +1902,29 @@ the RELEASE would simply complete, thereby avoiding the 
deadlock.
a sleep-unlock race, but the locking primitive needs to resolve
such races properly in any case.
 
+With smp_mb__after_unlock_lock(), the two critical sections cannot overlap.
+For example, with the following code, the store to *A will always be
+seen by other CPUs before the store to *B:
+
+   *A = a;
+   RELEASE M
+   ACQUIRE N
+   smp_mb__after_unlock_lock();
+   *B = b;
+
+The operations will always occur in one of the following orders:
+
+   STORE *A, RELEASE, ACQUIRE, smp_mb__after_unlock_lock(), STORE *B
+   STORE *A, ACQUIRE, RELEASE, smp_mb__after_unlock_lock(), STORE *B
+   ACQUIRE, STORE *A, RELEASE, smp_mb__after_unlock_lock(), STORE *B
+
+If the RELEASE and ACQUIRE were instead both operating on the same lock
+variable, only the first of these alternatives can occur.  In addition,
+the more strongly ordered systems may rule out some of the above orders.
+But in any case, as noted earlier, the smp_mb__after_unlock_lock()
+ensures that the store to *A will always be seen as happening before
+the store to *B.
+
 Locks and semaphores may not provide any guarantee of ordering on UP compiled
 systems, and so cannot be counted on in such a situation to actually achieve
 anything at all - especially with respect to I/O accesses - unless combined
@@ -2126,6 +2155,40 @@ But it won't see any of:
*E, *F or *G following RELEASE Q
 
 
+However, if the following occurs:
+
+   CPU 1   CPU 2
+   === ===
+   WRITE_ONCE(*A, a);
+   ACQUIRE M[1]
+   WRITE_ONCE(*B, b);
+   WRITE_ONCE(*C, c);
+   RELEASE M[1]
+   WRITE_ONCE(*D, d);  WRITE_ONCE(*E, e);
+   ACQUIRE M[2]
+   smp_mb__after_unlock_lock();
+   WRITE_ONCE(*F, f);
+  

Re: [PATCH 4/4] locking: Introduce smp_cond_acquire()

2015-11-12 Thread Oleg Nesterov
On 11/12, Peter Zijlstra wrote:
>
> On Thu, Nov 12, 2015 at 07:38:07PM +0100, Oleg Nesterov wrote:
> > It seems that PPC needs to define smp_mb__before_spinlock() as full mb(),
> > no?
>
> It does:
>
> arch/powerpc/include/asm/barrier.h:#define smp_mb__before_spinlock() smp_mb()

Ah, indeed, thanks.

And given that it also defines smp_mb__after_unlock_lock() as smp_mb(),
I am starting to understand how it can help to avoid the races with
spin_unlock_wait() in (for example) do_exit().

But as Boqun has already mentioned, this means that mb__after_unlock_lock()
has the new meaning which should be documented.

Hmm. And 12d560f4 "Privatize smp_mb__after_unlock_lock()" should be reverted
then ;)

Oleg.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 4/4] locking: Introduce smp_cond_acquire()

2015-11-12 Thread Linus Torvalds
On Wed, Nov 11, 2015 at 11:14 PM, Boqun Feng  wrote:
>
> Hmm.. probably incorrect.. because the ACQUIRE semantics of spin_lock()
> only guarantees that the memory operations following spin_lock() can't
> be reorder before the *LOAD* part of spin_lock() not the *STORE* part,
> i.e. the case below can happen(assuming the spin_lock() is implemented
> as ll/sc loop)
>
> spin_lock(&lock):
>   r1 = *lock; // LL, r1 == 0
> o = READ_ONCE(object); // could be reordered here.
>   *lock = 1; // SC

It may be worth noting that at least in theory, not only reads may
pass the store. If the spin-lock is done as

r1 = *lock   // read-acquire, r1 == 0
*lock = 1// SC

then even *writes* inside the locked region might pass up through the
"*lock = 1".

So other CPU's - that haven't taken the spinlock - could see the
modifications inside the critical region before they actually see the
lock itself change.

Now, the point of spin_unlock_wait() (and "spin_is_locked()") should
generally be that you have some external ordering guarantee that
guarantees that the lock has been taken. For example, for the IPC
semaphores, we do either one of:

 (a) get large lock, then - once you hold that lock - wait for each small lock

or

 (b) get small lock, then - once you hold that lock - check that the
largo lock is unlocked

and that's the case we should really worry about.  The other uses of
spin_unlock_wait() should have similar "I have other reasons to know
I've seen that the lock was taken, or will never be taken after this
because XYZ".

This is why powerpc has a memory barrier in "arch_spin_is_locked()".
Exactly so that the "check that the other lock is unlocked" is
guaranteed to be ordered wrt the store that gets the first lock.

It looks like ARM64 gets this wrong and is fundamentally buggy wrt
"spin_is_locked()" (and, as a result, "spin_unlock_wait()").

BUT! And this is a bug BUT:

It should be noted that that is purely an ARM64 bug. Not a bug in our
users. If you have a spinlock where the "get lock write" part of the
lock can be delayed, then you have to have a "arch_spin_is_locked()"
that has the proper memory barriers.

Of course, ARM still hides their architecture manuals in odd places,
so I can't double-check. But afaik, ARM64 store-conditional is "store
exclusive with release", and it has only release semantics, and ARM64
really does have the above bug.

On that note: can anybody point me to the latest ARM64 8.1
architecture manual in pdf form, without the "you have to register"
crap? I thought ARM released it, but all my googling just points to
the idiotic ARM service center that wants me to sign away something
just to see the docs. Which I don't do.

   Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 4/4] locking: Introduce smp_cond_acquire()

2015-11-12 Thread Peter Zijlstra
On Thu, Nov 12, 2015 at 07:38:07PM +0100, Oleg Nesterov wrote:
> It seems that PPC needs to define smp_mb__before_spinlock() as full mb(),
> no?

It does:

arch/powerpc/include/asm/barrier.h:#define smp_mb__before_spinlock() smp_mb()

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 4/4] locking: Introduce smp_cond_acquire()

2015-11-12 Thread Oleg Nesterov
On 11/12, Boqun Feng wrote:
>
> On Thu, Nov 12, 2015 at 04:00:58PM +0100, Oleg Nesterov wrote:
> >
> > No?
> >
>
> do_exit() is surely buggy if spin_lock() could work in this way.

OK, good ;) so we need to fix spin_lock() on PPC ? Or add
mb__after_unlock_lock() but this leads to other questions... Or
change do_exit() to do lock() + unlock().

>
> > And smp_mb__before_spinlock() looks wrong too then.
> >
>
> Maybe not? As smp_mb__before_spinlock() is used before a LOCK operation,
> which has both LOAD part and STORE part unlike spin_unlock_wait()?

Maybe not.

But let me remind that the original purpose of this mb__before_spinlock()
was to ensure that "CONDITION = true" before ttwu() can not be reordered
with

if (!(p->state & state))
goto out;   // do not wakeup


inside try_to_wake_up(). Otherwise

CONDITION = true;
try_to_wake_up(p);

can race with "p" doing

set_current_state(...); // implies mb();
if (CONDITION)
return;
schedule();

because try_to_wake_up() can read p->state before it sets CONDITION = 1
and then it won't wakeup "p" which has already checked this CONDITION.


Now. If try_to_wake_up() can read p->state before it writes to *pi_lock,
then how smp_mb__before_spinlock() == wmb() can help to serialize STORE
and LOAD?

It seems that PPC needs to define smp_mb__before_spinlock() as full mb(),
no?

Oleg.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 4/4] locking: Introduce smp_cond_acquire()

2015-11-12 Thread Paul E. McKenney
On Thu, Nov 12, 2015 at 10:49:02PM +0800, Boqun Feng wrote:
> On Thu, Nov 12, 2015 at 06:40:04AM -0800, Paul E. McKenney wrote:
> [snip]
> > 
> > I cannot resist suggesting that any lock that interacts with
> > spin_unlock_wait() must have all relevant acquisitions followed by
> > smp_mb__after_unlock_lock().
> > 
> 
> But
> 
> 1.This would expand the purpose of smp_mb__after_unlock_lock(),
>   right? smp_mb__after_unlock_lock() is for making UNLOCK-LOCK
>   pair global transitive rather than guaranteeing no operations
>   can be reorder before the STORE part of LOCK/ACQUIRE.

Indeed it would.  Which might be OK.

> 2.If ARM64 has the same problem as PPC now,
>   smp_mb__after_unlock_lock() can't help, as it's a no-op on
>   ARM64.

Agreed, and that is why we need Will to weigh in.

Thanx, Paul

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 4/4] locking: Introduce smp_cond_acquire()

2015-11-12 Thread Paul E. McKenney
On Thu, Nov 12, 2015 at 04:08:22PM +0100, Peter Zijlstra wrote:
> On Thu, Nov 12, 2015 at 07:01:49AM -0800, Paul E. McKenney wrote:
> > On Thu, Nov 12, 2015 at 03:50:13PM +0100, Peter Zijlstra wrote:
> > > On Thu, Nov 12, 2015 at 06:40:04AM -0800, Paul E. McKenney wrote:
> > > > I cannot resist suggesting that any lock that interacts with
> > > > spin_unlock_wait() must have all relevant acquisitions followed by
> > > > smp_mb__after_unlock_lock().
> > > 
> > > Ha! that would certainly help here. But it would mean that argh64v8 also
> > > needs to define that, even though that is already RCsc.
> > 
> > Maybe.  It could also be that arm64 avoids the need somehow, for example
> > via their RCsc behavior.  Their memory model is similar to PPC, but not
> > exactly the same.
> > 
> > Will?
> 
> So when I spoke to Will earlier today, we agreed that LDAXR+STXR is
> susceptible to the same problem. The STXR will allow loads to pass up
> over that store.
> 
> On v8.1, which is using LDADDA, this is not an issue, for as the ACQUIRE
> is part of the LOAD, the Read-Modify-Write is indivisible as a whole,
> and therefore a subsequent load has nothing to pass over.

So one approach required for one level of hardware and another for the
next level.  I can relate to that all too well...  :-/

Thanx, Paul

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 4/4] locking: Introduce smp_cond_acquire()

2015-11-12 Thread Boqun Feng
On Thu, Nov 12, 2015 at 04:00:58PM +0100, Oleg Nesterov wrote:
> On 11/12, Boqun Feng wrote:
[snip]
> >
> > Hmm.. probably incorrect.. because the ACQUIRE semantics of spin_lock()
> > only guarantees that the memory operations following spin_lock() can't
> > be reorder before the *LOAD* part of spin_lock() not the *STORE* part,
> > i.e. the case below can happen(assuming the spin_lock() is implemented
> > as ll/sc loop)
> >
> > spin_lock(&lock):
> >   r1 = *lock; // LL, r1 == 0
> > o = READ_ONCE(object); // could be reordered here.
> >   *lock = 1; // SC
> >
> > This could happen because of the ACQUIRE semantics of spin_lock(), and
> > the current implementation of spin_lock() on PPC allows this happen.
> >
> > (Cc PPC maintainers for their opinions on this one)
> 
> In this case the code above is obviously wrong. And I do not understand
> how we can rely on spin_unlock_wait() then.
> 
> And afaics do_exit() is buggy too then, see below.
> 
> > I think it's OK for it as an ACQUIRE(with a proper barrier) or even just
> > a control dependency to pair with spin_unlock(), for example, the
> > following snippet in do_exit() is OK, except the smp_mb() is redundant,
> > unless I'm missing something subtle:
> >
> > /*
> >  * The setting of TASK_RUNNING by try_to_wake_up() may be delayed
> >  * when the following two conditions become true.
> >  *   - There is race condition of mmap_sem (It is acquired by
> >  * exit_mm()), and
> >  *   - SMI occurs before setting TASK_RUNINNG.
> >  * (or hypervisor of virtual machine switches to other guest)
> >  *  As a result, we may become TASK_RUNNING after becoming TASK_DEAD
> >  *
> >  * To avoid it, we have to wait for releasing tsk->pi_lock which
> >  * is held by try_to_wake_up()
> >  */
> > smp_mb();
> > raw_spin_unlock_wait(&tsk->pi_lock);
> 
> Perhaps it is me who missed something. But I don't think we can remove
> this mb(). And at the same time it can't help on PPC if I understand

You are right, we need this smp_mb() to order the previous STORE of
->state with the LOAD of ->pi_lock. I missed that part because I saw all
the explicit STOREs of ->state in do_exit() are set_current_state()
which has a smp_mb() following the STOREs.

> your explanation above correctly.
> 
> To simplify, lets ignore exit_mm/down_read/etc. The exiting task does
> 
> 
>   current->state = TASK_UNINTERRUPTIBLE;
>   // without schedule() in between
>   current->state = TASK_RUNNING;
> 
>   smp_mb();
>   spin_unlock_wait(pi_lock);
> 
>   current->state = TASK_DEAD;
>   schedule();
> 
> and we need to ensure that if we race with 
> try_to_wake_up(TASK_UNINTERRUPTIBLE)
> it can't change TASK_DEAD back to RUNNING.
> 
> Without smp_mb() this can be reordered, spin_unlock_wait(pi_locked) can
> read the old "unlocked" state of pi_lock before we set UNINTERRUPTIBLE,
> so in fact we could have
> 
>   current->state = TASK_UNINTERRUPTIBLE;
>   
>   spin_unlock_wait(pi_lock);
> 
>   current->state = TASK_RUNNING;
> 
>   current->state = TASK_DEAD;
> 
> and this can obviously race with ttwu() which can take pi_lock and see
> state == TASK_UNINTERRUPTIBLE after spin_unlock_wait().
> 

Yep, my mistake ;-)

> And, if I understand you correctly, this smp_mb() can't help on PPC.
> try_to_wake_up() can read task->state before it writes to *pi_lock.
> To me this doesn't really differ from the code above,
> 
>   CPU 1 (do_exit) CPU_2 (ttwu)
> 
>   spin_lock(pi_lock):
> r1 = *pi_lock; // r1 == 0;
>   p->state = TASK_UNINTERRUPTIBLE;
>   state = p->state;
>   p->state = TASK_RUNNING;
>   mb();
>   spin_unlock_wait();
>   *pi_lock = 1;
> 
>   p->state = TASK_DEAD;
>   if (state & 
> TASK_UNINTERRUPTIBLE) // true
>   p->state = RUNNING;
> 
> No?
> 

do_exit() is surely buggy if spin_lock() could work in this way.

> And smp_mb__before_spinlock() looks wrong too then.
> 

Maybe not? As smp_mb__before_spinlock() is used before a LOCK operation,
which has both LOAD part and STORE part unlike spin_unlock_wait()?

> Oleg.
> 


signature.asc
Description: PGP signature


Re: [PATCH 4/4] locking: Introduce smp_cond_acquire()

2015-11-12 Thread Peter Zijlstra
On Thu, Nov 12, 2015 at 07:01:49AM -0800, Paul E. McKenney wrote:
> On Thu, Nov 12, 2015 at 03:50:13PM +0100, Peter Zijlstra wrote:
> > On Thu, Nov 12, 2015 at 06:40:04AM -0800, Paul E. McKenney wrote:
> > > I cannot resist suggesting that any lock that interacts with
> > > spin_unlock_wait() must have all relevant acquisitions followed by
> > > smp_mb__after_unlock_lock().
> > 
> > Ha! that would certainly help here. But it would mean that argh64v8 also
> > needs to define that, even though that is already RCsc.
> 
> Maybe.  It could also be that arm64 avoids the need somehow, for example
> via their RCsc behavior.  Their memory model is similar to PPC, but not
> exactly the same.
> 
> Will?

So when I spoke to Will earlier today, we agreed that LDAXR+STXR is
susceptible to the same problem. The STXR will allow loads to pass up
over that store.

On v8.1, which is using LDADDA, this is not an issue, for as the ACQUIRE
is part of the LOAD, the Read-Modify-Write is indivisible as a whole,
and therefore a subsequent load has nothing to pass over.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 4/4] locking: Introduce smp_cond_acquire()

2015-11-12 Thread Paul E. McKenney
On Thu, Nov 12, 2015 at 03:50:13PM +0100, Peter Zijlstra wrote:
> On Thu, Nov 12, 2015 at 06:40:04AM -0800, Paul E. McKenney wrote:
> > I cannot resist suggesting that any lock that interacts with
> > spin_unlock_wait() must have all relevant acquisitions followed by
> > smp_mb__after_unlock_lock().
> 
> Ha! that would certainly help here. But it would mean that argh64v8 also
> needs to define that, even though that is already RCsc.

Maybe.  It could also be that arm64 avoids the need somehow, for example
via their RCsc behavior.  Their memory model is similar to PPC, but not
exactly the same.

Will?

Thanx, Paul

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 4/4] locking: Introduce smp_cond_acquire()

2015-11-12 Thread Peter Zijlstra
On Thu, Nov 12, 2015 at 06:40:04AM -0800, Paul E. McKenney wrote:
> I cannot resist suggesting that any lock that interacts with
> spin_unlock_wait() must have all relevant acquisitions followed by
> smp_mb__after_unlock_lock().

Ha! that would certainly help here. But it would mean that argh64v8 also
needs to define that, even though that is already RCsc.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 4/4] locking: Introduce smp_cond_acquire()

2015-11-12 Thread Boqun Feng
On Thu, Nov 12, 2015 at 06:40:04AM -0800, Paul E. McKenney wrote:
[snip]
> 
> I cannot resist suggesting that any lock that interacts with
> spin_unlock_wait() must have all relevant acquisitions followed by
> smp_mb__after_unlock_lock().
> 

But

1.  This would expand the purpose of smp_mb__after_unlock_lock(),
right? smp_mb__after_unlock_lock() is for making UNLOCK-LOCK
pair global transitive rather than guaranteeing no operations
can be reorder before the STORE part of LOCK/ACQUIRE.

2.  If ARM64 has the same problem as PPC now,
smp_mb__after_unlock_lock() can't help, as it's a no-op on
ARM64.

Regards,
Boqun


signature.asc
Description: PGP signature


Re: [PATCH 4/4] locking: Introduce smp_cond_acquire()

2015-11-12 Thread Paul E. McKenney
On Thu, Nov 12, 2015 at 04:00:58PM +0100, Oleg Nesterov wrote:
> On 11/12, Boqun Feng wrote:
> >
> > On Wed, Nov 11, 2015 at 08:39:53PM +0100, Oleg Nesterov wrote:
> > >
> > >   object_t *object;
> > >   spinlock_t lock;
> > >
> > >   void update(void)
> > >   {
> > >   object_t *o;
> > >
> > >   spin_lock(&lock);
> > >   o = READ_ONCE(object);
> > >   if (o) {
> > >   BUG_ON(o->dead);
> > >   do_something(o);
> > >   }
> > >   spin_unlock(&lock);
> > >   }
> > >
> > >   void destroy(void) // can be called only once, can't race with itself
> > >   {
> > >   object_t *o;
> > >
> > >   o = object;
> > >   object = NULL;
> > >
> > >   /*
> > >* pairs with lock/ACQUIRE. The next update() must see
> > >* object == NULL after spin_lock();
> > >*/
> > >   smp_mb();
> > >
> > >   spin_unlock_wait(&lock);
> > >
> > >   /*
> > >* pairs with unlock/RELEASE. The previous update() has
> > >* already passed BUG_ON(o->dead).
> > >*
> > >* (Yes, yes, in this particular case it is not needed,
> > >*  we can rely on the control dependency).
> > >*/
> > >   smp_mb();
> > >
> > >   o->dead = true;
> > >   }
> > >
> > > I believe the code above is correct and it needs the barriers on both 
> > > sides.
> > >
> >
> > Hmm.. probably incorrect.. because the ACQUIRE semantics of spin_lock()
> > only guarantees that the memory operations following spin_lock() can't
> > be reorder before the *LOAD* part of spin_lock() not the *STORE* part,
> > i.e. the case below can happen(assuming the spin_lock() is implemented
> > as ll/sc loop)
> >
> > spin_lock(&lock):
> >   r1 = *lock; // LL, r1 == 0
> > o = READ_ONCE(object); // could be reordered here.
> >   *lock = 1; // SC
> >
> > This could happen because of the ACQUIRE semantics of spin_lock(), and
> > the current implementation of spin_lock() on PPC allows this happen.
> >
> > (Cc PPC maintainers for their opinions on this one)
> 
> In this case the code above is obviously wrong. And I do not understand
> how we can rely on spin_unlock_wait() then.
> 
> And afaics do_exit() is buggy too then, see below.
> 
> > I think it's OK for it as an ACQUIRE(with a proper barrier) or even just
> > a control dependency to pair with spin_unlock(), for example, the
> > following snippet in do_exit() is OK, except the smp_mb() is redundant,
> > unless I'm missing something subtle:
> >
> > /*
> >  * The setting of TASK_RUNNING by try_to_wake_up() may be delayed
> >  * when the following two conditions become true.
> >  *   - There is race condition of mmap_sem (It is acquired by
> >  * exit_mm()), and
> >  *   - SMI occurs before setting TASK_RUNINNG.
> >  * (or hypervisor of virtual machine switches to other guest)
> >  *  As a result, we may become TASK_RUNNING after becoming TASK_DEAD
> >  *
> >  * To avoid it, we have to wait for releasing tsk->pi_lock which
> >  * is held by try_to_wake_up()
> >  */
> > smp_mb();
> > raw_spin_unlock_wait(&tsk->pi_lock);
> 
> Perhaps it is me who missed something. But I don't think we can remove
> this mb(). And at the same time it can't help on PPC if I understand
> your explanation above correctly.

I cannot resist suggesting that any lock that interacts with
spin_unlock_wait() must have all relevant acquisitions followed by
smp_mb__after_unlock_lock().

Thanx, Paul

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 4/4] locking: Introduce smp_cond_acquire()

2015-11-12 Thread Oleg Nesterov
On 11/12, Boqun Feng wrote:
>
> On Wed, Nov 11, 2015 at 08:39:53PM +0100, Oleg Nesterov wrote:
> >
> > object_t *object;
> > spinlock_t lock;
> >
> > void update(void)
> > {
> > object_t *o;
> >
> > spin_lock(&lock);
> > o = READ_ONCE(object);
> > if (o) {
> > BUG_ON(o->dead);
> > do_something(o);
> > }
> > spin_unlock(&lock);
> > }
> >
> > void destroy(void) // can be called only once, can't race with itself
> > {
> > object_t *o;
> >
> > o = object;
> > object = NULL;
> >
> > /*
> >  * pairs with lock/ACQUIRE. The next update() must see
> >  * object == NULL after spin_lock();
> >  */
> > smp_mb();
> >
> > spin_unlock_wait(&lock);
> >
> > /*
> >  * pairs with unlock/RELEASE. The previous update() has
> >  * already passed BUG_ON(o->dead).
> >  *
> >  * (Yes, yes, in this particular case it is not needed,
> >  *  we can rely on the control dependency).
> >  */
> > smp_mb();
> >
> > o->dead = true;
> > }
> >
> > I believe the code above is correct and it needs the barriers on both sides.
> >
>
> Hmm.. probably incorrect.. because the ACQUIRE semantics of spin_lock()
> only guarantees that the memory operations following spin_lock() can't
> be reorder before the *LOAD* part of spin_lock() not the *STORE* part,
> i.e. the case below can happen(assuming the spin_lock() is implemented
> as ll/sc loop)
>
>   spin_lock(&lock):
> r1 = *lock; // LL, r1 == 0
>   o = READ_ONCE(object); // could be reordered here.
> *lock = 1; // SC
>
> This could happen because of the ACQUIRE semantics of spin_lock(), and
> the current implementation of spin_lock() on PPC allows this happen.
>
> (Cc PPC maintainers for their opinions on this one)

In this case the code above is obviously wrong. And I do not understand
how we can rely on spin_unlock_wait() then.

And afaics do_exit() is buggy too then, see below.

> I think it's OK for it as an ACQUIRE(with a proper barrier) or even just
> a control dependency to pair with spin_unlock(), for example, the
> following snippet in do_exit() is OK, except the smp_mb() is redundant,
> unless I'm missing something subtle:
>
>   /*
>* The setting of TASK_RUNNING by try_to_wake_up() may be delayed
>* when the following two conditions become true.
>*   - There is race condition of mmap_sem (It is acquired by
>* exit_mm()), and
>*   - SMI occurs before setting TASK_RUNINNG.
>* (or hypervisor of virtual machine switches to other guest)
>*  As a result, we may become TASK_RUNNING after becoming TASK_DEAD
>*
>* To avoid it, we have to wait for releasing tsk->pi_lock which
>* is held by try_to_wake_up()
>*/
>   smp_mb();
>   raw_spin_unlock_wait(&tsk->pi_lock);

Perhaps it is me who missed something. But I don't think we can remove
this mb(). And at the same time it can't help on PPC if I understand
your explanation above correctly.

To simplify, lets ignore exit_mm/down_read/etc. The exiting task does


current->state = TASK_UNINTERRUPTIBLE;
// without schedule() in between
current->state = TASK_RUNNING;

smp_mb();
spin_unlock_wait(pi_lock);

current->state = TASK_DEAD;
schedule();

and we need to ensure that if we race with try_to_wake_up(TASK_UNINTERRUPTIBLE)
it can't change TASK_DEAD back to RUNNING.

Without smp_mb() this can be reordered, spin_unlock_wait(pi_locked) can
read the old "unlocked" state of pi_lock before we set UNINTERRUPTIBLE,
so in fact we could have

current->state = TASK_UNINTERRUPTIBLE;

spin_unlock_wait(pi_lock);

current->state = TASK_RUNNING;

current->state = TASK_DEAD;

and this can obviously race with ttwu() which can take pi_lock and see
state == TASK_UNINTERRUPTIBLE after spin_unlock_wait().

And, if I understand you correctly, this smp_mb() can't help on PPC.
try_to_wake_up() can read task->state before it writes to *pi_lock.
To me this doesn't really differ from the code above,

CPU 1 (do_exit) CPU_2 (ttwu)

spin_lock(pi_lock):
  r1 = *pi_lock; // r1 == 0;
p->state = TASK_UNINTERRUPTIBLE;
state = p->state;
p->state = TASK_RUNNING;
mb();
spin_unlock_wait();
*pi_lock = 1;

p->state = TASK_DEAD;
if (state & 
TASK_UNINTERRUPTIBLE) // true

Re: [PATCH 4/4] locking: Introduce smp_cond_acquire()

2015-11-12 Thread Paul E. McKenney
On Wed, Nov 11, 2015 at 06:34:56PM +0800, Boqun Feng wrote:
> On Wed, Nov 11, 2015 at 05:39:40PM +0800, Boqun Feng wrote:
> > Hi Oleg,
> > 
> > On Tue, Nov 03, 2015 at 06:59:58PM +0100, Oleg Nesterov wrote:
> > [snip]
> > > 
> > > Unfortunately this doesn't look exactly right...
> > > 
> > > spin_unlock_wait() is not equal to "while (locked) relax", the latter
> > > is live-lockable or at least sub-optimal: we do not really need to spin
> > 
> > Just be curious, should spin_unlock_wait() semantically be an ACQUIRE?
> 
> Hmm.. I guess I was wrong, it doesn't need to be an ACQUIRE, it needs
> only to use the control dependency to order the load of lock state and
> stores following it.

I must say that spin_unlock_wait() and friends are a bit tricky.
There are only a few uses in drivers/ata/libata-eh.c, ipc/sem.c,
kernel/exit.c, kernel/sched/completion.c, and kernel/task_work.c.
Almost all of these seem to assume that spin_unlock_wait() provides
no ordering.  We might have just barely enough uses to produce useful
abstractions, but my guess is that it would not hurt to wait.

> > Because spin_unlock_wait() is used for us to wait for a certain lock to
> > RELEASE so that we can do something *after* we observe the RELEASE.
> > Considering the follow example:
> > 
> > CPU 0   CPU 1
> > ===
> > { X = 0 }
> > WRITE_ONCE(X, 1);
> > spin_unlock(&lock);
> > spin_unlock_wait(&lock)
> > r1 = READ_ONCE(X);
> > 
> > If spin_unlock_wait() is not an ACQUIRE, r1 can be 0 in this case,
> > right? Am I missing something subtle here? Or spin_unlock_wait() itself
> > doesn't have the ACQUIRE semantics, but it should always come with a
> > smp_mb() *following* it to achieve the ACQUIRE semantics? However in
> > do_exit(), an smp_mb() is preceding raw_spin_unlock_wait() rather than
> > following, which makes me confused... could you explain that? Thank you
> > ;-)
> > 
> 
> But still, there is one suspicious use of smp_mb() in do_exit():
> 
>   /*
>* The setting of TASK_RUNNING by try_to_wake_up() may be delayed
>* when the following two conditions become true.
>*   - There is race condition of mmap_sem (It is acquired by
>* exit_mm()), and
>*   - SMI occurs before setting TASK_RUNINNG.
>* (or hypervisor of virtual machine switches to other guest)
>*  As a result, we may become TASK_RUNNING after becoming TASK_DEAD
>*
>* To avoid it, we have to wait for releasing tsk->pi_lock which
>* is held by try_to_wake_up()
>*/
>   smp_mb();
>   raw_spin_unlock_wait(&tsk->pi_lock);
> 
>   /* causes final put_task_struct in finish_task_switch(). */
>   tsk->state = TASK_DEAD;
>   tsk->flags |= PF_NOFREEZE;  /* tell freezer to ignore us */
>   schedule();
> 
> Seems like smp_mb() doesn't need here? Because the control dependency
> already orders load of tsk->pi_lock and store of tsk->state, and this
> control dependency order guarantee pairs with the spin_unlock(->pi_lock)
> in try_to_wake_up() to avoid data race on ->state.

The exit() path is pretty heavyweight, so I suspect that an extra smp_mb()
is down in the noise.  Or are you saying that this is somehow unsafe?

Thanx, Paul

> Regards,
> Boqun
> 
> > Regards,
> > Boqun
> > 
> > > until we observe !spin_is_locked(), we only need to synchronize with the
> > > current owner of this lock. Once it drops the lock we can proceed, we
> > > do not care if another thread takes the same lock right after that.
> > > 
> > > Oleg.
> > > 
> 
> 


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 4/4] locking: Introduce smp_cond_acquire()

2015-11-12 Thread Peter Zijlstra
On Thu, Nov 12, 2015 at 03:14:51PM +0800, Boqun Feng wrote:
> Hmm.. probably incorrect.. because the ACQUIRE semantics of spin_lock()
> only guarantees that the memory operations following spin_lock() can't
> be reorder before the *LOAD* part of spin_lock() not the *STORE* part,
> i.e. the case below can happen(assuming the spin_lock() is implemented
> as ll/sc loop)
> 
>   spin_lock(&lock):
> r1 = *lock; // LL, r1 == 0
>   o = READ_ONCE(object); // could be reordered here.
> *lock = 1; // SC
> 
> This could happen because of the ACQUIRE semantics of spin_lock(), and 
> the current implementation of spin_lock() on PPC allows this happen.

Urgh, you just _had_ to send an email like this, didn't you ;-)

I think AARGH64 does the same thing. They either use LDAXR/STXR, which
also places the ACQUIRE on the load, or use LDADDA (v8.1) which I
suspect does the same, but I'm not entirely sure how to decode these 8.1
atomics yet.

Let me think a little more on this..


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 4/4] locking: Introduce smp_cond_acquire()

2015-11-11 Thread Boqun Feng
On Wed, Nov 11, 2015 at 08:39:53PM +0100, Oleg Nesterov wrote:
> On 11/11, Peter Zijlstra wrote:
> >
> > On Wed, Nov 11, 2015 at 05:39:40PM +0800, Boqun Feng wrote:
> >
> > > Just be curious, should spin_unlock_wait() semantically be an ACQUIRE?
> >
> > I did wonder the same thing, it would simplify a number of things if
> > this were so.
> 
> Yes, me too.
> 
> Sometimes I even think it should have both ACQUIRE + RELEASE semantics.
> IOW, it should be "equivalent" to spin_lock() + spin_unlock().
> 
> Consider this code:
> 
>   object_t *object;
>   spinlock_t lock;
> 
>   void update(void)
>   {
>   object_t *o;
> 
>   spin_lock(&lock);
>   o = READ_ONCE(object);
>   if (o) {
>   BUG_ON(o->dead);
>   do_something(o);
>   }
>   spin_unlock(&lock);
>   }
> 
>   void destroy(void) // can be called only once, can't race with itself
>   {
>   object_t *o;
> 
>   o = object;
>   object = NULL;
> 
>   /*
>* pairs with lock/ACQUIRE. The next update() must see
>* object == NULL after spin_lock();
>*/
>   smp_mb();
> 
>   spin_unlock_wait(&lock);
> 
>   /*
>* pairs with unlock/RELEASE. The previous update() has
>* already passed BUG_ON(o->dead).
>*
>* (Yes, yes, in this particular case it is not needed,
>*  we can rely on the control dependency).
>*/
>   smp_mb();
> 
>   o->dead = true;
>   }
> 
> I believe the code above is correct and it needs the barriers on both sides.
> 

Hmm.. probably incorrect.. because the ACQUIRE semantics of spin_lock()
only guarantees that the memory operations following spin_lock() can't
be reorder before the *LOAD* part of spin_lock() not the *STORE* part,
i.e. the case below can happen(assuming the spin_lock() is implemented
as ll/sc loop)

spin_lock(&lock):
  r1 = *lock; // LL, r1 == 0
o = READ_ONCE(object); // could be reordered here.
  *lock = 1; // SC

This could happen because of the ACQUIRE semantics of spin_lock(), and 
the current implementation of spin_lock() on PPC allows this happen.

(Cc PPC maintainers for their opinions on this one)

Therefore the case below can happen:

CPU 1   CPU 2   CPU 3
==  ==
spin_unlock(&lock);
spin_lock(&lock):
  r1 = *lock; // r1 == 0;
o = READ_ONCE(object); // reordered here
object = NULL;
smp_mb();
spin_unlock_wait(&lock);
  *lock = 1;
smp_mb();   
o->dead = true;
if (o) // true
  BUG_ON(o->dead); // true!!


To show this, I also translate this situation into a PPC litmus for
herd[1]:

PPC spin-lock-wait
"
r1: local variable of lock
r2: constant 1
r3: constant 0 or NULL
r4: local variable of object, i.e. o
r5: local variable of *o (simulate ->dead as I didn't know 
how to write fields of structure in herd ;-()
r13: the address of lock, i.e. &lock
r14: the address of object, i.e. &object
"
{
0:r1=0;0:r2=1;0:r3=0;0:r13=lock;0:r14=object;
1:r1=0;1:r2=1;1:r3=0;1:r4=0;1:r5=0;1:r13=lock;1:r14=object;
2:r1=0;2:r13=lock;
lock=1; object=old; old=0;
}

P0  | P1 | P2 ;
ld r4,0(r14)| Lock:  | stw r1,0(r13);
std r3,0(r14)   | lwarx r1,r3,r13| ;
| cmpwi r1,0 | ;
sync| bne Lock   | ;
| stwcx. r2,r3,r13   | ;
Wait:   | bne Lock   | ;
lwz r1,0(r13)   | lwsync | ;
cmpwi r1,0  | ld r4,0(r14)   | ;
bne Wait| cmpwi r4,0 | ;
| beq Fail   | ;
sync| lwz r5, 0(r4)  | ;
stw r2,0(r4)| Fail:  | ;
| lwsync | ;
| stw r3, 0(r13) | ;

exists
(1:r4=old /\ 1:r5=1)

,whose result says that (1:r4=old /\ 1:r5=1) can happen:

Test spin-lock-wait Allowed
States 3
1:r4=0; 1:r5=0;
1:r4=old; 1:r5=0;
1:r4=old; 1:r5=1;
Loop Ok
Witnesses
Positive: 18 Negative: 108
Condition exists (1:r4=old /\ 1:r5=1)
Observation spin-lock-wait Sometimes 18 108
Hash=244f8

Re: [PATCH 4/4] locking: Introduce smp_cond_acquire()

2015-11-11 Thread Linus Torvalds
On Wed, Nov 11, 2015 at 11:39 AM, Oleg Nesterov  wrote:
> On 11/11, Peter Zijlstra wrote:
>
> Sometimes I even think it should have both ACQUIRE + RELEASE semantics.
> IOW, it should be "equivalent" to spin_lock() + spin_unlock().

That's insane.

"Release" semantics are - by definition - about stuff that *predeces* it.

It is simply not sensible to have a "wait_for_unlock()" that then
synchronizes loads or stores that happened *before* the wait. That's
some crazy voodoo programming.

And if you do need to have model where you do "store something, then
make sure that the unlock we wait for happens after the store", then
you need to just add a "smp_mb()" in between that store and the
waiting for unlock.

Or just take the damn lock, and don't play any games at all.

   Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 4/4] locking: Introduce smp_cond_acquire()

2015-11-11 Thread Oleg Nesterov
He Boqun,

Let me first state that I can't answer authoritatively when it comes to
barriers. That said,

On 11/11, Boqun Feng wrote:
>
> But still, there is one suspicious use of smp_mb() in do_exit():
>
>   /*
>* The setting of TASK_RUNNING by try_to_wake_up() may be delayed
>* when the following two conditions become true.
>*   - There is race condition of mmap_sem (It is acquired by
>* exit_mm()), and
>*   - SMI occurs before setting TASK_RUNINNG.
>* (or hypervisor of virtual machine switches to other guest)
>*  As a result, we may become TASK_RUNNING after becoming TASK_DEAD
>*
>* To avoid it, we have to wait for releasing tsk->pi_lock which
>* is held by try_to_wake_up()
>*/
>   smp_mb();
>   raw_spin_unlock_wait(&tsk->pi_lock);
>
>   /* causes final put_task_struct in finish_task_switch(). */
>   tsk->state = TASK_DEAD;
>   tsk->flags |= PF_NOFREEZE;  /* tell freezer to ignore us */
>   schedule();
>
> Seems like smp_mb() doesn't need here?

Please see my reply to peterz's email.

AFAICS, we need the barries on both sides. But, since we only need to
STORE into tsk->state after unlock_wait(), we can rely on the control
dependency and avoid the 2nd mb().

Oleg.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 4/4] locking: Introduce smp_cond_acquire()

2015-11-11 Thread Oleg Nesterov
On 11/11, Peter Zijlstra wrote:
>
> On Wed, Nov 11, 2015 at 05:39:40PM +0800, Boqun Feng wrote:
>
> > Just be curious, should spin_unlock_wait() semantically be an ACQUIRE?
>
> I did wonder the same thing, it would simplify a number of things if
> this were so.

Yes, me too.

Sometimes I even think it should have both ACQUIRE + RELEASE semantics.
IOW, it should be "equivalent" to spin_lock() + spin_unlock().

Consider this code:

object_t *object;
spinlock_t lock;

void update(void)
{
object_t *o;

spin_lock(&lock);
o = READ_ONCE(object);
if (o) {
BUG_ON(o->dead);
do_something(o);
}
spin_unlock(&lock);
}

void destroy(void) // can be called only once, can't race with itself
{
object_t *o;

o = object;
object = NULL;

/*
 * pairs with lock/ACQUIRE. The next update() must see
 * object == NULL after spin_lock();
 */
smp_mb();

spin_unlock_wait(&lock);

/*
 * pairs with unlock/RELEASE. The previous update() has
 * already passed BUG_ON(o->dead).
 *
 * (Yes, yes, in this particular case it is not needed,
 *  we can rely on the control dependency).
 */
smp_mb();

o->dead = true;
}

I believe the code above is correct and it needs the barriers on both sides.

If it is wrong, then task_work_run() is buggy: it relies on mb() implied by
cmpxchg() before spin_unlock_wait() the same way: the next task_work_cancel()
should see the result of our cmpxchg(), it must not try to read work->next or
work->func.

Hmm. Not sure I really understand what I am trying to say... Perhaps in fact
I mean that unlock_wait() should be removed because it is too subtle for me ;)

Oleg.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 4/4] locking: Introduce smp_cond_acquire()

2015-11-11 Thread Peter Zijlstra
On Wed, Nov 11, 2015 at 05:39:40PM +0800, Boqun Feng wrote:

> Just be curious, should spin_unlock_wait() semantically be an ACQUIRE?

I did wonder the same thing, it would simplify a number of things if
this were so.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 4/4] locking: Introduce smp_cond_acquire()

2015-11-11 Thread Boqun Feng
On Wed, Nov 11, 2015 at 05:39:40PM +0800, Boqun Feng wrote:
> Hi Oleg,
> 
> On Tue, Nov 03, 2015 at 06:59:58PM +0100, Oleg Nesterov wrote:
> [snip]
> > 
> > Unfortunately this doesn't look exactly right...
> > 
> > spin_unlock_wait() is not equal to "while (locked) relax", the latter
> > is live-lockable or at least sub-optimal: we do not really need to spin
> 
> Just be curious, should spin_unlock_wait() semantically be an ACQUIRE?

Hmm.. I guess I was wrong, it doesn't need to be an ACQUIRE, it needs
only to use the control dependency to order the load of lock state and
stores following it.

> Because spin_unlock_wait() is used for us to wait for a certain lock to
> RELEASE so that we can do something *after* we observe the RELEASE.
> Considering the follow example:
> 
>   CPU 0   CPU 1
>   ===
>   { X = 0 }
>   WRITE_ONCE(X, 1);
>   spin_unlock(&lock);
>   spin_unlock_wait(&lock)
>   r1 = READ_ONCE(X);
> 
> If spin_unlock_wait() is not an ACQUIRE, r1 can be 0 in this case,
> right? Am I missing something subtle here? Or spin_unlock_wait() itself
> doesn't have the ACQUIRE semantics, but it should always come with a
> smp_mb() *following* it to achieve the ACQUIRE semantics? However in
> do_exit(), an smp_mb() is preceding raw_spin_unlock_wait() rather than
> following, which makes me confused... could you explain that? Thank you
> ;-)
> 

But still, there is one suspicious use of smp_mb() in do_exit():

/*
 * The setting of TASK_RUNNING by try_to_wake_up() may be delayed
 * when the following two conditions become true.
 *   - There is race condition of mmap_sem (It is acquired by
 * exit_mm()), and
 *   - SMI occurs before setting TASK_RUNINNG.
 * (or hypervisor of virtual machine switches to other guest)
 *  As a result, we may become TASK_RUNNING after becoming TASK_DEAD
 *
 * To avoid it, we have to wait for releasing tsk->pi_lock which
 * is held by try_to_wake_up()
 */
smp_mb();
raw_spin_unlock_wait(&tsk->pi_lock);

/* causes final put_task_struct in finish_task_switch(). */
tsk->state = TASK_DEAD;
tsk->flags |= PF_NOFREEZE;  /* tell freezer to ignore us */
schedule();

Seems like smp_mb() doesn't need here? Because the control dependency
already orders load of tsk->pi_lock and store of tsk->state, and this
control dependency order guarantee pairs with the spin_unlock(->pi_lock)
in try_to_wake_up() to avoid data race on ->state.

Regards,
Boqun

> Regards,
> Boqun
> 
> > until we observe !spin_is_locked(), we only need to synchronize with the
> > current owner of this lock. Once it drops the lock we can proceed, we
> > do not care if another thread takes the same lock right after that.
> > 
> > Oleg.
> > 




signature.asc
Description: PGP signature


Re: [PATCH 4/4] locking: Introduce smp_cond_acquire()

2015-11-11 Thread Boqun Feng
Hi Oleg,

On Tue, Nov 03, 2015 at 06:59:58PM +0100, Oleg Nesterov wrote:
[snip]
> 
> Unfortunately this doesn't look exactly right...
> 
> spin_unlock_wait() is not equal to "while (locked) relax", the latter
> is live-lockable or at least sub-optimal: we do not really need to spin

Just be curious, should spin_unlock_wait() semantically be an ACQUIRE?
Because spin_unlock_wait() is used for us to wait for a certain lock to
RELEASE so that we can do something *after* we observe the RELEASE.
Considering the follow example:

CPU 0   CPU 1
===
{ X = 0 }
WRITE_ONCE(X, 1);
spin_unlock(&lock);
spin_unlock_wait(&lock)
r1 = READ_ONCE(X);

If spin_unlock_wait() is not an ACQUIRE, r1 can be 0 in this case,
right? Am I missing something subtle here? Or spin_unlock_wait() itself
doesn't have the ACQUIRE semantics, but it should always come with a
smp_mb() *following* it to achieve the ACQUIRE semantics? However in
do_exit(), an smp_mb() is preceding raw_spin_unlock_wait() rather than
following, which makes me confused... could you explain that? Thank you
;-)

Regards,
Boqun

> until we observe !spin_is_locked(), we only need to synchronize with the
> current owner of this lock. Once it drops the lock we can proceed, we
> do not care if another thread takes the same lock right after that.
> 
> Oleg.
> 


signature.asc
Description: PGP signature


Re: [PATCH 4/4] locking: Introduce smp_cond_acquire()

2015-11-04 Thread Paul E. McKenney
On Tue, Nov 03, 2015 at 08:43:22PM -0800, Linus Torvalds wrote:
> On Tue, Nov 3, 2015 at 7:57 PM, Paul E. McKenney
>  wrote:
> >
> > Thank you, and yes, it clearly states that read-to-write dependencies
> > are ordered.
> 
> Well, I wouldn't say that it's exactly "clear".
> 
> The fact that they explicitly say "Note that the DP relation does not
> directly impose a BEFORE (⇐) ordering between accesses u and v" makes
> it all clear as mud.
> 
> They *then* go on to talk about how the DP relationship *together*
> with the odd "is source of" ordering (which in turn is defined in
> terms of BEFORE ordering) cannot have cycles.
> 
> I have no idea why they do it that way, but the reason seems to be
> that they wanted to make "BEFORE" be purely about barriers and
> accesses, and make the other orderings be described separately. So the
> "BEFORE" ordering is used to define how memory must act, which is then
> used as a basis for that storage definition and the "is source of"
> thing.
> 
> But none of that seems to make much sense to a *user*.
> 
> The fact that they seem to equate "BEFORE" with "Processor Issue
> Constraints" also makes me think that the whole logic was written by a
> CPU designer, and part of why they document it that way is that the
> CPU designer literally thought of "can I issue this access" as being
> very different from "is there some inherent ordering that just results
> from issues outside of my design".
> 
> I really don't know. That whole series of memory ordering rules makes
> my head hurt.

The honest answer is that for Alpha, I don't know either.  But if your
head isn't hurting enough yet, feel free to read on...



My guess, based loosely on ARM and PowerPC, is that memory barriers
provide global ordering but that the load-to-store dependency ordering is
strictly local.  Which as you say does not necessarily make much sense
to a user.  One guess is that the following would never trigger the
BUG_ON() given x, y, and z initially zero (and ignoring the possibility
of compiler mischief):

CPU 0   CPU 1   CPU 2
r1 = x; r2 = y; r3 = z;
if (r1) if (r2) if (r3)
y = 1;  z = 1;  x = 1;
BUG_ON(r1 == 1 && r2 == 1 && r3 == 1); /* after the dust settles */

The write-to-read relationships prevent misordering.  The copy of the
Alpha manual I downloaded hints that this BUG_ON() could not fire.

However, the following might well trigger:

CPU 0   CPU 1   CPU 2
x = 1;  r1 = x; r2 = y;
if (r1) if (r2)
y = 1;  x = 2;
BUG_ON(r1 == 1 && r2 == 1 && x == 1); /* after the dust settles */

The dependency ordering orders each CPU individually, but might not force
CPU 0's write to reach CPU 1 and CPU 2 at the same time.  So the BUG_ON()
case would happen if CPU 0's write to x reach CPU 1 before it reached
CPU 2, in which case the x==2 value might not be seen outside of CPU 2,
so that everyone agrees on the order of values taken on by x.

And this could be prevented by enforcing global (rather than local)
ordering by placing a memory barrier in CPU 1:

CPU 0   CPU 1   CPU 2
x = 1;  r1 = x; r2 = y;
smp_mb();   if (r2)
if (r1) x = 2;
y = 1;
BUG_ON(r1 == 1 && r2 == 1 && x == 1); /* after the dust settles */

But the reference manual doesn't have this sort of litmus test, so who
knows?  CCing the Alpha maintainers in case they know.

> But I do think the only thing that matters in the end is that they do
> have that DP relationship between reads and subsequently dependent
> writes, but  basically not for *any* other relationship.
> 
> So on alpha read-vs-read, write-vs-later-read, and write-vs-write all
> have to have memory barriers, unless the accesses physically overlap.

Agreed.

Thanx, Paul

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 4/4] locking: Introduce smp_cond_acquire()

2015-11-03 Thread Linus Torvalds
On Tue, Nov 3, 2015 at 7:57 PM, Paul E. McKenney
 wrote:
>
> Thank you, and yes, it clearly states that read-to-write dependencies
> are ordered.

Well, I wouldn't say that it's exactly "clear".

The fact that they explicitly say "Note that the DP relation does not
directly impose a BEFORE (⇐) ordering between accesses u and v" makes
it all clear as mud.

They *then* go on to talk about how the DP relationship *together*
with the odd "is source of" ordering (which in turn is defined in
terms of BEFORE ordering) cannot have cycles.

I have no idea why they do it that way, but the reason seems to be
that they wanted to make "BEFORE" be purely about barriers and
accesses, and make the other orderings be described separately. So the
"BEFORE" ordering is used to define how memory must act, which is then
used as a basis for that storage definition and the "is source of"
thing.

But none of that seems to make much sense to a *user*.

The fact that they seem to equate "BEFORE" with "Processor Issue
Constraints" also makes me think that the whole logic was written by a
CPU designer, and part of why they document it that way is that the
CPU designer literally thought of "can I issue this access" as being
very different from "is there some inherent ordering that just results
from issues outside of my design".

I really don't know. That whole series of memory ordering rules makes
my head hurt.

But I do think the only thing that matters in the end is that they do
have that DP relationship between reads and subsequently dependent
writes, but  basically not for *any* other relationship.

So on alpha read-vs-read, write-vs-later-read, and write-vs-write all
have to have memory barriers, unless the accesses physically overlap.

 Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 4/4] locking: Introduce smp_cond_acquire()

2015-11-03 Thread Paul E. McKenney
On Tue, Nov 03, 2015 at 11:40:24AM -0800, Linus Torvalds wrote:
> On Mon, Nov 2, 2015 at 5:57 PM, Paul E. McKenney
>  wrote:

[ . . . ]

> > I am in India and my Alpha Architecture Manual is in the USA.
> 
> I sent you a link to something that should work, and that has the section.

Thank you, and yes, it clearly states that read-to-write dependencies
are ordered.  Color me slow and stupid.

> > And they did post a clarification on the web:
> 
> So for alpha, you trust a random web posting by a unknown person that
> talks about some random problem in an application that we don't even
> know what it is.

In my defense, the Alpha architects pointed me at that web posting, but
yes, it appears to be pushing for overly conservative safety rather than
accuracy.  Please accept my apologies for my confusion.

Thanx, Paul

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 4/4] locking: Introduce smp_cond_acquire()

2015-11-03 Thread Linus Torvalds
On Mon, Nov 2, 2015 at 5:57 PM, Paul E. McKenney
 wrote:
>>
>> Alpha isn't special. And smp_read_barrier_depends() hasn't magically
>> become something new.
>
> The existing control dependencies (READ_ONCE_CTRL() and friends) only
> guarantee ordering against later stores, and not against later loads.

Right. And using "smp_read_barrier_depends()" for them is WRONG.

That's my argument.

Your arguments make no sense.

> Of the weakly ordered architectures, only Alpha fails to respect
> load-to-store control dependencies.

NO IT DOES NOT.

Christ, Paul. I even sent you the alpha architecture manual
information where it explicitly says that there's a dependency
ordering for that case.

There's a reason that "smp_read_barrier_depends()" is called
smp_READ_barrier_depends().

It's a rmb. Really

You have turned it into something else in your mind. But your mind is WRONG.

> I am in India and my Alpha Architecture Manual is in the USA.

I sent you a link to something that should work, and that has the section.

> And they did post a clarification on the web:

So for alpha, you trust a random web posting by a unknown person that
talks about some random problem in an application that we don't even
know what it is.

But you don't trust the architecture manual, and you don't trust the
fact that it si *physically impossible* to not have the
load-to-control-to-store ordering without some kind of magical
nullifying stores that we know that alpha didn't have?

But then magically, you trust the architecture manuals for other
architectures, or just take the "you cannot have smp-visible
speculative stores" on faith.

But alpha is different, and lives in a universe where causality
suddenly doesn't exist.

I really don't understand your logic.

> So 5.6.1.7 apparently does not sanction data dependency ordering.

Exactly why are you arguing against he architecture manual?

>> Paul? I really disagree with how you have totally tried to re-purpose
>> smp_read_barrier_depends() in ways that are insane and make no sense.
>
> I did consider adding a new name, but apparently was lazy that day.
> I would of course be happy to add a separate name.

That is NOT WHAT I WANT AT ALL.

Get rid of the smp_read_barrier_depends(). It doesn't do control
barriers against stores, it has never done that, and they aren't
needed in the first place.

There is no need for a new name. The only thing that there is a need
for is to just realize that alpha isn't magical.

Alpha does have a stupid segmented cache which means that even when
the core is constrained by load-to-load data address dependencies
(because no alpha actually did value predication), the cachelines that
core loads may not be ordered without the memory barrier.

But that "second load may get a stale value from the past" is purely
about loads. You cannot have the same issue happen for stores, because
there's no way a store buffer somehow magically goes backwards in time
and exposes the store before the load that it depended on.

And the architecture manual even *EXPLICITLY* says so. Alpha does
actually have a dependency chain from loads to dependent stores -
through addresses, through data, _and_ through control. It's
documented, but equally importantly, it's just basically impossible to
not have an architecture that does that.

Even if you do some really fancy things like actual value prediction
(which the alpha architecture _allows_, although no alpha ever did,
afaik), that cannot break the load->store dependency. Because even if
the load value was predicted (or any control flow after the load was
predicted), the store cannot be committed and made visible to other
CPU's until after that prediction has been verified.

So there may be bogus values in a store buffer, but those values will
have to be killed with the store instructions that caused them, if any
prediction failed. They won't be visible to other CPU's.

And I say that not because the architecture manual says so (although
it does), but because because such a CPU wouldn't *work*. It wouldn't
be a CPU, it would at most be a fuzzy neural network that generates
cool results that may be interesting, but they won't be dependable or
reliable in the sense that the Linux kernel depends on.

>> That is not a control dependency. If it was, then PPC and ARM would
>> need to make it a real barrier too. I really don't see why you have
>> singled out alpha as the victim of your "let's just randomly change
>> the rules".
>
> Just trying to make this all work on Alpha as well as the other
> architectures...  But if the actual Alpha hardware that is currently
> running recent Linux kernels is more strict than the architecture

.. really. This is specified in the architecture manual. The fact that
you have found some random posting by some random person that says
"you need barriers everywhere" is immaterial. That support blog may
well have been simply a "I don't know what I am talking about, and I
don't know what problem you have, but I do know

Re: [PATCH 4/4] locking: Introduce smp_cond_acquire()

2015-11-03 Thread Peter Zijlstra
On Tue, Nov 03, 2015 at 06:59:58PM +0100, Oleg Nesterov wrote:

> > -   raw_spin_unlock_wait(&task->pi_lock);
> > -   smp_mb();
> > +   smp_cond_acquire(!raw_spin_is_locked(&task->pi_lock));
> 
> Unfortunately this doesn't look exactly right...
> 
> spin_unlock_wait() is not equal to "while (locked) relax", the latter
> is live-lockable or at least sub-optimal: we do not really need to spin
> until we observe !spin_is_locked(), we only need to synchronize with the
> current owner of this lock. Once it drops the lock we can proceed, we
> do not care if another thread takes the same lock right after that.

Ah indeed. And while every use of spin_unlock_wait() has 'interesting'
barriers associated, they all seem different.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 4/4] locking: Introduce smp_cond_acquire()

2015-11-03 Thread Oleg Nesterov
On 11/02, Peter Zijlstra wrote:
>
> +#define smp_cond_acquire(cond)   do {\
> + while (!(cond)) \
> + cpu_relax();\
> + smp_read_barrier_depends(); /* ctrl */  \
> + smp_rmb(); /* ctrl + rmb := acquire */  \
> +} while (0)

...

> --- a/kernel/task_work.c
> +++ b/kernel/task_work.c
> @@ -102,13 +102,13 @@ void task_work_run(void)
>
>   if (!work)
>   break;
> +
>   /*
>* Synchronize with task_work_cancel(). It can't remove
>* the first entry == work, cmpxchg(task_works) should
>* fail, but it can play with *work and other entries.
>*/
> - raw_spin_unlock_wait(&task->pi_lock);
> - smp_mb();
> + smp_cond_acquire(!raw_spin_is_locked(&task->pi_lock));

Unfortunately this doesn't look exactly right...

spin_unlock_wait() is not equal to "while (locked) relax", the latter
is live-lockable or at least sub-optimal: we do not really need to spin
until we observe !spin_is_locked(), we only need to synchronize with the
current owner of this lock. Once it drops the lock we can proceed, we
do not care if another thread takes the same lock right after that.

Oleg.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 4/4] locking: Introduce smp_cond_acquire()

2015-11-02 Thread Paul E. McKenney
On Mon, Nov 02, 2015 at 11:17:17AM -0800, Linus Torvalds wrote:
> On Mon, Nov 2, 2015 at 10:37 AM, Will Deacon  wrote:
> > On Mon, Nov 02, 2015 at 10:08:24AM -0800, Linus Torvalds wrote:
> >> On Mon, Nov 2, 2015 at 5:29 AM, Peter Zijlstra  
> >> wrote:
> >> > +#define smp_cond_acquire(cond) do {\
> >> > +   while (!(cond)) \
> >> > +   cpu_relax();\
> >> > +   smp_read_barrier_depends(); /* ctrl */  \
> >> > +   smp_rmb(); /* ctrl + rmb := acquire */  \
> >> > +} while (0)
> >>
> >> This code makes absolutely no sense.
> >>
> >> smp_read_barrier_depends() is about a memory barrier where there is a
> >> data dependency between two accesses. The "depends" is very much about
> >> the data dependency, and very much about *nothing* else.
> >
> > Paul wasn't so sure, which I think is why smp_read_barrier_depends()
> > is already used in, for example, READ_ONCE_CTRL:
> >
> >   http://lkml.kernel.org/r/20151007154003.gj3...@linux.vnet.ibm.com
> 
> Quoting the alpha architecture manual is kind of pointless, when NO
> OTHER ARCHITECTURE OUT THERE guararantees that whole "read +
> conditional orders wrt subsequent writes" _either_.
> 
> (Again, with the exception of x86, which has the sane "we honor causality")
> 
> Alpha isn't special. And smp_read_barrier_depends() hasn't magically
> become something new.

The existing control dependencies (READ_ONCE_CTRL() and friends) only
guarantee ordering against later stores, and not against later loads.
Of the weakly ordered architectures, only Alpha fails to respect
load-to-store control dependencies.

> If people think that control dependency needs a memory barrier on
> alpha, then it damn well needs on on all other weakly ordered
> architectuers too, afaik.

For load-to-load control dependencies, agreed, from what I can see,
all the weakly ordered architectures do need an explicit barrier.

> Either that "you cannot finalize a write unless all conditionals it
> depends on are finalized" is true or it is not. That argument has
> *never* been about some architecture-specific memory ordering model,
> as far as I know.
> 
> As to READ_ONCE_CTRL - two wrongs don't make a right.
> 
> That smp_read_barrier_depends() there doesn't make any sense either.
> 
> And finally, the alpha architecture manual actually does have the
> notion of "Dependence Constraint" (5.6.1.7) that talks about writes
> that depend on previous reads (where "depends" is explicitly spelled
> out to be about conditionals, write data _or_ write address). They are
> actually constrained on alpha too.

I am in India and my Alpha Architecture Manual is in the USA.  Google
Books has a PDF, but it conveniently omits that particular section.
I do remember quoting that section at the Alpha architects back in the
late 1990s and being told that it didn't mean what I thought it meant.

And they did post a clarification on the web:

http://h71000.www7.hp.com/wizard/wiz_2637.html

For instance, your producer must issue a "memory barrier"
instruction after writing the data to shared memory and before
inserting it on the queue; likewise, your consumer must issue a
memory barrier instruction after removing an item from the queue
and before reading from its memory.  Otherwise, you risk seeing
stale data, since, while the Alpha processor does provide coherent
memory, it does not provide implicit ordering of reads and writes.
(That is, the write of the producer's data might reach memory
after the write of the queue, such that the consumer might read
the new item from the queue but get the previous values from
the item's memory.

So 5.6.1.7 apparently does not sanction data dependency ordering.

> Note that a "Dependence Constraint" is not a memory barrier, because
> it only affects that particular chain of dependencies. So it doesn't
> order other things in *general*, but it does order a particular read
> with a particular sef of subsequent write. Which is all we guarantee
> on anything else too wrt the whole control dependencies.
> 
> The smp_read_barrier_depends() is a *READ BARRIER*. It's about a
> *read* that depends on a previous read. Now, it so happens that alpha
> doesn't have a "read-to-read" barrier, and it's implemented as a full
> barrier, but it really should be read as a "smp_rmb().
> 
> Yes, yes, alpha has the worst memory ordering ever, and the worst
> barriers for that insane memory ordering, but that still does not make
> alpha "magically able to reorder more than physically or logically
> possible". We don't add barriers that don't make sense just because
> the alpha memory orderings are insane.
> 
> Paul? I really disagree with how you have totally tried to re-purpose
> smp_read_barrier_depends() in ways that are insane and make no sense.

I did consider adding a new name, but apparently was lazy that day.
I would of course be happy to

Re: [PATCH 4/4] locking: Introduce smp_cond_acquire()

2015-11-02 Thread Linus Torvalds
On Mon, Nov 2, 2015 at 5:14 PM, Paul E. McKenney
 wrote:
>
> You would ask that question when I am many thousands of miles from my
> copy of the Alpha reference manual!  ;-)

I don't think I've touched a paper manual in years. Too m uch effort
to index and search. Look here

   http://download.majix.org/dec/alpha_arch_ref.pdf

> There is explicit wording in that manual that says that no multi-variable
> ordering is implied without explicit memory-barrier instructions.

Right. But the whole "read -> conditional -> write" ends up being a
dependency chain to *every* single write that happens after the
conditional.

So there may be no "multi-variable ordering" implied in any individual
access, but despite that a simple "read + conditional" orders every
single store that comes after it. Even if it orders them
"individually".

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 4/4] locking: Introduce smp_cond_acquire()

2015-11-02 Thread Paul E. McKenney
On Mon, Nov 02, 2015 at 05:43:48PM +, Will Deacon wrote:
> On Mon, Nov 02, 2015 at 02:57:26PM +0100, Peter Zijlstra wrote:
> > On Mon, Nov 02, 2015 at 02:29:05PM +0100, Peter Zijlstra wrote:
> > 
> > > Note that while smp_cond_acquire() has an explicit
> > > smp_read_barrier_depends() for Alpha, neither sites it gets used in
> > > were actually buggy on Alpha for their lack of it. The first uses
> > > smp_rmb(), which on Alpha is a full barrier too and therefore serves
> > > its purpose. The second had an explicit full barrier.
> > 
> > > +/**
> > > + * smp_cond_acquire() - Spin wait for cond with ACQUIRE ordering
> > > + * @cond: boolean expression to wait for
> > > + *
> > > + * Equivalent to using smp_load_acquire() on the condition variable but 
> > > employs
> > > + * the control dependency of the wait to reduce the barrier on many 
> > > platforms.
> > > + *
> > > + * The control dependency provides a LOAD->STORE order, the additional 
> > > RMB
> > > + * provides LOAD->LOAD order, together they provide LOAD->{LOAD,STORE} 
> > > order,
> > > + * aka. ACQUIRE.
> > > + */
> > > +#define smp_cond_acquire(cond)   do {\
> > > + while (!(cond)) \
> > > + cpu_relax();\
> > > + smp_read_barrier_depends(); /* ctrl */  \
> > > + smp_rmb(); /* ctrl + rmb := acquire */  \
> > > +} while (0)
> > 
> > So per the above argument we could leave out the
> > smp_read_barrier_depends() for Alpha, although that would break
> > consistency with all the other control dependency primitives we have. It
> > would avoid issuing a double barrier.
> > 
> > Thoughts?
> 
> Do we even know that Alpha needs a barrier for control-dependencies in
> the first place?

You would ask that question when I am many thousands of miles from my
copy of the Alpha reference manual!  ;-)

There is explicit wording in that manual that says that no multi-variable
ordering is implied without explicit memory-barrier instructions.

Thanx, Paul

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 4/4] locking: Introduce smp_cond_acquire()

2015-11-02 Thread Peter Zijlstra
On Mon, Nov 02, 2015 at 11:17:17AM -0800, Linus Torvalds wrote:
> As to READ_ONCE_CTRL - two wrongs don't make a right.
> 
> That smp_read_barrier_depends() there doesn't make any sense either.
> 
> And finally, the alpha architecture manual actually does have the
> notion of "Dependence Constraint" (5.6.1.7) that talks about writes
> that depend on previous reads (where "depends" is explicitly spelled
> out to be about conditionals, write data _or_ write address). They are
> actually constrained on alpha too.
> 
> Note that a "Dependence Constraint" is not a memory barrier, because
> it only affects that particular chain of dependencies. So it doesn't
> order other things in *general*, but it does order a particular read
> with a particular sef of subsequent write. Which is all we guarantee
> on anything else too wrt the whole control dependencies.

Something like so then, Paul?

---
Subject: locking: Alpha honours control dependencies too

The alpha architecture manual actually does have the notion of
"Dependence Constraint" (5.6.1.7) that talks about writes that depend on
previous reads (where "depends" is explicitly spelled out to be about
conditionals, write data _or_ write address). They are actually
constrained on alpha too.

Which means we can remove the smp_read_barrier_depends() abuse from the
various control dependency primitives.

Retain the primitives, as they are a useful documentation aid.

Maybe-Signed-off-by: Peter Zijlstra (Intel) 
---
 include/linux/atomic.h   | 2 --
 include/linux/compiler.h | 1 -
 2 files changed, 3 deletions(-)

diff --git a/include/linux/atomic.h b/include/linux/atomic.h
index 27e580d232ca..f16b1dedd909 100644
--- a/include/linux/atomic.h
+++ b/include/linux/atomic.h
@@ -8,7 +8,6 @@
 static inline int atomic_read_ctrl(const atomic_t *v)
 {
int val = atomic_read(v);
-   smp_read_barrier_depends(); /* Enforce control dependency. */
return val;
 }
 #endif
@@ -565,7 +564,6 @@ static inline int atomic_dec_if_positive(atomic_t *v)
 static inline long long atomic64_read_ctrl(const atomic64_t *v)
 {
long long val = atomic64_read(v);
-   smp_read_barrier_depends(); /* Enforce control dependency. */
return val;
 }
 #endif
diff --git a/include/linux/compiler.h b/include/linux/compiler.h
index 3d7810341b57..1d1f5902189d 100644
--- a/include/linux/compiler.h
+++ b/include/linux/compiler.h
@@ -311,7 +311,6 @@ static __always_inline void __write_once_size(volatile void 
*p, void *res, int s
 #define READ_ONCE_CTRL(x) \
 ({ \
typeof(x) __val = READ_ONCE(x); \
-   smp_read_barrier_depends(); /* Enforce control dependency. */ \
__val; \
 })
 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 4/4] locking: Introduce smp_cond_acquire()

2015-11-02 Thread Linus Torvalds
On Mon, Nov 2, 2015 at 12:36 PM, David Howells  wrote:
> Linus Torvalds  wrote:
>
>> > +   smp_read_barrier_depends(); /* ctrl */  \
>> > +   smp_rmb(); /* ctrl + rmb := acquire */  \
>
> Doesn't smp_rmb() imply an smp_read_barrier_depends() anyway?

Yes, it does. But that "smp_read_barrier_depends()" is actually
mis-used as a "barrier against subsequent dependent writes, thanks to
the control flow". It's not protecting against subsequent reads -
which is what the smp_rmb() is about.

Which is completely bogus, but that's what the comment implies.

Of course, on alpha (which is where smp_read_barrier_depends() makes a
difference), both that and smp_rmb() are just full memory barriers,
because alpha is some crazy sh*t. So yes, a "smp_rmb()" is sufficient
everywhere, but that is actually not where the confusion comes from in
the first place.

   Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 4/4] locking: Introduce smp_cond_acquire()

2015-11-02 Thread Peter Zijlstra
On Mon, Nov 02, 2015 at 08:36:49PM +, David Howells wrote:
> Linus Torvalds  wrote:
> 
> > > +   smp_read_barrier_depends(); /* ctrl */  \
> > > +   smp_rmb(); /* ctrl + rmb := acquire */  \
> 
> Doesn't smp_rmb() imply an smp_read_barrier_depends() anyway?  In
> memory-barriers.txt, it says:
> 
>  Read memory barriers imply data dependency barriers, and so can
>  substitute for them.

Yes, I noted that in a follow up email, Alpha implements both as MB. So
just the smp_rmb() is sufficient.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 4/4] locking: Introduce smp_cond_acquire()

2015-11-02 Thread David Howells
Linus Torvalds  wrote:

> > +   smp_read_barrier_depends(); /* ctrl */  \
> > +   smp_rmb(); /* ctrl + rmb := acquire */  \

Doesn't smp_rmb() imply an smp_read_barrier_depends() anyway?  In
memory-barriers.txt, it says:

 Read memory barriers imply data dependency barriers, and so can
 substitute for them.

David
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 4/4] locking: Introduce smp_cond_acquire()

2015-11-02 Thread Peter Zijlstra
On Mon, Nov 02, 2015 at 07:57:09PM +, Will Deacon wrote:
> > >> smp_read_barrier_depends() is about a memory barrier where there is a
> > >> data dependency between two accesses. The "depends" is very much about
> > >> the data dependency, and very much about *nothing* else.
> > >
> > > Paul wasn't so sure, which I think is why smp_read_barrier_depends()
> > > is already used in, for example, READ_ONCE_CTRL:
> > >
> > >   http://lkml.kernel.org/r/20151007154003.gj3...@linux.vnet.ibm.com
> > 
> > Quoting the alpha architecture manual is kind of pointless, when NO
> > OTHER ARCHITECTURE OUT THERE guararantees that whole "read +
> > conditional orders wrt subsequent writes" _either_.
> > 
> > (Again, with the exception of x86, which has the sane "we honor causality")
> > 
> > Alpha isn't special. And smp_read_barrier_depends() hasn't magically
> > become something new.
> > 
> > If people think that control dependency needs a memory barrier on
> > alpha, then it damn well needs on on all other weakly ordered
> > architectuers too, afaik.
> > 
> > Either that "you cannot finalize a write unless all conditionals it
> > depends on are finalized" is true or it is not. That argument has
> > *never* been about some architecture-specific memory ordering model,
> > as far as I know.
> 
> You can imagine a (horribly broken) value-speculating architecture that
> would permit this re-ordering, but then you'd have speculative stores
> and thin-air values, which Alpha doesn't have.

I've tried arguing this point with Paul on a number of occasions, and I
think he at one point constructed some elaborate scheme that depended on
the split cache where this might require the full barrier.

> > As to READ_ONCE_CTRL - two wrongs don't make a right.
> 
> Sure, I just wanted to point out the precedence and related discussion.

I purely followed 'convention' here.

> > That smp_read_barrier_depends() there doesn't make any sense either.
> > 
> > And finally, the alpha architecture manual actually does have the
> > notion of "Dependence Constraint" (5.6.1.7) that talks about writes
> > that depend on previous reads (where "depends" is explicitly spelled
> > out to be about conditionals, write data _or_ write address). They are
> > actually constrained on alpha too.
> 
> In which case, it looks like we can remove the smp_read_barrier_depends
> instances from all of the control-dependency macros, but I'll defer to
> Paul in case he has some further insight. I assume you're ok with this
> patch if the smp_read_barrier_depends() is removed?

That would be good indeed.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 4/4] locking: Introduce smp_cond_acquire()

2015-11-02 Thread Will Deacon
On Mon, Nov 02, 2015 at 11:17:17AM -0800, Linus Torvalds wrote:
> On Mon, Nov 2, 2015 at 10:37 AM, Will Deacon  wrote:
> > On Mon, Nov 02, 2015 at 10:08:24AM -0800, Linus Torvalds wrote:
> >> On Mon, Nov 2, 2015 at 5:29 AM, Peter Zijlstra  
> >> wrote:
> >> > +#define smp_cond_acquire(cond) do {\
> >> > +   while (!(cond)) \
> >> > +   cpu_relax();\
> >> > +   smp_read_barrier_depends(); /* ctrl */  \
> >> > +   smp_rmb(); /* ctrl + rmb := acquire */  \
> >> > +} while (0)
> >>
> >> This code makes absolutely no sense.
> >>
> >> smp_read_barrier_depends() is about a memory barrier where there is a
> >> data dependency between two accesses. The "depends" is very much about
> >> the data dependency, and very much about *nothing* else.
> >
> > Paul wasn't so sure, which I think is why smp_read_barrier_depends()
> > is already used in, for example, READ_ONCE_CTRL:
> >
> >   http://lkml.kernel.org/r/20151007154003.gj3...@linux.vnet.ibm.com
> 
> Quoting the alpha architecture manual is kind of pointless, when NO
> OTHER ARCHITECTURE OUT THERE guararantees that whole "read +
> conditional orders wrt subsequent writes" _either_.
> 
> (Again, with the exception of x86, which has the sane "we honor causality")
> 
> Alpha isn't special. And smp_read_barrier_depends() hasn't magically
> become something new.
> 
> If people think that control dependency needs a memory barrier on
> alpha, then it damn well needs on on all other weakly ordered
> architectuers too, afaik.
> 
> Either that "you cannot finalize a write unless all conditionals it
> depends on are finalized" is true or it is not. That argument has
> *never* been about some architecture-specific memory ordering model,
> as far as I know.

You can imagine a (horribly broken) value-speculating architecture that
would permit this re-ordering, but then you'd have speculative stores
and thin-air values, which Alpha doesn't have.

> As to READ_ONCE_CTRL - two wrongs don't make a right.

Sure, I just wanted to point out the precedence and related discussion.

> That smp_read_barrier_depends() there doesn't make any sense either.
> 
> And finally, the alpha architecture manual actually does have the
> notion of "Dependence Constraint" (5.6.1.7) that talks about writes
> that depend on previous reads (where "depends" is explicitly spelled
> out to be about conditionals, write data _or_ write address). They are
> actually constrained on alpha too.

In which case, it looks like we can remove the smp_read_barrier_depends
instances from all of the control-dependency macros, but I'll defer to
Paul in case he has some further insight. I assume you're ok with this
patch if the smp_read_barrier_depends() is removed?

> > In this case, control dependencies are only referring to READ -> WRITE
> > ordering, so they are honoured by ARM and PowerPC.
> 
> Do ARM and PPC actually guarantee the generic "previous reads always
> order before subsequent writes"?

Only to *dependent* subsequent writes, but Peter's patch makes all
subsequent writes dependent on cond, so I think that's the right way to
achieve the ordering we want.

Will
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 4/4] locking: Introduce smp_cond_acquire()

2015-11-02 Thread Linus Torvalds
On Mon, Nov 2, 2015 at 10:37 AM, Will Deacon  wrote:
> On Mon, Nov 02, 2015 at 10:08:24AM -0800, Linus Torvalds wrote:
>> On Mon, Nov 2, 2015 at 5:29 AM, Peter Zijlstra  wrote:
>> > +#define smp_cond_acquire(cond) do {\
>> > +   while (!(cond)) \
>> > +   cpu_relax();\
>> > +   smp_read_barrier_depends(); /* ctrl */  \
>> > +   smp_rmb(); /* ctrl + rmb := acquire */  \
>> > +} while (0)
>>
>> This code makes absolutely no sense.
>>
>> smp_read_barrier_depends() is about a memory barrier where there is a
>> data dependency between two accesses. The "depends" is very much about
>> the data dependency, and very much about *nothing* else.
>
> Paul wasn't so sure, which I think is why smp_read_barrier_depends()
> is already used in, for example, READ_ONCE_CTRL:
>
>   http://lkml.kernel.org/r/20151007154003.gj3...@linux.vnet.ibm.com

Quoting the alpha architecture manual is kind of pointless, when NO
OTHER ARCHITECTURE OUT THERE guararantees that whole "read +
conditional orders wrt subsequent writes" _either_.

(Again, with the exception of x86, which has the sane "we honor causality")

Alpha isn't special. And smp_read_barrier_depends() hasn't magically
become something new.

If people think that control dependency needs a memory barrier on
alpha, then it damn well needs on on all other weakly ordered
architectuers too, afaik.

Either that "you cannot finalize a write unless all conditionals it
depends on are finalized" is true or it is not. That argument has
*never* been about some architecture-specific memory ordering model,
as far as I know.

As to READ_ONCE_CTRL - two wrongs don't make a right.

That smp_read_barrier_depends() there doesn't make any sense either.

And finally, the alpha architecture manual actually does have the
notion of "Dependence Constraint" (5.6.1.7) that talks about writes
that depend on previous reads (where "depends" is explicitly spelled
out to be about conditionals, write data _or_ write address). They are
actually constrained on alpha too.

Note that a "Dependence Constraint" is not a memory barrier, because
it only affects that particular chain of dependencies. So it doesn't
order other things in *general*, but it does order a particular read
with a particular sef of subsequent write. Which is all we guarantee
on anything else too wrt the whole control dependencies.

The smp_read_barrier_depends() is a *READ BARRIER*. It's about a
*read* that depends on a previous read. Now, it so happens that alpha
doesn't have a "read-to-read" barrier, and it's implemented as a full
barrier, but it really should be read as a "smp_rmb().

Yes, yes, alpha has the worst memory ordering ever, and the worst
barriers for that insane memory ordering, but that still does not make
alpha "magically able to reorder more than physically or logically
possible". We don't add barriers that don't make sense just because
the alpha memory orderings are insane.

Paul? I really disagree with how you have totally tried to re-purpose
smp_read_barrier_depends() in ways that are insane and make no sense.
That is not a control dependency. If it was, then PPC and ARM would
need to make it a real barrier too. I really don't see why you have
singled out alpha as the victim of your "let's just randomly change
the rules".

> In this case, control dependencies are only referring to READ -> WRITE
> ordering, so they are honoured by ARM and PowerPC.

Do ARM and PPC actually guarantee the generic "previous reads always
order before subsequent writes"?

Because if that's the issue, then we should perhaps add a new ordering
primitive that says that (ie "smp_read_to_write_barrier()"). That
could be a useful thing in general, where we currently use "smp_mb()"
to order an earlier read with a later write.

   Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 4/4] locking: Introduce smp_cond_acquire()

2015-11-02 Thread Will Deacon
On Mon, Nov 02, 2015 at 10:08:24AM -0800, Linus Torvalds wrote:
> On Mon, Nov 2, 2015 at 5:29 AM, Peter Zijlstra  wrote:
> > +#define smp_cond_acquire(cond) do {\
> > +   while (!(cond)) \
> > +   cpu_relax();\
> > +   smp_read_barrier_depends(); /* ctrl */  \
> > +   smp_rmb(); /* ctrl + rmb := acquire */  \
> > +} while (0)
> 
> This code makes absolutely no sense.
> 
> smp_read_barrier_depends() is about a memory barrier where there is a
> data dependency between two accesses. The "depends" is very much about
> the data dependency, and very much about *nothing* else.

Paul wasn't so sure, which I think is why smp_read_barrier_depends()
is already used in, for example, READ_ONCE_CTRL:

  http://lkml.kernel.org/r/20151007154003.gj3...@linux.vnet.ibm.com

although I agree that this would pave the way for speculative stores on
Alpha and that seems like a heavy accusation to make.

> Your comment talks about control dependencies, but
> smp_read_barrier_depends() has absolutely nothing to do with a control
> dependency. In fact, it is explicitly a no-op on architectures like
> ARM and PowerPC that violate control dependencies.

In this case, control dependencies are only referring to READ -> WRITE
ordering, so they are honoured by ARM and PowerPC.

Will
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 4/4] locking: Introduce smp_cond_acquire()

2015-11-02 Thread Linus Torvalds
On Mon, Nov 2, 2015 at 5:29 AM, Peter Zijlstra  wrote:
> +#define smp_cond_acquire(cond) do {\
> +   while (!(cond)) \
> +   cpu_relax();\
> +   smp_read_barrier_depends(); /* ctrl */  \
> +   smp_rmb(); /* ctrl + rmb := acquire */  \
> +} while (0)

This code makes absolutely no sense.

smp_read_barrier_depends() is about a memory barrier where there is a
data dependency between two accesses. The "depends" is very much about
the data dependency, and very much about *nothing* else.

Your comment talks about control dependencies, but
smp_read_barrier_depends() has absolutely nothing to do with a control
dependency. In fact, it is explicitly a no-op on architectures like
ARM and PowerPC that violate control dependencies.

So the code may end up *working*, but the comments in it are
misleading, insane, and nonsensical. And I think the code is actively
wrong (although in the sense that it has a barrier too *many*, not
lacking one).

Because smp_read_barrier_depends() definitely has *nothing* to do with
control barriers in any way, shape or form. The comment is actively
and entirely wrong.

As far as I know, even on alpha that 'smp_read_barrier_depends()' is
unnecessary. Not because of any barrier semantics (whether
smp_read_barrier depends or any other barrier), but simply because a
store cannot finalize before the conditional has been resolved, which
means that the read must have been done.

So for alpha, you end up depending on the exact same logic that you
depend on for every other architecture, since there is no
"architectural" memory ordering guarantee of it anywhere else either
(ok, so x86 has the explicit "memory ordering is causal" guarantee, so
x86 does kind of make that conditional ordering explicit).

Adding that "smp_read_barrier_depends()" doesn't add anything to the
whole ctrl argument.

So the code looks insane to me. On everything but alpha,
"smp_read_barrier_depends()" is a no-op. And on alpha, a combination
of "smp_read_barrier_depends()" and "smp_rmb()" is nonsensical. So in
no case can that code make sense, as far as I can tell.

   Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 4/4] locking: Introduce smp_cond_acquire()

2015-11-02 Thread Will Deacon
On Mon, Nov 02, 2015 at 02:57:26PM +0100, Peter Zijlstra wrote:
> On Mon, Nov 02, 2015 at 02:29:05PM +0100, Peter Zijlstra wrote:
> 
> > Note that while smp_cond_acquire() has an explicit
> > smp_read_barrier_depends() for Alpha, neither sites it gets used in
> > were actually buggy on Alpha for their lack of it. The first uses
> > smp_rmb(), which on Alpha is a full barrier too and therefore serves
> > its purpose. The second had an explicit full barrier.
> 
> > +/**
> > + * smp_cond_acquire() - Spin wait for cond with ACQUIRE ordering
> > + * @cond: boolean expression to wait for
> > + *
> > + * Equivalent to using smp_load_acquire() on the condition variable but 
> > employs
> > + * the control dependency of the wait to reduce the barrier on many 
> > platforms.
> > + *
> > + * The control dependency provides a LOAD->STORE order, the additional RMB
> > + * provides LOAD->LOAD order, together they provide LOAD->{LOAD,STORE} 
> > order,
> > + * aka. ACQUIRE.
> > + */
> > +#define smp_cond_acquire(cond) do {\
> > +   while (!(cond)) \
> > +   cpu_relax();\
> > +   smp_read_barrier_depends(); /* ctrl */  \
> > +   smp_rmb(); /* ctrl + rmb := acquire */  \
> > +} while (0)
> 
> So per the above argument we could leave out the
> smp_read_barrier_depends() for Alpha, although that would break
> consistency with all the other control dependency primitives we have. It
> would avoid issuing a double barrier.
> 
> Thoughts?

Do we even know that Alpha needs a barrier for control-dependencies in
the first place?

Will
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 4/4] locking: Introduce smp_cond_acquire()

2015-11-02 Thread Will Deacon
Hi Peter,

On Mon, Nov 02, 2015 at 02:29:05PM +0100, Peter Zijlstra wrote:
> Introduce smp_cond_acquire() which combines a control dependency and a
> read barrier to form acquire semantics.
> 
> This primitive has two benefits:
>  - it documents control dependencies,
>  - its typically cheaper than using smp_load_acquire() in a loop.

I'm not sure that's necessarily true on arm64, where we have a native
load-acquire instruction, but not a READ -> READ barrier (smp_rmb()
orders prior loads against subsequent loads and stores for us).

Perhaps we could allow architectures to provide their own definition of
smp_cond_acquire in case they can implement it more efficiently?

> Note that while smp_cond_acquire() has an explicit
> smp_read_barrier_depends() for Alpha, neither sites it gets used in
> were actually buggy on Alpha for their lack of it. The first uses
> smp_rmb(), which on Alpha is a full barrier too and therefore serves
> its purpose. The second had an explicit full barrier.
> 
> Signed-off-by: Peter Zijlstra (Intel) 
> ---
>  include/linux/compiler.h |   18 ++
>  kernel/sched/core.c  |8 +---
>  kernel/task_work.c   |4 ++--
>  3 files changed, 21 insertions(+), 9 deletions(-)
> 
> --- a/include/linux/compiler.h
> +++ b/include/linux/compiler.h
> @@ -275,6 +275,24 @@ static __always_inline void __write_once
>   __val; \
>  })
>  
> +/**
> + * smp_cond_acquire() - Spin wait for cond with ACQUIRE ordering
> + * @cond: boolean expression to wait for
> + *
> + * Equivalent to using smp_load_acquire() on the condition variable but 
> employs
> + * the control dependency of the wait to reduce the barrier on many 
> platforms.
> + *
> + * The control dependency provides a LOAD->STORE order, the additional RMB
> + * provides LOAD->LOAD order, together they provide LOAD->{LOAD,STORE} order,
> + * aka. ACQUIRE.
> + */
> +#define smp_cond_acquire(cond)   do {\

I think the previous version that you posted/discussed had the actual
address of the variable being loaded passed in here too? That would be
useful for arm64, where we can wait-until-memory-location-has-changed
to save us re-evaluating cond prematurely.

> + while (!(cond)) \
> + cpu_relax();\
> + smp_read_barrier_depends(); /* ctrl */  \
> + smp_rmb(); /* ctrl + rmb := acquire */  \

It's actually stronger than acquire, I think, because accesses before the
smp_cond_acquire cannot be moved across it.

> +} while (0)
> +
>  #endif /* __KERNEL__ */
>  
>  #endif /* __ASSEMBLY__ */
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -2111,19 +2111,13 @@ try_to_wake_up(struct task_struct *p, un
>   /*
>* If the owning (remote) cpu is still in the middle of schedule() with
>* this task as prev, wait until its done referencing the task.
> -  */
> - while (p->on_cpu)
> - cpu_relax();
> - /*
> -  * Combined with the control dependency above, we have an effective
> -  * smp_load_acquire() without the need for full barriers.
>*
>* Pairs with the smp_store_release() in finish_lock_switch().
>*
>* This ensures that tasks getting woken will be fully ordered against
>* their previous state and preserve Program Order.
>*/
> - smp_rmb();
> + smp_cond_acquire(!p->on_cpu);
>  
>   p->sched_contributes_to_load = !!task_contributes_to_load(p);
>   p->state = TASK_WAKING;
> --- a/kernel/task_work.c
> +++ b/kernel/task_work.c
> @@ -102,13 +102,13 @@ void task_work_run(void)
>  
>   if (!work)
>   break;
> +
>   /*
>* Synchronize with task_work_cancel(). It can't remove
>* the first entry == work, cmpxchg(task_works) should
>* fail, but it can play with *work and other entries.
>*/
> - raw_spin_unlock_wait(&task->pi_lock);
> - smp_mb();
> + smp_cond_acquire(!raw_spin_is_locked(&task->pi_lock));

Hmm, there's some sort of release equivalent in kernel/exit.c, but I
couldn't easily figure out whether we could do anything there. If we
could, we could kill raw_spin_unlock_wait :)

Will
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 4/4] locking: Introduce smp_cond_acquire()

2015-11-02 Thread Peter Zijlstra
On Mon, Nov 02, 2015 at 02:29:05PM +0100, Peter Zijlstra wrote:

> Note that while smp_cond_acquire() has an explicit
> smp_read_barrier_depends() for Alpha, neither sites it gets used in
> were actually buggy on Alpha for their lack of it. The first uses
> smp_rmb(), which on Alpha is a full barrier too and therefore serves
> its purpose. The second had an explicit full barrier.

> +/**
> + * smp_cond_acquire() - Spin wait for cond with ACQUIRE ordering
> + * @cond: boolean expression to wait for
> + *
> + * Equivalent to using smp_load_acquire() on the condition variable but 
> employs
> + * the control dependency of the wait to reduce the barrier on many 
> platforms.
> + *
> + * The control dependency provides a LOAD->STORE order, the additional RMB
> + * provides LOAD->LOAD order, together they provide LOAD->{LOAD,STORE} order,
> + * aka. ACQUIRE.
> + */
> +#define smp_cond_acquire(cond)   do {\
> + while (!(cond)) \
> + cpu_relax();\
> + smp_read_barrier_depends(); /* ctrl */  \
> + smp_rmb(); /* ctrl + rmb := acquire */  \
> +} while (0)

So per the above argument we could leave out the
smp_read_barrier_depends() for Alpha, although that would break
consistency with all the other control dependency primitives we have. It
would avoid issuing a double barrier.

Thoughts?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 4/4] locking: Introduce smp_cond_acquire()

2015-11-02 Thread Peter Zijlstra
Introduce smp_cond_acquire() which combines a control dependency and a
read barrier to form acquire semantics.

This primitive has two benefits:
 - it documents control dependencies,
 - its typically cheaper than using smp_load_acquire() in a loop.

Note that while smp_cond_acquire() has an explicit
smp_read_barrier_depends() for Alpha, neither sites it gets used in
were actually buggy on Alpha for their lack of it. The first uses
smp_rmb(), which on Alpha is a full barrier too and therefore serves
its purpose. The second had an explicit full barrier.

Signed-off-by: Peter Zijlstra (Intel) 
---
 include/linux/compiler.h |   18 ++
 kernel/sched/core.c  |8 +---
 kernel/task_work.c   |4 ++--
 3 files changed, 21 insertions(+), 9 deletions(-)

--- a/include/linux/compiler.h
+++ b/include/linux/compiler.h
@@ -275,6 +275,24 @@ static __always_inline void __write_once
__val; \
 })
 
+/**
+ * smp_cond_acquire() - Spin wait for cond with ACQUIRE ordering
+ * @cond: boolean expression to wait for
+ *
+ * Equivalent to using smp_load_acquire() on the condition variable but employs
+ * the control dependency of the wait to reduce the barrier on many platforms.
+ *
+ * The control dependency provides a LOAD->STORE order, the additional RMB
+ * provides LOAD->LOAD order, together they provide LOAD->{LOAD,STORE} order,
+ * aka. ACQUIRE.
+ */
+#define smp_cond_acquire(cond) do {\
+   while (!(cond)) \
+   cpu_relax();\
+   smp_read_barrier_depends(); /* ctrl */  \
+   smp_rmb(); /* ctrl + rmb := acquire */  \
+} while (0)
+
 #endif /* __KERNEL__ */
 
 #endif /* __ASSEMBLY__ */
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2111,19 +2111,13 @@ try_to_wake_up(struct task_struct *p, un
/*
 * If the owning (remote) cpu is still in the middle of schedule() with
 * this task as prev, wait until its done referencing the task.
-*/
-   while (p->on_cpu)
-   cpu_relax();
-   /*
-* Combined with the control dependency above, we have an effective
-* smp_load_acquire() without the need for full barriers.
 *
 * Pairs with the smp_store_release() in finish_lock_switch().
 *
 * This ensures that tasks getting woken will be fully ordered against
 * their previous state and preserve Program Order.
 */
-   smp_rmb();
+   smp_cond_acquire(!p->on_cpu);
 
p->sched_contributes_to_load = !!task_contributes_to_load(p);
p->state = TASK_WAKING;
--- a/kernel/task_work.c
+++ b/kernel/task_work.c
@@ -102,13 +102,13 @@ void task_work_run(void)
 
if (!work)
break;
+
/*
 * Synchronize with task_work_cancel(). It can't remove
 * the first entry == work, cmpxchg(task_works) should
 * fail, but it can play with *work and other entries.
 */
-   raw_spin_unlock_wait(&task->pi_lock);
-   smp_mb();
+   smp_cond_acquire(!raw_spin_is_locked(&task->pi_lock));
 
do {
next = work->next;


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/