Re: [PATCH 4/4] locking: Introduce smp_cond_acquire()
On Thu, Nov 19, 2015 at 06:01:52PM +, Will Deacon wrote: > For completeness, here's what I've currently got. I've failed to measure > any performance impact on my 8-core systems, but that's not surprising. > +static inline void arch_spin_unlock_wait(arch_spinlock_t *lock) > +{ > + unsigned int tmp; > + arch_spinlock_t lockval; > > + asm volatile( > +"sevl\n" > +"1: wfe\n" Using WFE here would lower the cacheline bouncing pressure a bit I imagine. Sure we still pull it over into S(hared) after every store but we don't keep banging on it making the initial e(X)clusive grab hard. > +"2: ldaxr %w0, %2\n" > +"eor %w1, %w0, %w0, ror #16\n" > +"cbnz%w1, 1b\n" > + ARM64_LSE_ATOMIC_INSN( > + /* LL/SC */ > +"stxr%w1, %w0, %2\n" > + /* Serialise against any concurrent lockers */ > +"cbnz%w1, 2b\n", > + /* LSE atomics */ > +"nop\n" > +"nop\n") I find these ARM64_LSE macro thingies aren't always easy to read, its fairly easy to overlook the ',' separating the v8 and v8.1 parts, esp. if you have further interleaving comments like in the above. > + : "=&r" (lockval), "=&r" (tmp), "+Q" (*lock) > + : > + : "memory"); > +} -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 4/4] locking: Introduce smp_cond_acquire()
On Wed, Nov 18, 2015 at 11:25:14AM +, Will Deacon wrote: > On Tue, Nov 17, 2015 at 01:01:09PM -0800, Paul E. McKenney wrote: > > On Tue, Nov 17, 2015 at 11:51:10AM +, Will Deacon wrote: > > > On Mon, Nov 16, 2015 at 01:58:49PM -0800, Linus Torvalds wrote: > > > > On Mon, Nov 16, 2015 at 8:24 AM, Will Deacon > > > > wrote: > > > > > > > > > > ... or we upgrade spin_unlock_wait to a LOCK operation, which might be > > > > > slightly cheaper than spin_lock()+spin_unlock(). > > > > > > > > So traditionally the real concern has been the cacheline ping-pong > > > > part of spin_unlock_wait(). I think adding a memory barrier (that > > > > doesn't force any exclusive states, just ordering) to it is fine, but > > > > I don't think we want to necessarily have it have to get the cacheline > > > > into exclusive state. > > > > > > The problem is, I don't think the memory-barrier buys you anything in > > > the context of Boqun's example. In fact, he already had smp_mb() either > > > side of the spin_unlock_wait() and its still broken on arm64 and ppc. > > > > > > Paul is proposing adding a memory barrier after spin_lock() in the racing > > > thread, but I personally think people will forget to add that. > > > > A mechanical check would certainly make me feel better about it, so that > > any lock that was passed to spin_unlock_wait() was required to have all > > acquisitions followed by smp_mb__after_unlock_lock() or some such. > > But I haven't yet given up on finding a better solution. > > Right-o. I'll hack together the arm64 spin_unlock_wait fix, but hold off > merging it for a few weeks in case we get struck by a sudden flash of > inspiration. For completeness, here's what I've currently got. I've failed to measure any performance impact on my 8-core systems, but that's not surprising. Will --->8 >From da14adc1aef2f12b7a7def4d6b7dde254a91ebf1 Mon Sep 17 00:00:00 2001 From: Will Deacon Date: Thu, 19 Nov 2015 17:48:31 + Subject: [PATCH] arm64: spinlock: serialise spin_unlock_wait against concurrent lockers Boqun Feng reported a rather nasty ordering issue with spin_unlock_wait on architectures implementing spin_lock with LL/SC sequences and acquire semantics: | CPU 1 CPU 2 CPU 3 | == == | spin_unlock(&lock); | spin_lock(&lock): | r1 = *lock; // r1 == 0; | o = READ_ONCE(object); // reordered here | object = NULL; | smp_mb(); | spin_unlock_wait(&lock); | *lock = 1; | smp_mb(); | o->dead = true; | if (o) // true | BUG_ON(o->dead); // true!! The crux of the problem is that spin_unlock_wait(&lock) can return on CPU 1 whilst CPU 2 is in the process of taking the lock. This can be resolved by upgrading spin_unlock_wait to a LOCK operation, forcing it to serialise against a concurrent locker and giving it acquire semantics in the process (although it is not at all clear whether this is needed - different callers seem to assume different things about the barrier semantics and architectures are similarly disjoint in their implementations of the macro). This patch implements spin_unlock_wait using an LL/SC sequence with acquire semantics on arm64. For v8.1 systems with the LSE atomics, the exclusive writeback is omitted, since the spin_lock operation is indivisible and no intermediate state can be observed. Signed-off-by: Will Deacon --- arch/arm64/include/asm/spinlock.h | 24 ++-- 1 file changed, 22 insertions(+), 2 deletions(-) diff --git a/arch/arm64/include/asm/spinlock.h b/arch/arm64/include/asm/spinlock.h index c85e96d174a5..b531791a75ff 100644 --- a/arch/arm64/include/asm/spinlock.h +++ b/arch/arm64/include/asm/spinlock.h @@ -26,9 +26,29 @@ * The memory barriers are implicit with the load-acquire and store-release * instructions. */ +static inline void arch_spin_unlock_wait(arch_spinlock_t *lock) +{ + unsigned int tmp; + arch_spinlock_t lockval; -#define arch_spin_unlock_wait(lock) \ - do { while (arch_spin_is_locked(lock)) cpu_relax(); } while (0) + asm volatile( +" sevl\n" +"1:wfe\n" +"2:ldaxr %w0, %2\n" +" eor %w1, %w0, %w0, ror #16\n" +" cbnz%w1, 1b\n" + ARM64_LSE_ATOMIC_INSN( + /* LL/SC */ +" stxr%w1, %w0, %2\n" + /* Serialise against any concurrent lockers */ +" cbnz%w1, 2b\n", + /* LSE atomics */ +" nop\n" +" nop\n") + : "=&r" (lockval), "=&r" (tmp), "+Q" (*lock) + : + : "memory"); +} #define arch_spin_lock_flags(lock, flags) arch_spin_lock(lock) -- 2.1.4 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.
Re: [PATCH 4/4] locking: Introduce smp_cond_acquire()
On Tue, Nov 17, 2015 at 01:01:09PM -0800, Paul E. McKenney wrote: > On Tue, Nov 17, 2015 at 11:51:10AM +, Will Deacon wrote: > > On Mon, Nov 16, 2015 at 01:58:49PM -0800, Linus Torvalds wrote: > > > On Mon, Nov 16, 2015 at 8:24 AM, Will Deacon wrote: > > > > > > > > ... or we upgrade spin_unlock_wait to a LOCK operation, which might be > > > > slightly cheaper than spin_lock()+spin_unlock(). > > > > > > So traditionally the real concern has been the cacheline ping-pong > > > part of spin_unlock_wait(). I think adding a memory barrier (that > > > doesn't force any exclusive states, just ordering) to it is fine, but > > > I don't think we want to necessarily have it have to get the cacheline > > > into exclusive state. > > > > The problem is, I don't think the memory-barrier buys you anything in > > the context of Boqun's example. In fact, he already had smp_mb() either > > side of the spin_unlock_wait() and its still broken on arm64 and ppc. > > > > Paul is proposing adding a memory barrier after spin_lock() in the racing > > thread, but I personally think people will forget to add that. > > A mechanical check would certainly make me feel better about it, so that > any lock that was passed to spin_unlock_wait() was required to have all > acquisitions followed by smp_mb__after_unlock_lock() or some such. > But I haven't yet given up on finding a better solution. Right-o. I'll hack together the arm64 spin_unlock_wait fix, but hold off merging it for a few weeks in case we get struck by a sudden flash of inspiration. Will -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 4/4] locking: Introduce smp_cond_acquire()
On Tue, Nov 17, 2015 at 11:51:10AM +, Will Deacon wrote: > Hi Linus, > > On Mon, Nov 16, 2015 at 01:58:49PM -0800, Linus Torvalds wrote: > > On Mon, Nov 16, 2015 at 8:24 AM, Will Deacon wrote: > > > > > > ... or we upgrade spin_unlock_wait to a LOCK operation, which might be > > > slightly cheaper than spin_lock()+spin_unlock(). > > > > So traditionally the real concern has been the cacheline ping-pong > > part of spin_unlock_wait(). I think adding a memory barrier (that > > doesn't force any exclusive states, just ordering) to it is fine, but > > I don't think we want to necessarily have it have to get the cacheline > > into exclusive state. > > The problem is, I don't think the memory-barrier buys you anything in > the context of Boqun's example. In fact, he already had smp_mb() either > side of the spin_unlock_wait() and its still broken on arm64 and ppc. > > Paul is proposing adding a memory barrier after spin_lock() in the racing > thread, but I personally think people will forget to add that. A mechanical check would certainly make me feel better about it, so that any lock that was passed to spin_unlock_wait() was required to have all acquisitions followed by smp_mb__after_unlock_lock() or some such. But I haven't yet given up on finding a better solution. Thanx, Paul -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 4/4] locking: Introduce smp_cond_acquire()
Hi Linus, On Mon, Nov 16, 2015 at 01:58:49PM -0800, Linus Torvalds wrote: > On Mon, Nov 16, 2015 at 8:24 AM, Will Deacon wrote: > > > > ... or we upgrade spin_unlock_wait to a LOCK operation, which might be > > slightly cheaper than spin_lock()+spin_unlock(). > > So traditionally the real concern has been the cacheline ping-pong > part of spin_unlock_wait(). I think adding a memory barrier (that > doesn't force any exclusive states, just ordering) to it is fine, but > I don't think we want to necessarily have it have to get the cacheline > into exclusive state. The problem is, I don't think the memory-barrier buys you anything in the context of Boqun's example. In fact, he already had smp_mb() either side of the spin_unlock_wait() and its still broken on arm64 and ppc. Paul is proposing adding a memory barrier after spin_lock() in the racing thread, but I personally think people will forget to add that. > Because if spin_unlock_wait() ends up having to get the spinlock > cacheline (for example, by writing the same value back with a SC), I > don't think spin_unlock_wait() will really be all that much cheaper > than just getting the spinlock, and in that case we shouldn't play > complicated ordering games. It was the lock-fairness guarantees that I was concerned about. A spin_lock could place you into a queue, so you're no longer waiting for a single spin_unlock(), you're now waiting for *all* the spin_unlocks by the CPUs preceding you in the queue. > On another issue: > > I'm also looking at the ARM documentation for strx, and the > _documentation_ says that it has no stronger ordering than a "store > release", but I'm starting to wonder if that is actually true. > > Because I do end up thinking that it does have the same "control > dependency" to all subsequent writes (but not reads). So reads after > the SC can percolate up, but I think writes are restricted. > > Why? In order for the SC to be able to return success, the write > itself may not have been actually done yet, but the cacheline for the > write must have successfully be turned into exclusive ownership. > Agreed? If the LL/SC logic hangs off the coherency logic, then yes, but there are other ways to build this (using a seperate "exclusive monitor" block) and the architecture caters for this, too. See below. > That means that by the time a SC returns success, no other CPU can see > the old value of the spinlock any more. So by the time any subsequent > stores in the locked region can be visible to any other CPU's, the > locked value of the lock itself has to be visible too. > > Agreed? No. A successful SC is *not* multi-copy atomic, but you're right to point out that it provides more guarantees than a plain store. In particular, a successful SC cannot return its success value until its corresponding write has fixed its place in the coherence order for the location that it is updating. To be more concrete (simplified AArch64 asm, X1 and X2 hold addresses of zero-initialised locations): P0 LDXRX0, [X1] ADD X0, X0, #1 STXRX0, [X1]// Succeeds P1 LDR X0, [X1]// Reads 1 STR #1, [X2] P2 LDR X0, [X2]// Reads 1 LDR X0, [X1]// **Not required to read 1** However: P0 LDXRX0, [X1] ADD X0, X0, #1 STXRX0, [X1]// Succeeds P1 LDR X0, [X1]// Reads 1 STR #1, [X2] P2 LDR X0, [X2]// Reads 1 STR #2, [X1]// Location at [X1] must be ordered {0->1->2} We can also extend this example so that P2 instead does: P2 LDR X0, [X2]// Reads 1 LDXRX0, [X1] ADD X0, X0, #1 STXRX0, [X1]// Succeeds; location at [x1] must be ordered {0->1->2} i.e. the STXR cannot succeed until the coherence order has been resolved, which also requires the LDXR to return the up-to-date value. > So I think that in effect, when a spinlock is implemnted with LL/SC, > the loads inside the locked region are only ordered wrt the acquire on > the LL, but the stores can be considered ordered wrt the SC. > > No? I initially fell into the same trap because a control dependency between a load and a store is sufficient to create order (i.e. we don't speculate writes). An SC is different, though, because the control dependency can be resolved without the write being multi-copy atomic, whereas a read is required to return its data (i.e. complete) before the control hazard can be resolved. I think that all of this means I should either: (1) Update the arm64 spin_unlock_wait to use LDXR/STXR (perhaps with acquire semantics?) - or - (2) Replace spin_unlock_wait with spin_lock; spin_unlock, my worries about queuing notwithstanding. The cacheline ping-pong in (1) can be mitigated somewhat by the use of wfe, so it won't be any worse than a spin_lock(). Will -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vg
Re: [PATCH 4/4] locking: Introduce smp_cond_acquire()
On Mon, Nov 16, 2015 at 8:24 AM, Will Deacon wrote: > > ... or we upgrade spin_unlock_wait to a LOCK operation, which might be > slightly cheaper than spin_lock()+spin_unlock(). So traditionally the real concern has been the cacheline ping-pong part of spin_unlock_wait(). I think adding a memory barrier (that doesn't force any exclusive states, just ordering) to it is fine, but I don't think we want to necessarily have it have to get the cacheline into exclusive state. Because if spin_unlock_wait() ends up having to get the spinlock cacheline (for example, by writing the same value back with a SC), I don't think spin_unlock_wait() will really be all that much cheaper than just getting the spinlock, and in that case we shouldn't play complicated ordering games. On another issue: I'm also looking at the ARM documentation for strx, and the _documentation_ says that it has no stronger ordering than a "store release", but I'm starting to wonder if that is actually true. Because I do end up thinking that it does have the same "control dependency" to all subsequent writes (but not reads). So reads after the SC can percolate up, but I think writes are restricted. Why? In order for the SC to be able to return success, the write itself may not have been actually done yet, but the cacheline for the write must have successfully be turned into exclusive ownership. Agreed? That means that by the time a SC returns success, no other CPU can see the old value of the spinlock any more. So by the time any subsequent stores in the locked region can be visible to any other CPU's, the locked value of the lock itself has to be visible too. Agreed? So I think that in effect, when a spinlock is implemnted with LL/SC, the loads inside the locked region are only ordered wrt the acquire on the LL, but the stores can be considered ordered wrt the SC. No? So I think a _successful_ SC - is still more ordered than just any random store with release consistency. Of course, I'm not sure that actually *helps* us, because I think the problem tends to be loads in the locked region moving up earlier than the actual store that sets the lock, but maybe it makes some difference. Linus -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 4/4] locking: Introduce smp_cond_acquire()
On Mon, Nov 16, 2015 at 04:46:36PM +, Will Deacon wrote: > On Mon, Nov 16, 2015 at 08:44:43AM -0800, Paul E. McKenney wrote: > > On Mon, Nov 16, 2015 at 04:24:53PM +, Will Deacon wrote: > > > On Mon, Nov 16, 2015 at 05:04:45PM +0100, Peter Zijlstra wrote: > > > > On Mon, Nov 16, 2015 at 04:56:58PM +0100, Peter Zijlstra wrote: > > > > > On Thu, Nov 12, 2015 at 10:21:39AM -0800, Linus Torvalds wrote: > > > > > > Now, the point of spin_unlock_wait() (and "spin_is_locked()") should > > > > > > generally be that you have some external ordering guarantee that > > > > > > guarantees that the lock has been taken. For example, for the IPC > > > > > > semaphores, we do either one of: > > > > > > > > > > > > (a) get large lock, then - once you hold that lock - wait for each > > > > > > small lock > > > > > > > > > > > > or > > > > > > > > > > > > (b) get small lock, then - once you hold that lock - check that the > > > > > > largo lock is unlocked > > > > > > > > > > > > and that's the case we should really worry about. The other uses of > > > > > > spin_unlock_wait() should have similar "I have other reasons to know > > > > > > I've seen that the lock was taken, or will never be taken after this > > > > > > because XYZ". > > > > > > > > > > I don't think this is true for the usage in do_exit(), we have no > > > > > knowledge on if pi_lock is taken or not. We just want to make sure > > > > > that > > > > > _if_ it were taken, we wait until it is released. > > > > > > > > And unless PPC would move to using RCsc locks with a SYNC in > > > > spin_lock(), I don't think it makes sense to add > > > > smp_mb__after_unlock_lock() to all tsk->pi_lock instances to fix this. > > > > As that is far more expensive than flipping the exit path to do > > > > spin_lock()+spin_unlock(). > > > > > > ... or we upgrade spin_unlock_wait to a LOCK operation, which might be > > > slightly cheaper than spin_lock()+spin_unlock(). > > > > Or we supply a heavyweight version of spin_unlock_wait() that forces > > the cache miss. But I bet that the difference in overhead between > > spin_lock()+spin_unlock() and the heavyweight version would be down in > > the noise. > > I'm not so sure. If the lock is ticket-based, then spin_lock() has to > queue for its turn, whereas spin_unlock_wait could just wait for the > next unlock. Fair point, and it actually applies to high-contention spinlocks as well, just a bit less deterministically. OK, given that I believe that we do see high contention on the lock in question, I withdraw any objections to a heavy-weight form of spin_unlock_wait(). Thanx, Paul -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 4/4] locking: Introduce smp_cond_acquire()
On Mon, Nov 16, 2015 at 08:44:43AM -0800, Paul E. McKenney wrote: > On Mon, Nov 16, 2015 at 04:24:53PM +, Will Deacon wrote: > > On Mon, Nov 16, 2015 at 05:04:45PM +0100, Peter Zijlstra wrote: > > > On Mon, Nov 16, 2015 at 04:56:58PM +0100, Peter Zijlstra wrote: > > > > On Thu, Nov 12, 2015 at 10:21:39AM -0800, Linus Torvalds wrote: > > > > > Now, the point of spin_unlock_wait() (and "spin_is_locked()") should > > > > > generally be that you have some external ordering guarantee that > > > > > guarantees that the lock has been taken. For example, for the IPC > > > > > semaphores, we do either one of: > > > > > > > > > > (a) get large lock, then - once you hold that lock - wait for each > > > > > small lock > > > > > > > > > > or > > > > > > > > > > (b) get small lock, then - once you hold that lock - check that the > > > > > largo lock is unlocked > > > > > > > > > > and that's the case we should really worry about. The other uses of > > > > > spin_unlock_wait() should have similar "I have other reasons to know > > > > > I've seen that the lock was taken, or will never be taken after this > > > > > because XYZ". > > > > > > > > I don't think this is true for the usage in do_exit(), we have no > > > > knowledge on if pi_lock is taken or not. We just want to make sure that > > > > _if_ it were taken, we wait until it is released. > > > > > > And unless PPC would move to using RCsc locks with a SYNC in > > > spin_lock(), I don't think it makes sense to add > > > smp_mb__after_unlock_lock() to all tsk->pi_lock instances to fix this. > > > As that is far more expensive than flipping the exit path to do > > > spin_lock()+spin_unlock(). > > > > ... or we upgrade spin_unlock_wait to a LOCK operation, which might be > > slightly cheaper than spin_lock()+spin_unlock(). > > Or we supply a heavyweight version of spin_unlock_wait() that forces > the cache miss. But I bet that the difference in overhead between > spin_lock()+spin_unlock() and the heavyweight version would be down in > the noise. I'm not so sure. If the lock is ticket-based, then spin_lock() has to queue for its turn, whereas spin_unlock_wait could just wait for the next unlock. Will -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 4/4] locking: Introduce smp_cond_acquire()
On Mon, Nov 16, 2015 at 04:24:53PM +, Will Deacon wrote: > On Mon, Nov 16, 2015 at 05:04:45PM +0100, Peter Zijlstra wrote: > > On Mon, Nov 16, 2015 at 04:56:58PM +0100, Peter Zijlstra wrote: > > > On Thu, Nov 12, 2015 at 10:21:39AM -0800, Linus Torvalds wrote: > > > > Now, the point of spin_unlock_wait() (and "spin_is_locked()") should > > > > generally be that you have some external ordering guarantee that > > > > guarantees that the lock has been taken. For example, for the IPC > > > > semaphores, we do either one of: > > > > > > > > (a) get large lock, then - once you hold that lock - wait for each > > > > small lock > > > > > > > > or > > > > > > > > (b) get small lock, then - once you hold that lock - check that the > > > > largo lock is unlocked > > > > > > > > and that's the case we should really worry about. The other uses of > > > > spin_unlock_wait() should have similar "I have other reasons to know > > > > I've seen that the lock was taken, or will never be taken after this > > > > because XYZ". > > > > > > I don't think this is true for the usage in do_exit(), we have no > > > knowledge on if pi_lock is taken or not. We just want to make sure that > > > _if_ it were taken, we wait until it is released. > > > > And unless PPC would move to using RCsc locks with a SYNC in > > spin_lock(), I don't think it makes sense to add > > smp_mb__after_unlock_lock() to all tsk->pi_lock instances to fix this. > > As that is far more expensive than flipping the exit path to do > > spin_lock()+spin_unlock(). > > ... or we upgrade spin_unlock_wait to a LOCK operation, which might be > slightly cheaper than spin_lock()+spin_unlock(). Or we supply a heavyweight version of spin_unlock_wait() that forces the cache miss. But I bet that the difference in overhead between spin_lock()+spin_unlock() and the heavyweight version would be down in the noise. Thanx, Paul -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 4/4] locking: Introduce smp_cond_acquire()
On Mon, Nov 16, 2015 at 05:04:45PM +0100, Peter Zijlstra wrote: > On Mon, Nov 16, 2015 at 04:56:58PM +0100, Peter Zijlstra wrote: > > On Thu, Nov 12, 2015 at 10:21:39AM -0800, Linus Torvalds wrote: > > > Now, the point of spin_unlock_wait() (and "spin_is_locked()") should > > > generally be that you have some external ordering guarantee that > > > guarantees that the lock has been taken. For example, for the IPC > > > semaphores, we do either one of: > > > > > > (a) get large lock, then - once you hold that lock - wait for each small > > > lock > > > > > > or > > > > > > (b) get small lock, then - once you hold that lock - check that the > > > largo lock is unlocked > > > > > > and that's the case we should really worry about. The other uses of > > > spin_unlock_wait() should have similar "I have other reasons to know > > > I've seen that the lock was taken, or will never be taken after this > > > because XYZ". > > > > I don't think this is true for the usage in do_exit(), we have no > > knowledge on if pi_lock is taken or not. We just want to make sure that > > _if_ it were taken, we wait until it is released. > > And unless PPC would move to using RCsc locks with a SYNC in > spin_lock(), I don't think it makes sense to add > smp_mb__after_unlock_lock() to all tsk->pi_lock instances to fix this. > As that is far more expensive than flipping the exit path to do > spin_lock()+spin_unlock(). ... or we upgrade spin_unlock_wait to a LOCK operation, which might be slightly cheaper than spin_lock()+spin_unlock(). Will -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 4/4] locking: Introduce smp_cond_acquire()
On Mon, Nov 16, 2015 at 04:56:58PM +0100, Peter Zijlstra wrote: > On Thu, Nov 12, 2015 at 10:21:39AM -0800, Linus Torvalds wrote: > > Now, the point of spin_unlock_wait() (and "spin_is_locked()") should > > generally be that you have some external ordering guarantee that > > guarantees that the lock has been taken. For example, for the IPC > > semaphores, we do either one of: > > > > (a) get large lock, then - once you hold that lock - wait for each small > > lock > > > > or > > > > (b) get small lock, then - once you hold that lock - check that the > > largo lock is unlocked > > > > and that's the case we should really worry about. The other uses of > > spin_unlock_wait() should have similar "I have other reasons to know > > I've seen that the lock was taken, or will never be taken after this > > because XYZ". > > I don't think this is true for the usage in do_exit(), we have no > knowledge on if pi_lock is taken or not. We just want to make sure that > _if_ it were taken, we wait until it is released. And unless PPC would move to using RCsc locks with a SYNC in spin_lock(), I don't think it makes sense to add smp_mb__after_unlock_lock() to all tsk->pi_lock instances to fix this. As that is far more expensive than flipping the exit path to do spin_lock()+spin_unlock(). -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 4/4] locking: Introduce smp_cond_acquire()
On Thu, Nov 12, 2015 at 10:21:39AM -0800, Linus Torvalds wrote: > Now, the point of spin_unlock_wait() (and "spin_is_locked()") should > generally be that you have some external ordering guarantee that > guarantees that the lock has been taken. For example, for the IPC > semaphores, we do either one of: > > (a) get large lock, then - once you hold that lock - wait for each small lock > > or > > (b) get small lock, then - once you hold that lock - check that the > largo lock is unlocked > > and that's the case we should really worry about. The other uses of > spin_unlock_wait() should have similar "I have other reasons to know > I've seen that the lock was taken, or will never be taken after this > because XYZ". I don't think this is true for the usage in do_exit(), we have no knowledge on if pi_lock is taken or not. We just want to make sure that _if_ it were taken, we wait until it is released. But I'm not sure where task_work_run() sits, at first reading it appears to also not be true -- there doesn't appear to be a reason we know a lock to be held. It does however appear true for the usage in completion_done(), where by having tested x->done, we know a pi_lock _was_ held. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 4/4] locking: Introduce smp_cond_acquire()
Hi Paul, On Thu, Nov 12, 2015 at 03:43:51PM -0800, Paul E. McKenney wrote: > On Thu, Nov 12, 2015 at 09:33:39PM +, Will Deacon wrote: > > I think we ended up concluding that smp_mb__after_unlock_lock is indeed > > required, but I don't think we should just resurrect the old definition, > > which doesn't keep UNLOCK -> LOCK distinct from RELEASE -> ACQUIRE. I'm > > still working on documenting the different types of transitivity that we > > identified in that thread, but it's slow going. > > > > Also, as far as spin_unlock_wait is concerned, it is neither a LOCK or > > an UNLOCK and this barrier doesn't offer us anything. Sure, it might > > work because PPC defines it as smp_mb(), but it doesn't help on arm64 > > and defining the macro is overkill for us in most places (i.e. RCU). > > > > If we decide that the current usage of spin_unlock_wait is valid, then I > > would much rather implement a version of it in the arm64 backend that > > does something like: > > > > 1: ldrex r1, [&lock] > > if r1 indicates that lock is taken, branch back to 1b > > strex r1, [&lock] > > if store failed, branch back to 1b > > > > i.e. we don't just test the lock, but we also write it back atomically > > if we discover that it's free. That would then clear the exclusive monitor > > on any cores in the process of taking the lock and restore the ordering > > that we need. > > We could clearly do something similar in PowerPC, but I suspect that this > would hurt really badly on large systems, given that there are PowerPC > systems with more than a thousand hardware threads. So one approach > is ARM makes spin_unlock_wait() do the write, similar to spin_lock(); > spin_lock(), but PowerPC relies on smp_mb__after_unlock_lock(). Sure, I'm certainly not trying to tell you how to do this for PPC, but the above would be better for arm64 (any huge system should be using the 8.1 atomics anyway). > Or does someone have a better proposal? I don't think I'm completely clear on the current proposal wrt smp_mb__after_unlock_lock. To summarise my understanding (in the context of Boqun's original example, which I've duplicated at the end of this mail): * Putting smp_mb__after_unlock_lock() after the LOCK on CPU2 creates global order, by upgrading the UNLOCK -> LOCK to a full barrier. If we extend this to include the accesses made by the UNLOCK and LOCK as happening *before* the notional full barrier, then a from-read edge from CPU2 to CPU1 on `object' implies that the LOCK operation is observed by CPU1 before it writes object = NULL. So we can add this barrier and fix the test for PPC. * Upgrading spin_unlock_wait to a LOCK operation (which is basically what I'm proposing for arm64) means that we now rely on LOCK -> LOCK being part of a single, total order (i.e. all CPUs agree on the order in which a lock was taken). Assuming PPC has this global LOCK -> LOCK ordering, then we end up in a sticky situation defining the kernel's memory model because we don't have an architecture-agnostic semantic. The only way I can see to fix this is by adding something like smp_mb__between_lock_unlock_wait, but that's grim. Do you see a way forward? Will --->8 CPU 1 CPU 2 CPU 3 == == spin_unlock(&lock); spin_lock(&lock): r1 = *lock; // r1 == 0; o = READ_ONCE(object); // reordered here object = NULL; smp_mb(); spin_unlock_wait(&lock); *lock = 1; smp_mb(); o->dead = true; if (o) // true BUG_ON(o->dead); // true!! -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 4/4] locking: Introduce smp_cond_acquire()
On Thu, Nov 12, 2015 at 09:33:39PM +, Will Deacon wrote: > On Thu, Nov 12, 2015 at 10:59:06AM -0800, Paul E. McKenney wrote: > > On Thu, Nov 12, 2015 at 08:33:02PM +0100, Oleg Nesterov wrote: > > > On 11/12, Peter Zijlstra wrote: > > > > > > > > On Thu, Nov 12, 2015 at 07:38:07PM +0100, Oleg Nesterov wrote: > > > > > It seems that PPC needs to define smp_mb__before_spinlock() as full > > > > > mb(), > > > > > no? > > > > > > > > It does: > > > > > > > > arch/powerpc/include/asm/barrier.h:#define smp_mb__before_spinlock() > > > > smp_mb() > > > > > > Ah, indeed, thanks. > > > > > > And given that it also defines smp_mb__after_unlock_lock() as smp_mb(), > > > I am starting to understand how it can help to avoid the races with > > > spin_unlock_wait() in (for example) do_exit(). > > > > > > But as Boqun has already mentioned, this means that > > > mb__after_unlock_lock() > > > has the new meaning which should be documented. > > > > > > Hmm. And 12d560f4 "Privatize smp_mb__after_unlock_lock()" should be > > > reverted > > > then ;) > > > > Surprisingly, this reverts cleanly against today's mainline, please see > > the patch below. Against my -rcu stack, not so much, but so it goes. ;-) > > I think we ended up concluding that smp_mb__after_unlock_lock is indeed > required, but I don't think we should just resurrect the old definition, > which doesn't keep UNLOCK -> LOCK distinct from RELEASE -> ACQUIRE. I'm > still working on documenting the different types of transitivity that we > identified in that thread, but it's slow going. > > Also, as far as spin_unlock_wait is concerned, it is neither a LOCK or > an UNLOCK and this barrier doesn't offer us anything. Sure, it might > work because PPC defines it as smp_mb(), but it doesn't help on arm64 > and defining the macro is overkill for us in most places (i.e. RCU). > > If we decide that the current usage of spin_unlock_wait is valid, then I > would much rather implement a version of it in the arm64 backend that > does something like: > > 1: ldrex r1, [&lock] > if r1 indicates that lock is taken, branch back to 1b > strex r1, [&lock] > if store failed, branch back to 1b > > i.e. we don't just test the lock, but we also write it back atomically > if we discover that it's free. That would then clear the exclusive monitor > on any cores in the process of taking the lock and restore the ordering > that we need. We could clearly do something similar in PowerPC, but I suspect that this would hurt really badly on large systems, given that there are PowerPC systems with more than a thousand hardware threads. So one approach is ARM makes spin_unlock_wait() do the write, similar to spin_lock(); spin_lock(), but PowerPC relies on smp_mb__after_unlock_lock(). Or does someone have a better proposal? Thanx, Paul -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 4/4] locking: Introduce smp_cond_acquire()
On Thu, Nov 12, 2015 at 10:21:39AM -0800, Linus Torvalds wrote: > On Wed, Nov 11, 2015 at 11:14 PM, Boqun Feng wrote: > > > > Hmm.. probably incorrect.. because the ACQUIRE semantics of spin_lock() > > only guarantees that the memory operations following spin_lock() can't > > be reorder before the *LOAD* part of spin_lock() not the *STORE* part, > > i.e. the case below can happen(assuming the spin_lock() is implemented > > as ll/sc loop) > > > > spin_lock(&lock): > > r1 = *lock; // LL, r1 == 0 > > o = READ_ONCE(object); // could be reordered here. > > *lock = 1; // SC > > It may be worth noting that at least in theory, not only reads may > pass the store. If the spin-lock is done as > > r1 = *lock // read-acquire, r1 == 0 > *lock = 1// SC > > then even *writes* inside the locked region might pass up through the > "*lock = 1". Right, but only if we didn't have the control dependency to branch back in the case that the SC failed. In that case, the lock would be broken because you'd dive into the critical section even if you failed to take the lock. > So other CPU's - that haven't taken the spinlock - could see the > modifications inside the critical region before they actually see the > lock itself change. > > Now, the point of spin_unlock_wait() (and "spin_is_locked()") should > generally be that you have some external ordering guarantee that > guarantees that the lock has been taken. For example, for the IPC > semaphores, we do either one of: > > (a) get large lock, then - once you hold that lock - wait for each small lock > > or > > (b) get small lock, then - once you hold that lock - check that the > largo lock is unlocked > > and that's the case we should really worry about. The other uses of > spin_unlock_wait() should have similar "I have other reasons to know > I've seen that the lock was taken, or will never be taken after this > because XYZ". > > This is why powerpc has a memory barrier in "arch_spin_is_locked()". > Exactly so that the "check that the other lock is unlocked" is > guaranteed to be ordered wrt the store that gets the first lock. > > It looks like ARM64 gets this wrong and is fundamentally buggy wrt > "spin_is_locked()" (and, as a result, "spin_unlock_wait()"). I don't see how a memory barrier would help us here, and Boqun's example had smp_mb() either sides of the spin_unlock_wait(). What we actually need is to make spin_unlock_wait more like a LOCK operation, so that it forces parallel lockers to replay their LL/SC sequences. I'll write a patch once I've heard more back from Paul about smp_mb__after_unlock_lock(). > BUT! And this is a bug BUT: > > It should be noted that that is purely an ARM64 bug. Not a bug in our > users. If you have a spinlock where the "get lock write" part of the > lock can be delayed, then you have to have a "arch_spin_is_locked()" > that has the proper memory barriers. > > Of course, ARM still hides their architecture manuals in odd places, > so I can't double-check. But afaik, ARM64 store-conditional is "store > exclusive with release", and it has only release semantics, and ARM64 > really does have the above bug. The store-conditional in our spin_lock routine has no barrier semantics; they are enforced by the load-exclusive-acquire that pairs with it. > On that note: can anybody point me to the latest ARM64 8.1 > architecture manual in pdf form, without the "you have to register" > crap? I thought ARM released it, but all my googling just points to > the idiotic ARM service center that wants me to sign away something > just to see the docs. Which I don't do. We haven't yet released a document for the 8.1 instructions, but I can certainly send you a copy when it's made available. Will -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 4/4] locking: Introduce smp_cond_acquire()
On Thu, Nov 12, 2015 at 07:02:51AM -0800, Paul E. McKenney wrote: > On Thu, Nov 12, 2015 at 10:49:02PM +0800, Boqun Feng wrote: > > On Thu, Nov 12, 2015 at 06:40:04AM -0800, Paul E. McKenney wrote: > > [snip] > > > > > > I cannot resist suggesting that any lock that interacts with > > > spin_unlock_wait() must have all relevant acquisitions followed by > > > smp_mb__after_unlock_lock(). > > > > > > > But > > > > 1. This would expand the purpose of smp_mb__after_unlock_lock(), > > right? smp_mb__after_unlock_lock() is for making UNLOCK-LOCK > > pair global transitive rather than guaranteeing no operations > > can be reorder before the STORE part of LOCK/ACQUIRE. > > Indeed it would. Which might be OK. > > > 2. If ARM64 has the same problem as PPC now, > > smp_mb__after_unlock_lock() can't help, as it's a no-op on > > ARM64. > > Agreed, and that is why we need Will to weigh in. I really don't want to implement smp_mb__after_unlock_lock, because we don't need it based on its current definition and I think there's a better way to fix spin_unlock_wait (see my other post). Will -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 4/4] locking: Introduce smp_cond_acquire()
On Thu, Nov 12, 2015 at 10:59:06AM -0800, Paul E. McKenney wrote: > On Thu, Nov 12, 2015 at 08:33:02PM +0100, Oleg Nesterov wrote: > > On 11/12, Peter Zijlstra wrote: > > > > > > On Thu, Nov 12, 2015 at 07:38:07PM +0100, Oleg Nesterov wrote: > > > > It seems that PPC needs to define smp_mb__before_spinlock() as full > > > > mb(), > > > > no? > > > > > > It does: > > > > > > arch/powerpc/include/asm/barrier.h:#define smp_mb__before_spinlock() > > > smp_mb() > > > > Ah, indeed, thanks. > > > > And given that it also defines smp_mb__after_unlock_lock() as smp_mb(), > > I am starting to understand how it can help to avoid the races with > > spin_unlock_wait() in (for example) do_exit(). > > > > But as Boqun has already mentioned, this means that mb__after_unlock_lock() > > has the new meaning which should be documented. > > > > Hmm. And 12d560f4 "Privatize smp_mb__after_unlock_lock()" should be reverted > > then ;) > > Surprisingly, this reverts cleanly against today's mainline, please see > the patch below. Against my -rcu stack, not so much, but so it goes. ;-) I think we ended up concluding that smp_mb__after_unlock_lock is indeed required, but I don't think we should just resurrect the old definition, which doesn't keep UNLOCK -> LOCK distinct from RELEASE -> ACQUIRE. I'm still working on documenting the different types of transitivity that we identified in that thread, but it's slow going. Also, as far as spin_unlock_wait is concerned, it is neither a LOCK or an UNLOCK and this barrier doesn't offer us anything. Sure, it might work because PPC defines it as smp_mb(), but it doesn't help on arm64 and defining the macro is overkill for us in most places (i.e. RCU). If we decide that the current usage of spin_unlock_wait is valid, then I would much rather implement a version of it in the arm64 backend that does something like: 1: ldrex r1, [&lock] if r1 indicates that lock is taken, branch back to 1b strex r1, [&lock] if store failed, branch back to 1b i.e. we don't just test the lock, but we also write it back atomically if we discover that it's free. That would then clear the exclusive monitor on any cores in the process of taking the lock and restore the ordering that we need. Will -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 4/4] locking: Introduce smp_cond_acquire()
[sorry for the late reply, I'm away from my desk until Monday since I'm busy with family issues] On Thu, Nov 12, 2015 at 07:20:42AM -0800, Paul E. McKenney wrote: > On Thu, Nov 12, 2015 at 04:08:22PM +0100, Peter Zijlstra wrote: > > On Thu, Nov 12, 2015 at 07:01:49AM -0800, Paul E. McKenney wrote: > > > On Thu, Nov 12, 2015 at 03:50:13PM +0100, Peter Zijlstra wrote: > > > > On Thu, Nov 12, 2015 at 06:40:04AM -0800, Paul E. McKenney wrote: > > > > > I cannot resist suggesting that any lock that interacts with > > > > > spin_unlock_wait() must have all relevant acquisitions followed by > > > > > smp_mb__after_unlock_lock(). > > > > > > > > Ha! that would certainly help here. But it would mean that argh64v8 also > > > > needs to define that, even though that is already RCsc. > > > > > > Maybe. It could also be that arm64 avoids the need somehow, for example > > > via their RCsc behavior. Their memory model is similar to PPC, but not > > > exactly the same. > > > > > > Will? > > > > So when I spoke to Will earlier today, we agreed that LDAXR+STXR is > > susceptible to the same problem. The STXR will allow loads to pass up > > over that store. > > > > On v8.1, which is using LDADDA, this is not an issue, for as the ACQUIRE > > is part of the LOAD, the Read-Modify-Write is indivisible as a whole, > > and therefore a subsequent load has nothing to pass over. > > So one approach required for one level of hardware and another for the > next level. I can relate to that all too well... :-/ Just to confirm, Peter's correct in that Boqun's litmus test is permitted by the arm64 architecture when the ll/sc spinlock definitions are in use. However, I don't think that strengthening smp_mb__after_unlock_lock is the right way to solve this. I'll reply to the other part of the thread... Will -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 4/4] locking: Introduce smp_cond_acquire()
On Thu, Nov 12, 2015 at 08:33:02PM +0100, Oleg Nesterov wrote: > On 11/12, Peter Zijlstra wrote: > > > > On Thu, Nov 12, 2015 at 07:38:07PM +0100, Oleg Nesterov wrote: > > > It seems that PPC needs to define smp_mb__before_spinlock() as full mb(), > > > no? > > > > It does: > > > > arch/powerpc/include/asm/barrier.h:#define smp_mb__before_spinlock() > > smp_mb() > > Ah, indeed, thanks. > > And given that it also defines smp_mb__after_unlock_lock() as smp_mb(), > I am starting to understand how it can help to avoid the races with > spin_unlock_wait() in (for example) do_exit(). > > But as Boqun has already mentioned, this means that mb__after_unlock_lock() > has the new meaning which should be documented. > > Hmm. And 12d560f4 "Privatize smp_mb__after_unlock_lock()" should be reverted > then ;) Surprisingly, this reverts cleanly against today's mainline, please see the patch below. Against my -rcu stack, not so much, but so it goes. ;-) Thanx, Paul commit eff0632b4181f91f2596d56f7c73194e1a869aff Author: Paul E. McKenney Date: Thu Nov 12 10:54:23 2015 -0800 Revert "rcu,locking: Privatize smp_mb__after_unlock_lock()" This reverts commit 12d560f4ea87030667438a169912380be00cea4b. The reason for this revert is that smp_mb__after_unlock_lock() might prove useful outside of RCU after all for interactions between the locking primitives and spin_unlock_wait(). diff --git a/Documentation/memory-barriers.txt b/Documentation/memory-barriers.txt index aef9487303d0..d4501664d49f 100644 --- a/Documentation/memory-barriers.txt +++ b/Documentation/memory-barriers.txt @@ -1855,10 +1855,16 @@ RELEASE are to the same lock variable, but only from the perspective of another CPU not holding that lock. In short, a ACQUIRE followed by an RELEASE may -not- be assumed to be a full memory barrier. -Similarly, the reverse case of a RELEASE followed by an ACQUIRE does -not imply a full memory barrier. Therefore, the CPU's execution of the -critical sections corresponding to the RELEASE and the ACQUIRE can cross, -so that: +Similarly, the reverse case of a RELEASE followed by an ACQUIRE does not +imply a full memory barrier. If it is necessary for a RELEASE-ACQUIRE +pair to produce a full barrier, the ACQUIRE can be followed by an +smp_mb__after_unlock_lock() invocation. This will produce a full barrier +(including transitivity) if either (a) the RELEASE and the ACQUIRE are +executed by the same CPU or task, or (b) the RELEASE and ACQUIRE act on +the same variable. The smp_mb__after_unlock_lock() primitive is free +on many architectures. Without smp_mb__after_unlock_lock(), the CPU's +execution of the critical sections corresponding to the RELEASE and the +ACQUIRE can cross, so that: *A = a; RELEASE M @@ -1896,6 +1902,29 @@ the RELEASE would simply complete, thereby avoiding the deadlock. a sleep-unlock race, but the locking primitive needs to resolve such races properly in any case. +With smp_mb__after_unlock_lock(), the two critical sections cannot overlap. +For example, with the following code, the store to *A will always be +seen by other CPUs before the store to *B: + + *A = a; + RELEASE M + ACQUIRE N + smp_mb__after_unlock_lock(); + *B = b; + +The operations will always occur in one of the following orders: + + STORE *A, RELEASE, ACQUIRE, smp_mb__after_unlock_lock(), STORE *B + STORE *A, ACQUIRE, RELEASE, smp_mb__after_unlock_lock(), STORE *B + ACQUIRE, STORE *A, RELEASE, smp_mb__after_unlock_lock(), STORE *B + +If the RELEASE and ACQUIRE were instead both operating on the same lock +variable, only the first of these alternatives can occur. In addition, +the more strongly ordered systems may rule out some of the above orders. +But in any case, as noted earlier, the smp_mb__after_unlock_lock() +ensures that the store to *A will always be seen as happening before +the store to *B. + Locks and semaphores may not provide any guarantee of ordering on UP compiled systems, and so cannot be counted on in such a situation to actually achieve anything at all - especially with respect to I/O accesses - unless combined @@ -2126,6 +2155,40 @@ But it won't see any of: *E, *F or *G following RELEASE Q +However, if the following occurs: + + CPU 1 CPU 2 + === === + WRITE_ONCE(*A, a); + ACQUIRE M[1] + WRITE_ONCE(*B, b); + WRITE_ONCE(*C, c); + RELEASE M[1] + WRITE_ONCE(*D, d); WRITE_ONCE(*E, e); + ACQUIRE M[2] + smp_mb__after_unlock_lock(); + WRITE_ONCE(*F, f); +
Re: [PATCH 4/4] locking: Introduce smp_cond_acquire()
On 11/12, Peter Zijlstra wrote: > > On Thu, Nov 12, 2015 at 07:38:07PM +0100, Oleg Nesterov wrote: > > It seems that PPC needs to define smp_mb__before_spinlock() as full mb(), > > no? > > It does: > > arch/powerpc/include/asm/barrier.h:#define smp_mb__before_spinlock() smp_mb() Ah, indeed, thanks. And given that it also defines smp_mb__after_unlock_lock() as smp_mb(), I am starting to understand how it can help to avoid the races with spin_unlock_wait() in (for example) do_exit(). But as Boqun has already mentioned, this means that mb__after_unlock_lock() has the new meaning which should be documented. Hmm. And 12d560f4 "Privatize smp_mb__after_unlock_lock()" should be reverted then ;) Oleg. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 4/4] locking: Introduce smp_cond_acquire()
On Wed, Nov 11, 2015 at 11:14 PM, Boqun Feng wrote: > > Hmm.. probably incorrect.. because the ACQUIRE semantics of spin_lock() > only guarantees that the memory operations following spin_lock() can't > be reorder before the *LOAD* part of spin_lock() not the *STORE* part, > i.e. the case below can happen(assuming the spin_lock() is implemented > as ll/sc loop) > > spin_lock(&lock): > r1 = *lock; // LL, r1 == 0 > o = READ_ONCE(object); // could be reordered here. > *lock = 1; // SC It may be worth noting that at least in theory, not only reads may pass the store. If the spin-lock is done as r1 = *lock // read-acquire, r1 == 0 *lock = 1// SC then even *writes* inside the locked region might pass up through the "*lock = 1". So other CPU's - that haven't taken the spinlock - could see the modifications inside the critical region before they actually see the lock itself change. Now, the point of spin_unlock_wait() (and "spin_is_locked()") should generally be that you have some external ordering guarantee that guarantees that the lock has been taken. For example, for the IPC semaphores, we do either one of: (a) get large lock, then - once you hold that lock - wait for each small lock or (b) get small lock, then - once you hold that lock - check that the largo lock is unlocked and that's the case we should really worry about. The other uses of spin_unlock_wait() should have similar "I have other reasons to know I've seen that the lock was taken, or will never be taken after this because XYZ". This is why powerpc has a memory barrier in "arch_spin_is_locked()". Exactly so that the "check that the other lock is unlocked" is guaranteed to be ordered wrt the store that gets the first lock. It looks like ARM64 gets this wrong and is fundamentally buggy wrt "spin_is_locked()" (and, as a result, "spin_unlock_wait()"). BUT! And this is a bug BUT: It should be noted that that is purely an ARM64 bug. Not a bug in our users. If you have a spinlock where the "get lock write" part of the lock can be delayed, then you have to have a "arch_spin_is_locked()" that has the proper memory barriers. Of course, ARM still hides their architecture manuals in odd places, so I can't double-check. But afaik, ARM64 store-conditional is "store exclusive with release", and it has only release semantics, and ARM64 really does have the above bug. On that note: can anybody point me to the latest ARM64 8.1 architecture manual in pdf form, without the "you have to register" crap? I thought ARM released it, but all my googling just points to the idiotic ARM service center that wants me to sign away something just to see the docs. Which I don't do. Linus -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 4/4] locking: Introduce smp_cond_acquire()
On Thu, Nov 12, 2015 at 07:38:07PM +0100, Oleg Nesterov wrote: > It seems that PPC needs to define smp_mb__before_spinlock() as full mb(), > no? It does: arch/powerpc/include/asm/barrier.h:#define smp_mb__before_spinlock() smp_mb() -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 4/4] locking: Introduce smp_cond_acquire()
On 11/12, Boqun Feng wrote: > > On Thu, Nov 12, 2015 at 04:00:58PM +0100, Oleg Nesterov wrote: > > > > No? > > > > do_exit() is surely buggy if spin_lock() could work in this way. OK, good ;) so we need to fix spin_lock() on PPC ? Or add mb__after_unlock_lock() but this leads to other questions... Or change do_exit() to do lock() + unlock(). > > > And smp_mb__before_spinlock() looks wrong too then. > > > > Maybe not? As smp_mb__before_spinlock() is used before a LOCK operation, > which has both LOAD part and STORE part unlike spin_unlock_wait()? Maybe not. But let me remind that the original purpose of this mb__before_spinlock() was to ensure that "CONDITION = true" before ttwu() can not be reordered with if (!(p->state & state)) goto out; // do not wakeup inside try_to_wake_up(). Otherwise CONDITION = true; try_to_wake_up(p); can race with "p" doing set_current_state(...); // implies mb(); if (CONDITION) return; schedule(); because try_to_wake_up() can read p->state before it sets CONDITION = 1 and then it won't wakeup "p" which has already checked this CONDITION. Now. If try_to_wake_up() can read p->state before it writes to *pi_lock, then how smp_mb__before_spinlock() == wmb() can help to serialize STORE and LOAD? It seems that PPC needs to define smp_mb__before_spinlock() as full mb(), no? Oleg. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 4/4] locking: Introduce smp_cond_acquire()
On Thu, Nov 12, 2015 at 10:49:02PM +0800, Boqun Feng wrote: > On Thu, Nov 12, 2015 at 06:40:04AM -0800, Paul E. McKenney wrote: > [snip] > > > > I cannot resist suggesting that any lock that interacts with > > spin_unlock_wait() must have all relevant acquisitions followed by > > smp_mb__after_unlock_lock(). > > > > But > > 1.This would expand the purpose of smp_mb__after_unlock_lock(), > right? smp_mb__after_unlock_lock() is for making UNLOCK-LOCK > pair global transitive rather than guaranteeing no operations > can be reorder before the STORE part of LOCK/ACQUIRE. Indeed it would. Which might be OK. > 2.If ARM64 has the same problem as PPC now, > smp_mb__after_unlock_lock() can't help, as it's a no-op on > ARM64. Agreed, and that is why we need Will to weigh in. Thanx, Paul -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 4/4] locking: Introduce smp_cond_acquire()
On Thu, Nov 12, 2015 at 04:08:22PM +0100, Peter Zijlstra wrote: > On Thu, Nov 12, 2015 at 07:01:49AM -0800, Paul E. McKenney wrote: > > On Thu, Nov 12, 2015 at 03:50:13PM +0100, Peter Zijlstra wrote: > > > On Thu, Nov 12, 2015 at 06:40:04AM -0800, Paul E. McKenney wrote: > > > > I cannot resist suggesting that any lock that interacts with > > > > spin_unlock_wait() must have all relevant acquisitions followed by > > > > smp_mb__after_unlock_lock(). > > > > > > Ha! that would certainly help here. But it would mean that argh64v8 also > > > needs to define that, even though that is already RCsc. > > > > Maybe. It could also be that arm64 avoids the need somehow, for example > > via their RCsc behavior. Their memory model is similar to PPC, but not > > exactly the same. > > > > Will? > > So when I spoke to Will earlier today, we agreed that LDAXR+STXR is > susceptible to the same problem. The STXR will allow loads to pass up > over that store. > > On v8.1, which is using LDADDA, this is not an issue, for as the ACQUIRE > is part of the LOAD, the Read-Modify-Write is indivisible as a whole, > and therefore a subsequent load has nothing to pass over. So one approach required for one level of hardware and another for the next level. I can relate to that all too well... :-/ Thanx, Paul -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 4/4] locking: Introduce smp_cond_acquire()
On Thu, Nov 12, 2015 at 04:00:58PM +0100, Oleg Nesterov wrote: > On 11/12, Boqun Feng wrote: [snip] > > > > Hmm.. probably incorrect.. because the ACQUIRE semantics of spin_lock() > > only guarantees that the memory operations following spin_lock() can't > > be reorder before the *LOAD* part of spin_lock() not the *STORE* part, > > i.e. the case below can happen(assuming the spin_lock() is implemented > > as ll/sc loop) > > > > spin_lock(&lock): > > r1 = *lock; // LL, r1 == 0 > > o = READ_ONCE(object); // could be reordered here. > > *lock = 1; // SC > > > > This could happen because of the ACQUIRE semantics of spin_lock(), and > > the current implementation of spin_lock() on PPC allows this happen. > > > > (Cc PPC maintainers for their opinions on this one) > > In this case the code above is obviously wrong. And I do not understand > how we can rely on spin_unlock_wait() then. > > And afaics do_exit() is buggy too then, see below. > > > I think it's OK for it as an ACQUIRE(with a proper barrier) or even just > > a control dependency to pair with spin_unlock(), for example, the > > following snippet in do_exit() is OK, except the smp_mb() is redundant, > > unless I'm missing something subtle: > > > > /* > > * The setting of TASK_RUNNING by try_to_wake_up() may be delayed > > * when the following two conditions become true. > > * - There is race condition of mmap_sem (It is acquired by > > * exit_mm()), and > > * - SMI occurs before setting TASK_RUNINNG. > > * (or hypervisor of virtual machine switches to other guest) > > * As a result, we may become TASK_RUNNING after becoming TASK_DEAD > > * > > * To avoid it, we have to wait for releasing tsk->pi_lock which > > * is held by try_to_wake_up() > > */ > > smp_mb(); > > raw_spin_unlock_wait(&tsk->pi_lock); > > Perhaps it is me who missed something. But I don't think we can remove > this mb(). And at the same time it can't help on PPC if I understand You are right, we need this smp_mb() to order the previous STORE of ->state with the LOAD of ->pi_lock. I missed that part because I saw all the explicit STOREs of ->state in do_exit() are set_current_state() which has a smp_mb() following the STOREs. > your explanation above correctly. > > To simplify, lets ignore exit_mm/down_read/etc. The exiting task does > > > current->state = TASK_UNINTERRUPTIBLE; > // without schedule() in between > current->state = TASK_RUNNING; > > smp_mb(); > spin_unlock_wait(pi_lock); > > current->state = TASK_DEAD; > schedule(); > > and we need to ensure that if we race with > try_to_wake_up(TASK_UNINTERRUPTIBLE) > it can't change TASK_DEAD back to RUNNING. > > Without smp_mb() this can be reordered, spin_unlock_wait(pi_locked) can > read the old "unlocked" state of pi_lock before we set UNINTERRUPTIBLE, > so in fact we could have > > current->state = TASK_UNINTERRUPTIBLE; > > spin_unlock_wait(pi_lock); > > current->state = TASK_RUNNING; > > current->state = TASK_DEAD; > > and this can obviously race with ttwu() which can take pi_lock and see > state == TASK_UNINTERRUPTIBLE after spin_unlock_wait(). > Yep, my mistake ;-) > And, if I understand you correctly, this smp_mb() can't help on PPC. > try_to_wake_up() can read task->state before it writes to *pi_lock. > To me this doesn't really differ from the code above, > > CPU 1 (do_exit) CPU_2 (ttwu) > > spin_lock(pi_lock): > r1 = *pi_lock; // r1 == 0; > p->state = TASK_UNINTERRUPTIBLE; > state = p->state; > p->state = TASK_RUNNING; > mb(); > spin_unlock_wait(); > *pi_lock = 1; > > p->state = TASK_DEAD; > if (state & > TASK_UNINTERRUPTIBLE) // true > p->state = RUNNING; > > No? > do_exit() is surely buggy if spin_lock() could work in this way. > And smp_mb__before_spinlock() looks wrong too then. > Maybe not? As smp_mb__before_spinlock() is used before a LOCK operation, which has both LOAD part and STORE part unlike spin_unlock_wait()? > Oleg. > signature.asc Description: PGP signature
Re: [PATCH 4/4] locking: Introduce smp_cond_acquire()
On Thu, Nov 12, 2015 at 07:01:49AM -0800, Paul E. McKenney wrote: > On Thu, Nov 12, 2015 at 03:50:13PM +0100, Peter Zijlstra wrote: > > On Thu, Nov 12, 2015 at 06:40:04AM -0800, Paul E. McKenney wrote: > > > I cannot resist suggesting that any lock that interacts with > > > spin_unlock_wait() must have all relevant acquisitions followed by > > > smp_mb__after_unlock_lock(). > > > > Ha! that would certainly help here. But it would mean that argh64v8 also > > needs to define that, even though that is already RCsc. > > Maybe. It could also be that arm64 avoids the need somehow, for example > via their RCsc behavior. Their memory model is similar to PPC, but not > exactly the same. > > Will? So when I spoke to Will earlier today, we agreed that LDAXR+STXR is susceptible to the same problem. The STXR will allow loads to pass up over that store. On v8.1, which is using LDADDA, this is not an issue, for as the ACQUIRE is part of the LOAD, the Read-Modify-Write is indivisible as a whole, and therefore a subsequent load has nothing to pass over. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 4/4] locking: Introduce smp_cond_acquire()
On Thu, Nov 12, 2015 at 03:50:13PM +0100, Peter Zijlstra wrote: > On Thu, Nov 12, 2015 at 06:40:04AM -0800, Paul E. McKenney wrote: > > I cannot resist suggesting that any lock that interacts with > > spin_unlock_wait() must have all relevant acquisitions followed by > > smp_mb__after_unlock_lock(). > > Ha! that would certainly help here. But it would mean that argh64v8 also > needs to define that, even though that is already RCsc. Maybe. It could also be that arm64 avoids the need somehow, for example via their RCsc behavior. Their memory model is similar to PPC, but not exactly the same. Will? Thanx, Paul -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 4/4] locking: Introduce smp_cond_acquire()
On Thu, Nov 12, 2015 at 06:40:04AM -0800, Paul E. McKenney wrote: > I cannot resist suggesting that any lock that interacts with > spin_unlock_wait() must have all relevant acquisitions followed by > smp_mb__after_unlock_lock(). Ha! that would certainly help here. But it would mean that argh64v8 also needs to define that, even though that is already RCsc. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 4/4] locking: Introduce smp_cond_acquire()
On Thu, Nov 12, 2015 at 06:40:04AM -0800, Paul E. McKenney wrote: [snip] > > I cannot resist suggesting that any lock that interacts with > spin_unlock_wait() must have all relevant acquisitions followed by > smp_mb__after_unlock_lock(). > But 1. This would expand the purpose of smp_mb__after_unlock_lock(), right? smp_mb__after_unlock_lock() is for making UNLOCK-LOCK pair global transitive rather than guaranteeing no operations can be reorder before the STORE part of LOCK/ACQUIRE. 2. If ARM64 has the same problem as PPC now, smp_mb__after_unlock_lock() can't help, as it's a no-op on ARM64. Regards, Boqun signature.asc Description: PGP signature
Re: [PATCH 4/4] locking: Introduce smp_cond_acquire()
On Thu, Nov 12, 2015 at 04:00:58PM +0100, Oleg Nesterov wrote: > On 11/12, Boqun Feng wrote: > > > > On Wed, Nov 11, 2015 at 08:39:53PM +0100, Oleg Nesterov wrote: > > > > > > object_t *object; > > > spinlock_t lock; > > > > > > void update(void) > > > { > > > object_t *o; > > > > > > spin_lock(&lock); > > > o = READ_ONCE(object); > > > if (o) { > > > BUG_ON(o->dead); > > > do_something(o); > > > } > > > spin_unlock(&lock); > > > } > > > > > > void destroy(void) // can be called only once, can't race with itself > > > { > > > object_t *o; > > > > > > o = object; > > > object = NULL; > > > > > > /* > > >* pairs with lock/ACQUIRE. The next update() must see > > >* object == NULL after spin_lock(); > > >*/ > > > smp_mb(); > > > > > > spin_unlock_wait(&lock); > > > > > > /* > > >* pairs with unlock/RELEASE. The previous update() has > > >* already passed BUG_ON(o->dead). > > >* > > >* (Yes, yes, in this particular case it is not needed, > > >* we can rely on the control dependency). > > >*/ > > > smp_mb(); > > > > > > o->dead = true; > > > } > > > > > > I believe the code above is correct and it needs the barriers on both > > > sides. > > > > > > > Hmm.. probably incorrect.. because the ACQUIRE semantics of spin_lock() > > only guarantees that the memory operations following spin_lock() can't > > be reorder before the *LOAD* part of spin_lock() not the *STORE* part, > > i.e. the case below can happen(assuming the spin_lock() is implemented > > as ll/sc loop) > > > > spin_lock(&lock): > > r1 = *lock; // LL, r1 == 0 > > o = READ_ONCE(object); // could be reordered here. > > *lock = 1; // SC > > > > This could happen because of the ACQUIRE semantics of spin_lock(), and > > the current implementation of spin_lock() on PPC allows this happen. > > > > (Cc PPC maintainers for their opinions on this one) > > In this case the code above is obviously wrong. And I do not understand > how we can rely on spin_unlock_wait() then. > > And afaics do_exit() is buggy too then, see below. > > > I think it's OK for it as an ACQUIRE(with a proper barrier) or even just > > a control dependency to pair with spin_unlock(), for example, the > > following snippet in do_exit() is OK, except the smp_mb() is redundant, > > unless I'm missing something subtle: > > > > /* > > * The setting of TASK_RUNNING by try_to_wake_up() may be delayed > > * when the following two conditions become true. > > * - There is race condition of mmap_sem (It is acquired by > > * exit_mm()), and > > * - SMI occurs before setting TASK_RUNINNG. > > * (or hypervisor of virtual machine switches to other guest) > > * As a result, we may become TASK_RUNNING after becoming TASK_DEAD > > * > > * To avoid it, we have to wait for releasing tsk->pi_lock which > > * is held by try_to_wake_up() > > */ > > smp_mb(); > > raw_spin_unlock_wait(&tsk->pi_lock); > > Perhaps it is me who missed something. But I don't think we can remove > this mb(). And at the same time it can't help on PPC if I understand > your explanation above correctly. I cannot resist suggesting that any lock that interacts with spin_unlock_wait() must have all relevant acquisitions followed by smp_mb__after_unlock_lock(). Thanx, Paul -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 4/4] locking: Introduce smp_cond_acquire()
On 11/12, Boqun Feng wrote: > > On Wed, Nov 11, 2015 at 08:39:53PM +0100, Oleg Nesterov wrote: > > > > object_t *object; > > spinlock_t lock; > > > > void update(void) > > { > > object_t *o; > > > > spin_lock(&lock); > > o = READ_ONCE(object); > > if (o) { > > BUG_ON(o->dead); > > do_something(o); > > } > > spin_unlock(&lock); > > } > > > > void destroy(void) // can be called only once, can't race with itself > > { > > object_t *o; > > > > o = object; > > object = NULL; > > > > /* > > * pairs with lock/ACQUIRE. The next update() must see > > * object == NULL after spin_lock(); > > */ > > smp_mb(); > > > > spin_unlock_wait(&lock); > > > > /* > > * pairs with unlock/RELEASE. The previous update() has > > * already passed BUG_ON(o->dead). > > * > > * (Yes, yes, in this particular case it is not needed, > > * we can rely on the control dependency). > > */ > > smp_mb(); > > > > o->dead = true; > > } > > > > I believe the code above is correct and it needs the barriers on both sides. > > > > Hmm.. probably incorrect.. because the ACQUIRE semantics of spin_lock() > only guarantees that the memory operations following spin_lock() can't > be reorder before the *LOAD* part of spin_lock() not the *STORE* part, > i.e. the case below can happen(assuming the spin_lock() is implemented > as ll/sc loop) > > spin_lock(&lock): > r1 = *lock; // LL, r1 == 0 > o = READ_ONCE(object); // could be reordered here. > *lock = 1; // SC > > This could happen because of the ACQUIRE semantics of spin_lock(), and > the current implementation of spin_lock() on PPC allows this happen. > > (Cc PPC maintainers for their opinions on this one) In this case the code above is obviously wrong. And I do not understand how we can rely on spin_unlock_wait() then. And afaics do_exit() is buggy too then, see below. > I think it's OK for it as an ACQUIRE(with a proper barrier) or even just > a control dependency to pair with spin_unlock(), for example, the > following snippet in do_exit() is OK, except the smp_mb() is redundant, > unless I'm missing something subtle: > > /* >* The setting of TASK_RUNNING by try_to_wake_up() may be delayed >* when the following two conditions become true. >* - There is race condition of mmap_sem (It is acquired by >* exit_mm()), and >* - SMI occurs before setting TASK_RUNINNG. >* (or hypervisor of virtual machine switches to other guest) >* As a result, we may become TASK_RUNNING after becoming TASK_DEAD >* >* To avoid it, we have to wait for releasing tsk->pi_lock which >* is held by try_to_wake_up() >*/ > smp_mb(); > raw_spin_unlock_wait(&tsk->pi_lock); Perhaps it is me who missed something. But I don't think we can remove this mb(). And at the same time it can't help on PPC if I understand your explanation above correctly. To simplify, lets ignore exit_mm/down_read/etc. The exiting task does current->state = TASK_UNINTERRUPTIBLE; // without schedule() in between current->state = TASK_RUNNING; smp_mb(); spin_unlock_wait(pi_lock); current->state = TASK_DEAD; schedule(); and we need to ensure that if we race with try_to_wake_up(TASK_UNINTERRUPTIBLE) it can't change TASK_DEAD back to RUNNING. Without smp_mb() this can be reordered, spin_unlock_wait(pi_locked) can read the old "unlocked" state of pi_lock before we set UNINTERRUPTIBLE, so in fact we could have current->state = TASK_UNINTERRUPTIBLE; spin_unlock_wait(pi_lock); current->state = TASK_RUNNING; current->state = TASK_DEAD; and this can obviously race with ttwu() which can take pi_lock and see state == TASK_UNINTERRUPTIBLE after spin_unlock_wait(). And, if I understand you correctly, this smp_mb() can't help on PPC. try_to_wake_up() can read task->state before it writes to *pi_lock. To me this doesn't really differ from the code above, CPU 1 (do_exit) CPU_2 (ttwu) spin_lock(pi_lock): r1 = *pi_lock; // r1 == 0; p->state = TASK_UNINTERRUPTIBLE; state = p->state; p->state = TASK_RUNNING; mb(); spin_unlock_wait(); *pi_lock = 1; p->state = TASK_DEAD; if (state & TASK_UNINTERRUPTIBLE) // true
Re: [PATCH 4/4] locking: Introduce smp_cond_acquire()
On Wed, Nov 11, 2015 at 06:34:56PM +0800, Boqun Feng wrote: > On Wed, Nov 11, 2015 at 05:39:40PM +0800, Boqun Feng wrote: > > Hi Oleg, > > > > On Tue, Nov 03, 2015 at 06:59:58PM +0100, Oleg Nesterov wrote: > > [snip] > > > > > > Unfortunately this doesn't look exactly right... > > > > > > spin_unlock_wait() is not equal to "while (locked) relax", the latter > > > is live-lockable or at least sub-optimal: we do not really need to spin > > > > Just be curious, should spin_unlock_wait() semantically be an ACQUIRE? > > Hmm.. I guess I was wrong, it doesn't need to be an ACQUIRE, it needs > only to use the control dependency to order the load of lock state and > stores following it. I must say that spin_unlock_wait() and friends are a bit tricky. There are only a few uses in drivers/ata/libata-eh.c, ipc/sem.c, kernel/exit.c, kernel/sched/completion.c, and kernel/task_work.c. Almost all of these seem to assume that spin_unlock_wait() provides no ordering. We might have just barely enough uses to produce useful abstractions, but my guess is that it would not hurt to wait. > > Because spin_unlock_wait() is used for us to wait for a certain lock to > > RELEASE so that we can do something *after* we observe the RELEASE. > > Considering the follow example: > > > > CPU 0 CPU 1 > > === > > { X = 0 } > > WRITE_ONCE(X, 1); > > spin_unlock(&lock); > > spin_unlock_wait(&lock) > > r1 = READ_ONCE(X); > > > > If spin_unlock_wait() is not an ACQUIRE, r1 can be 0 in this case, > > right? Am I missing something subtle here? Or spin_unlock_wait() itself > > doesn't have the ACQUIRE semantics, but it should always come with a > > smp_mb() *following* it to achieve the ACQUIRE semantics? However in > > do_exit(), an smp_mb() is preceding raw_spin_unlock_wait() rather than > > following, which makes me confused... could you explain that? Thank you > > ;-) > > > > But still, there is one suspicious use of smp_mb() in do_exit(): > > /* >* The setting of TASK_RUNNING by try_to_wake_up() may be delayed >* when the following two conditions become true. >* - There is race condition of mmap_sem (It is acquired by >* exit_mm()), and >* - SMI occurs before setting TASK_RUNINNG. >* (or hypervisor of virtual machine switches to other guest) >* As a result, we may become TASK_RUNNING after becoming TASK_DEAD >* >* To avoid it, we have to wait for releasing tsk->pi_lock which >* is held by try_to_wake_up() >*/ > smp_mb(); > raw_spin_unlock_wait(&tsk->pi_lock); > > /* causes final put_task_struct in finish_task_switch(). */ > tsk->state = TASK_DEAD; > tsk->flags |= PF_NOFREEZE; /* tell freezer to ignore us */ > schedule(); > > Seems like smp_mb() doesn't need here? Because the control dependency > already orders load of tsk->pi_lock and store of tsk->state, and this > control dependency order guarantee pairs with the spin_unlock(->pi_lock) > in try_to_wake_up() to avoid data race on ->state. The exit() path is pretty heavyweight, so I suspect that an extra smp_mb() is down in the noise. Or are you saying that this is somehow unsafe? Thanx, Paul > Regards, > Boqun > > > Regards, > > Boqun > > > > > until we observe !spin_is_locked(), we only need to synchronize with the > > > current owner of this lock. Once it drops the lock we can proceed, we > > > do not care if another thread takes the same lock right after that. > > > > > > Oleg. > > > > > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 4/4] locking: Introduce smp_cond_acquire()
On Thu, Nov 12, 2015 at 03:14:51PM +0800, Boqun Feng wrote: > Hmm.. probably incorrect.. because the ACQUIRE semantics of spin_lock() > only guarantees that the memory operations following spin_lock() can't > be reorder before the *LOAD* part of spin_lock() not the *STORE* part, > i.e. the case below can happen(assuming the spin_lock() is implemented > as ll/sc loop) > > spin_lock(&lock): > r1 = *lock; // LL, r1 == 0 > o = READ_ONCE(object); // could be reordered here. > *lock = 1; // SC > > This could happen because of the ACQUIRE semantics of spin_lock(), and > the current implementation of spin_lock() on PPC allows this happen. Urgh, you just _had_ to send an email like this, didn't you ;-) I think AARGH64 does the same thing. They either use LDAXR/STXR, which also places the ACQUIRE on the load, or use LDADDA (v8.1) which I suspect does the same, but I'm not entirely sure how to decode these 8.1 atomics yet. Let me think a little more on this.. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 4/4] locking: Introduce smp_cond_acquire()
On Wed, Nov 11, 2015 at 08:39:53PM +0100, Oleg Nesterov wrote: > On 11/11, Peter Zijlstra wrote: > > > > On Wed, Nov 11, 2015 at 05:39:40PM +0800, Boqun Feng wrote: > > > > > Just be curious, should spin_unlock_wait() semantically be an ACQUIRE? > > > > I did wonder the same thing, it would simplify a number of things if > > this were so. > > Yes, me too. > > Sometimes I even think it should have both ACQUIRE + RELEASE semantics. > IOW, it should be "equivalent" to spin_lock() + spin_unlock(). > > Consider this code: > > object_t *object; > spinlock_t lock; > > void update(void) > { > object_t *o; > > spin_lock(&lock); > o = READ_ONCE(object); > if (o) { > BUG_ON(o->dead); > do_something(o); > } > spin_unlock(&lock); > } > > void destroy(void) // can be called only once, can't race with itself > { > object_t *o; > > o = object; > object = NULL; > > /* >* pairs with lock/ACQUIRE. The next update() must see >* object == NULL after spin_lock(); >*/ > smp_mb(); > > spin_unlock_wait(&lock); > > /* >* pairs with unlock/RELEASE. The previous update() has >* already passed BUG_ON(o->dead). >* >* (Yes, yes, in this particular case it is not needed, >* we can rely on the control dependency). >*/ > smp_mb(); > > o->dead = true; > } > > I believe the code above is correct and it needs the barriers on both sides. > Hmm.. probably incorrect.. because the ACQUIRE semantics of spin_lock() only guarantees that the memory operations following spin_lock() can't be reorder before the *LOAD* part of spin_lock() not the *STORE* part, i.e. the case below can happen(assuming the spin_lock() is implemented as ll/sc loop) spin_lock(&lock): r1 = *lock; // LL, r1 == 0 o = READ_ONCE(object); // could be reordered here. *lock = 1; // SC This could happen because of the ACQUIRE semantics of spin_lock(), and the current implementation of spin_lock() on PPC allows this happen. (Cc PPC maintainers for their opinions on this one) Therefore the case below can happen: CPU 1 CPU 2 CPU 3 == == spin_unlock(&lock); spin_lock(&lock): r1 = *lock; // r1 == 0; o = READ_ONCE(object); // reordered here object = NULL; smp_mb(); spin_unlock_wait(&lock); *lock = 1; smp_mb(); o->dead = true; if (o) // true BUG_ON(o->dead); // true!! To show this, I also translate this situation into a PPC litmus for herd[1]: PPC spin-lock-wait " r1: local variable of lock r2: constant 1 r3: constant 0 or NULL r4: local variable of object, i.e. o r5: local variable of *o (simulate ->dead as I didn't know how to write fields of structure in herd ;-() r13: the address of lock, i.e. &lock r14: the address of object, i.e. &object " { 0:r1=0;0:r2=1;0:r3=0;0:r13=lock;0:r14=object; 1:r1=0;1:r2=1;1:r3=0;1:r4=0;1:r5=0;1:r13=lock;1:r14=object; 2:r1=0;2:r13=lock; lock=1; object=old; old=0; } P0 | P1 | P2 ; ld r4,0(r14)| Lock: | stw r1,0(r13); std r3,0(r14) | lwarx r1,r3,r13| ; | cmpwi r1,0 | ; sync| bne Lock | ; | stwcx. r2,r3,r13 | ; Wait: | bne Lock | ; lwz r1,0(r13) | lwsync | ; cmpwi r1,0 | ld r4,0(r14) | ; bne Wait| cmpwi r4,0 | ; | beq Fail | ; sync| lwz r5, 0(r4) | ; stw r2,0(r4)| Fail: | ; | lwsync | ; | stw r3, 0(r13) | ; exists (1:r4=old /\ 1:r5=1) ,whose result says that (1:r4=old /\ 1:r5=1) can happen: Test spin-lock-wait Allowed States 3 1:r4=0; 1:r5=0; 1:r4=old; 1:r5=0; 1:r4=old; 1:r5=1; Loop Ok Witnesses Positive: 18 Negative: 108 Condition exists (1:r4=old /\ 1:r5=1) Observation spin-lock-wait Sometimes 18 108 Hash=244f8
Re: [PATCH 4/4] locking: Introduce smp_cond_acquire()
On Wed, Nov 11, 2015 at 11:39 AM, Oleg Nesterov wrote: > On 11/11, Peter Zijlstra wrote: > > Sometimes I even think it should have both ACQUIRE + RELEASE semantics. > IOW, it should be "equivalent" to spin_lock() + spin_unlock(). That's insane. "Release" semantics are - by definition - about stuff that *predeces* it. It is simply not sensible to have a "wait_for_unlock()" that then synchronizes loads or stores that happened *before* the wait. That's some crazy voodoo programming. And if you do need to have model where you do "store something, then make sure that the unlock we wait for happens after the store", then you need to just add a "smp_mb()" in between that store and the waiting for unlock. Or just take the damn lock, and don't play any games at all. Linus -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 4/4] locking: Introduce smp_cond_acquire()
He Boqun, Let me first state that I can't answer authoritatively when it comes to barriers. That said, On 11/11, Boqun Feng wrote: > > But still, there is one suspicious use of smp_mb() in do_exit(): > > /* >* The setting of TASK_RUNNING by try_to_wake_up() may be delayed >* when the following two conditions become true. >* - There is race condition of mmap_sem (It is acquired by >* exit_mm()), and >* - SMI occurs before setting TASK_RUNINNG. >* (or hypervisor of virtual machine switches to other guest) >* As a result, we may become TASK_RUNNING after becoming TASK_DEAD >* >* To avoid it, we have to wait for releasing tsk->pi_lock which >* is held by try_to_wake_up() >*/ > smp_mb(); > raw_spin_unlock_wait(&tsk->pi_lock); > > /* causes final put_task_struct in finish_task_switch(). */ > tsk->state = TASK_DEAD; > tsk->flags |= PF_NOFREEZE; /* tell freezer to ignore us */ > schedule(); > > Seems like smp_mb() doesn't need here? Please see my reply to peterz's email. AFAICS, we need the barries on both sides. But, since we only need to STORE into tsk->state after unlock_wait(), we can rely on the control dependency and avoid the 2nd mb(). Oleg. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 4/4] locking: Introduce smp_cond_acquire()
On 11/11, Peter Zijlstra wrote: > > On Wed, Nov 11, 2015 at 05:39:40PM +0800, Boqun Feng wrote: > > > Just be curious, should spin_unlock_wait() semantically be an ACQUIRE? > > I did wonder the same thing, it would simplify a number of things if > this were so. Yes, me too. Sometimes I even think it should have both ACQUIRE + RELEASE semantics. IOW, it should be "equivalent" to spin_lock() + spin_unlock(). Consider this code: object_t *object; spinlock_t lock; void update(void) { object_t *o; spin_lock(&lock); o = READ_ONCE(object); if (o) { BUG_ON(o->dead); do_something(o); } spin_unlock(&lock); } void destroy(void) // can be called only once, can't race with itself { object_t *o; o = object; object = NULL; /* * pairs with lock/ACQUIRE. The next update() must see * object == NULL after spin_lock(); */ smp_mb(); spin_unlock_wait(&lock); /* * pairs with unlock/RELEASE. The previous update() has * already passed BUG_ON(o->dead). * * (Yes, yes, in this particular case it is not needed, * we can rely on the control dependency). */ smp_mb(); o->dead = true; } I believe the code above is correct and it needs the barriers on both sides. If it is wrong, then task_work_run() is buggy: it relies on mb() implied by cmpxchg() before spin_unlock_wait() the same way: the next task_work_cancel() should see the result of our cmpxchg(), it must not try to read work->next or work->func. Hmm. Not sure I really understand what I am trying to say... Perhaps in fact I mean that unlock_wait() should be removed because it is too subtle for me ;) Oleg. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 4/4] locking: Introduce smp_cond_acquire()
On Wed, Nov 11, 2015 at 05:39:40PM +0800, Boqun Feng wrote: > Just be curious, should spin_unlock_wait() semantically be an ACQUIRE? I did wonder the same thing, it would simplify a number of things if this were so. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 4/4] locking: Introduce smp_cond_acquire()
On Wed, Nov 11, 2015 at 05:39:40PM +0800, Boqun Feng wrote: > Hi Oleg, > > On Tue, Nov 03, 2015 at 06:59:58PM +0100, Oleg Nesterov wrote: > [snip] > > > > Unfortunately this doesn't look exactly right... > > > > spin_unlock_wait() is not equal to "while (locked) relax", the latter > > is live-lockable or at least sub-optimal: we do not really need to spin > > Just be curious, should spin_unlock_wait() semantically be an ACQUIRE? Hmm.. I guess I was wrong, it doesn't need to be an ACQUIRE, it needs only to use the control dependency to order the load of lock state and stores following it. > Because spin_unlock_wait() is used for us to wait for a certain lock to > RELEASE so that we can do something *after* we observe the RELEASE. > Considering the follow example: > > CPU 0 CPU 1 > === > { X = 0 } > WRITE_ONCE(X, 1); > spin_unlock(&lock); > spin_unlock_wait(&lock) > r1 = READ_ONCE(X); > > If spin_unlock_wait() is not an ACQUIRE, r1 can be 0 in this case, > right? Am I missing something subtle here? Or spin_unlock_wait() itself > doesn't have the ACQUIRE semantics, but it should always come with a > smp_mb() *following* it to achieve the ACQUIRE semantics? However in > do_exit(), an smp_mb() is preceding raw_spin_unlock_wait() rather than > following, which makes me confused... could you explain that? Thank you > ;-) > But still, there is one suspicious use of smp_mb() in do_exit(): /* * The setting of TASK_RUNNING by try_to_wake_up() may be delayed * when the following two conditions become true. * - There is race condition of mmap_sem (It is acquired by * exit_mm()), and * - SMI occurs before setting TASK_RUNINNG. * (or hypervisor of virtual machine switches to other guest) * As a result, we may become TASK_RUNNING after becoming TASK_DEAD * * To avoid it, we have to wait for releasing tsk->pi_lock which * is held by try_to_wake_up() */ smp_mb(); raw_spin_unlock_wait(&tsk->pi_lock); /* causes final put_task_struct in finish_task_switch(). */ tsk->state = TASK_DEAD; tsk->flags |= PF_NOFREEZE; /* tell freezer to ignore us */ schedule(); Seems like smp_mb() doesn't need here? Because the control dependency already orders load of tsk->pi_lock and store of tsk->state, and this control dependency order guarantee pairs with the spin_unlock(->pi_lock) in try_to_wake_up() to avoid data race on ->state. Regards, Boqun > Regards, > Boqun > > > until we observe !spin_is_locked(), we only need to synchronize with the > > current owner of this lock. Once it drops the lock we can proceed, we > > do not care if another thread takes the same lock right after that. > > > > Oleg. > > signature.asc Description: PGP signature
Re: [PATCH 4/4] locking: Introduce smp_cond_acquire()
Hi Oleg, On Tue, Nov 03, 2015 at 06:59:58PM +0100, Oleg Nesterov wrote: [snip] > > Unfortunately this doesn't look exactly right... > > spin_unlock_wait() is not equal to "while (locked) relax", the latter > is live-lockable or at least sub-optimal: we do not really need to spin Just be curious, should spin_unlock_wait() semantically be an ACQUIRE? Because spin_unlock_wait() is used for us to wait for a certain lock to RELEASE so that we can do something *after* we observe the RELEASE. Considering the follow example: CPU 0 CPU 1 === { X = 0 } WRITE_ONCE(X, 1); spin_unlock(&lock); spin_unlock_wait(&lock) r1 = READ_ONCE(X); If spin_unlock_wait() is not an ACQUIRE, r1 can be 0 in this case, right? Am I missing something subtle here? Or spin_unlock_wait() itself doesn't have the ACQUIRE semantics, but it should always come with a smp_mb() *following* it to achieve the ACQUIRE semantics? However in do_exit(), an smp_mb() is preceding raw_spin_unlock_wait() rather than following, which makes me confused... could you explain that? Thank you ;-) Regards, Boqun > until we observe !spin_is_locked(), we only need to synchronize with the > current owner of this lock. Once it drops the lock we can proceed, we > do not care if another thread takes the same lock right after that. > > Oleg. > signature.asc Description: PGP signature
Re: [PATCH 4/4] locking: Introduce smp_cond_acquire()
On Tue, Nov 03, 2015 at 08:43:22PM -0800, Linus Torvalds wrote: > On Tue, Nov 3, 2015 at 7:57 PM, Paul E. McKenney > wrote: > > > > Thank you, and yes, it clearly states that read-to-write dependencies > > are ordered. > > Well, I wouldn't say that it's exactly "clear". > > The fact that they explicitly say "Note that the DP relation does not > directly impose a BEFORE (⇐) ordering between accesses u and v" makes > it all clear as mud. > > They *then* go on to talk about how the DP relationship *together* > with the odd "is source of" ordering (which in turn is defined in > terms of BEFORE ordering) cannot have cycles. > > I have no idea why they do it that way, but the reason seems to be > that they wanted to make "BEFORE" be purely about barriers and > accesses, and make the other orderings be described separately. So the > "BEFORE" ordering is used to define how memory must act, which is then > used as a basis for that storage definition and the "is source of" > thing. > > But none of that seems to make much sense to a *user*. > > The fact that they seem to equate "BEFORE" with "Processor Issue > Constraints" also makes me think that the whole logic was written by a > CPU designer, and part of why they document it that way is that the > CPU designer literally thought of "can I issue this access" as being > very different from "is there some inherent ordering that just results > from issues outside of my design". > > I really don't know. That whole series of memory ordering rules makes > my head hurt. The honest answer is that for Alpha, I don't know either. But if your head isn't hurting enough yet, feel free to read on... My guess, based loosely on ARM and PowerPC, is that memory barriers provide global ordering but that the load-to-store dependency ordering is strictly local. Which as you say does not necessarily make much sense to a user. One guess is that the following would never trigger the BUG_ON() given x, y, and z initially zero (and ignoring the possibility of compiler mischief): CPU 0 CPU 1 CPU 2 r1 = x; r2 = y; r3 = z; if (r1) if (r2) if (r3) y = 1; z = 1; x = 1; BUG_ON(r1 == 1 && r2 == 1 && r3 == 1); /* after the dust settles */ The write-to-read relationships prevent misordering. The copy of the Alpha manual I downloaded hints that this BUG_ON() could not fire. However, the following might well trigger: CPU 0 CPU 1 CPU 2 x = 1; r1 = x; r2 = y; if (r1) if (r2) y = 1; x = 2; BUG_ON(r1 == 1 && r2 == 1 && x == 1); /* after the dust settles */ The dependency ordering orders each CPU individually, but might not force CPU 0's write to reach CPU 1 and CPU 2 at the same time. So the BUG_ON() case would happen if CPU 0's write to x reach CPU 1 before it reached CPU 2, in which case the x==2 value might not be seen outside of CPU 2, so that everyone agrees on the order of values taken on by x. And this could be prevented by enforcing global (rather than local) ordering by placing a memory barrier in CPU 1: CPU 0 CPU 1 CPU 2 x = 1; r1 = x; r2 = y; smp_mb(); if (r2) if (r1) x = 2; y = 1; BUG_ON(r1 == 1 && r2 == 1 && x == 1); /* after the dust settles */ But the reference manual doesn't have this sort of litmus test, so who knows? CCing the Alpha maintainers in case they know. > But I do think the only thing that matters in the end is that they do > have that DP relationship between reads and subsequently dependent > writes, but basically not for *any* other relationship. > > So on alpha read-vs-read, write-vs-later-read, and write-vs-write all > have to have memory barriers, unless the accesses physically overlap. Agreed. Thanx, Paul -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 4/4] locking: Introduce smp_cond_acquire()
On Tue, Nov 3, 2015 at 7:57 PM, Paul E. McKenney wrote: > > Thank you, and yes, it clearly states that read-to-write dependencies > are ordered. Well, I wouldn't say that it's exactly "clear". The fact that they explicitly say "Note that the DP relation does not directly impose a BEFORE (⇐) ordering between accesses u and v" makes it all clear as mud. They *then* go on to talk about how the DP relationship *together* with the odd "is source of" ordering (which in turn is defined in terms of BEFORE ordering) cannot have cycles. I have no idea why they do it that way, but the reason seems to be that they wanted to make "BEFORE" be purely about barriers and accesses, and make the other orderings be described separately. So the "BEFORE" ordering is used to define how memory must act, which is then used as a basis for that storage definition and the "is source of" thing. But none of that seems to make much sense to a *user*. The fact that they seem to equate "BEFORE" with "Processor Issue Constraints" also makes me think that the whole logic was written by a CPU designer, and part of why they document it that way is that the CPU designer literally thought of "can I issue this access" as being very different from "is there some inherent ordering that just results from issues outside of my design". I really don't know. That whole series of memory ordering rules makes my head hurt. But I do think the only thing that matters in the end is that they do have that DP relationship between reads and subsequently dependent writes, but basically not for *any* other relationship. So on alpha read-vs-read, write-vs-later-read, and write-vs-write all have to have memory barriers, unless the accesses physically overlap. Linus -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 4/4] locking: Introduce smp_cond_acquire()
On Tue, Nov 03, 2015 at 11:40:24AM -0800, Linus Torvalds wrote: > On Mon, Nov 2, 2015 at 5:57 PM, Paul E. McKenney > wrote: [ . . . ] > > I am in India and my Alpha Architecture Manual is in the USA. > > I sent you a link to something that should work, and that has the section. Thank you, and yes, it clearly states that read-to-write dependencies are ordered. Color me slow and stupid. > > And they did post a clarification on the web: > > So for alpha, you trust a random web posting by a unknown person that > talks about some random problem in an application that we don't even > know what it is. In my defense, the Alpha architects pointed me at that web posting, but yes, it appears to be pushing for overly conservative safety rather than accuracy. Please accept my apologies for my confusion. Thanx, Paul -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 4/4] locking: Introduce smp_cond_acquire()
On Mon, Nov 2, 2015 at 5:57 PM, Paul E. McKenney wrote: >> >> Alpha isn't special. And smp_read_barrier_depends() hasn't magically >> become something new. > > The existing control dependencies (READ_ONCE_CTRL() and friends) only > guarantee ordering against later stores, and not against later loads. Right. And using "smp_read_barrier_depends()" for them is WRONG. That's my argument. Your arguments make no sense. > Of the weakly ordered architectures, only Alpha fails to respect > load-to-store control dependencies. NO IT DOES NOT. Christ, Paul. I even sent you the alpha architecture manual information where it explicitly says that there's a dependency ordering for that case. There's a reason that "smp_read_barrier_depends()" is called smp_READ_barrier_depends(). It's a rmb. Really You have turned it into something else in your mind. But your mind is WRONG. > I am in India and my Alpha Architecture Manual is in the USA. I sent you a link to something that should work, and that has the section. > And they did post a clarification on the web: So for alpha, you trust a random web posting by a unknown person that talks about some random problem in an application that we don't even know what it is. But you don't trust the architecture manual, and you don't trust the fact that it si *physically impossible* to not have the load-to-control-to-store ordering without some kind of magical nullifying stores that we know that alpha didn't have? But then magically, you trust the architecture manuals for other architectures, or just take the "you cannot have smp-visible speculative stores" on faith. But alpha is different, and lives in a universe where causality suddenly doesn't exist. I really don't understand your logic. > So 5.6.1.7 apparently does not sanction data dependency ordering. Exactly why are you arguing against he architecture manual? >> Paul? I really disagree with how you have totally tried to re-purpose >> smp_read_barrier_depends() in ways that are insane and make no sense. > > I did consider adding a new name, but apparently was lazy that day. > I would of course be happy to add a separate name. That is NOT WHAT I WANT AT ALL. Get rid of the smp_read_barrier_depends(). It doesn't do control barriers against stores, it has never done that, and they aren't needed in the first place. There is no need for a new name. The only thing that there is a need for is to just realize that alpha isn't magical. Alpha does have a stupid segmented cache which means that even when the core is constrained by load-to-load data address dependencies (because no alpha actually did value predication), the cachelines that core loads may not be ordered without the memory barrier. But that "second load may get a stale value from the past" is purely about loads. You cannot have the same issue happen for stores, because there's no way a store buffer somehow magically goes backwards in time and exposes the store before the load that it depended on. And the architecture manual even *EXPLICITLY* says so. Alpha does actually have a dependency chain from loads to dependent stores - through addresses, through data, _and_ through control. It's documented, but equally importantly, it's just basically impossible to not have an architecture that does that. Even if you do some really fancy things like actual value prediction (which the alpha architecture _allows_, although no alpha ever did, afaik), that cannot break the load->store dependency. Because even if the load value was predicted (or any control flow after the load was predicted), the store cannot be committed and made visible to other CPU's until after that prediction has been verified. So there may be bogus values in a store buffer, but those values will have to be killed with the store instructions that caused them, if any prediction failed. They won't be visible to other CPU's. And I say that not because the architecture manual says so (although it does), but because because such a CPU wouldn't *work*. It wouldn't be a CPU, it would at most be a fuzzy neural network that generates cool results that may be interesting, but they won't be dependable or reliable in the sense that the Linux kernel depends on. >> That is not a control dependency. If it was, then PPC and ARM would >> need to make it a real barrier too. I really don't see why you have >> singled out alpha as the victim of your "let's just randomly change >> the rules". > > Just trying to make this all work on Alpha as well as the other > architectures... But if the actual Alpha hardware that is currently > running recent Linux kernels is more strict than the architecture .. really. This is specified in the architecture manual. The fact that you have found some random posting by some random person that says "you need barriers everywhere" is immaterial. That support blog may well have been simply a "I don't know what I am talking about, and I don't know what problem you have, but I do know
Re: [PATCH 4/4] locking: Introduce smp_cond_acquire()
On Tue, Nov 03, 2015 at 06:59:58PM +0100, Oleg Nesterov wrote: > > - raw_spin_unlock_wait(&task->pi_lock); > > - smp_mb(); > > + smp_cond_acquire(!raw_spin_is_locked(&task->pi_lock)); > > Unfortunately this doesn't look exactly right... > > spin_unlock_wait() is not equal to "while (locked) relax", the latter > is live-lockable or at least sub-optimal: we do not really need to spin > until we observe !spin_is_locked(), we only need to synchronize with the > current owner of this lock. Once it drops the lock we can proceed, we > do not care if another thread takes the same lock right after that. Ah indeed. And while every use of spin_unlock_wait() has 'interesting' barriers associated, they all seem different. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 4/4] locking: Introduce smp_cond_acquire()
On 11/02, Peter Zijlstra wrote: > > +#define smp_cond_acquire(cond) do {\ > + while (!(cond)) \ > + cpu_relax();\ > + smp_read_barrier_depends(); /* ctrl */ \ > + smp_rmb(); /* ctrl + rmb := acquire */ \ > +} while (0) ... > --- a/kernel/task_work.c > +++ b/kernel/task_work.c > @@ -102,13 +102,13 @@ void task_work_run(void) > > if (!work) > break; > + > /* >* Synchronize with task_work_cancel(). It can't remove >* the first entry == work, cmpxchg(task_works) should >* fail, but it can play with *work and other entries. >*/ > - raw_spin_unlock_wait(&task->pi_lock); > - smp_mb(); > + smp_cond_acquire(!raw_spin_is_locked(&task->pi_lock)); Unfortunately this doesn't look exactly right... spin_unlock_wait() is not equal to "while (locked) relax", the latter is live-lockable or at least sub-optimal: we do not really need to spin until we observe !spin_is_locked(), we only need to synchronize with the current owner of this lock. Once it drops the lock we can proceed, we do not care if another thread takes the same lock right after that. Oleg. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 4/4] locking: Introduce smp_cond_acquire()
On Mon, Nov 02, 2015 at 11:17:17AM -0800, Linus Torvalds wrote: > On Mon, Nov 2, 2015 at 10:37 AM, Will Deacon wrote: > > On Mon, Nov 02, 2015 at 10:08:24AM -0800, Linus Torvalds wrote: > >> On Mon, Nov 2, 2015 at 5:29 AM, Peter Zijlstra > >> wrote: > >> > +#define smp_cond_acquire(cond) do {\ > >> > + while (!(cond)) \ > >> > + cpu_relax();\ > >> > + smp_read_barrier_depends(); /* ctrl */ \ > >> > + smp_rmb(); /* ctrl + rmb := acquire */ \ > >> > +} while (0) > >> > >> This code makes absolutely no sense. > >> > >> smp_read_barrier_depends() is about a memory barrier where there is a > >> data dependency between two accesses. The "depends" is very much about > >> the data dependency, and very much about *nothing* else. > > > > Paul wasn't so sure, which I think is why smp_read_barrier_depends() > > is already used in, for example, READ_ONCE_CTRL: > > > > http://lkml.kernel.org/r/20151007154003.gj3...@linux.vnet.ibm.com > > Quoting the alpha architecture manual is kind of pointless, when NO > OTHER ARCHITECTURE OUT THERE guararantees that whole "read + > conditional orders wrt subsequent writes" _either_. > > (Again, with the exception of x86, which has the sane "we honor causality") > > Alpha isn't special. And smp_read_barrier_depends() hasn't magically > become something new. The existing control dependencies (READ_ONCE_CTRL() and friends) only guarantee ordering against later stores, and not against later loads. Of the weakly ordered architectures, only Alpha fails to respect load-to-store control dependencies. > If people think that control dependency needs a memory barrier on > alpha, then it damn well needs on on all other weakly ordered > architectuers too, afaik. For load-to-load control dependencies, agreed, from what I can see, all the weakly ordered architectures do need an explicit barrier. > Either that "you cannot finalize a write unless all conditionals it > depends on are finalized" is true or it is not. That argument has > *never* been about some architecture-specific memory ordering model, > as far as I know. > > As to READ_ONCE_CTRL - two wrongs don't make a right. > > That smp_read_barrier_depends() there doesn't make any sense either. > > And finally, the alpha architecture manual actually does have the > notion of "Dependence Constraint" (5.6.1.7) that talks about writes > that depend on previous reads (where "depends" is explicitly spelled > out to be about conditionals, write data _or_ write address). They are > actually constrained on alpha too. I am in India and my Alpha Architecture Manual is in the USA. Google Books has a PDF, but it conveniently omits that particular section. I do remember quoting that section at the Alpha architects back in the late 1990s and being told that it didn't mean what I thought it meant. And they did post a clarification on the web: http://h71000.www7.hp.com/wizard/wiz_2637.html For instance, your producer must issue a "memory barrier" instruction after writing the data to shared memory and before inserting it on the queue; likewise, your consumer must issue a memory barrier instruction after removing an item from the queue and before reading from its memory. Otherwise, you risk seeing stale data, since, while the Alpha processor does provide coherent memory, it does not provide implicit ordering of reads and writes. (That is, the write of the producer's data might reach memory after the write of the queue, such that the consumer might read the new item from the queue but get the previous values from the item's memory. So 5.6.1.7 apparently does not sanction data dependency ordering. > Note that a "Dependence Constraint" is not a memory barrier, because > it only affects that particular chain of dependencies. So it doesn't > order other things in *general*, but it does order a particular read > with a particular sef of subsequent write. Which is all we guarantee > on anything else too wrt the whole control dependencies. > > The smp_read_barrier_depends() is a *READ BARRIER*. It's about a > *read* that depends on a previous read. Now, it so happens that alpha > doesn't have a "read-to-read" barrier, and it's implemented as a full > barrier, but it really should be read as a "smp_rmb(). > > Yes, yes, alpha has the worst memory ordering ever, and the worst > barriers for that insane memory ordering, but that still does not make > alpha "magically able to reorder more than physically or logically > possible". We don't add barriers that don't make sense just because > the alpha memory orderings are insane. > > Paul? I really disagree with how you have totally tried to re-purpose > smp_read_barrier_depends() in ways that are insane and make no sense. I did consider adding a new name, but apparently was lazy that day. I would of course be happy to
Re: [PATCH 4/4] locking: Introduce smp_cond_acquire()
On Mon, Nov 2, 2015 at 5:14 PM, Paul E. McKenney wrote: > > You would ask that question when I am many thousands of miles from my > copy of the Alpha reference manual! ;-) I don't think I've touched a paper manual in years. Too m uch effort to index and search. Look here http://download.majix.org/dec/alpha_arch_ref.pdf > There is explicit wording in that manual that says that no multi-variable > ordering is implied without explicit memory-barrier instructions. Right. But the whole "read -> conditional -> write" ends up being a dependency chain to *every* single write that happens after the conditional. So there may be no "multi-variable ordering" implied in any individual access, but despite that a simple "read + conditional" orders every single store that comes after it. Even if it orders them "individually". Linus -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 4/4] locking: Introduce smp_cond_acquire()
On Mon, Nov 02, 2015 at 05:43:48PM +, Will Deacon wrote: > On Mon, Nov 02, 2015 at 02:57:26PM +0100, Peter Zijlstra wrote: > > On Mon, Nov 02, 2015 at 02:29:05PM +0100, Peter Zijlstra wrote: > > > > > Note that while smp_cond_acquire() has an explicit > > > smp_read_barrier_depends() for Alpha, neither sites it gets used in > > > were actually buggy on Alpha for their lack of it. The first uses > > > smp_rmb(), which on Alpha is a full barrier too and therefore serves > > > its purpose. The second had an explicit full barrier. > > > > > +/** > > > + * smp_cond_acquire() - Spin wait for cond with ACQUIRE ordering > > > + * @cond: boolean expression to wait for > > > + * > > > + * Equivalent to using smp_load_acquire() on the condition variable but > > > employs > > > + * the control dependency of the wait to reduce the barrier on many > > > platforms. > > > + * > > > + * The control dependency provides a LOAD->STORE order, the additional > > > RMB > > > + * provides LOAD->LOAD order, together they provide LOAD->{LOAD,STORE} > > > order, > > > + * aka. ACQUIRE. > > > + */ > > > +#define smp_cond_acquire(cond) do {\ > > > + while (!(cond)) \ > > > + cpu_relax();\ > > > + smp_read_barrier_depends(); /* ctrl */ \ > > > + smp_rmb(); /* ctrl + rmb := acquire */ \ > > > +} while (0) > > > > So per the above argument we could leave out the > > smp_read_barrier_depends() for Alpha, although that would break > > consistency with all the other control dependency primitives we have. It > > would avoid issuing a double barrier. > > > > Thoughts? > > Do we even know that Alpha needs a barrier for control-dependencies in > the first place? You would ask that question when I am many thousands of miles from my copy of the Alpha reference manual! ;-) There is explicit wording in that manual that says that no multi-variable ordering is implied without explicit memory-barrier instructions. Thanx, Paul -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 4/4] locking: Introduce smp_cond_acquire()
On Mon, Nov 02, 2015 at 11:17:17AM -0800, Linus Torvalds wrote: > As to READ_ONCE_CTRL - two wrongs don't make a right. > > That smp_read_barrier_depends() there doesn't make any sense either. > > And finally, the alpha architecture manual actually does have the > notion of "Dependence Constraint" (5.6.1.7) that talks about writes > that depend on previous reads (where "depends" is explicitly spelled > out to be about conditionals, write data _or_ write address). They are > actually constrained on alpha too. > > Note that a "Dependence Constraint" is not a memory barrier, because > it only affects that particular chain of dependencies. So it doesn't > order other things in *general*, but it does order a particular read > with a particular sef of subsequent write. Which is all we guarantee > on anything else too wrt the whole control dependencies. Something like so then, Paul? --- Subject: locking: Alpha honours control dependencies too The alpha architecture manual actually does have the notion of "Dependence Constraint" (5.6.1.7) that talks about writes that depend on previous reads (where "depends" is explicitly spelled out to be about conditionals, write data _or_ write address). They are actually constrained on alpha too. Which means we can remove the smp_read_barrier_depends() abuse from the various control dependency primitives. Retain the primitives, as they are a useful documentation aid. Maybe-Signed-off-by: Peter Zijlstra (Intel) --- include/linux/atomic.h | 2 -- include/linux/compiler.h | 1 - 2 files changed, 3 deletions(-) diff --git a/include/linux/atomic.h b/include/linux/atomic.h index 27e580d232ca..f16b1dedd909 100644 --- a/include/linux/atomic.h +++ b/include/linux/atomic.h @@ -8,7 +8,6 @@ static inline int atomic_read_ctrl(const atomic_t *v) { int val = atomic_read(v); - smp_read_barrier_depends(); /* Enforce control dependency. */ return val; } #endif @@ -565,7 +564,6 @@ static inline int atomic_dec_if_positive(atomic_t *v) static inline long long atomic64_read_ctrl(const atomic64_t *v) { long long val = atomic64_read(v); - smp_read_barrier_depends(); /* Enforce control dependency. */ return val; } #endif diff --git a/include/linux/compiler.h b/include/linux/compiler.h index 3d7810341b57..1d1f5902189d 100644 --- a/include/linux/compiler.h +++ b/include/linux/compiler.h @@ -311,7 +311,6 @@ static __always_inline void __write_once_size(volatile void *p, void *res, int s #define READ_ONCE_CTRL(x) \ ({ \ typeof(x) __val = READ_ONCE(x); \ - smp_read_barrier_depends(); /* Enforce control dependency. */ \ __val; \ }) -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 4/4] locking: Introduce smp_cond_acquire()
On Mon, Nov 2, 2015 at 12:36 PM, David Howells wrote: > Linus Torvalds wrote: > >> > + smp_read_barrier_depends(); /* ctrl */ \ >> > + smp_rmb(); /* ctrl + rmb := acquire */ \ > > Doesn't smp_rmb() imply an smp_read_barrier_depends() anyway? Yes, it does. But that "smp_read_barrier_depends()" is actually mis-used as a "barrier against subsequent dependent writes, thanks to the control flow". It's not protecting against subsequent reads - which is what the smp_rmb() is about. Which is completely bogus, but that's what the comment implies. Of course, on alpha (which is where smp_read_barrier_depends() makes a difference), both that and smp_rmb() are just full memory barriers, because alpha is some crazy sh*t. So yes, a "smp_rmb()" is sufficient everywhere, but that is actually not where the confusion comes from in the first place. Linus -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 4/4] locking: Introduce smp_cond_acquire()
On Mon, Nov 02, 2015 at 08:36:49PM +, David Howells wrote: > Linus Torvalds wrote: > > > > + smp_read_barrier_depends(); /* ctrl */ \ > > > + smp_rmb(); /* ctrl + rmb := acquire */ \ > > Doesn't smp_rmb() imply an smp_read_barrier_depends() anyway? In > memory-barriers.txt, it says: > > Read memory barriers imply data dependency barriers, and so can > substitute for them. Yes, I noted that in a follow up email, Alpha implements both as MB. So just the smp_rmb() is sufficient. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 4/4] locking: Introduce smp_cond_acquire()
Linus Torvalds wrote: > > + smp_read_barrier_depends(); /* ctrl */ \ > > + smp_rmb(); /* ctrl + rmb := acquire */ \ Doesn't smp_rmb() imply an smp_read_barrier_depends() anyway? In memory-barriers.txt, it says: Read memory barriers imply data dependency barriers, and so can substitute for them. David -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 4/4] locking: Introduce smp_cond_acquire()
On Mon, Nov 02, 2015 at 07:57:09PM +, Will Deacon wrote: > > >> smp_read_barrier_depends() is about a memory barrier where there is a > > >> data dependency between two accesses. The "depends" is very much about > > >> the data dependency, and very much about *nothing* else. > > > > > > Paul wasn't so sure, which I think is why smp_read_barrier_depends() > > > is already used in, for example, READ_ONCE_CTRL: > > > > > > http://lkml.kernel.org/r/20151007154003.gj3...@linux.vnet.ibm.com > > > > Quoting the alpha architecture manual is kind of pointless, when NO > > OTHER ARCHITECTURE OUT THERE guararantees that whole "read + > > conditional orders wrt subsequent writes" _either_. > > > > (Again, with the exception of x86, which has the sane "we honor causality") > > > > Alpha isn't special. And smp_read_barrier_depends() hasn't magically > > become something new. > > > > If people think that control dependency needs a memory barrier on > > alpha, then it damn well needs on on all other weakly ordered > > architectuers too, afaik. > > > > Either that "you cannot finalize a write unless all conditionals it > > depends on are finalized" is true or it is not. That argument has > > *never* been about some architecture-specific memory ordering model, > > as far as I know. > > You can imagine a (horribly broken) value-speculating architecture that > would permit this re-ordering, but then you'd have speculative stores > and thin-air values, which Alpha doesn't have. I've tried arguing this point with Paul on a number of occasions, and I think he at one point constructed some elaborate scheme that depended on the split cache where this might require the full barrier. > > As to READ_ONCE_CTRL - two wrongs don't make a right. > > Sure, I just wanted to point out the precedence and related discussion. I purely followed 'convention' here. > > That smp_read_barrier_depends() there doesn't make any sense either. > > > > And finally, the alpha architecture manual actually does have the > > notion of "Dependence Constraint" (5.6.1.7) that talks about writes > > that depend on previous reads (where "depends" is explicitly spelled > > out to be about conditionals, write data _or_ write address). They are > > actually constrained on alpha too. > > In which case, it looks like we can remove the smp_read_barrier_depends > instances from all of the control-dependency macros, but I'll defer to > Paul in case he has some further insight. I assume you're ok with this > patch if the smp_read_barrier_depends() is removed? That would be good indeed. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 4/4] locking: Introduce smp_cond_acquire()
On Mon, Nov 02, 2015 at 11:17:17AM -0800, Linus Torvalds wrote: > On Mon, Nov 2, 2015 at 10:37 AM, Will Deacon wrote: > > On Mon, Nov 02, 2015 at 10:08:24AM -0800, Linus Torvalds wrote: > >> On Mon, Nov 2, 2015 at 5:29 AM, Peter Zijlstra > >> wrote: > >> > +#define smp_cond_acquire(cond) do {\ > >> > + while (!(cond)) \ > >> > + cpu_relax();\ > >> > + smp_read_barrier_depends(); /* ctrl */ \ > >> > + smp_rmb(); /* ctrl + rmb := acquire */ \ > >> > +} while (0) > >> > >> This code makes absolutely no sense. > >> > >> smp_read_barrier_depends() is about a memory barrier where there is a > >> data dependency between two accesses. The "depends" is very much about > >> the data dependency, and very much about *nothing* else. > > > > Paul wasn't so sure, which I think is why smp_read_barrier_depends() > > is already used in, for example, READ_ONCE_CTRL: > > > > http://lkml.kernel.org/r/20151007154003.gj3...@linux.vnet.ibm.com > > Quoting the alpha architecture manual is kind of pointless, when NO > OTHER ARCHITECTURE OUT THERE guararantees that whole "read + > conditional orders wrt subsequent writes" _either_. > > (Again, with the exception of x86, which has the sane "we honor causality") > > Alpha isn't special. And smp_read_barrier_depends() hasn't magically > become something new. > > If people think that control dependency needs a memory barrier on > alpha, then it damn well needs on on all other weakly ordered > architectuers too, afaik. > > Either that "you cannot finalize a write unless all conditionals it > depends on are finalized" is true or it is not. That argument has > *never* been about some architecture-specific memory ordering model, > as far as I know. You can imagine a (horribly broken) value-speculating architecture that would permit this re-ordering, but then you'd have speculative stores and thin-air values, which Alpha doesn't have. > As to READ_ONCE_CTRL - two wrongs don't make a right. Sure, I just wanted to point out the precedence and related discussion. > That smp_read_barrier_depends() there doesn't make any sense either. > > And finally, the alpha architecture manual actually does have the > notion of "Dependence Constraint" (5.6.1.7) that talks about writes > that depend on previous reads (where "depends" is explicitly spelled > out to be about conditionals, write data _or_ write address). They are > actually constrained on alpha too. In which case, it looks like we can remove the smp_read_barrier_depends instances from all of the control-dependency macros, but I'll defer to Paul in case he has some further insight. I assume you're ok with this patch if the smp_read_barrier_depends() is removed? > > In this case, control dependencies are only referring to READ -> WRITE > > ordering, so they are honoured by ARM and PowerPC. > > Do ARM and PPC actually guarantee the generic "previous reads always > order before subsequent writes"? Only to *dependent* subsequent writes, but Peter's patch makes all subsequent writes dependent on cond, so I think that's the right way to achieve the ordering we want. Will -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 4/4] locking: Introduce smp_cond_acquire()
On Mon, Nov 2, 2015 at 10:37 AM, Will Deacon wrote: > On Mon, Nov 02, 2015 at 10:08:24AM -0800, Linus Torvalds wrote: >> On Mon, Nov 2, 2015 at 5:29 AM, Peter Zijlstra wrote: >> > +#define smp_cond_acquire(cond) do {\ >> > + while (!(cond)) \ >> > + cpu_relax();\ >> > + smp_read_barrier_depends(); /* ctrl */ \ >> > + smp_rmb(); /* ctrl + rmb := acquire */ \ >> > +} while (0) >> >> This code makes absolutely no sense. >> >> smp_read_barrier_depends() is about a memory barrier where there is a >> data dependency between two accesses. The "depends" is very much about >> the data dependency, and very much about *nothing* else. > > Paul wasn't so sure, which I think is why smp_read_barrier_depends() > is already used in, for example, READ_ONCE_CTRL: > > http://lkml.kernel.org/r/20151007154003.gj3...@linux.vnet.ibm.com Quoting the alpha architecture manual is kind of pointless, when NO OTHER ARCHITECTURE OUT THERE guararantees that whole "read + conditional orders wrt subsequent writes" _either_. (Again, with the exception of x86, which has the sane "we honor causality") Alpha isn't special. And smp_read_barrier_depends() hasn't magically become something new. If people think that control dependency needs a memory barrier on alpha, then it damn well needs on on all other weakly ordered architectuers too, afaik. Either that "you cannot finalize a write unless all conditionals it depends on are finalized" is true or it is not. That argument has *never* been about some architecture-specific memory ordering model, as far as I know. As to READ_ONCE_CTRL - two wrongs don't make a right. That smp_read_barrier_depends() there doesn't make any sense either. And finally, the alpha architecture manual actually does have the notion of "Dependence Constraint" (5.6.1.7) that talks about writes that depend on previous reads (where "depends" is explicitly spelled out to be about conditionals, write data _or_ write address). They are actually constrained on alpha too. Note that a "Dependence Constraint" is not a memory barrier, because it only affects that particular chain of dependencies. So it doesn't order other things in *general*, but it does order a particular read with a particular sef of subsequent write. Which is all we guarantee on anything else too wrt the whole control dependencies. The smp_read_barrier_depends() is a *READ BARRIER*. It's about a *read* that depends on a previous read. Now, it so happens that alpha doesn't have a "read-to-read" barrier, and it's implemented as a full barrier, but it really should be read as a "smp_rmb(). Yes, yes, alpha has the worst memory ordering ever, and the worst barriers for that insane memory ordering, but that still does not make alpha "magically able to reorder more than physically or logically possible". We don't add barriers that don't make sense just because the alpha memory orderings are insane. Paul? I really disagree with how you have totally tried to re-purpose smp_read_barrier_depends() in ways that are insane and make no sense. That is not a control dependency. If it was, then PPC and ARM would need to make it a real barrier too. I really don't see why you have singled out alpha as the victim of your "let's just randomly change the rules". > In this case, control dependencies are only referring to READ -> WRITE > ordering, so they are honoured by ARM and PowerPC. Do ARM and PPC actually guarantee the generic "previous reads always order before subsequent writes"? Because if that's the issue, then we should perhaps add a new ordering primitive that says that (ie "smp_read_to_write_barrier()"). That could be a useful thing in general, where we currently use "smp_mb()" to order an earlier read with a later write. Linus -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 4/4] locking: Introduce smp_cond_acquire()
On Mon, Nov 02, 2015 at 10:08:24AM -0800, Linus Torvalds wrote: > On Mon, Nov 2, 2015 at 5:29 AM, Peter Zijlstra wrote: > > +#define smp_cond_acquire(cond) do {\ > > + while (!(cond)) \ > > + cpu_relax();\ > > + smp_read_barrier_depends(); /* ctrl */ \ > > + smp_rmb(); /* ctrl + rmb := acquire */ \ > > +} while (0) > > This code makes absolutely no sense. > > smp_read_barrier_depends() is about a memory barrier where there is a > data dependency between two accesses. The "depends" is very much about > the data dependency, and very much about *nothing* else. Paul wasn't so sure, which I think is why smp_read_barrier_depends() is already used in, for example, READ_ONCE_CTRL: http://lkml.kernel.org/r/20151007154003.gj3...@linux.vnet.ibm.com although I agree that this would pave the way for speculative stores on Alpha and that seems like a heavy accusation to make. > Your comment talks about control dependencies, but > smp_read_barrier_depends() has absolutely nothing to do with a control > dependency. In fact, it is explicitly a no-op on architectures like > ARM and PowerPC that violate control dependencies. In this case, control dependencies are only referring to READ -> WRITE ordering, so they are honoured by ARM and PowerPC. Will -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 4/4] locking: Introduce smp_cond_acquire()
On Mon, Nov 2, 2015 at 5:29 AM, Peter Zijlstra wrote: > +#define smp_cond_acquire(cond) do {\ > + while (!(cond)) \ > + cpu_relax();\ > + smp_read_barrier_depends(); /* ctrl */ \ > + smp_rmb(); /* ctrl + rmb := acquire */ \ > +} while (0) This code makes absolutely no sense. smp_read_barrier_depends() is about a memory barrier where there is a data dependency between two accesses. The "depends" is very much about the data dependency, and very much about *nothing* else. Your comment talks about control dependencies, but smp_read_barrier_depends() has absolutely nothing to do with a control dependency. In fact, it is explicitly a no-op on architectures like ARM and PowerPC that violate control dependencies. So the code may end up *working*, but the comments in it are misleading, insane, and nonsensical. And I think the code is actively wrong (although in the sense that it has a barrier too *many*, not lacking one). Because smp_read_barrier_depends() definitely has *nothing* to do with control barriers in any way, shape or form. The comment is actively and entirely wrong. As far as I know, even on alpha that 'smp_read_barrier_depends()' is unnecessary. Not because of any barrier semantics (whether smp_read_barrier depends or any other barrier), but simply because a store cannot finalize before the conditional has been resolved, which means that the read must have been done. So for alpha, you end up depending on the exact same logic that you depend on for every other architecture, since there is no "architectural" memory ordering guarantee of it anywhere else either (ok, so x86 has the explicit "memory ordering is causal" guarantee, so x86 does kind of make that conditional ordering explicit). Adding that "smp_read_barrier_depends()" doesn't add anything to the whole ctrl argument. So the code looks insane to me. On everything but alpha, "smp_read_barrier_depends()" is a no-op. And on alpha, a combination of "smp_read_barrier_depends()" and "smp_rmb()" is nonsensical. So in no case can that code make sense, as far as I can tell. Linus -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 4/4] locking: Introduce smp_cond_acquire()
On Mon, Nov 02, 2015 at 02:57:26PM +0100, Peter Zijlstra wrote: > On Mon, Nov 02, 2015 at 02:29:05PM +0100, Peter Zijlstra wrote: > > > Note that while smp_cond_acquire() has an explicit > > smp_read_barrier_depends() for Alpha, neither sites it gets used in > > were actually buggy on Alpha for their lack of it. The first uses > > smp_rmb(), which on Alpha is a full barrier too and therefore serves > > its purpose. The second had an explicit full barrier. > > > +/** > > + * smp_cond_acquire() - Spin wait for cond with ACQUIRE ordering > > + * @cond: boolean expression to wait for > > + * > > + * Equivalent to using smp_load_acquire() on the condition variable but > > employs > > + * the control dependency of the wait to reduce the barrier on many > > platforms. > > + * > > + * The control dependency provides a LOAD->STORE order, the additional RMB > > + * provides LOAD->LOAD order, together they provide LOAD->{LOAD,STORE} > > order, > > + * aka. ACQUIRE. > > + */ > > +#define smp_cond_acquire(cond) do {\ > > + while (!(cond)) \ > > + cpu_relax();\ > > + smp_read_barrier_depends(); /* ctrl */ \ > > + smp_rmb(); /* ctrl + rmb := acquire */ \ > > +} while (0) > > So per the above argument we could leave out the > smp_read_barrier_depends() for Alpha, although that would break > consistency with all the other control dependency primitives we have. It > would avoid issuing a double barrier. > > Thoughts? Do we even know that Alpha needs a barrier for control-dependencies in the first place? Will -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 4/4] locking: Introduce smp_cond_acquire()
Hi Peter, On Mon, Nov 02, 2015 at 02:29:05PM +0100, Peter Zijlstra wrote: > Introduce smp_cond_acquire() which combines a control dependency and a > read barrier to form acquire semantics. > > This primitive has two benefits: > - it documents control dependencies, > - its typically cheaper than using smp_load_acquire() in a loop. I'm not sure that's necessarily true on arm64, where we have a native load-acquire instruction, but not a READ -> READ barrier (smp_rmb() orders prior loads against subsequent loads and stores for us). Perhaps we could allow architectures to provide their own definition of smp_cond_acquire in case they can implement it more efficiently? > Note that while smp_cond_acquire() has an explicit > smp_read_barrier_depends() for Alpha, neither sites it gets used in > were actually buggy on Alpha for their lack of it. The first uses > smp_rmb(), which on Alpha is a full barrier too and therefore serves > its purpose. The second had an explicit full barrier. > > Signed-off-by: Peter Zijlstra (Intel) > --- > include/linux/compiler.h | 18 ++ > kernel/sched/core.c |8 +--- > kernel/task_work.c |4 ++-- > 3 files changed, 21 insertions(+), 9 deletions(-) > > --- a/include/linux/compiler.h > +++ b/include/linux/compiler.h > @@ -275,6 +275,24 @@ static __always_inline void __write_once > __val; \ > }) > > +/** > + * smp_cond_acquire() - Spin wait for cond with ACQUIRE ordering > + * @cond: boolean expression to wait for > + * > + * Equivalent to using smp_load_acquire() on the condition variable but > employs > + * the control dependency of the wait to reduce the barrier on many > platforms. > + * > + * The control dependency provides a LOAD->STORE order, the additional RMB > + * provides LOAD->LOAD order, together they provide LOAD->{LOAD,STORE} order, > + * aka. ACQUIRE. > + */ > +#define smp_cond_acquire(cond) do {\ I think the previous version that you posted/discussed had the actual address of the variable being loaded passed in here too? That would be useful for arm64, where we can wait-until-memory-location-has-changed to save us re-evaluating cond prematurely. > + while (!(cond)) \ > + cpu_relax();\ > + smp_read_barrier_depends(); /* ctrl */ \ > + smp_rmb(); /* ctrl + rmb := acquire */ \ It's actually stronger than acquire, I think, because accesses before the smp_cond_acquire cannot be moved across it. > +} while (0) > + > #endif /* __KERNEL__ */ > > #endif /* __ASSEMBLY__ */ > --- a/kernel/sched/core.c > +++ b/kernel/sched/core.c > @@ -2111,19 +2111,13 @@ try_to_wake_up(struct task_struct *p, un > /* >* If the owning (remote) cpu is still in the middle of schedule() with >* this task as prev, wait until its done referencing the task. > - */ > - while (p->on_cpu) > - cpu_relax(); > - /* > - * Combined with the control dependency above, we have an effective > - * smp_load_acquire() without the need for full barriers. >* >* Pairs with the smp_store_release() in finish_lock_switch(). >* >* This ensures that tasks getting woken will be fully ordered against >* their previous state and preserve Program Order. >*/ > - smp_rmb(); > + smp_cond_acquire(!p->on_cpu); > > p->sched_contributes_to_load = !!task_contributes_to_load(p); > p->state = TASK_WAKING; > --- a/kernel/task_work.c > +++ b/kernel/task_work.c > @@ -102,13 +102,13 @@ void task_work_run(void) > > if (!work) > break; > + > /* >* Synchronize with task_work_cancel(). It can't remove >* the first entry == work, cmpxchg(task_works) should >* fail, but it can play with *work and other entries. >*/ > - raw_spin_unlock_wait(&task->pi_lock); > - smp_mb(); > + smp_cond_acquire(!raw_spin_is_locked(&task->pi_lock)); Hmm, there's some sort of release equivalent in kernel/exit.c, but I couldn't easily figure out whether we could do anything there. If we could, we could kill raw_spin_unlock_wait :) Will -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 4/4] locking: Introduce smp_cond_acquire()
On Mon, Nov 02, 2015 at 02:29:05PM +0100, Peter Zijlstra wrote: > Note that while smp_cond_acquire() has an explicit > smp_read_barrier_depends() for Alpha, neither sites it gets used in > were actually buggy on Alpha for their lack of it. The first uses > smp_rmb(), which on Alpha is a full barrier too and therefore serves > its purpose. The second had an explicit full barrier. > +/** > + * smp_cond_acquire() - Spin wait for cond with ACQUIRE ordering > + * @cond: boolean expression to wait for > + * > + * Equivalent to using smp_load_acquire() on the condition variable but > employs > + * the control dependency of the wait to reduce the barrier on many > platforms. > + * > + * The control dependency provides a LOAD->STORE order, the additional RMB > + * provides LOAD->LOAD order, together they provide LOAD->{LOAD,STORE} order, > + * aka. ACQUIRE. > + */ > +#define smp_cond_acquire(cond) do {\ > + while (!(cond)) \ > + cpu_relax();\ > + smp_read_barrier_depends(); /* ctrl */ \ > + smp_rmb(); /* ctrl + rmb := acquire */ \ > +} while (0) So per the above argument we could leave out the smp_read_barrier_depends() for Alpha, although that would break consistency with all the other control dependency primitives we have. It would avoid issuing a double barrier. Thoughts? -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/