Re: sem_lock() vs qspinlocks
On Mon, May 23, 2016 at 10:52:09AM -0700, Linus Torvalds wrote: > On Mon, May 23, 2016 at 5:25 AM, Peter Zijlstra wrote: > > > > Paul has smp_mb__after_unlock_lock() for the RCpc 'upgrade'. How about > > something like: > > > > smp_mb__after_lock() > > I'd much rather make the naming be higher level. It's not necessarily Speak of higher level, I realize that problem here is similar to the problem we discussed last year: http://lkml.kernel.org/r/20151112070915.gc6...@fixme-laptop.cn.ibm.com the problem here is about synchronization between two spinlocks and that problem is about synchronization between a spinlock and ordinary variables. (One result of this similarity is that qspinlock on x86 may be also broken in the do_exit() code as spinlocks on AARCH64 and PPC. Because a variable LOAD inside a qspinlock critical section could be reordered before the STORE part of a qspinlock acquisition.) For the problem we found last year, the current solution for AARCH64 and PPC is to have a little heavy weight spin_unlock_wait() to pair with spin_lock(): AARCH64: http://lkml.kernel.org/r/1448624646-15863-1-git-send-email-will.dea...@arm.com PPC: http://lkml.kernel.org/r/1461130033-70898-1-git-send-email-boqun.f...@gmail.com (not merged yet) Another solution works on PPC is what Paul Mckenney suggested, using smp_mb__after_unlock_lock(): http://lkml.kernel.org/r/20151112144004.gu3...@linux.vnet.ibm.com , which is petty much the same as the spinlock synchronization primitive we are discussing about here. So I'm thinking, if we are going to introduce some primitives for synchronizing two spinlocks (or even a spinlock and a mutex) anyway, could we be a little more higher level, to reuse/invent primitives to solve the synchronzing problem we have between spinlocks(spin_unlock_wait()) and normal variables? One benefit of this is that we could drop the complex implementations of spin_unlock_wait() on AARCH64 and PPC. Thoughts? Regards, Boqun > going to be a "mb", and while the problem is about smp, the primitives > it is synchronizing aren't actually smp-specific (ie you're > synchronizing a lock that is relevant on UP too). > > So I'd just call it something like > > spin_lock_sync_after_lock(); > > because different locks might have different levels of serialization > (ie maybe a spinlock needs one thing, and a mutex needs another - if > we start worrying about ordering between spin_lock and > mutex_is_locked(), for example, or between mutex_lock() and > spin_is_locked()). > > Hmm? > > Linus signature.asc Description: PGP signature
Re: sem_lock() vs qspinlocks
On Sat, May 21, 2016 at 03:49:20PM +0200, Manfred Spraul wrote: > >I'm tempted to put that trailing smp_rmb() in spin_unlock_wait() too; > >because I suspect the netfilter code is broken without it. > > > >And it seems intuitive to assume that if we return from unlock_wait() we > >can indeed observe the critical section we waited on. > Then !spin_is_locked() and spin_unlock_wait() would be different with > regards to memory barriers. > Would that really help? We could fix that I think; something horrible like: static __always_inline int queued_spin_is_locked(struct qspinlock *lock) { int locked; smp_mb(); locked = atomic_read(&lock->val) & _Q_LOCKED_MASK; smp_acquire__after_ctrl_dep(); return locked; } Which if used in a conditional like: spin_lock(A); if (spin_is_locked(B)) { spin_unlock(A); spin_lock(B); ... } would still provide the ACQUIRE semantics required. The only difference is that it would provide it to _both_ branches, which might be a little more expensive. > My old plan was to document the rules, and define a generic > smp_acquire__after_spin_is_unlocked. > https://lkml.org/lkml/2015/3/1/153 Yeah; I more or less forgot all that. Now, I too think having the thing documented is good; _however_ I also think having primitives that actually do what you assume them to is a good thing. spin_unlock_wait() not actually serializing against the spin_unlock() is really surprising and subtle.
Re: sem_lock() vs qspinlocks
On Mon, May 23, 2016 at 5:25 AM, Peter Zijlstra wrote: > > Paul has smp_mb__after_unlock_lock() for the RCpc 'upgrade'. How about > something like: > > smp_mb__after_lock() I'd much rather make the naming be higher level. It's not necessarily going to be a "mb", and while the problem is about smp, the primitives it is synchronizing aren't actually smp-specific (ie you're synchronizing a lock that is relevant on UP too). So I'd just call it something like spin_lock_sync_after_lock(); because different locks might have different levels of serialization (ie maybe a spinlock needs one thing, and a mutex needs another - if we start worrying about ordering between spin_lock and mutex_is_locked(), for example, or between mutex_lock() and spin_is_locked()). Hmm? Linus
Re: sem_lock() vs qspinlocks
On Fri, May 20, 2016 at 10:00:45AM -0700, Linus Torvalds wrote: > So I do wonder if we should make that smp_mb() be something the > *caller* has to do, and document rules for it. IOW, introduce a new > spinlock primitive called "spin_lock_synchronize()", and then spinlock > implementations that have this non-atomic behavior with an unordered > store would do something like > > static inline void queued_spin_lock_synchronize(struct qspinlock > *a, struct qspinlock *b) > { > smp_mb(); > } > > and then we'd document that *if* yuou need ordering guarantees between > >spin_lock(a); >.. spin_is_locked/spin_wait_lock(b) .. > > you have to have a > > spin_lock_synchronize(a, b); > > in between. So I think I favour the explicit barrier. But my 'problem' is that we now have _two_ different scenarios in which we need to order two different spinlocks. The first is the RCpc vs RCsc spinlock situation (currently only on PowerPC). Where the spin_unlock() spin_lock() 'barier' is not transitive. And the second is this 'new' situation, where the store is unordered and is not observable until a release, which is fundamentally so on PPC and ARM64 but also possible due to lock implementation choices like with our qspinlock, which makes it manifest even on x86. Now, ideally we'd be able to use one barrier construct for both; but given that, while there is overlap, they're not the same. And I'd be somewhat reluctant to issue superfluous smp_mb()s just because; it is an expensive instruction. Paul has smp_mb__after_unlock_lock() for the RCpc 'upgrade'. How about something like: smp_mb__after_lock() ? OTOH; even if we document this, it is something that is easy to forget or miss. It is not like Documentation/memory-barriers.txt is in want of more complexity.
Re: sem_lock() vs qspinlocks
On Sun, May 22, 2016 at 10:43:08AM +0200, Manfred Spraul wrote: > How would we handle mixed spin_lock()/mutex_lock() code? > For the IPC code, I would like to replace the outer lock with a mutex. > The code only uses spinlocks, because at the time it was written, the mutex > code didn't contain a busy wait. > With a mutex, the code would become simpler (all the > lock/unlock/kmalloc/relock parts could be removed). > > The result would be something like: > > mutex_lock(A) spin_lock(B) > spin_unlock_wait(B) if (!mutex_is_locked(A)) > do_something()do_something() > Should work similarly, but we'll have to audit mutex for these same issues. I'll put it on todo.
Re: sem_lock() vs qspinlocks
Hi Peter, On 05/20/2016 06:04 PM, Peter Zijlstra wrote: On Fri, May 20, 2016 at 05:21:49PM +0200, Peter Zijlstra wrote: Let me write a patch.. OK, something like the below then.. lemme go build that and verify that too fixes things. --- Subject: locking,qspinlock: Fix spin_is_locked() and spin_unlock_wait() Similar to commits: 51d7d5205d33 ("powerpc: Add smp_mb() to arch_spin_is_locked()") d86b8da04dfa ("arm64: spinlock: serialise spin_unlock_wait against concurrent lockers") qspinlock suffers from the fact that the _Q_LOCKED_VAL store is unordered inside the ACQUIRE of the lock. And while this is not a problem for the regular mutual exclusive critical section usage of spinlocks, it breaks creative locking like: spin_lock(A)spin_lock(B) spin_unlock_wait(B) if (!spin_is_locked(A)) do_something()do_something() In that both CPUs can end up running do_something at the same time, because our _Q_LOCKED_VAL store can drop past the spin_unlock_wait() spin_is_locked() loads (even on x86!!). How would we handle mixed spin_lock()/mutex_lock() code? For the IPC code, I would like to replace the outer lock with a mutex. The code only uses spinlocks, because at the time it was written, the mutex code didn't contain a busy wait. With a mutex, the code would become simpler (all the lock/unlock/kmalloc/relock parts could be removed). The result would be something like: mutex_lock(A) spin_lock(B) spin_unlock_wait(B) if (!mutex_is_locked(A)) do_something()do_something() -- Manfred
Re: sem_lock() vs qspinlocks
On Sat, 21 May 2016, Peter Zijlstra wrote: On Fri, May 20, 2016 at 05:48:39PM -0700, Davidlohr Bueso wrote: On Fri, 20 May 2016, Linus Torvalds wrote: >Oh, I definitely agree on the stable part, and yes, the "splt things >up" model should come later if people agree that it's a good thing. The backporting part is quite nice, yes, but ultimately I think I prefer Linus' suggestion making things explicit, as opposed to consulting the spinlock implying barriers. I also hate to have an smp_mb() (particularly for spin_is_locked) given that we are not optimizing for the common case (regular mutual excl). I'm confused; we _are_ optimizing for the common case. spin_is_locked() is very unlikely to be used. And arguably should be used less in favour of lockdep_assert_held(). Indeed we are. But by 'common case' I was really thinking about spin_is_locked() vs spin_wait_unlock(). The former being the more common of the two, and the one which mostly will _not_ be used for lock correctness purposes, hence it doesn't need that new smp_mb. Hence allowing users to explicitly set the ordering needs (ie spin_lock_synchronize()) seems like the better long term alternative. otoh, with your approach all such bugs are automatically fixed :) Thanks, Davidlohr
Re: sem_lock() vs qspinlocks
On 05/21/2016 09:37 AM, Peter Zijlstra wrote: On Fri, May 20, 2016 at 05:48:39PM -0700, Davidlohr Bueso wrote: As opposed to spin_is_locked(), spin_unlock_wait() is perhaps more tempting to use for locking correctness. For example, taking a look at nf_conntrack_all_lock(), it too likes to get smart with spin_unlock_wait() -- also for finer graining purposes. While not identical to sems, it goes like: nf_conntrack_all_lock():nf_conntrack_lock(): spin_lock(B); spin_lock(A); if (bar) { // false bar = 1; ... } [loop ctrl-barrier] spin_unlock_wait(A); foo(); foo(); If the spin_unlock_wait() doesn't yet see the store that makes A visibly locked, we could end up with both threads in foo(), no?. (Although I'm unsure about that ctrl barrier and archs could fall into it. The point was to see in-tree examples of creative thinking with locking). I'm tempted to put that trailing smp_rmb() in spin_unlock_wait() too; because I suspect the netfilter code is broken without it. And it seems intuitive to assume that if we return from unlock_wait() we can indeed observe the critical section we waited on. Then !spin_is_locked() and spin_unlock_wait() would be different with regards to memory barriers. Would that really help? My old plan was to document the rules, and define a generic smp_acquire__after_spin_is_unlocked. https://lkml.org/lkml/2015/3/1/153 Noone supported it, so it ended up as ipc_smp_acquire__after_spin_is_unlocked(). Should we move it to linux/spinlock.h? Who needs it? - ipc/sem.c (but please start from the version from linux-next as reference, it is far less convoluted compared to the current code) https://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git/tree/ipc/sem.c - nf_conntrack - task_rq_lock() perhaps needs smp_acquire__after_ctrl_dep (I didn't figure out yet what happened to the proposed patch) https://lkml.org/lkml/2015/2/17/129 -- Manfred
Re: sem_lock() vs qspinlocks
On Sat, May 21, 2016 at 12:01:00AM -0400, Waiman Long wrote: > On 05/20/2016 08:59 PM, Davidlohr Bueso wrote: > >On Fri, 20 May 2016, Peter Zijlstra wrote: > > > >>On Fri, May 20, 2016 at 04:47:43PM -0400, Waiman Long wrote: > >> > Similarly, and I know you hate it, but afaict, then semantically > queued_spin_is_contended() ought to be: > > - return atomic_read(&lock->val) & ~_Q_LOCKED_MASK; > + return atomic_read(&lock->val); > > >> > >>>Looking for contended lock, you need to consider the lock waiters > >>>also. So > >>>looking at the whole word is right. > >> > >>No, you _only_ need to look at the lock waiters. > > > >Is there anyway to do this in a single atomic_read? My thought is that > >otherwise > >we could further expand the race window Its inherently racy, arrival of a contender is subject to timing. No point in trying to fix what can't be fixed. > The existing code is doing that, but I would argue that including the > locked, but uncontended case isn't a bad idea. It _IS_ a bad idea, you get unconditional lock-breaks. Its the same as: #define spin_is_contended(l)(true) Because the only reason you're using spin_is_conteded() is if you're holding it.
Re: sem_lock() vs qspinlocks
On Fri, May 20, 2016 at 05:48:39PM -0700, Davidlohr Bueso wrote: > On Fri, 20 May 2016, Linus Torvalds wrote: > > > >Oh, I definitely agree on the stable part, and yes, the "splt things > >up" model should come later if people agree that it's a good thing. > > The backporting part is quite nice, yes, but ultimately I think I prefer > Linus' suggestion making things explicit, as opposed to consulting the > spinlock > implying barriers. I also hate to have an smp_mb() (particularly for > spin_is_locked) > given that we are not optimizing for the common case (regular mutual excl). I'm confused; we _are_ optimizing for the common case. spin_is_locked() is very unlikely to be used. And arguably should be used less in favour of lockdep_assert_held(). > As opposed to spin_is_locked(), spin_unlock_wait() is perhaps more tempting > to use for locking correctness. For example, taking a look at > nf_conntrack_all_lock(), > it too likes to get smart with spin_unlock_wait() -- also for finer graining > purposes. > While not identical to sems, it goes like: > > nf_conntrack_all_lock(): nf_conntrack_lock(): > spin_lock(B); spin_lock(A); > > if (bar) { // false > bar = 1; ... > } > [loop ctrl-barrier] > spin_unlock_wait(A); > foo();foo(); > > If the spin_unlock_wait() doesn't yet see the store that makes A visibly > locked, > we could end up with both threads in foo(), no?. (Although I'm unsure about > that > ctrl barrier and archs could fall into it. The point was to see in-tree > examples > of creative thinking with locking). I'm tempted to put that trailing smp_rmb() in spin_unlock_wait() too; because I suspect the netfilter code is broken without it. And it seems intuitive to assume that if we return from unlock_wait() we can indeed observe the critical section we waited on. Something a little like so; but then for _all_ implementations. --- include/asm-generic/qspinlock.h | 6 ++ include/linux/compiler.h| 13 - 2 files changed, 14 insertions(+), 5 deletions(-) diff --git a/include/asm-generic/qspinlock.h b/include/asm-generic/qspinlock.h index 6bd05700d8c9..2f2eddd3e1f9 100644 --- a/include/asm-generic/qspinlock.h +++ b/include/asm-generic/qspinlock.h @@ -135,6 +135,12 @@ static inline void queued_spin_unlock_wait(struct qspinlock *lock) smp_mb(); while (atomic_read(&lock->val) & _Q_LOCKED_MASK) cpu_relax(); + + /* +* Match the RELEASE of the spin_unlock() we just observed. Thereby +* ensuring we observe the whole critical section that ended. +*/ + smp_acquire__after_ctrl_dep(); } #ifndef virt_spin_lock diff --git a/include/linux/compiler.h b/include/linux/compiler.h index 793c0829e3a3..3c4bc8160947 100644 --- a/include/linux/compiler.h +++ b/include/linux/compiler.h @@ -304,21 +304,24 @@ static __always_inline void __write_once_size(volatile void *p, void *res, int s __u.__val; \ }) +/* + * A control dependency provides a LOAD->STORE order, the additional RMB + * provides LOAD->LOAD order, together they provide LOAD->{LOAD,STORE} order, + * aka. ACQUIRE. + */ +#define smp_acquire__after_ctrl_dep() smp_rmb() + /** * smp_cond_acquire() - Spin wait for cond with ACQUIRE ordering * @cond: boolean expression to wait for * * Equivalent to using smp_load_acquire() on the condition variable but employs * the control dependency of the wait to reduce the barrier on many platforms. - * - * The control dependency provides a LOAD->STORE order, the additional RMB - * provides LOAD->LOAD order, together they provide LOAD->{LOAD,STORE} order, - * aka. ACQUIRE. */ #define smp_cond_acquire(cond) do {\ while (!(cond)) \ cpu_relax();\ - smp_rmb(); /* ctrl + rmb := acquire */ \ + smp_acquire__after_ctrl_dep(); \ } while (0) #endif /* __KERNEL__ */
Re: sem_lock() vs qspinlocks
On 05/20/2016 08:59 PM, Davidlohr Bueso wrote: On Fri, 20 May 2016, Peter Zijlstra wrote: On Fri, May 20, 2016 at 04:47:43PM -0400, Waiman Long wrote: >Similarly, and I know you hate it, but afaict, then semantically >queued_spin_is_contended() ought to be: > >- return atomic_read(&lock->val) & ~_Q_LOCKED_MASK; >+ return atomic_read(&lock->val); > Looking for contended lock, you need to consider the lock waiters also. So looking at the whole word is right. No, you _only_ need to look at the lock waiters. Is there anyway to do this in a single atomic_read? My thought is that otherwise we could further expand the race window of when the lock is and isn't contended (as returned to by the user). Ie avoiding crap like: atomic_read(&lock->val) && atomic_read(&lock->val) != _Q_LOCKED_VAL In any case, falsely returning for the 'locked, uncontended' case, vs completely ignoring waiters is probably the lesser evil :). Thanks, Davidlohr The existing code is doing that, but I would argue that including the locked, but uncontended case isn't a bad idea. Cheers, Longman
Re: sem_lock() vs qspinlocks
On Fri, May 20, 2016 at 5:48 PM, Davidlohr Bueso wrote: > > I can verify that this patch fixes the issue. Ok, I've applied it to my tree. Linus
Re: sem_lock() vs qspinlocks
On Fri, 20 May 2016, Peter Zijlstra wrote: On Fri, May 20, 2016 at 04:47:43PM -0400, Waiman Long wrote: >Similarly, and I know you hate it, but afaict, then semantically >queued_spin_is_contended() ought to be: > >- return atomic_read(&lock->val) & ~_Q_LOCKED_MASK; >+ return atomic_read(&lock->val); > Looking for contended lock, you need to consider the lock waiters also. So looking at the whole word is right. No, you _only_ need to look at the lock waiters. Is there anyway to do this in a single atomic_read? My thought is that otherwise we could further expand the race window of when the lock is and isn't contended (as returned to by the user). Ie avoiding crap like: atomic_read(&lock->val) && atomic_read(&lock->val) != _Q_LOCKED_VAL In any case, falsely returning for the 'locked, uncontended' case, vs completely ignoring waiters is probably the lesser evil :). Thanks, Davidlohr
Re: sem_lock() vs qspinlocks
On Fri, 20 May 2016, Linus Torvalds wrote: Oh, I definitely agree on the stable part, and yes, the "splt things up" model should come later if people agree that it's a good thing. The backporting part is quite nice, yes, but ultimately I think I prefer Linus' suggestion making things explicit, as opposed to consulting the spinlock implying barriers. I also hate to have an smp_mb() (particularly for spin_is_locked) given that we are not optimizing for the common case (regular mutual excl). As opposed to spin_is_locked(), spin_unlock_wait() is perhaps more tempting to use for locking correctness. For example, taking a look at nf_conntrack_all_lock(), it too likes to get smart with spin_unlock_wait() -- also for finer graining purposes. While not identical to sems, it goes like: nf_conntrack_all_lock():nf_conntrack_lock(): spin_lock(B); spin_lock(A); if (bar) { // false bar = 1; ... } [loop ctrl-barrier] spin_unlock_wait(A); foo(); foo(); If the spin_unlock_wait() doesn't yet see the store that makes A visibly locked, we could end up with both threads in foo(), no?. (Although I'm unsure about that ctrl barrier and archs could fall into it. The point was to see in-tree examples of creative thinking with locking). Should I take the patch as-is, or should I just wait for a pull request from the locking tree? Either is ok by me. I can verify that this patch fixes the issue. Thanks, Davidlohr
Re: sem_lock() vs qspinlocks
On Fri, May 20, 2016 at 2:06 PM, Peter Zijlstra wrote: >> >> See for example "ipc_smp_acquire__after_spin_is_unlocked()", which has >> a big comment atop of it that now becomes nonsensical with this patch. > > Not quite; we still need that I think. I think so too, but it's the *comment* that is nonsensical. The comment says that "spin_unlock_wait() and !spin_is_locked() are not memory barriers", and clearly now those instructions *are* memory barriers with your patch. However, the semaphore code wants a memory barrier after the _read_ in the spin_unlocked_wait(), which it doesn't get. So that is part of why I don't like the "hide memory barriers inside the implementation". Because once the operations aren't atomic (exactly like the spinlock is now no longer atomic on x86: it's a separate read-with-acquire followed by an unordered store for the queued case), the barrier semantics within such an operation get very screwy. There may be barriers, but they aren't barriers to *everything*, they are just barriers to part of the non-atomic operation. If we were to make the synchronization explicit, we'd still have to deal with all the subtle semantics, but now the subtle semantics would at least be *explicit*. And it would make it much easier to explain the barriers in that ipc semaphore code. >> Now, I'd take Peter's patch as-is, because I don't think any of this >> matters from a *performance* standpoint, and Peter's patch is much >> smaller and simpler. > > I would suggest you do this and also mark it for stable v4.2 and later. Oh, I definitely agree on the stable part, and yes, the "splt things up" model should come later if people agree that it's a good thing. Should I take the patch as-is, or should I just wait for a pull request from the locking tree? Either is ok by me. Linus
Re: sem_lock() vs qspinlocks
On Fri, May 20, 2016 at 10:00:45AM -0700, Linus Torvalds wrote: > On Fri, May 20, 2016 at 9:04 AM, Peter Zijlstra wrote: > > +* queued_spin_lock_slowpath() can ACQUIRE the lock before > > +* issuing the unordered store that sets _Q_LOCKED_VAL. > > Ugh. This was my least favorite part of the queued locks, and I really > liked the completely unambiguous semantics of the x86 atomics-based > versions we used to have. > > But I guess we're stuck with it. Yeah, I wasn't too happy either when I realized it today. We _could_ make these atomic ops, but then we're just making things slower for this one weird case :/ > That said, it strikes me that almost all of the users of > "spin_is_locked()" are using it for verification purposes (not locking > correctness), Right; although I feel those people should be using lockdep_assert_held() for this instead. That not only compiles out when !LOCKDEP but also asserts the current task/cpu is the lock owner, not someone else. > and that the people who are playing games with locking > correctness are few and already have to play *other* games anyway. > > See for example "ipc_smp_acquire__after_spin_is_unlocked()", which has > a big comment atop of it that now becomes nonsensical with this patch. Not quite; we still need that I think. We now have: spin_lock(A); smp_mb(); while (!spin_is_locked(B)) cpu_relax(); smp_rmb(); And that control dependency together with the rmb form a load-acquire on the unlocked B, which matches the release of the spin_unlock(B) and ensures we observe the whole previous critical section we waited for. The new smp_mb() doesn't help with that. > Now, I'd take Peter's patch as-is, because I don't think any of this > matters from a *performance* standpoint, and Peter's patch is much > smaller and simpler. I would suggest you do this and also mark it for stable v4.2 and later. > But the reason I think it might be a good thing to introduce those > spin_lock_synchronize() and splin_lock_acquire_after_unlock() concepts > would be to make it very very clear what those subtle implementations > in mutexes and the multi-level locks in the ipc layer are doing and > what they rely on. We can always do the fancy stuff on top, but that isn't going to need backporting to all stable trees, this is. I'll think a little more on the explicit document vs simple thing.
Re: sem_lock() vs qspinlocks
On Fri, May 20, 2016 at 04:44:19PM -0400, Waiman Long wrote: > On 05/20/2016 07:58 AM, Peter Zijlstra wrote: > >On Thu, May 19, 2016 at 10:39:26PM -0700, Davidlohr Bueso wrote: > >>As such, the following restores the behavior of the ticket locks and 'fixes' > >>(or hides?) the bug in sems. Naturally incorrect approach: > >> > >>@@ -290,7 +290,8 @@ static void sem_wait_array(struct sem_array *sma) > >> > >>for (i = 0; i< sma->sem_nsems; i++) { > >>sem = sma->sem_base + i; > >>- spin_unlock_wait(&sem->lock); > >>+ while (atomic_read(&sem->lock)) > >>+ cpu_relax(); > >>} > >>ipc_smp_acquire__after_spin_is_unlocked(); > >>} > >The actual bug is clear_pending_set_locked() not having acquire > >semantics. And the above 'fixes' things because it will observe the old > >pending bit or the locked bit, so it doesn't matter if the store > >flipping them is delayed. > > The clear_pending_set_locked() is not the only place where the lock is set. > If there are more than one waiter, the queuing patch will be used instead. > The set_locked(), which is also an unordered store, will then be used to set > the lock. Ah yes. I didn't get that far. One case was enough :-)
Re: sem_lock() vs qspinlocks
On Fri, May 20, 2016 at 04:47:43PM -0400, Waiman Long wrote: > >Similarly, and I know you hate it, but afaict, then semantically > >queued_spin_is_contended() ought to be: > > > >- return atomic_read(&lock->val) & ~_Q_LOCKED_MASK; > >+ return atomic_read(&lock->val); > > > Looking for contended lock, you need to consider the lock waiters also. So > looking at the whole word is right. No, you _only_ need to look at the lock waiters.
Re: sem_lock() vs qspinlocks
On 05/20/2016 11:00 AM, Davidlohr Bueso wrote: On Fri, 20 May 2016, Peter Zijlstra wrote: On Thu, May 19, 2016 at 10:39:26PM -0700, Davidlohr Bueso wrote: In addition, this makes me wonder if queued_spin_is_locked() should then be: -return atomic_read(&lock->val); +return atomic_read(&lock->val) & _Q_LOCKED_MASK; And avoid considering pending waiters as locked. Probably Similarly, and I know you hate it, but afaict, then semantically queued_spin_is_contended() ought to be: - return atomic_read(&lock->val) & ~_Q_LOCKED_MASK; + return atomic_read(&lock->val); Thanks, Davidlohr Looking for contended lock, you need to consider the lock waiters also. So looking at the whole word is right. Cheers, Longman
Re: sem_lock() vs qspinlocks
On 05/20/2016 07:58 AM, Peter Zijlstra wrote: On Thu, May 19, 2016 at 10:39:26PM -0700, Davidlohr Bueso wrote: As such, the following restores the behavior of the ticket locks and 'fixes' (or hides?) the bug in sems. Naturally incorrect approach: @@ -290,7 +290,8 @@ static void sem_wait_array(struct sem_array *sma) for (i = 0; i< sma->sem_nsems; i++) { sem = sma->sem_base + i; - spin_unlock_wait(&sem->lock); + while (atomic_read(&sem->lock)) + cpu_relax(); } ipc_smp_acquire__after_spin_is_unlocked(); } The actual bug is clear_pending_set_locked() not having acquire semantics. And the above 'fixes' things because it will observe the old pending bit or the locked bit, so it doesn't matter if the store flipping them is delayed. The clear_pending_set_locked() is not the only place where the lock is set. If there are more than one waiter, the queuing patch will be used instead. The set_locked(), which is also an unordered store, will then be used to set the lock. Cheers, Longman
Re: sem_lock() vs qspinlocks
On Fri, May 20, 2016 at 9:04 AM, Peter Zijlstra wrote: > +* queued_spin_lock_slowpath() can ACQUIRE the lock before > +* issuing the unordered store that sets _Q_LOCKED_VAL. Ugh. This was my least favorite part of the queued locks, and I really liked the completely unambiguous semantics of the x86 atomics-based versions we used to have. But I guess we're stuck with it. That said, it strikes me that almost all of the users of "spin_is_locked()" are using it for verification purposes (not locking correctness), and that the people who are playing games with locking correctness are few and already have to play *other* games anyway. See for example "ipc_smp_acquire__after_spin_is_unlocked()", which has a big comment atop of it that now becomes nonsensical with this patch. So I do wonder if we should make that smp_mb() be something the *caller* has to do, and document rules for it. IOW, introduce a new spinlock primitive called "spin_lock_synchronize()", and then spinlock implementations that have this non-atomic behavior with an unordered store would do something like static inline void queued_spin_lock_synchronize(struct qspinlock *a, struct qspinlock *b) { smp_mb(); } and then we'd document that *if* yuou need ordering guarantees between spin_lock(a); .. spin_is_locked/spin_wait_lock(b) .. you have to have a spin_lock_synchronize(a, b); in between. A spin-lock implementation with the old x86 atomic semantics would make it a no-op. We should also introduce something like that "splin_lock_acquire_after_unlock()" so that the ipc/sem.c behavior would be documented too. Partly because some spinlock implementations might have stronger unlock ordering and that could be a no-op for them), but mostly for documentation of the rules. Now, I'd take Peter's patch as-is, because I don't think any of this matters from a *performance* standpoint, and Peter's patch is much smaller and simpler. But the reason I think it might be a good thing to introduce those spin_lock_synchronize() and splin_lock_acquire_after_unlock() concepts would be to make it very very clear what those subtle implementations in mutexes and the multi-level locks in the ipc layer are doing and what they rely on. Comments? There are arguments for Peter's simple approach too ("robust and simpler interface" vs "make our requirements explicit"). Linus
Re: sem_lock() vs qspinlocks
On Fri, 20 May 2016, Peter Zijlstra wrote: The problem is that the clear_pending_set_locked() is an unordered store, therefore this store can be delayed until no later than spin_unlock() (which orders against it due to the address dependency). This opens numerous races; for example: ipc_lock_object(&sma->sem_perm); sem_wait_array(sma); false -> spin_is_locked(&sma->sem_perm.lock) is entirely possible, because sem_wait_array() consists of pure reads, so the store can pass all that, even on x86. I had pondered at the unordered stores in clear_pending_set_locked() for arm, for example, but I _certainly_ missed this for x86 inside the ACQUIRE region. Thanks, Davidlohr
Re: sem_lock() vs qspinlocks
On Fri, May 20, 2016 at 05:21:49PM +0200, Peter Zijlstra wrote: > Let me write a patch.. OK, something like the below then.. lemme go build that and verify that too fixes things. --- Subject: locking,qspinlock: Fix spin_is_locked() and spin_unlock_wait() Similar to commits: 51d7d5205d33 ("powerpc: Add smp_mb() to arch_spin_is_locked()") d86b8da04dfa ("arm64: spinlock: serialise spin_unlock_wait against concurrent lockers") qspinlock suffers from the fact that the _Q_LOCKED_VAL store is unordered inside the ACQUIRE of the lock. And while this is not a problem for the regular mutual exclusive critical section usage of spinlocks, it breaks creative locking like: spin_lock(A)spin_lock(B) spin_unlock_wait(B) if (!spin_is_locked(A)) do_something()do_something() In that both CPUs can end up running do_something at the same time, because our _Q_LOCKED_VAL store can drop past the spin_unlock_wait() spin_is_locked() loads (even on x86!!). To avoid making the normal case slower, add smp_mb()s to the less used spin_unlock_wait() / spin_is_locked() side of things to avoid this problem. Reported-by: Davidlohr Bueso Reported-by: Giovanni Gherdovich Signed-off-by: Peter Zijlstra (Intel) --- include/asm-generic/qspinlock.h | 27 ++- 1 file changed, 26 insertions(+), 1 deletion(-) diff --git a/include/asm-generic/qspinlock.h b/include/asm-generic/qspinlock.h index 35a52a880b2f..6bd05700d8c9 100644 --- a/include/asm-generic/qspinlock.h +++ b/include/asm-generic/qspinlock.h @@ -28,7 +28,30 @@ */ static __always_inline int queued_spin_is_locked(struct qspinlock *lock) { - return atomic_read(&lock->val); + /* +* queued_spin_lock_slowpath() can ACQUIRE the lock before +* issuing the unordered store that sets _Q_LOCKED_VAL. +* +* See both smp_cond_acquire() sites for more detail. +* +* This however means that in code like: +* +* spin_lock(A) spin_lock(B) +* spin_unlock_wait(B)spin_is_locked(A) +* do_something() do_something() +* +* Both CPUs can end up running do_something() because the store +* setting _Q_LOCKED_VAL will pass through the loads in +* spin_unlock_wait() and/or spin_is_locked(). +* +* Avoid this by issuing a full memory barrier between the spin_lock() +* and the loads in spin_unlock_wait() and spin_is_locked(). +* +* Note that regular mutual exclusion doesn't care about this +* delayed store. +*/ + smp_mb(); + return atomic_read(&lock->val) & _Q_LOCKED_MASK; } /** @@ -108,6 +131,8 @@ static __always_inline void queued_spin_unlock(struct qspinlock *lock) */ static inline void queued_spin_unlock_wait(struct qspinlock *lock) { + /* See queued_spin_is_locked() */ + smp_mb(); while (atomic_read(&lock->val) & _Q_LOCKED_MASK) cpu_relax(); }
Re: sem_lock() vs qspinlocks
On Fri, May 20, 2016 at 05:05:05PM +0200, Peter Zijlstra wrote: > On Fri, May 20, 2016 at 08:00:49AM -0700, Davidlohr Bueso wrote: > > On Fri, 20 May 2016, Peter Zijlstra wrote: > > > > >On Thu, May 19, 2016 at 10:39:26PM -0700, Davidlohr Bueso wrote: > > >> In addition, this makes me wonder if queued_spin_is_locked() should then > > >> be: > > >> > > >>- return atomic_read(&lock->val); > > >>+ return atomic_read(&lock->val) & _Q_LOCKED_MASK; > > >> > > >>And avoid considering pending waiters as locked. > > > > > >Probably > > > > Similarly, and I know you hate it, but afaict, then semantically > > queued_spin_is_contended() ought to be: > > > > - return atomic_read(&lock->val) & ~_Q_LOCKED_MASK; > > + return atomic_read(&lock->val); > > Nah, that would make it return true for (0,0,1), ie. uncontended locked. FWIW, the only usage of spin_is_contended() should be for lock breaking, see spin_needbreak(). This also means that #define spin_is_contended(l) (false) is a valid implementation, where the only down-side is worse latency. This is done (together with GENERIC_LOCKBREAK), to allow trivial test-and-set spinlock implementations; as these cannot tell if the lock is contended.
Re: sem_lock() vs qspinlocks
On Fri, 20 May 2016, Peter Zijlstra wrote: On Fri, May 20, 2016 at 08:00:49AM -0700, Davidlohr Bueso wrote: On Fri, 20 May 2016, Peter Zijlstra wrote: >On Thu, May 19, 2016 at 10:39:26PM -0700, Davidlohr Bueso wrote: >> In addition, this makes me wonder if queued_spin_is_locked() should then be: >> >>- return atomic_read(&lock->val); >>+ return atomic_read(&lock->val) & _Q_LOCKED_MASK; >> >>And avoid considering pending waiters as locked. > >Probably Similarly, and I know you hate it, but afaict, then semantically queued_spin_is_contended() ought to be: - return atomic_read(&lock->val) & ~_Q_LOCKED_MASK; + return atomic_read(&lock->val); Nah, that would make it return true for (0,0,1), ie. uncontended locked. Right, and we want: (*, 1, 1) (*, 1, 0) (n, 0, 0) I may be missing some combinations, its still early.
Re: sem_lock() vs qspinlocks
On Fri, May 20, 2016 at 10:05:33PM +0800, Boqun Feng wrote: > On Fri, May 20, 2016 at 01:58:19PM +0200, Peter Zijlstra wrote: > > On Thu, May 19, 2016 at 10:39:26PM -0700, Davidlohr Bueso wrote: > > > As such, the following restores the behavior of the ticket locks and > > > 'fixes' > > > (or hides?) the bug in sems. Naturally incorrect approach: > > > > > > @@ -290,7 +290,8 @@ static void sem_wait_array(struct sem_array *sma) > > > > > > for (i = 0; i < sma->sem_nsems; i++) { > > > sem = sma->sem_base + i; > > > - spin_unlock_wait(&sem->lock); > > > + while (atomic_read(&sem->lock)) > > > + cpu_relax(); > > > } > > > ipc_smp_acquire__after_spin_is_unlocked(); > > > } > > > > The actual bug is clear_pending_set_locked() not having acquire > > semantics. And the above 'fixes' things because it will observe the old > > pending bit or the locked bit, so it doesn't matter if the store > > flipping them is delayed. > > > > The comment in queued_spin_lock_slowpath() above the smp_cond_acquire() > > states that that acquire is sufficient, but this is incorrect in the > > face of spin_is_locked()/spin_unlock_wait() usage only looking at the > > lock byte. > > > > The problem is that the clear_pending_set_locked() is an unordered > > store, therefore this store can be delayed until no later than > > spin_unlock() (which orders against it due to the address dependency). > > > > This opens numerous races; for example: > > > > ipc_lock_object(&sma->sem_perm); > > sem_wait_array(sma); > > > > false -> > > spin_is_locked(&sma->sem_perm.lock) > > > > is entirely possible, because sem_wait_array() consists of pure reads, > > so the store can pass all that, even on x86. > > > > The below 'hack' seems to solve the problem. > > > > _However_ this also means the atomic_cmpxchg_relaxed() in the locked: > > branch is equally wrong -- although not visible on x86. And note that > > atomic_cmpxchg_acquire() would not in fact be sufficient either, since > > the acquire is on the LOAD not the STORE of the LL/SC. > > > > I need a break of sorts, because after twisting my head around the sem > > code and then the qspinlock code I'm wrecked. I'll try and make a proper > > patch if people can indeed confirm my thinking here. > > > > I think your analysis is right, however, the problem only exists if we > have the following use pattern, right? > > CPU 0 CPU 1 > == > spin_lock(A); spin_lock(B); > spin_unlock_wait(B);spin_unlock_wait(A); > do_something(); do_something(); More or less yes. The semaphore code is like: spin_lock(A)spin_lock(B) spin_unlock_wait(B) spin_is_locked(A) which shows that both spin_is_locked() and spin_unlock_wait() are in the same class. > , which ends up CPU 0 and 1 both running do_something(). And actually > this can be simply fixed by add smp_mb() between spin_lock() and > spin_unlock_wait() on both CPU, or add an smp_mb() in spin_unlock_wait() > as PPC does in 51d7d5205d338 "powerpc: Add smp_mb() to arch_spin_is_locked()". Right and arm64 does in d86b8da04dfa. Curiously you only fixed spin_is_locked() and Will only fixed spin_unlock_wait, while AFAIU we need to have _BOTH_ fixed. Now looking at the PPC code, spin_unlock_wait() as per arch/powerpc/lib/locks.c actually does included the extra smp_mb(). > So if relaxed/acquire atomics and clear_pending_set_locked() work fine > in other situations, a proper fix would be fixing the > spin_is_locked()/spin_unlock_wait() or their users? Right; the relaxed stores work fine for the 'regular' mutual exclusive critical section usage of locks. And yes, I think only the case you outlined can care about it. Let me write a patch..
Re: sem_lock() vs qspinlocks
On Fri, May 20, 2016 at 08:00:49AM -0700, Davidlohr Bueso wrote: > On Fri, 20 May 2016, Peter Zijlstra wrote: > > >On Thu, May 19, 2016 at 10:39:26PM -0700, Davidlohr Bueso wrote: > >> In addition, this makes me wonder if queued_spin_is_locked() should then > >> be: > >> > >>- return atomic_read(&lock->val); > >>+ return atomic_read(&lock->val) & _Q_LOCKED_MASK; > >> > >>And avoid considering pending waiters as locked. > > > >Probably > > Similarly, and I know you hate it, but afaict, then semantically > queued_spin_is_contended() ought to be: > > - return atomic_read(&lock->val) & ~_Q_LOCKED_MASK; > + return atomic_read(&lock->val); Nah, that would make it return true for (0,0,1), ie. uncontended locked.
Re: sem_lock() vs qspinlocks
On Fri, 20 May 2016, Peter Zijlstra wrote: On Thu, May 19, 2016 at 10:39:26PM -0700, Davidlohr Bueso wrote: In addition, this makes me wonder if queued_spin_is_locked() should then be: - return atomic_read(&lock->val); + return atomic_read(&lock->val) & _Q_LOCKED_MASK; And avoid considering pending waiters as locked. Probably Similarly, and I know you hate it, but afaict, then semantically queued_spin_is_contended() ought to be: - return atomic_read(&lock->val) & ~_Q_LOCKED_MASK; + return atomic_read(&lock->val); Thanks, Davidlohr
Re: sem_lock() vs qspinlocks
Hi Peter, On Fri, May 20, 2016 at 01:58:19PM +0200, Peter Zijlstra wrote: > On Thu, May 19, 2016 at 10:39:26PM -0700, Davidlohr Bueso wrote: > > As such, the following restores the behavior of the ticket locks and 'fixes' > > (or hides?) the bug in sems. Naturally incorrect approach: > > > > @@ -290,7 +290,8 @@ static void sem_wait_array(struct sem_array *sma) > > > > for (i = 0; i < sma->sem_nsems; i++) { > > sem = sma->sem_base + i; > > - spin_unlock_wait(&sem->lock); > > + while (atomic_read(&sem->lock)) > > + cpu_relax(); > > } > > ipc_smp_acquire__after_spin_is_unlocked(); > > } > > The actual bug is clear_pending_set_locked() not having acquire > semantics. And the above 'fixes' things because it will observe the old > pending bit or the locked bit, so it doesn't matter if the store > flipping them is delayed. > > The comment in queued_spin_lock_slowpath() above the smp_cond_acquire() > states that that acquire is sufficient, but this is incorrect in the > face of spin_is_locked()/spin_unlock_wait() usage only looking at the > lock byte. > > The problem is that the clear_pending_set_locked() is an unordered > store, therefore this store can be delayed until no later than > spin_unlock() (which orders against it due to the address dependency). > > This opens numerous races; for example: > > ipc_lock_object(&sma->sem_perm); > sem_wait_array(sma); > > false -> > spin_is_locked(&sma->sem_perm.lock) > > is entirely possible, because sem_wait_array() consists of pure reads, > so the store can pass all that, even on x86. > > The below 'hack' seems to solve the problem. > > _However_ this also means the atomic_cmpxchg_relaxed() in the locked: > branch is equally wrong -- although not visible on x86. And note that > atomic_cmpxchg_acquire() would not in fact be sufficient either, since > the acquire is on the LOAD not the STORE of the LL/SC. > > I need a break of sorts, because after twisting my head around the sem > code and then the qspinlock code I'm wrecked. I'll try and make a proper > patch if people can indeed confirm my thinking here. > I think your analysis is right, however, the problem only exists if we have the following use pattern, right? CPU 0 CPU 1 == spin_lock(A); spin_lock(B); spin_unlock_wait(B);spin_unlock_wait(A); do_something(); do_something(); , which ends up CPU 0 and 1 both running do_something(). And actually this can be simply fixed by add smp_mb() between spin_lock() and spin_unlock_wait() on both CPU, or add an smp_mb() in spin_unlock_wait() as PPC does in 51d7d5205d338 "powerpc: Add smp_mb() to arch_spin_is_locked()". So if relaxed/acquire atomics and clear_pending_set_locked() work fine in other situations, a proper fix would be fixing the spin_is_locked()/spin_unlock_wait() or their users? Regards, Boqun > --- > kernel/locking/qspinlock.c | 1 + > 1 file changed, 1 insertion(+) > > diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c > index ce2f75e32ae1..348e172e774f 100644 > --- a/kernel/locking/qspinlock.c > +++ b/kernel/locking/qspinlock.c > @@ -366,6 +366,7 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, > u32 val) >* *,1,0 -> *,0,1 >*/ > clear_pending_set_locked(lock); > + smp_mb(); > return; > > /* signature.asc Description: PGP signature
Re: sem_lock() vs qspinlocks
On Thu, May 19, 2016 at 10:39:26PM -0700, Davidlohr Bueso wrote: > As such, the following restores the behavior of the ticket locks and 'fixes' > (or hides?) the bug in sems. Naturally incorrect approach: > > @@ -290,7 +290,8 @@ static void sem_wait_array(struct sem_array *sma) > > for (i = 0; i < sma->sem_nsems; i++) { > sem = sma->sem_base + i; > - spin_unlock_wait(&sem->lock); > + while (atomic_read(&sem->lock)) > + cpu_relax(); > } > ipc_smp_acquire__after_spin_is_unlocked(); > } The actual bug is clear_pending_set_locked() not having acquire semantics. And the above 'fixes' things because it will observe the old pending bit or the locked bit, so it doesn't matter if the store flipping them is delayed. The comment in queued_spin_lock_slowpath() above the smp_cond_acquire() states that that acquire is sufficient, but this is incorrect in the face of spin_is_locked()/spin_unlock_wait() usage only looking at the lock byte. The problem is that the clear_pending_set_locked() is an unordered store, therefore this store can be delayed until no later than spin_unlock() (which orders against it due to the address dependency). This opens numerous races; for example: ipc_lock_object(&sma->sem_perm); sem_wait_array(sma); false -> spin_is_locked(&sma->sem_perm.lock) is entirely possible, because sem_wait_array() consists of pure reads, so the store can pass all that, even on x86. The below 'hack' seems to solve the problem. _However_ this also means the atomic_cmpxchg_relaxed() in the locked: branch is equally wrong -- although not visible on x86. And note that atomic_cmpxchg_acquire() would not in fact be sufficient either, since the acquire is on the LOAD not the STORE of the LL/SC. I need a break of sorts, because after twisting my head around the sem code and then the qspinlock code I'm wrecked. I'll try and make a proper patch if people can indeed confirm my thinking here. --- kernel/locking/qspinlock.c | 1 + 1 file changed, 1 insertion(+) diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c index ce2f75e32ae1..348e172e774f 100644 --- a/kernel/locking/qspinlock.c +++ b/kernel/locking/qspinlock.c @@ -366,6 +366,7 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val) * *,1,0 -> *,0,1 */ clear_pending_set_locked(lock); + smp_mb(); return; /*
Re: sem_lock() vs qspinlocks
On Fri, May 20, 2016 at 12:09:01PM +0200, Ingo Molnar wrote: > > * Peter Zijlstra wrote: > > > On Fri, May 20, 2016 at 10:30:08AM +0200, Peter Zijlstra wrote: > > > On Thu, May 19, 2016 at 10:39:26PM -0700, Davidlohr Bueso wrote: > > > > Specifically > > > > for the 'cascade_cond' and 'cascade_flock' programs, which exhibit > > > > hangs in libc's > > > > semop() blocked waiting for zero. > > > > > > OK; so I've been running: > > > > > > while :; do > > > bin/cascade_cond -E -C 200 -L -S -W -T 200 -I 200 ; > > > bin/cascade_flock -E -C 200 -L -S -W -P 200 -I 500 ; > > > done > > > > > > for a few minutes now and its not stuck and my machine didn't splat. > > > > > > Am I not doing it right? > > > > Hooray, it went *bang*.. > > I suspect a required step was to post about failure to reproduce! > It is known that the bug is both intermittent and not all machines can reproduce the problem. If it fails to reproduce, it's not necessarily a methodology error and can simply be a function of luck. -- Mel Gorman SUSE Labs
Re: sem_lock() vs qspinlocks
* Peter Zijlstra wrote: > On Fri, May 20, 2016 at 10:30:08AM +0200, Peter Zijlstra wrote: > > On Thu, May 19, 2016 at 10:39:26PM -0700, Davidlohr Bueso wrote: > > > Specifically > > > for the 'cascade_cond' and 'cascade_flock' programs, which exhibit hangs > > > in libc's > > > semop() blocked waiting for zero. > > > > OK; so I've been running: > > > > while :; do > > bin/cascade_cond -E -C 200 -L -S -W -T 200 -I 200 ; > > bin/cascade_flock -E -C 200 -L -S -W -P 200 -I 500 ; > > done > > > > for a few minutes now and its not stuck and my machine didn't splat. > > > > Am I not doing it right? > > Hooray, it went *bang*.. I suspect a required step was to post about failure to reproduce! Ingo
Re: sem_lock() vs qspinlocks
On Fri, May 20, 2016 at 11:07:49AM +0200, Giovanni Gherdovich wrote: > --- run_cascade.sh - > #!/bin/bash > > TESTCASE=$1 > CASCADE_PATH="libmicro-1-installed/bin-x86_64" > > case $TESTCASE in > c_flock_200) > BINNAME="cascade_flock" > COMMAND="$CASCADE_PATH/cascade_flock -E -D 6 -L -S -W \ > -N c_flock_200 \ > -P 200 -I 500" > # c_flock_200 is supposed to last 60 seconds. > SLEEPTIME=70 > ;; > c_cond_10) > BINNAME="cascade_cond" > COMMAND="$CASCADE_PATH/cascade_cond -E -C 2000 -L -S -W \ > -N c_cond_10 \ > -T 10 -I 3000" > # c_cond_10 terminates in less than 1 second. > SLEEPTIME=5 > ;; > *) > echo "Unknown test case" >&2 > exit 1 > ;; > esac > > ERRORS=0 > uname -a > for i in {1..10} ; do > { > eval $COMMAND & > } >/dev/null 2>&1 > sleep $SLEEPTIME > if pidof $BINNAME >/dev/null ; then > echo Run \#$i: $TESTCASE hangs > for PID in $(pidof $BINNAME) ; do > head -1 /proc/$PID/stack > done | sort | uniq -c > ERRORS=$((ERRORS+1)) > killall $BINNAME > else > echo Run \#$i: $TESTCASE exits successfully > fi > done > echo $TESTCASE hanged $ERRORS times. > Thanks, that's a much nicer script than mine ;-)
Re: sem_lock() vs qspinlocks
On Fri, May 20, 2016 at 10:30:08AM +0200, Peter Zijlstra wrote: > On Thu, May 19, 2016 at 10:39:26PM -0700, Davidlohr Bueso wrote: > > Specifically > > for the 'cascade_cond' and 'cascade_flock' programs, which exhibit hangs in > > libc's > > semop() blocked waiting for zero. > > OK; so I've been running: > > while :; do > bin/cascade_cond -E -C 200 -L -S -W -T 200 -I 200 ; > bin/cascade_flock -E -C 200 -L -S -W -P 200 -I 500 ; > done > > for a few minutes now and its not stuck and my machine didn't splat. > > Am I not doing it right? Hooray, it went *bang*.. OK, lemme have a harder look at this semaphore code.
Re: sem_lock() vs qspinlocks
On Thu, May 19, 2016 at 10:39:26PM -0700, Davidlohr Bueso wrote: > Specifically > for the 'cascade_cond' and 'cascade_flock' programs, which exhibit hangs in > libc's > semop() blocked waiting for zero. OK; so I've been running: while :; do bin/cascade_cond -E -C 200 -L -S -W -T 200 -I 200 ; bin/cascade_flock -E -C 200 -L -S -W -P 200 -I 500 ; done for a few minutes now and its not stuck and my machine didn't splat. Am I not doing it right?
Re: sem_lock() vs qspinlocks
On Fri, May 20, 2016 at 10:13:15AM +0200, Peter Zijlstra wrote: > On Thu, May 19, 2016 at 10:39:26PM -0700, Davidlohr Bueso wrote: > > > [1] https://hg.java.net/hg/libmicro~hg-repo > > So far I've managed to install mercurial and clone this thing, but it > doesn't actually build :/ > > I'll try harder.. The stuff needs this.. --- diff -r 7dd95b416c3c Makefile.com --- a/Makefile.com Thu Jul 26 12:56:00 2012 -0700 +++ b/Makefile.com Fri May 20 10:18:08 2016 +0200 @@ -107,7 +107,7 @@ echo "char compiler_version[] = \""`$(COMPILER_VERSION_CMD)`"\";" > tattle.h echo "char CC[] = \""$(CC)"\";" >> tattle.h echo "char extra_compiler_flags[] = \""$(extra_CFLAGS)"\";" >> tattle.h - $(CC) -o tattle $(CFLAGS) -I. ../tattle.c libmicro.a -lrt -lm + $(CC) -o tattle $(CFLAGS) -I. ../tattle.c libmicro.a -lrt -lm -lpthread $(ELIDED_BENCHMARKS): ../elided.c $(CC) -o $(@) ../elided.c
Re: sem_lock() vs qspinlocks
On Thu, May 19, 2016 at 10:39:26PM -0700, Davidlohr Bueso wrote: > [1] https://hg.java.net/hg/libmicro~hg-repo So far I've managed to install mercurial and clone this thing, but it doesn't actually build :/ I'll try harder..
Re: sem_lock() vs qspinlocks
On Thu, May 19, 2016 at 10:39:26PM -0700, Davidlohr Bueso wrote: > However, this is semantically different to > what was previously done with ticket locks in that spin_unlock_wait() will > always observe > all waiters by adding itself to the tail. static inline void arch_spin_unlock_wait(arch_spinlock_t *lock) { __ticket_t head = READ_ONCE(lock->tickets.head); for (;;) { struct __raw_tickets tmp = READ_ONCE(lock->tickets); /* * We need to check "unlocked" in a loop, tmp.head == head * can be false positive because of overflow. */ if (__tickets_equal(tmp.head, tmp.tail) || !__tickets_equal(tmp.head, head)) break; cpu_relax(); } } I'm not seeing that (although I think I agreed yesterday on IRC). Note how we observe the head and then loop until either the lock is unlocked (head == tail) or simply head isn't what it used to be. And head is the lock holder end of the queue; see arch_spin_unlock() incrementing it. So the ticket lock too should only wait for the current lock holder to go away, not any longer.
Re: sem_lock() vs qspinlocks
On Thu, May 19, 2016 at 10:39:26PM -0700, Davidlohr Bueso wrote: > In addition, this makes me wonder if queued_spin_is_locked() should then be: > > - return atomic_read(&lock->val); > + return atomic_read(&lock->val) & _Q_LOCKED_MASK; > > And avoid considering pending waiters as locked. Probably
sem_lock() vs qspinlocks
Hi, Giovanni ran into a pretty reproducible situation in which the libmicro benchmark[1] shows a functional regression in sysv semaphores, on upstream kernels. Specifically for the 'cascade_cond' and 'cascade_flock' programs, which exhibit hangs in libc's semop() blocked waiting for zero. Alternatively, the following splats may appear: [ 692.991258] BUG: unable to handle kernel NULL pointer dereference (null) [ 692.992062] IP: [] unmerge_queues+0x2f/0x70 [ 692.992062] PGD 862fab067 PUD 858bbc067 PMD 0 [ 692.992062] Oops: [#1] SMP [ 692.992062] Modules linked in: ... [ 692.992062] CPU: 18 PID: 7398 Comm: cascade_flock Tainted: GE 4.6.0-juancho2-default+ #18 [ 692.992062] Hardware name: Intel Corporation S2600WTT/S2600WTT, BIOS GRNDSDP1.86B.0030.R03.1405061547 05/06/2014 [ 692.992062] task: 88084a7e9640 ti: 880854748000 task.ti: 880854748000 [ 692.992062] RIP: 0010:[] [] unmerge_queues+0x2f/0x70 [ 692.992062] RSP: 0018:88085474bce8 EFLAGS: 00010216 [ 692.992062] RAX: RBX: RCX: 88086cc3d0d0 [ 692.992062] RDX: 88086cc3d0d0 RSI: 88086cc3d0d0 RDI: 88086cc3d040 [ 692.992062] RBP: 88085474bce8 R08: 0007 R09: 88086cc3d088 [ 692.992062] R10: R11: 00a1597ea64c R12: 88085474bd90 [ 692.992062] R13: 88086cc3d040 R14: R15: [ 692.992062] FS: 7faa46a2d700() GS:88086e50() knlGS: [ 692.992062] CS: 0010 DS: ES: CR0: 80050033 [ 692.992062] CR2: CR3: 000862faa000 CR4: 001406e0 [ 692.992062] Stack: [ 692.992062] 88085474bf38 812a2ac3 810c3995 88084a7e9640 [ 692.992062] 81cb1f48 0002 880400038000 [ 692.992062] 88084a7e9640 88085474bd40 fffc 88085474bd40 [ 692.992062] Call Trace: [ 692.992062] [] SYSC_semtimedop+0x833/0xc00 [ 692.992062] [] ? __wake_up_common+0x55/0x90 [ 692.992062] [] ? kmem_cache_alloc+0x1e0/0x200 [ 692.992062] [] ? locks_alloc_lock+0x1b/0x70 [ 692.992062] [] ? locks_insert_lock_ctx+0x93/0xa0 [ 692.992062] [] ? flock_lock_inode+0xf4/0x220 [ 692.992062] [] ? locks_lock_inode_wait+0x47/0x160 [ 692.992062] [] ? kmem_cache_alloc+0x1e0/0x200 [ 692.992062] [] ? locks_alloc_lock+0x1b/0x70 [ 692.992062] [] ? locks_free_lock+0x4f/0x60 [ 692.992062] [] SyS_semop+0x10/0x20 [ 692.992062] [] entry_SYSCALL_64_fastpath+0x1a/0xa4 [ 692.992062] Code: 00 55 8b 47 7c 48 89 e5 85 c0 75 53 48 8b 4f 48 4c 8d 4f 48 4c 39 c9 48 8b 11 48 89 ce 75 08 eb 36 48 89 d1 48 89 c2 48 8b 41 28 <0f> b7 00 48 c1 e0 06 48 03 47 40 4c 8b 40 18 48 89 70 18 48 83 [ 692.992062] RIP [] unmerge_queues+0x2f/0x70 [ 692.992062] RSP [ 692.992062] CR2: [ 693.882179] ---[ end trace 5605f108ab79cdb2 ]--- Or, [ 463.567641] BUG: unable to handle kernel paging request at fffa [ 463.576246] IP: [] perform_atomic_semop.isra.5+0xcf/0x170 [ 463.584553] PGD 1c0d067 PUD 1c0f067 PMD 0 [ 463.590071] Oops: [#1] SMP [ 463.594667] Modules linked in: ... [ 463.664710] Supported: Yes [ 463.668682] CPU: 6 PID: 2912 Comm: cascade_cond Not tainted 4.4.3-29-default #1 [ 463.677230] Hardware name: SGI.COM C2112-4GP3/X10DRT-P, BIOS 1.0b 04/07/2015 [ 463.685588] task: 88105dba0b40 ti: 8808fc7e task.ti: 8808fc7e [ 463.694366] RIP: 0010:[] [] perform_atomic_semop.isra.5+0xcf/0x170 [ 463.705084] RSP: 0018:8808fc7e3c60 EFLAGS: 00010217 [ 463.711610] RAX: RBX: 88085d22dae0 RCX: 5732f1e7 [ 463.719952] RDX: fffa RSI: 88085d22dad0 RDI: 88085d22da80 [ 463.728270] RBP: R08: fff7 R09: [ 463.736561] R10: R11: 0206 R12: 88085d22da88 [ 463.745001] R13: 88085d22dad0 R14: ffc0 R15: 88085d22da40 [ 463.753450] FS: 7f30fd9e5700() GS:88085fac() knlGS: [ 463.762684] CS: 0010 DS: ES: CR0: 80050033 [ 463.769731] CR2: fffa CR3: 00017bc09000 CR4: 001406e0 [ 463.778039] Stack: [ 463.781130] 8808fc7e3d50 88085d22dad0 8126dfe1 4d226800 [ 463.789704] 88085d22da80 0001 88085d22da88 0001 [ 463.798254] 0001 88085d22da40 8808fc7e3d50 8808fc7e3da0 [ 463.806758] Call Trace: [ 463.810305] [] update_queue+0xa1/0x180 [ 463.816706] [] do_smart_update+0x45/0xf0 [ 463.823276] [] SYSC_semtimedop+0x3d1/0xb00 [ 463.830035] [] entry_SYSCALL_64_fastpath+0x12/0x71 [ 463.838608] DWARF2 unwinder stuck at entry_SYSCALL_64_fastpath+0x12/0x71 [ 463.846331] [ 463.848747] Leftover inexact backtrace: [ 463.848747] [ 463.855853] Code: 80 00 00 81 f9 ff ff 00 00 0f 87 98 00 00 00 66 45 89 0b 48 83 c2 06 44 89 00 48 39