Re: [PATCH] mutexes: Add CONFIG_DEBUG_MUTEX_FASTPATH=y debug variant to debug SMP races
On 12/03/2013 03:52 AM, Ingo Molnar wrote: * I think we try that, to make mutexes safer in general. Can you see a way to do that fairly cheaply? I can see two approaches, both rather radical: 1) Eliminate mutex->count and (ab-)use mutex->wait_lock as 'the' mutex lock: with TICKET_SLOWPATH_FLAG used to denote waiters or so and care taken to not use it as a 'real' spinlock but use the raw accessors. This would still allow a good mutex_lock() fastpath, as it would essentially become spin_trylock() with an asm goto slow path helper perhaps. Doing this would have various advantages: - we'd eliminate much (all?) of per arch mutex code - we'd share spinlock and mutex low level implementations - we'd reduce struct mutex size by 4 bytes It's still early in the morning so I might be missing something trivial though - this sounds suspiciously too easy ;-) Having a proper mutex slowpath might not be so easy without changing the spinlock code. 2) Another method would be to do the opposite: eliminate mutex->wait_lock [for the non-debug case] and do everything via mutex->count and mutex->owner. This saves even more space and potentially enables a tighter slowpath. It probably won't hurt the massively parallel case, as we already do smart MCS locking via mutex->spin_mlock. So I'd argue for #2. (Assuming it addresses the problem) Thanks, Ingo I also think that #2 is safer as messing with spinlock code can be risky. However, #2 probably won't work for architectures that use the generic mutex-xchg.h fastpath. Currently the following architectures use mutex-xchg.h - unicore32, arc, arm and hexagon. Is there a reason why they cannot be converted to use mutex-dec.h instead? -Longman -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mutexes: Add CONFIG_DEBUG_MUTEX_FASTPATH=y debug variant to debug SMP races
On Wed, Dec 04, 2013 at 01:14:56PM -0800, Linus Torvalds wrote: > The lock we're moving up isn't the lock that actually protects the > whole allocation logic (it's the lock that then protects the pipe > contents when a pipe is *used*). So it's a useless lock, and moving it > up is a good idea regardless (because it makes the locks only protect > the parts they are actually *supposed* to protect. > > And while extraneous lock wouldn't normally hurt, the sleeping locks > (both mutexes and semaphores) aren't actually safe wrt de-allocation - > they protect anything *inside* the lock, but the lock data structure > itself is accessed racily wrt other lockers (in a way that still > leaves the locked region protected, but not the lock itself). If you > care about details, you can walk through my example. Yes, this makes sense now. It was spin_unlock_mutex() on the pipe lock that itself was already already freed and poisoned by another cpu. This explicit poison check also fires: diff --git a/arch/x86/include/asm/spinlock.h b/arch/x86/include/asm/spinlock.h index bf156de..ae425d0 100644 --- a/arch/x86/include/asm/spinlock.h +++ b/arch/x86/include/asm/spinlock.h @@ -159,6 +159,7 @@ static __always_inline void arch_spin_unlock(arch_spinlock_t *lock) __ticket_unlock_slowpath(lock, prev); } else __add(&lock->tickets.head, TICKET_LOCK_INC, UNLOCK_LOCK_PREFIX); + WARN_ON(*(unsigned int *)&lock->tickets.head == 0x6b6b6b6c); } static inline int arch_spin_is_locked(arch_spinlock_t *lock) It warns only as often as the poison checking already did, with a stack of warn_*, __mutex_unlock_slowpath(), mutex_unlock(), pipe_release(). Trying to prove a negative, of course, but I tested with your first fix overnight and got no errors. Current git (with b0d8d2292160bb63de) also looks good. I will leave it running for a few days. Thanks for getting stuck on this one. It was educational, at least! Simon- -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mutexes: Add CONFIG_DEBUG_MUTEX_FASTPATH=y debug variant to debug SMP races
On Tue, Dec 03, 2013 at 09:52:33AM +0100, Ingo Molnar wrote: > Indeed: this comes from mutex->count being separate from > mutex->wait_lock, and this should affect every architecture that has a > mutex->count fast-path implemented (essentially every architecture > that matters). > > Such bugs should also magically go away with mutex debugging enabled. Confirmed: I ran the reproducer with CONFIG_DEBUG_MUTEXES for a few hours, and never got a single poison overwritten notice. > I'd expect such bugs to be more prominent with unlucky object > size/alignment: if mutex->count lies on a separate cache line from > mutex->wait_lock. > > Side note: this might be a valid light weight debugging technique, we > could add padding between the two fields to force them into separate > cache lines, without slowing it down. > > Simon, would you be willing to try the fairly trivial patch below? > Please enable CONFIG_DEBUG_MUTEX_FASTPATH=y. Does your kernel fail > faster that way? I didn't see much of a change other than the incremented poison byte is now further in due to the padding, and it shows up in kmalloc-256. I also tried with Linus' udelay() suggestion, below. With this, there were many occurrences per second. Simon- diff --git a/kernel/mutex.c b/kernel/mutex.c index d24105b..f65e735 100644 --- a/kernel/mutex.c +++ b/kernel/mutex.c @@ -25,6 +25,7 @@ #include #include #include +#include /* * In the DEBUG case we are using the "NULL fastpath" for mutexes, @@ -740,6 +741,11 @@ __mutex_unlock_common_slowpath(atomic_t *lock_count, int nested) wake_up_process(waiter->task); } + /* udelay a bit if the spinlock isn't contended */ + if (lock->wait_lock.rlock.raw_lock.tickets.head + 1 == + lock->wait_lock.rlock.raw_lock.tickets.tail) + udelay(1); + spin_unlock_mutex(&lock->wait_lock, flags); } -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mutexes: Add CONFIG_DEBUG_MUTEX_FASTPATH=y debug variant to debug SMP races
On Wed, Dec 4, 2013 at 1:19 AM, Simon Kirby wrote: > > Meanwhile, I still don't understand how moving the unlock _up_ to cover > less of the code can solve the race, but I will stare at your long > explanation more tomorrow. The lock we're moving up isn't the lock that actually protects the whole allocation logic (it's the lock that then protects the pipe contents when a pipe is *used*). So it's a useless lock, and moving it up is a good idea regardless (because it makes the locks only protect the parts they are actually *supposed* to protect. And while extraneous lock wouldn't normally hurt, the sleeping locks (both mutexes and semaphores) aren't actually safe wrt de-allocation - they protect anything *inside* the lock, but the lock data structure itself is accessed racily wrt other lockers (in a way that still leaves the locked region protected, but not the lock itself). If you care about details, you can walk through my example. Linus -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mutexes: Add CONFIG_DEBUG_MUTEX_FASTPATH=y debug variant to debug SMP races
On Tue, Dec 03, 2013 at 10:10:29AM -0800, Linus Torvalds wrote: > On Tue, Dec 3, 2013 at 12:52 AM, Ingo Molnar wrote: > > > > I'd expect such bugs to be more prominent with unlucky object > > size/alignment: if mutex->count lies on a separate cache line from > > mutex->wait_lock. > > I doubt that makes much of a difference. It's still just "CPU cycles" > away, and the window will be tiny unless you have multi-socket > machines and/or are just very unlucky. > > For stress-testing, it would be much better to use some hack like > > /* udelay a bit if the spinlock isn't contended */ > if (mutex->wait_lock.ticket.head+1 == mutex->wait_lock.ticket.tail) > udelay(1); > > in __mutex_unlock_common() just before the spin_unlock(). Make the > window really *big*. I haven't had a chance yet to do much testing of the proposed race fix and race enlarging patches, but I did write a tool to reproduce the race. I started it running and left for dinner, and sure enough, it actually seems to work on plain 3.12 on a Dell PowerEdge r410 w/dual E5520, reproducing "Poison overwritten" at a rate of about once every 15 minutes (running 6 in parallel, booted with "slub_debug"). I have no idea if actually relying on tsc alignment between cores and sockets is a reasonable idea these days, but it seems to work. I first used a read() on a pipe close()d by the other process to synchronize them, but this didn't seem to work as well as busy-waiting until the timestamp counters pass a previously-decided-upon start time. Meanwhile, I still don't understand how moving the unlock _up_ to cover less of the code can solve the race, but I will stare at your long explanation more tomorrow. Simon- #include #include #include #include #include #define MAX_PIPES 450 #define FORK_CYCLES 200 #define CLOSE_CYCLES 10 #define STAT_SHIFT 6 static inline unsigned long readtsc() { unsigned int low, high; asm volatile("rdtsc" : "=a" (low), "=d" (high)); return low | ((unsigned long)(high) << 32); } static int pipefd[MAX_PIPES][2]; int main(int argc, char *argv[]) { unsigned long loop, race_start, miss; unsigned long misses = 0, races = 0; int i; pid_t pid; struct rlimit rlim = { .rlim_cur = MAX_PIPES * 2 + 96, .rlim_max = MAX_PIPES * 2 + 96, }; if (setrlimit(RLIMIT_NOFILE, &rlim) != 0) perror("setrlimit(RLIMIT_NOFILE)"); for (loop = 1;;loop++) { /* * Make a bunch of pipes */ for (i = 0;i < MAX_PIPES;i++) { if (pipe(pipefd[i]) == -1) { perror("pipe()"); exit(EXIT_FAILURE); } } race_start = readtsc() + FORK_CYCLES; asm("":::"memory"); pid = fork(); if (pid == -1) { perror("fork()"); exit(EXIT_FAILURE); } pid = !!pid; /* * Close one pipe descriptor per process */ for (i = 0;i < MAX_PIPES;i++) close(pipefd[i][!pid]); for (i = 0;i < MAX_PIPES;i++) { /* * Line up and try to close at the same time */ miss = 1; while (readtsc() < race_start) miss = 0; close(pipefd[i][pid]); misses+= miss; race_start+= CLOSE_CYCLES; } races+= MAX_PIPES; if (!(loop & (~(~0UL << STAT_SHIFT fprintf(stderr, "%c %lu (%.2f%% false starts)\n", "CP"[pid], readtsc(), misses * 100. / races); if (pid) wait(NULL); /* Parent */ else exit(0); /* Child */ } }
Re: [PATCH] mutexes: Add CONFIG_DEBUG_MUTEX_FASTPATH=y debug variant to debug SMP races
On Tue, Dec 3, 2013 at 12:52 AM, Ingo Molnar wrote: > > I'd expect such bugs to be more prominent with unlucky object > size/alignment: if mutex->count lies on a separate cache line from > mutex->wait_lock. I doubt that makes much of a difference. It's still just "CPU cycles" away, and the window will be tiny unless you have multi-socket machines and/or are just very unlucky. For stress-testing, it would be much better to use some hack like /* udelay a bit if the spinlock isn't contended */ if (mutex->wait_lock.ticket.head+1 == mutex->wait_lock.ticket.tail) udelay(1); in __mutex_unlock_common() just before the spin_unlock(). Make the window really *big*. That said: >> So having a non-atomic refcount protected inside a sleeping lock is >> a bug, and that's really the bug that ended up biting us for pipes. >> >> Now, the question is: do we have other such cases? How do we >> document this? [...] > > I'd not be surprised if we had such cases, especially where > spinlock->mutex conversions were done So I actually don't think we do. Why? Having a sleeping lock inside the object that protects the refcount is actually hard to do. You need to increment the refcount when finding the object, but that in turn tends to imply holding the lock *before* you find it. Or you have to find it locklessly, which in turn implies RCU, which in turn implies non-sleeping lock. And quite frankly, anybody who uses SRCU and a sleeping lock is just broken to begin with. That's just an abomination. If the RT codepaths do something like that, don't even tell me. It's broken and stupid. So protecting a refcount with a mutex in the same object is really quite hard. Even the pipe code didn't actually do that, it just happened to nest the real lock inside the sleeping lock (but that's also why it was so easy to fix - the sleeping lock wasn't actually necessary or helpful in the refcounting path). So my *gut* feel is that nobody does this. But people have been known to do insane things just because they use too many sleeping locks and think it's "better". Linus -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] mutexes: Add CONFIG_DEBUG_MUTEX_FASTPATH=y debug variant to debug SMP races
* Linus Torvalds wrote: > - but in the meantime, CPU1 is busy still unlocking: > > if (!list_empty(&lock->wait_list)) { > /* get the first entry from the wait-list: */ > struct mutex_waiter *waiter = > list_entry(lock->wait_list.next, >struct mutex_waiter, list); > > debug_mutex_wake_waiter(lock, waiter); > > wake_up_process(waiter->task); > } > spin_unlock_mutex(&lock->wait_lock, flags); > > and in particular notice how CPU1 wrote to the lock *after* CPU2 had > gotten the lock and free'd the data structure. In practice, it's > only really that "spin_unlock_mutex(&lock->wait_lock, flags);" that > happens (and this is how we got the single byte being incremented). > > In other words, it's unsafe to protect reference counts inside > objects with anything but spinlocks and/or atomic refcounts. Or you > have to have the lock *outside* the object you're protecting (which > is often what you want for other reasons anyway, notably lookup). Indeed: this comes from mutex->count being separate from mutex->wait_lock, and this should affect every architecture that has a mutex->count fast-path implemented (essentially every architecture that matters). Such bugs should also magically go away with mutex debugging enabled. I'd expect such bugs to be more prominent with unlucky object size/alignment: if mutex->count lies on a separate cache line from mutex->wait_lock. Side note: this might be a valid light weight debugging technique, we could add padding between the two fields to force them into separate cache lines, without slowing it down. Simon, would you be willing to try the fairly trivial patch below? Please enable CONFIG_DEBUG_MUTEX_FASTPATH=y. Does your kernel fail faster that way? > So having a non-atomic refcount protected inside a sleeping lock is > a bug, and that's really the bug that ended up biting us for pipes. > > Now, the question is: do we have other such cases? How do we > document this? [...] I'd not be surprised if we had such cases, especially where spinlock->mutex conversions were done. About 20% of the kernel's critical sections use mutexes, 60% use spinlocks. (the remaining 20% is shared between semaphored, rwlocks and rwsems, in that order of frequency.) > [...] Do we try to make mutexes and other locks safe to use for > things like this? I think we try that, to make mutexes safer in general. Can you see a way to do that fairly cheaply? I can see two approaches, both rather radical: 1) Eliminate mutex->count and (ab-)use mutex->wait_lock as 'the' mutex lock: with TICKET_SLOWPATH_FLAG used to denote waiters or so and care taken to not use it as a 'real' spinlock but use the raw accessors. This would still allow a good mutex_lock() fastpath, as it would essentially become spin_trylock() with an asm goto slow path helper perhaps. Doing this would have various advantages: - we'd eliminate much (all?) of per arch mutex code - we'd share spinlock and mutex low level implementations - we'd reduce struct mutex size by 4 bytes It's still early in the morning so I might be missing something trivial though - this sounds suspiciously too easy ;-) Having a proper mutex slowpath might not be so easy without changing the spinlock code. 2) Another method would be to do the opposite: eliminate mutex->wait_lock [for the non-debug case] and do everything via mutex->count and mutex->owner. This saves even more space and potentially enables a tighter slowpath. It probably won't hurt the massively parallel case, as we already do smart MCS locking via mutex->spin_mlock. So I'd argue for #2. (Assuming it addresses the problem) Thanks, Ingo ===> Subject: mutexes: Add CONFIG_DEBUG_MUTEX_FASTPATH=y debug variant to debug SMP races From: Ingo Molnar Help debug SMP ordering issues by forcing mutex->count on a separate cache line with mutex->wait_lock. This will make certain classes races more prominent, without slowing down other parts of the mutex code. Signed-off-by: Ingo Molnar diff --git a/include/linux/mutex.h b/include/linux/mutex.h index d318193..6952fc4 100644 --- a/include/linux/mutex.h +++ b/include/linux/mutex.h @@ -49,6 +49,15 @@ struct mutex { /* 1: unlocked, 0: locked, negative: locked, possible waiters */ atomic_tcount; +#ifdef CONFIG_DEBUG_MUTEX_FASTPATH + /* +* Debug SMP ordering issues by forcing mutex->count on a separate +* cache line with mutex->wait_lock. This will make certain classes +* of races more prominent, without slowing down other parts of the +* mutex code. +*/ + u8 __cache_padding[64]; +#endif spinlock_t wait_lock; struct list_headw