Re: [PATCH] mutexes: Add CONFIG_DEBUG_MUTEX_FASTPATH=y debug variant to debug SMP races

2013-12-11 Thread Waiman Long

On 12/03/2013 03:52 AM, Ingo Molnar wrote:

*
I think we try that, to make mutexes safer in general.

Can you see a way to do that fairly cheaply?

I can see two approaches, both rather radical:

1)

Eliminate mutex->count and (ab-)use mutex->wait_lock as 'the' mutex
lock: with TICKET_SLOWPATH_FLAG used to denote waiters or so and care
taken to not use it as a 'real' spinlock but use the raw accessors.

This would still allow a good mutex_lock() fastpath, as it would
essentially become spin_trylock() with an asm goto slow path helper
perhaps.

Doing this would have various advantages:

  - we'd eliminate much (all?) of per arch mutex code
  - we'd share spinlock and mutex low level implementations
  - we'd reduce struct mutex size by 4 bytes

It's still early in the morning so I might be missing something
trivial though - this sounds suspiciously too easy ;-) Having a proper
mutex slowpath might not be so easy without changing the spinlock
code.

2)

Another method would be to do the opposite: eliminate mutex->wait_lock
[for the non-debug case] and do everything via mutex->count and
mutex->owner.

This saves even more space and potentially enables a tighter slowpath.

It probably won't hurt the massively parallel case, as we already do
smart MCS locking via mutex->spin_mlock.

So I'd argue for #2. (Assuming it addresses the problem)

Thanks,

Ingo




I also think that #2 is safer as messing with spinlock code can be 
risky. However, #2 probably won't work for architectures that use the 
generic mutex-xchg.h fastpath. Currently the following architectures use 
mutex-xchg.h - unicore32, arc, arm and hexagon. Is there a reason why 
they cannot be converted to use mutex-dec.h instead?


-Longman


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mutexes: Add CONFIG_DEBUG_MUTEX_FASTPATH=y debug variant to debug SMP races

2013-12-05 Thread Simon Kirby
On Wed, Dec 04, 2013 at 01:14:56PM -0800, Linus Torvalds wrote:

> The lock we're moving up isn't the lock that actually protects the
> whole allocation logic (it's the lock that then protects the pipe
> contents when a pipe is *used*). So it's a useless lock, and moving it
> up is a good idea regardless (because it makes the locks only protect
> the parts they are actually *supposed* to protect.
> 
> And while extraneous lock wouldn't normally hurt, the sleeping locks
> (both mutexes and semaphores) aren't actually safe wrt de-allocation -
> they protect anything *inside* the lock, but the lock data structure
> itself is accessed racily wrt other lockers (in a way that still
> leaves the locked region protected, but not the lock itself). If you
> care about details, you can walk through my example.

Yes, this makes sense now. It was spin_unlock_mutex() on the pipe lock
that itself was already already freed and poisoned by another cpu. This
explicit poison check also fires:

diff --git a/arch/x86/include/asm/spinlock.h b/arch/x86/include/asm/spinlock.h
index bf156de..ae425d0 100644
--- a/arch/x86/include/asm/spinlock.h
+++ b/arch/x86/include/asm/spinlock.h
@@ -159,6 +159,7 @@ static __always_inline void 
arch_spin_unlock(arch_spinlock_t *lock)
__ticket_unlock_slowpath(lock, prev);
} else
__add(&lock->tickets.head, TICKET_LOCK_INC, UNLOCK_LOCK_PREFIX);
+   WARN_ON(*(unsigned int *)&lock->tickets.head == 0x6b6b6b6c);
 }
 
 static inline int arch_spin_is_locked(arch_spinlock_t *lock)

It warns only as often as the poison checking already did, with a stack
of warn_*, __mutex_unlock_slowpath(), mutex_unlock(), pipe_release().

Trying to prove a negative, of course, but I tested with your first fix
overnight and got no errors. Current git (with b0d8d2292160bb63de) also
looks good. I will leave it running for a few days.

Thanks for getting stuck on this one. It was educational, at least!

Simon-
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mutexes: Add CONFIG_DEBUG_MUTEX_FASTPATH=y debug variant to debug SMP races

2013-12-04 Thread Simon Kirby
On Tue, Dec 03, 2013 at 09:52:33AM +0100, Ingo Molnar wrote:

> Indeed: this comes from mutex->count being separate from 
> mutex->wait_lock, and this should affect every architecture that has a 
> mutex->count fast-path implemented (essentially every architecture 
> that matters).
> 
> Such bugs should also magically go away with mutex debugging enabled.

Confirmed: I ran the reproducer with CONFIG_DEBUG_MUTEXES for a few
hours, and never got a single poison overwritten notice.

> I'd expect such bugs to be more prominent with unlucky object 
> size/alignment: if mutex->count lies on a separate cache line from 
> mutex->wait_lock.
> 
> Side note: this might be a valid light weight debugging technique, we 
> could add padding between the two fields to force them into separate 
> cache lines, without slowing it down.
> 
> Simon, would you be willing to try the fairly trivial patch below? 
> Please enable CONFIG_DEBUG_MUTEX_FASTPATH=y. Does your kernel fail 
> faster that way?

I didn't see much of a change other than the incremented poison byte is
now further in due to the padding, and it shows up in kmalloc-256.

I also tried with Linus' udelay() suggestion, below. With this, there
were many occurrences per second.

Simon-

diff --git a/kernel/mutex.c b/kernel/mutex.c
index d24105b..f65e735 100644
--- a/kernel/mutex.c
+++ b/kernel/mutex.c
@@ -25,6 +25,7 @@
 #include 
 #include 
 #include 
+#include 
 
 /*
  * In the DEBUG case we are using the "NULL fastpath" for mutexes,
@@ -740,6 +741,11 @@ __mutex_unlock_common_slowpath(atomic_t *lock_count, int 
nested)
wake_up_process(waiter->task);
}
 
+   /* udelay a bit if the spinlock isn't contended */
+   if (lock->wait_lock.rlock.raw_lock.tickets.head + 1 ==
+   lock->wait_lock.rlock.raw_lock.tickets.tail)
+   udelay(1);
+
spin_unlock_mutex(&lock->wait_lock, flags);
 }
 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mutexes: Add CONFIG_DEBUG_MUTEX_FASTPATH=y debug variant to debug SMP races

2013-12-04 Thread Linus Torvalds
On Wed, Dec 4, 2013 at 1:19 AM, Simon Kirby  wrote:
>
> Meanwhile, I still don't understand how moving the unlock _up_ to cover
> less of the code can solve the race, but I will stare at your long
> explanation more tomorrow.

The lock we're moving up isn't the lock that actually protects the
whole allocation logic (it's the lock that then protects the pipe
contents when a pipe is *used*). So it's a useless lock, and moving it
up is a good idea regardless (because it makes the locks only protect
the parts they are actually *supposed* to protect.

And while extraneous lock wouldn't normally hurt, the sleeping locks
(both mutexes and semaphores) aren't actually safe wrt de-allocation -
they protect anything *inside* the lock, but the lock data structure
itself is accessed racily wrt other lockers (in a way that still
leaves the locked region protected, but not the lock itself). If you
care about details, you can walk through my example.

 Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mutexes: Add CONFIG_DEBUG_MUTEX_FASTPATH=y debug variant to debug SMP races

2013-12-04 Thread Simon Kirby
On Tue, Dec 03, 2013 at 10:10:29AM -0800, Linus Torvalds wrote:

> On Tue, Dec 3, 2013 at 12:52 AM, Ingo Molnar  wrote:
> >
> > I'd expect such bugs to be more prominent with unlucky object
> > size/alignment: if mutex->count lies on a separate cache line from
> > mutex->wait_lock.
> 
> I doubt that makes much of a difference. It's still just "CPU cycles"
> away, and the window will be tiny unless you have multi-socket
> machines and/or are just very unlucky.
> 
> For stress-testing, it would be much better to use some hack like
> 
> /* udelay a bit if the spinlock isn't contended */
> if (mutex->wait_lock.ticket.head+1 == mutex->wait_lock.ticket.tail)
> udelay(1);
> 
> in __mutex_unlock_common() just before the spin_unlock(). Make the
> window really *big*.

I haven't had a chance yet to do much testing of the proposed race fix
and race enlarging patches, but I did write a tool to reproduce the race.
I started it running and left for dinner, and sure enough, it actually
seems to work on plain 3.12 on a Dell PowerEdge r410 w/dual E5520,
reproducing "Poison overwritten" at a rate of about once every 15 minutes
(running 6 in parallel, booted with "slub_debug").

I have no idea if actually relying on tsc alignment between cores and
sockets is a reasonable idea these days, but it seems to work. I first
used a read() on a pipe close()d by the other process to synchronize
them, but this didn't seem to work as well as busy-waiting until the
timestamp counters pass a previously-decided-upon start time.

Meanwhile, I still don't understand how moving the unlock _up_ to cover
less of the code can solve the race, but I will stare at your long
explanation more tomorrow.

Simon-
#include 
#include 
#include 
#include 
#include 

#define MAX_PIPES 450
#define FORK_CYCLES 200
#define CLOSE_CYCLES 10
#define STAT_SHIFT 6

static inline unsigned long readtsc()
{
	unsigned int low, high;
	asm volatile("rdtsc" : "=a" (low), "=d" (high));
	return low | ((unsigned long)(high) << 32);
}

static int pipefd[MAX_PIPES][2];

int main(int argc, char *argv[]) {
	unsigned long loop, race_start, miss;
	unsigned long misses = 0, races = 0;
	int i;
	pid_t pid;
	struct rlimit rlim = {
		.rlim_cur = MAX_PIPES * 2 + 96,
		.rlim_max = MAX_PIPES * 2 + 96,
	};

	if (setrlimit(RLIMIT_NOFILE, &rlim) != 0)
		perror("setrlimit(RLIMIT_NOFILE)");

	for (loop = 1;;loop++) {
		/*
		 * Make a bunch of pipes
		 */
		for (i = 0;i < MAX_PIPES;i++) {
			if (pipe(pipefd[i]) == -1) {
perror("pipe()");
exit(EXIT_FAILURE);
			}
		}

		race_start = readtsc() + FORK_CYCLES;

		asm("":::"memory");

		pid = fork();
		if (pid == -1) {
			perror("fork()");
			exit(EXIT_FAILURE);
		}
		pid = !!pid;

		/*
		 * Close one pipe descriptor per process
		 */
		for (i = 0;i < MAX_PIPES;i++)
			close(pipefd[i][!pid]);

		for (i = 0;i < MAX_PIPES;i++) {
			/*
			 * Line up and try to close at the same time
			 */
			miss = 1;
			while (readtsc() < race_start)
miss = 0;

			close(pipefd[i][pid]);

			misses+= miss;
			race_start+= CLOSE_CYCLES;
		}
		races+= MAX_PIPES;

		if (!(loop & (~(~0UL << STAT_SHIFT
			fprintf(stderr, "%c %lu (%.2f%% false starts)\n",
"CP"[pid], readtsc(), misses * 100. / races);

		if (pid)
			wait(NULL);	/* Parent */
		else
			exit(0);	/* Child */
	}
}



Re: [PATCH] mutexes: Add CONFIG_DEBUG_MUTEX_FASTPATH=y debug variant to debug SMP races

2013-12-03 Thread Linus Torvalds
On Tue, Dec 3, 2013 at 12:52 AM, Ingo Molnar  wrote:
>
> I'd expect such bugs to be more prominent with unlucky object
> size/alignment: if mutex->count lies on a separate cache line from
> mutex->wait_lock.

I doubt that makes much of a difference. It's still just "CPU cycles"
away, and the window will be tiny unless you have multi-socket
machines and/or are just very unlucky.

For stress-testing, it would be much better to use some hack like

/* udelay a bit if the spinlock isn't contended */
if (mutex->wait_lock.ticket.head+1 == mutex->wait_lock.ticket.tail)
udelay(1);

in __mutex_unlock_common() just before the spin_unlock(). Make the
window really *big*.

That said:

>> So having a non-atomic refcount protected inside a sleeping lock is
>> a bug, and that's really the bug that ended up biting us for pipes.
>>
>> Now, the question is: do we have other such cases? How do we
>> document this? [...]
>
> I'd not be surprised if we had such cases, especially where
> spinlock->mutex conversions were done

So I actually don't think we do. Why? Having a sleeping lock inside
the object that protects the refcount is actually hard to do.

You need to increment the refcount when finding the object, but that
in turn tends to imply holding the lock *before* you find it. Or you
have to find it locklessly, which in turn implies RCU, which in turn
implies non-sleeping lock.

And quite frankly, anybody who uses SRCU and a sleeping lock is just
broken to begin with. That's just an abomination.  If the RT codepaths
do something like that, don't even tell me. It's broken and stupid.

So protecting a refcount with a mutex in the same object is really
quite hard. Even the pipe code didn't actually do that, it just
happened to nest the real lock inside the sleeping lock (but that's
also why it was so easy to fix - the sleeping lock wasn't actually
necessary or helpful in the refcounting path).

So my *gut* feel is that nobody does this. But people have been known
to do insane things just because they use too many sleeping locks and
think it's "better".

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] mutexes: Add CONFIG_DEBUG_MUTEX_FASTPATH=y debug variant to debug SMP races

2013-12-03 Thread Ingo Molnar

* Linus Torvalds  wrote:

>  - but in the meantime, CPU1 is busy still unlocking:
> 
> if (!list_empty(&lock->wait_list)) {
> /* get the first entry from the wait-list: */
> struct mutex_waiter *waiter =
> list_entry(lock->wait_list.next,
>struct mutex_waiter, list);
> 
> debug_mutex_wake_waiter(lock, waiter);
> 
> wake_up_process(waiter->task);
> }
> spin_unlock_mutex(&lock->wait_lock, flags);
> 
> and in particular notice how CPU1 wrote to the lock *after* CPU2 had 
> gotten the lock and free'd the data structure. In practice, it's 
> only really that "spin_unlock_mutex(&lock->wait_lock, flags);" that 
> happens (and this is how we got the single byte being incremented).
> 
> In other words, it's unsafe to protect reference counts inside 
> objects with anything but spinlocks and/or atomic refcounts. Or you 
> have to have the lock *outside* the object you're protecting (which 
> is often what you want for other reasons anyway, notably lookup).

Indeed: this comes from mutex->count being separate from 
mutex->wait_lock, and this should affect every architecture that has a 
mutex->count fast-path implemented (essentially every architecture 
that matters).

Such bugs should also magically go away with mutex debugging enabled.

I'd expect such bugs to be more prominent with unlucky object 
size/alignment: if mutex->count lies on a separate cache line from 
mutex->wait_lock.

Side note: this might be a valid light weight debugging technique, we 
could add padding between the two fields to force them into separate 
cache lines, without slowing it down.

Simon, would you be willing to try the fairly trivial patch below? 
Please enable CONFIG_DEBUG_MUTEX_FASTPATH=y. Does your kernel fail 
faster that way?

> So having a non-atomic refcount protected inside a sleeping lock is 
> a bug, and that's really the bug that ended up biting us for pipes.
> 
> Now, the question is: do we have other such cases? How do we 
> document this? [...]

I'd not be surprised if we had such cases, especially where 
spinlock->mutex conversions were done. About 20% of the kernel's 
critical sections use mutexes, 60% use spinlocks. (the remaining 20% 
is shared between semaphored, rwlocks and rwsems, in that order of 
frequency.)

> [...] Do we try to make mutexes and other locks safe to use for 
> things like this?

I think we try that, to make mutexes safer in general.

Can you see a way to do that fairly cheaply?

I can see two approaches, both rather radical:

1)

Eliminate mutex->count and (ab-)use mutex->wait_lock as 'the' mutex 
lock: with TICKET_SLOWPATH_FLAG used to denote waiters or so and care 
taken to not use it as a 'real' spinlock but use the raw accessors.

This would still allow a good mutex_lock() fastpath, as it would 
essentially become spin_trylock() with an asm goto slow path helper 
perhaps.

Doing this would have various advantages:

 - we'd eliminate much (all?) of per arch mutex code
 - we'd share spinlock and mutex low level implementations
 - we'd reduce struct mutex size by 4 bytes

It's still early in the morning so I might be missing something 
trivial though - this sounds suspiciously too easy ;-) Having a proper 
mutex slowpath might not be so easy without changing the spinlock 
code.

2)

Another method would be to do the opposite: eliminate mutex->wait_lock 
[for the non-debug case] and do everything via mutex->count and 
mutex->owner.

This saves even more space and potentially enables a tighter slowpath.

It probably won't hurt the massively parallel case, as we already do 
smart MCS locking via mutex->spin_mlock.

So I'd argue for #2. (Assuming it addresses the problem)

Thanks,

Ingo

===>
Subject: mutexes: Add CONFIG_DEBUG_MUTEX_FASTPATH=y debug variant to debug SMP 
races
From: Ingo Molnar 

Help debug SMP ordering issues by forcing mutex->count on a separate 
cache line with mutex->wait_lock. This will make certain classes races 
more prominent, without slowing down other parts of the mutex code.

Signed-off-by: Ingo Molnar 

diff --git a/include/linux/mutex.h b/include/linux/mutex.h
index d318193..6952fc4 100644
--- a/include/linux/mutex.h
+++ b/include/linux/mutex.h
@@ -49,6 +49,15 @@
 struct mutex {
/* 1: unlocked, 0: locked, negative: locked, possible waiters */
atomic_tcount;
+#ifdef CONFIG_DEBUG_MUTEX_FASTPATH
+   /*
+* Debug SMP ordering issues by forcing mutex->count on a separate
+* cache line with mutex->wait_lock. This will make certain classes
+* of races more prominent, without slowing down other parts of the
+* mutex code.
+*/
+   u8  __cache_padding[64];
+#endif
spinlock_t  wait_lock;
struct list_headw