Re: [PATCH RFC ticketlock] Auto-queued ticketlock

2013-06-14 Thread Waiman Long

On 06/14/2013 09:26 PM, Benjamin Herrenschmidt wrote:

On Fri, 2013-06-14 at 14:17 -0400, Waiman Long wrote:

With some minor changes, the current patch can be modified to support
debugging lock for 32-bit system. For 64-bit system, we can apply a
similar concept for debugging lock with cmpxchg_double. However, for
architecture that does not have cmpxchg_double support, it will be out
of luck and we probably couldn't support the same feature in debugging
mode. It will have to fall back to taking the lock.

That means only x86_64 and s390 would benefit from it ... I'm sure we can do 
better :-)

Cheers,
Ben.


On second thought, using cmpxchg_double may not be such a good idea 
after all as it requires a 16-byte alignment, at least for x86-64. 
Another possible alternative is to integrate the reference count 
directly into the spinlock_t data structure immediately after 
arch_spinlock_t for this special case. If CONFIG_GENERIC_LOCKBREAK is 
not defined, there will be a 4-byte hole that can be used. Otherwise, 
the spinlock_t structure will have an 8 byte size increase. I suppose 
that others won't be too upset for an 8-byte increase in size when 
spinlock debugging is turned on.


Regards,
Longman
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC ticketlock] Auto-queued ticketlock

2013-06-14 Thread Benjamin Herrenschmidt
On Fri, 2013-06-14 at 14:17 -0400, Waiman Long wrote:
> 
> With some minor changes, the current patch can be modified to support 
> debugging lock for 32-bit system. For 64-bit system, we can apply a 
> similar concept for debugging lock with cmpxchg_double. However, for 
> architecture that does not have cmpxchg_double support, it will be out 
> of luck and we probably couldn't support the same feature in debugging 
> mode. It will have to fall back to taking the lock.

That means only x86_64 and s390 would benefit from it ... I'm sure we can do 
better :-)

Cheers,
Ben.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC ticketlock] Auto-queued ticketlock

2013-06-14 Thread Waiman Long

On 06/14/2013 11:37 AM, Linus Torvalds wrote:

On Fri, Jun 14, 2013 at 8:00 AM, Waiman Long  wrote:

On 06/12/2013 08:59 PM, Linus Torvalds wrote:

Ho humm.. interesting. I was talking about wanting to mix atomics and
spinlocks earlier in this thread due to space constraints, and it
strikes me that that would actually help this case a lot. Having the
dentry count mix d_lock and the count in one word would allow for
atomic ops like "increment if not locked", and we'd avoid this whole
race entirely..

Something like "low bit of count is the lock bit" would end up being
lovely for this case. Of course, that's not how our spinlocks work ..

Linus


I have created another patch to do exactly the "increment if not locked"
operation as suggested. It did help a lot. See the patch below for more
information. Any additional comment will be appreciated.

Hmm. This is interesting and proves the concept, and the numbers look
very promising.

The patch is not mergable, though, since it clearly depends on the
spinlock/d_count fitting in a u64, which is normally true, but not the
case of debugging locks etc, we'd need to generalize and fix the whole
concept of "refcount+lock".


With some minor changes, the current patch can be modified to support 
debugging lock for 32-bit system. For 64-bit system, we can apply a 
similar concept for debugging lock with cmpxchg_double. However, for 
architecture that does not have cmpxchg_double support, it will be out 
of luck and we probably couldn't support the same feature in debugging 
mode. It will have to fall back to taking the lock.


I was thinking about generalizing the fix, but one issue that I was 
aware of was that the d_lock member of dentry had more than 300 
references throughout the filesystem code. A general fix will require 
d_lock to be accessed in a different way. So it will be a pretty massive 
patch touching quite a lot of files even though the changes will be 
pretty straightforward in most cases.



Generalizing it might be a good idea anyway, since there are other
cases of "atomic_dec_and_lock()" etc behaviours where we might want to
have these kinds of extended lock+count shenanigans.


The patch can certainly be generalized. I will see what I can do in a 
week of two.



I also do wonder if we could perhaps fit both in 32-bits, and just not
use the "real" spinlocks at all, but use a bitlock in the low (or
high) bit of the refcount. We do that in some other places - we'd
potentially lose lockdep etc, and we'd lose some of the other good
parts of spinlocks (fairness yadda yadda), but *if* we can reduce
contention enough that it works out, maybe it would be worth it.


As the dentry is such an important data structure for the filesystem 
layer, losing the fairness attribute and the ability to debug may be too 
high a price to pay. For other niche cases, such a combined data type 
can certainly be used.



So this doesn't look like 3.11 material, but the numbers certainly
make it look very promising, so with some more work on it ...

 Linus


When will be the deadline for stable patch that can be considered to be 
merged in 3.11?


Regards,
Longman
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC ticketlock] Auto-queued ticketlock

2013-06-14 Thread Linus Torvalds
On Fri, Jun 14, 2013 at 8:00 AM, Waiman Long  wrote:
> On 06/12/2013 08:59 PM, Linus Torvalds wrote:
>>
>> Ho humm.. interesting. I was talking about wanting to mix atomics and
>> spinlocks earlier in this thread due to space constraints, and it
>> strikes me that that would actually help this case a lot. Having the
>> dentry count mix d_lock and the count in one word would allow for
>> atomic ops like "increment if not locked", and we'd avoid this whole
>> race entirely..
>>
>> Something like "low bit of count is the lock bit" would end up being
>> lovely for this case. Of course, that's not how our spinlocks work ..
>>
>>Linus
>
>
> I have created another patch to do exactly the "increment if not locked"
> operation as suggested. It did help a lot. See the patch below for more
> information. Any additional comment will be appreciated.

Hmm. This is interesting and proves the concept, and the numbers look
very promising.

The patch is not mergable, though, since it clearly depends on the
spinlock/d_count fitting in a u64, which is normally true, but not the
case of debugging locks etc, we'd need to generalize and fix the whole
concept of "refcount+lock".

Generalizing it might be a good idea anyway, since there are other
cases of "atomic_dec_and_lock()" etc behaviours where we might want to
have these kinds of extended lock+count shenanigans.

I also do wonder if we could perhaps fit both in 32-bits, and just not
use the "real" spinlocks at all, but use a bitlock in the low (or
high) bit of the refcount. We do that in some other places - we'd
potentially lose lockdep etc, and we'd lose some of the other good
parts of spinlocks (fairness yadda yadda), but *if* we can reduce
contention enough that it works out, maybe it would be worth it.

So this doesn't look like 3.11 material, but the numbers certainly
make it look very promising, so with some more work on it ...

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC ticketlock] Auto-queued ticketlock

2013-06-14 Thread Waiman Long

On 06/12/2013 08:59 PM, Linus Torvalds wrote:

On Wed, Jun 12, 2013 at 5:49 PM, Al Viro  wrote:

On Wed, Jun 12, 2013 at 05:38:13PM -0700, Linus Torvalds wrote:

For the particular case of dget_parent() maybe dget_parent() should
just double-check the original dentry->d_parent pointer after getting
the refcount on it (and if the parent has changed, drop the refcount
again and go to the locked version). That might be a good idea anyway,
and should fix the possible race (which would be with another cpu
having to first rename the child to some other parent, and the
d_invalidate() the original parent)

Yes, but...  Then we'd need to dput() that sucker if we decide we shouldn't
have grabbed that reference, after all, which would make dget_parent()
potentially blocking.

Ho humm.. interesting. I was talking about wanting to mix atomics and
spinlocks earlier in this thread due to space constraints, and it
strikes me that that would actually help this case a lot. Having the
dentry count mix d_lock and the count in one word would allow for
atomic ops like "increment if not locked", and we'd avoid this whole
race entirely..

Something like "low bit of count is the lock bit" would end up being
lovely for this case. Of course, that's not how our spinlocks work ..

   Linus


I have created another patch to do exactly the "increment if not locked" 
operation as suggested. It did help a lot. See the patch below for more 
information. Any additional comment will be appreciated.


Regards,
Longman

---
The current code takes the dentry's d_lock lock whenever the d_count
reference count is being updated. In reality, nothing big really
happens until d_count goes to 0 in dput(). So it is not necessary
to take the lock if the reference count won't go to 0. On the other
hand, there are cases where d_count should not be updated or was not
expected to be updated while d_lock was taken by other functions.

To try to locklessly update the d_count while d_lock is not taken
by others, the 32-bit d_count and 32-bit d_lock (when no debugging
code is turned on) can be combined into a single 64-bit word to be
updated atomically whenever the following conditions happen:

1. The lock is not taken, i.e. spin_can_lock() returns true.
2. For increment, the original d_count must be > 0, or
3. for decrement, the original d_count must be > 1.

To maximize the chance of doing lockless update, the new code calls
spin_unlock_wait() before trying to do the update.

The new code also attempts to do lockless atomic update twice before
falling back to the old code path of taking a lock before doing
the update. It is because there will still be some fair amount of
contention with only one attempt.

The offsets of the d_count/d_lock duple are at byte 72 and 88 for
32-bit and 64-bit systems respectively. In both cases, they are
8-byte aligned and their combination into a single 8-byte word will
not introduce a hole that increase the size of the dentry structure.

This patch has a particular big impact on the short workload of the
AIM7 benchmark with ramdisk filesystem. The table below show the
performance improvement to the JPM (jobs per minutes) throughput due
to this patch on an 8-socket 80-core x86-64 system with a 3.10-rc4
kernel in a 1/2/4/8 node configuration by using numactl to restrict
the execution of the workload on certain nodes.

+-++-+--+
|  Configuration  |Mean JPM|Mean JPM | % Change |
| | Rate w/o patch | Rate with patch |  |
+-+-+
| |  User Range 10 - 100|
+-+-+
| 8 nodes, HT off |1596798 | 4748981 | +197.4%  |
| 4 nodes, HT off |1653817 | 4845590 | +193.0%  |
| 2 nodes, HT off |3802258 | 3832017 |   +0.8%  |
| 1 node , HT off |2403297 | 2386401 |   -0.7%  |
+-+-+
| |  User Range 200 - 1000  |
+-+-+
| 8 nodes, HT off |1070992 | 6060457 | +465.9%  |
| 4 nodes, HT off |1367668 | 6799978 | +397.2%  |
| 2 nodes, HT off |4554370 | 4609893 |   +1.2%  |
| 1 node , HT off |2534826 | 2526519 |   -0.3%  |
+-+-+
| |  User Range 1100 - 2000 |
+-+-+
| 8 nodes, HT off |1061322 | 6435537 | +506.37  |
| 4 nodes, HT off |1365111 | 6589983 | +382.7%  |
| 2 nodes, HT off |4583947 | 4648464 |   +1.4%  |
|

Re: [PATCH RFC ticketlock] Auto-queued ticketlock

2013-06-12 Thread Linus Torvalds
On Wed, Jun 12, 2013 at 5:49 PM, Al Viro  wrote:
> On Wed, Jun 12, 2013 at 05:38:13PM -0700, Linus Torvalds wrote:
>>
>> For the particular case of dget_parent() maybe dget_parent() should
>> just double-check the original dentry->d_parent pointer after getting
>> the refcount on it (and if the parent has changed, drop the refcount
>> again and go to the locked version). That might be a good idea anyway,
>> and should fix the possible race (which would be with another cpu
>> having to first rename the child to some other parent, and the
>> d_invalidate() the original parent)
>
> Yes, but...  Then we'd need to dput() that sucker if we decide we shouldn't
> have grabbed that reference, after all, which would make dget_parent()
> potentially blocking.

Ho humm.. interesting. I was talking about wanting to mix atomics and
spinlocks earlier in this thread due to space constraints, and it
strikes me that that would actually help this case a lot. Having the
dentry count mix d_lock and the count in one word would allow for
atomic ops like "increment if not locked", and we'd avoid this whole
race entirely..

Something like "low bit of count is the lock bit" would end up being
lovely for this case. Of course, that's not how our spinlocks work ..

  Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC ticketlock] Auto-queued ticketlock

2013-06-12 Thread Al Viro
On Wed, Jun 12, 2013 at 05:38:13PM -0700, Linus Torvalds wrote:
> On Wed, Jun 12, 2013 at 5:20 PM, Al Viro  wrote:
> >
> > Actually, dget_parent() change might be broken; the thing is, the 
> > assumptions
> > are more subtle than "zero -> non-zero only happens under ->d_lock".  It's
> > actually "new references are grabbed by somebody who's either already 
> > holding
> > one on the same dentry _or_ holding ->d_lock".  That's what d_invalidate()
> > check for ->d_count needs for correctness - caller holds one reference, so
> > comparing ->d_count with 2 under ->d_lock means checking that there's no 
> > other
> > holders _and_ there won't be any new ones appearing.
> 
> For the particular case of dget_parent() maybe dget_parent() should
> just double-check the original dentry->d_parent pointer after getting
> the refcount on it (and if the parent has changed, drop the refcount
> again and go to the locked version). That might be a good idea anyway,
> and should fix the possible race (which would be with another cpu
> having to first rename the child to some other parent, and the
> d_invalidate() the original parent)

Yes, but...  Then we'd need to dput() that sucker if we decide we shouldn't
have grabbed that reference, after all, which would make dget_parent()
potentially blocking.

> That said, the case we'd really want to fix isn't dget_parent(), but
> just the normal RCU lookup finishing touches (the__d_rcu_to_refcount()
> case you already mentioned) . *If* we could do that without ever
> taking the d_lock on the target, that would be lovely. But it would
> seem to have the exact same issue. Although maybe the
> dentry_rcuwalk_barrier() thing ends up solving it (ie if we had a
> lookup at a bad time, we know it will fail the sequence count test, so
> we're ok).

Maybe, but that would require dentry_rcuwalk_barrier() between any such
check and corresponding grabbing of ->d_lock done for it, so it's not
just d_invalidate().

> Subtle, subtle.

Yes ;-/  The current variant is using ->d_lock as a brute-force mechanism
for avoiding all that fun, and I'm not sure that getting rid of it would
buy us enough to make it worth the trouble.  I'm absolutely sure that if
we go for that, we _MUST_ document the entire scheme as explicitly as
possible, or we'll end up with the shitload of recurring bugs in that
area.  Preferably with the formal proof of correctness spelled out somewhere...
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC ticketlock] Auto-queued ticketlock

2013-06-12 Thread Linus Torvalds
On Wed, Jun 12, 2013 at 5:20 PM, Al Viro  wrote:
>
> Actually, dget_parent() change might be broken; the thing is, the assumptions
> are more subtle than "zero -> non-zero only happens under ->d_lock".  It's
> actually "new references are grabbed by somebody who's either already holding
> one on the same dentry _or_ holding ->d_lock".  That's what d_invalidate()
> check for ->d_count needs for correctness - caller holds one reference, so
> comparing ->d_count with 2 under ->d_lock means checking that there's no other
> holders _and_ there won't be any new ones appearing.

For the particular case of dget_parent() maybe dget_parent() should
just double-check the original dentry->d_parent pointer after getting
the refcount on it (and if the parent has changed, drop the refcount
again and go to the locked version). That might be a good idea anyway,
and should fix the possible race (which would be with another cpu
having to first rename the child to some other parent, and the
d_invalidate() the original parent)

That said, the case we'd really want to fix isn't dget_parent(), but
just the normal RCU lookup finishing touches (the__d_rcu_to_refcount()
case you already mentioned) . *If* we could do that without ever
taking the d_lock on the target, that would be lovely. But it would
seem to have the exact same issue. Although maybe the
dentry_rcuwalk_barrier() thing ends up solving it (ie if we had a
lookup at a bad time, we know it will fail the sequence count test, so
we're ok).

Subtle, subtle.

   Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC ticketlock] Auto-queued ticketlock

2013-06-12 Thread Al Viro
On Wed, Jun 12, 2013 at 05:01:19PM -0700, Linus Torvalds wrote:
> I'd actually suggest we do *not* remove any existing d_lock usage
> outside of the particular special cases we want to optimize, which at
> least from Davidlohr's profile is just dput() (which has shown up a
> lot before) and dget_parent() (which I'm not sure why it happens so
> much on his load, but it really seems trivially safe to optimistically
> do under just the RCU lock).

Actually, dget_parent() change might be broken; the thing is, the assumptions
are more subtle than "zero -> non-zero only happens under ->d_lock".  It's
actually "new references are grabbed by somebody who's either already holding
one on the same dentry _or_ holding ->d_lock".  That's what d_invalidate()
check for ->d_count needs for correctness - caller holds one reference, so
comparing ->d_count with 2 under ->d_lock means checking that there's no other
holders _and_ there won't be any new ones appearing.

Consider the following situation:
X is dentry of a/b
Y is dentry of a/b/c
Z is dentry of d/e

A holds a reference to Y and enters dget_parent(Y)
B holds a reference to X and enters d_invalidate(X)
A picks the value of Y->d_parent (== X)
C moves Y to Z
B grabs ->d_lock on X
B checks X->d_count; it's 1, we deduce that no other references exist or
are going to appear
A does atomic_inc_not_zero(&X->d_count).  And since it's not zero (it's 1,
actually), we've just grabbed an extra reference on X that was not going
to appear according to B...

> That said, I do wonder if we could do something like
> "atomic_inc_not_zero()" on the d_count, and only if it is zero (which
> won't be horribly unusual, since for leaf dentries that nobody else is
> using) we'd do the whole locking sequence.

Same correctness issue as above, I'm afraid...

> End result: I think it would be interesting to try this all out, and
> it could be a noticeable win under some cases, but it *definitely*
> needs a lot of timing and testing to see which ways it goes..

*nod*

What's more, we need the underlying assumptions documented very clearly for
any such change; it's _not_ as simple as "protect transitions from zero to
non-zero and we are done" ;-/
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC ticketlock] Auto-queued ticketlock

2013-06-12 Thread Linus Torvalds
On Wed, Jun 12, 2013 at 4:32 PM, Al Viro  wrote:
> On Wed, Jun 12, 2013 at 01:26:25PM -0700, Linus Torvalds wrote:
>>
>> However, one optimization missing from your patch is obvious in the
>> profile. "dget_parent()" also needs to be optimized - you still have
>> that as 99% of the spin-lock case. I think we could do something like
>>
>>rcu_read_lock();
>>parent = ACCESS_ONCE(dentry->d_parent);
>>if (atomic_inc_nonzero(&parent->d_count))
>>   return parent;
>>.. get d_lock and do it the slow way ...
>>rcu_read_unlock();
>>
>> to locklessly get the parent pointer. We know "parent" isn't going
>> away (dentries are rcu-free'd and we hold the rcu read lock), and I
>> think that we can optimistically take *any* parent dentry that
>> happened to be valid at one point. As long as the refcount didn't go
>> down to zero. Al?
>
> What will you do with __d_rcu_to_refcount()?  Any such scheme has to
> hold d_lock from zero->non-zero d_count changes, or atomic_dec_and_lock
> in dput() won't help at all.

I'd actually suggest we do *not* remove any existing d_lock usage
outside of the particular special cases we want to optimize, which at
least from Davidlohr's profile is just dput() (which has shown up a
lot before) and dget_parent() (which I'm not sure why it happens so
much on his load, but it really seems trivially safe to optimistically
do under just the RCU lock).

>  As it is, both complete_walk() and unlazy_walk()
> are grabbing ->d_lock on the dentry we'd reached, so they can call that
> sucker.  And that'll give you ->d_lock contention when a bunch of threads
> are hitting the same file; I don't see how atomics would avoid that
> one...

I'd love to get rid of complete_walk() using the dcache lock too, but
if we really can't get rid of it, I won't cry.

That said, I do wonder if we could do something like
"atomic_inc_not_zero()" on the d_count, and only if it is zero (which
won't be horribly unusual, since for leaf dentries that nobody else is
using) we'd do the whole locking sequence.

But my first reaction is to not even bother until it shows up on some
profile. Of course, maybe it immediately does.

There's a real downside to making d_count an "atomic_t", and there
might be loads where it actually bites us. But even in the absense of
contention, atomics tend to be sufficiently faster than spinlocks that
even if we end up adding two or even three atomics for each d_lock
lock we get rid of, we should be ok even for single-thread. On the
contention case, we'll obviously win almost regardless of how many
atomics we add.

Of course, that assumes we get rid of any locking at all for the
normal case. with dput() having to take the lock when the refcount
goes to zero, and most single-threaded file opens using the final path
component with a dentry that isn't otherwise used, doing an atomic
d_count might hurt the single-thread case without getting any spinlock
advantage at all. dget_parent() would be the only case that helps even
there, and that should normally only happen for "..", I think.

So who knows.. Are we willing to take a hit on the single-thread case
(*if* that is the case) if it helps scalability a lot? If it was a
common scalability issue, sure. But just for AIM7 looking up
/etc/passwd? Maybe that's not a good idea.

Of course, for the mostly non-contended case, it's quite possible that
the extra atomic isn't even noticeable. As long as the dentry is dirty
in the cache (which it would be, normally), the atomic cost is just in
the 10-20 cycles.

End result: I think it would be interesting to try this all out, and
it could be a noticeable win under some cases, but it *definitely*
needs a lot of timing and testing to see which ways it goes..

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC ticketlock] Auto-queued ticketlock

2013-06-12 Thread Al Viro
On Wed, Jun 12, 2013 at 01:26:25PM -0700, Linus Torvalds wrote:

> For similar reasons, I think you need to still maintain the d_lock in
> d_prune_aliases etc. That's a slow-path, so the fact that we add an
> atomic sequence there doesn't much matter.
> 
> However, one optimization missing from your patch is obvious in the
> profile. "dget_parent()" also needs to be optimized - you still have
> that as 99% of the spin-lock case. I think we could do something like
> 
>rcu_read_lock();
>parent = ACCESS_ONCE(dentry->d_parent);
>if (atomic_inc_nonzero(&parent->d_count))
>   return parent;
>.. get d_lock and do it the slow way ...
>rcu_read_unlock();
> 
> to locklessly get the parent pointer. We know "parent" isn't going
> away (dentries are rcu-free'd and we hold the rcu read lock), and I
> think that we can optimistically take *any* parent dentry that
> happened to be valid at one point. As long as the refcount didn't go
> down to zero. Al?

What will you do with __d_rcu_to_refcount()?  Any such scheme has to
hold d_lock from zero->non-zero d_count changes, or atomic_dec_and_lock
in dput() won't help at all.  As it is, both comlete_walk() and unlazy_walk()
are grabbing ->d_lock on the dentry we'd reached, so they can call that
sucker.  And that'll give you ->d_lock contention when a bunch of threads
are hitting the same file; I don't see how atomics would avoid that
one...
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC ticketlock] Auto-queued ticketlock

2013-06-12 Thread Raymond Jennings
On Wed, 2013-06-12 at 13:26 -0700, Linus Torvalds wrote:
> On Wed, Jun 12, 2013 at 1:03 PM, Davidlohr Bueso  
> wrote:
> >
> > According to him:
> >
> > "the short workload calls security functions like getpwnam(),
> > getpwuid(), getgrgid() a couple of times. These functions open
> > the /etc/passwd or /etc/group files, read their content and close the
> > files.
> 
> Ahh, ok. So yeah, it's multiple threads all hitting the same file

If that's the case and it's a bunch of reads, shouldn't they act
concurrently anyway?

I mean it's not like dentries are being changed or added or removed in
this case.

> I guess that /etc/passwd case is historically interesting, but I'm not
> sure we really want to care too deeply..
> 
> > I did a quick attempt at this (patch attached).
> 
> Yeah, that's wrong, although it probably approximates the dget() case
> (but incorrectly).
> 
> One of the points behind using an atomic d_count is that then dput() should do
> 
>if (!atomic_dec_and_lock(&dentry->d_count, &dentry->d_count))
>   return;
> 
> at the very top of the function. It can avoid taking the lock entirely
> if the count doesn't go down to zero, which would be a common case if
> you have lots of users opening the same file. While still protecting
> d_count from ever going to zero while the lock is held.
> 
> Your
> 
> +   if (atomic_read(&dentry->d_count) > 1) {
> +   atomic_dec(&dentry->d_count);
> +   return;
> +   }
> +   spin_lock(&dentry->d_lock);
> 
> pattern is fundamentally racy, but it's what "atomic_dec_and_lock()"
> should do race-free.
> 
> For similar reasons, I think you need to still maintain the d_lock in
> d_prune_aliases etc. That's a slow-path, so the fact that we add an
> atomic sequence there doesn't much matter.
> 
> However, one optimization missing from your patch is obvious in the
> profile. "dget_parent()" also needs to be optimized - you still have
> that as 99% of the spin-lock case. I think we could do something like
> 
>rcu_read_lock();
>parent = ACCESS_ONCE(dentry->d_parent);
>if (atomic_inc_nonzero(&parent->d_count))
>   return parent;
>.. get d_lock and do it the slow way ...
>rcu_read_unlock();
> 
> to locklessly get the parent pointer. We know "parent" isn't going
> away (dentries are rcu-free'd and we hold the rcu read lock), and I
> think that we can optimistically take *any* parent dentry that
> happened to be valid at one point. As long as the refcount didn't go
> down to zero. Al?
> 
> With dput and dget_parent() both being lockless for the common case,
> you might get rid of the d_lock contention entirely for that load. I
> dunno. And I should really think more about that dget_parent() thing a
> bit more, but I cannot imagine how it could not be right (because even
> with the current d_lock model, the lock is gotten *within*
> dget_parent(), so the caller can never know if it gets a new or an old
> parent, so there is no higher-level serialization going on - and we
> might as well return *either* the new or the old as such).
> 
> I really want Al to double-check me if we decide to try going down
> this hole. But the above two fixes to your patch should at least
> approximate the d_lock changes, even if I'd have to look more closely
> at the other details of your patch..
> 
> Linus
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC ticketlock] Auto-queued ticketlock

2013-06-12 Thread Davidlohr Bueso
On Wed, 2013-06-12 at 13:26 -0700, Linus Torvalds wrote:
> On Wed, Jun 12, 2013 at 1:03 PM, Davidlohr Bueso  
> wrote:
> >
> > According to him:
> >
> > "the short workload calls security functions like getpwnam(),
> > getpwuid(), getgrgid() a couple of times. These functions open
> > the /etc/passwd or /etc/group files, read their content and close the
> > files.
> 
> Ahh, ok. So yeah, it's multiple threads all hitting the same file.
> 
> I guess that /etc/passwd case is historically interesting, but I'm not
> sure we really want to care too deeply..
> 
> > I did a quick attempt at this (patch attached).
> 
> Yeah, that's wrong, although it probably approximates the dget() case
> (but incorrectly).

Indeed, it was only a proof of concept.

> One of the points behind using an atomic d_count is that then dput() should do
> 
>if (!atomic_dec_and_lock(&dentry->d_count, &dentry->d_count))
>   return;

noted.

> at the very top of the function. It can avoid taking the lock entirely
> if the count doesn't go down to zero, which would be a common case if
> you have lots of users opening the same file. While still protecting
> d_count from ever going to zero while the lock is held.
> 
> Your
> 
> +   if (atomic_read(&dentry->d_count) > 1) {
> +   atomic_dec(&dentry->d_count);
> +   return;
> +   }
> +   spin_lock(&dentry->d_lock);
> 
> pattern is fundamentally racy, but it's what "atomic_dec_and_lock()"
> should do race-free.
> For similar reasons, I think you need to still maintain the d_lock in
> d_prune_aliases etc. That's a slow-path, so the fact that we add an
> atomic sequence there doesn't much matter.
> 
> However, one optimization missing from your patch is obvious in the
> profile. "dget_parent()" also needs to be optimized - you still have
> that as 99% of the spin-lock case. I think we could do something like
> 
>rcu_read_lock();
>parent = ACCESS_ONCE(dentry->d_parent);
>if (atomic_inc_nonzero(&parent->d_count))
>   return parent;
>.. get d_lock and do it the slow way ...
>rcu_read_unlock();
> 
> to locklessly get the parent pointer. We know "parent" isn't going
> away (dentries are rcu-free'd and we hold the rcu read lock), and I
> think that we can optimistically take *any* parent dentry that
> happened to be valid at one point. As long as the refcount didn't go
> down to zero. Al?
> 
> With dput and dget_parent() both being lockless for the common case,
> you might get rid of the d_lock contention entirely for that load. I
> dunno. And I should really think more about that dget_parent() thing a
> bit more, but I cannot imagine how it could not be right (because even
> with the current d_lock model, the lock is gotten *within*
> dget_parent(), so the caller can never know if it gets a new or an old
> parent, so there is no higher-level serialization going on - and we
> might as well return *either* the new or the old as such).
> 
> I really want Al to double-check me if we decide to try going down
> this hole. But the above two fixes to your patch should at least
> approximate the d_lock changes, even if I'd have to look more closely
> at the other details of your patch..

Ok, I'll try to rerun and send a more conscious patch. Thanks for the
tips.

Davidlohr


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC ticketlock] Auto-queued ticketlock

2013-06-12 Thread Linus Torvalds
On Wed, Jun 12, 2013 at 1:03 PM, Davidlohr Bueso  wrote:
>
> Waiman's dcache patchet were actually an attempt to address these exact
> issues: http://lkml.org/lkml/2013/5/22/716

Ok, looking at that patch-set, I think it has the same race with not
atomically getting the d_lock spinlock and d_count going down to zero
in dput(). And Waiman clearly didn't know about
"atomic_inc_not_zero()" or "atomic_dec_and_lock()" that are designed
for exactly the "increment if already nonzero" and "decrement without
taking the lock if we're not going down to zero" cases.

As outlined, I'm also not at all sure that the whole seqrw-lock thing
that Waiman did is really necessary - I think the optimistic
dget_parent() might be sufficient.

 Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC ticketlock] Auto-queued ticketlock

2013-06-12 Thread Linus Torvalds
On Wed, Jun 12, 2013 at 1:03 PM, Davidlohr Bueso  wrote:
>
> According to him:
>
> "the short workload calls security functions like getpwnam(),
> getpwuid(), getgrgid() a couple of times. These functions open
> the /etc/passwd or /etc/group files, read their content and close the
> files.

Ahh, ok. So yeah, it's multiple threads all hitting the same file.

I guess that /etc/passwd case is historically interesting, but I'm not
sure we really want to care too deeply..

> I did a quick attempt at this (patch attached).

Yeah, that's wrong, although it probably approximates the dget() case
(but incorrectly).

One of the points behind using an atomic d_count is that then dput() should do

   if (!atomic_dec_and_lock(&dentry->d_count, &dentry->d_count))
  return;

at the very top of the function. It can avoid taking the lock entirely
if the count doesn't go down to zero, which would be a common case if
you have lots of users opening the same file. While still protecting
d_count from ever going to zero while the lock is held.

Your

+   if (atomic_read(&dentry->d_count) > 1) {
+   atomic_dec(&dentry->d_count);
+   return;
+   }
+   spin_lock(&dentry->d_lock);

pattern is fundamentally racy, but it's what "atomic_dec_and_lock()"
should do race-free.

For similar reasons, I think you need to still maintain the d_lock in
d_prune_aliases etc. That's a slow-path, so the fact that we add an
atomic sequence there doesn't much matter.

However, one optimization missing from your patch is obvious in the
profile. "dget_parent()" also needs to be optimized - you still have
that as 99% of the spin-lock case. I think we could do something like

   rcu_read_lock();
   parent = ACCESS_ONCE(dentry->d_parent);
   if (atomic_inc_nonzero(&parent->d_count))
  return parent;
   .. get d_lock and do it the slow way ...
   rcu_read_unlock();

to locklessly get the parent pointer. We know "parent" isn't going
away (dentries are rcu-free'd and we hold the rcu read lock), and I
think that we can optimistically take *any* parent dentry that
happened to be valid at one point. As long as the refcount didn't go
down to zero. Al?

With dput and dget_parent() both being lockless for the common case,
you might get rid of the d_lock contention entirely for that load. I
dunno. And I should really think more about that dget_parent() thing a
bit more, but I cannot imagine how it could not be right (because even
with the current d_lock model, the lock is gotten *within*
dget_parent(), so the caller can never know if it gets a new or an old
parent, so there is no higher-level serialization going on - and we
might as well return *either* the new or the old as such).

I really want Al to double-check me if we decide to try going down
this hole. But the above two fixes to your patch should at least
approximate the d_lock changes, even if I'd have to look more closely
at the other details of your patch..

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC ticketlock] Auto-queued ticketlock

2013-06-12 Thread Davidlohr Bueso
On Wed, 2013-06-12 at 11:15 -0700, Linus Torvalds wrote:
> On Wed, Jun 12, 2013 at 10:50 AM, Davidlohr Bueso
>  wrote:
> >
> > * short: is the big winner for this patch, +69% throughput improvement
> > with 100-2000 users. This makes a lot of sense since the workload spends
> > a ridiculous amount of time trying to acquire the d_lock:
> >
> >   84.86%1569902reaim  [kernel.kallsyms][k] 
> > _raw_spin_lock
> >   |
> >   --- _raw_spin_lock
> >  |
> >  |--49.96%-- dget_parent
> >  |  __fsnotify_parent
> >  |--49.71%-- dput
> 
> Ugh. Do you have any idea what the heck that thing actually does?

Waiman's dcache patchet were actually an attempt to address these exact
issues: http://lkml.org/lkml/2013/5/22/716

According to him:

"the short workload calls security functions like getpwnam(),
getpwuid(), getgrgid() a couple of times. These functions open
the /etc/passwd or /etc/group files, read their content and close the
files. 
It is the intensive open/read/close sequence from multiple threads that 
is causing 80%+ contention in the d_lock on a system with large number 
of cores. The MIT's MOSBench paper also outlined dentry reference 
counting as a scalability roadblock for its Apache and Exim tests."

> 
> Normally, we shouldn't see lots of dget contention, since the dcache
> these days does everything but the last path component locklessly.
> 
> But there's a few exceptions, like symlinks (act as "last component"
> in the middle). And obviously, if some crazy threaded program opens
> the *same* file concurrently over and over again, then that "last
> component" will hammer on the dentry lock of that particular path. But
> that "open the same file concurrently" seems totally unrealistic -
> although maybe that's what AIM does..
> 
> Anybody know the AIM subtests?
> 
> Also, we *may* actually be able to optimize this by making
> dentry->d_count atomic, which will allow us to often do dget_parent
> and put() without taking the dcache lock at all. That's what it used
> to be, but the RCU patches actually made it be protected by the
> d_lock. It made sense at the time, as a step in the sequence, and many
> of the dentry d_count accesses are under the lock, but now that the
> remaining hot-paths are dget_parent and dput and many of the dentry
> d_count increments are gone from the hot-paths, we might want to
> re-visit that decision.  It could go either way.

I did a quick attempt at this (patch attached). For the short workload,
we now have:

   76.90% 928688reaim  [kernel.kallsyms][k] 
_raw_spin_lock
  |
  --- _raw_spin_lock
 |  
 |--99.69%-- dget_parent
 |  __fsnotify_parent
 |  |  
 |  |--20.23%-- fsnotify_access
 |  |  vfs_read
 |  |--20.13%-- __fput
 |  |  fput
 |  |  task_work_run
 |  |--20.07%-- security_file_permission
 |  |  rw_verify_area
 |  |  vfs_read
 |  |--19.97%-- do_sys_open
 |  |  SyS_open
 |   --19.60%-- security_file_open
 | do_dentry_open

Still 76%!!! Throughput wise we do have a very nice boost when compared
to the vanilla kernel:

10-100 users: +47%
100-1000 users: +76%
1000-2000 users: +76%

Thanks,
Davidlohr

diff --git a/fs/autofs4/expire.c b/fs/autofs4/expire.c
index 13ddec9..0127d69 100644
--- a/fs/autofs4/expire.c
+++ b/fs/autofs4/expire.c
@@ -109,7 +109,7 @@ cont:
 
 	spin_lock_nested(&q->d_lock, DENTRY_D_LOCK_NESTED);
 	/* Already gone or negative dentry (under construction) - try next */
-	if (q->d_count == 0 || !simple_positive(q)) {
+	if (atomic_read(&q->d_count) == 0 || !simple_positive(q)) {
 		spin_unlock(&q->d_lock);
 		next = q->d_u.d_child.next;
 		goto cont;
@@ -267,7 +267,7 @@ static int autofs4_tree_busy(struct vfsmount *mnt,
 			else
 ino_count++;
 
-			if (p->d_count > ino_count) {
+			if (atomic_read(&p->d_count) > ino_count) {
 top_ino->last_used = jiffies;
 dput(p);
 return 1;
@@ -409,7 +409,7 @@ struct dentry *autofs4_expire_indirect(struct super_block *sb,
 		if (!exp_leaves) {
 			/* Path walk currently on this dentry? */
 			ino_count = atomic_read(&ino->count) + 1;
-			if (dentry->d_count > ino_count)
+			if (atomic_read(&dentry->d_count) > ino_count)
 goto next;
 
 			if (!autofs4_tree_busy(mnt, dentry, timeout, do_no

Re: [PATCH RFC ticketlock] Auto-queued ticketlock

2013-06-12 Thread Steven Rostedt
On Wed, 2013-06-12 at 10:50 -0700, Davidlohr Bueso wrote:
> On Tue, 2013-06-11 at 14:10 -0400, Steven Rostedt wrote:
> > Perhaps short work loads have a cold cache, and the impact on cache is
> > not as drastic?
> > 
> > It would be interesting to see what perf reports on these runs.
> 
> After running the aim7 workloads on Paul's v3 patch (same 80 core, 8
> socket box - HT off) the results are quite similar to the v1. One
> difference is that the five_sec workload benefited with +15% throughput
> after 500 users.

Thanks,

> 
> Taking a further look at each workload:
> 
> * five_sec: spends a large amount of time in the newish mcs style lock
> at the spin on owner for the inode->i_mutex:
> 
>   24.13% 315655reaim  [kernel.kallsyms][k] 
> mspin_lock 
>   |
>   --- mspin_lock
>  |  
>  |--99.76%-- __mutex_lock_killable_slowpath
>  |  mutex_lock_killable
>  |  vfs_readdir
>  |  SyS_getdents
>  |  system_call_fastpath
>  |  __getdents64
> 
> With this patch:
> 23.56% 310531reaim  [kernel.kallsyms][k] 
> mspin_lock   
>   |
>   --- mspin_lock
>  |  
>  |--99.78%-- __mutex_lock_killable_slowpath
>  |  mutex_lock_killable
>  |  vfs_readdir
>  |  SyS_getdents
>  |  system_call
>  |  __getdents64

Note, the mspin_lock is not interesting, as its not affected by this
patch.

>  
> * custom: Got a -33% throughput regression with this patch with 10-100
> users and -46% with 100 users and up. It spends most kernel space time
> dealing trying to take the inode->i_mutex and the ext4 ->s_orphan_lock
> (note that all runs are performed on ramdisks with ext4):
> 
>3.12% 137131reaim  [kernel.kallsyms][k] 
> mspin_lock 
>   |
>   --- mspin_lock
>  |  
>  |--82.98%-- __mutex_lock_killable_slowpath
>  |  mutex_lock_killable
>  |  vfs_readdir
>  |  SyS_getdents
>  |  system_call_fastpath
>  |  __getdents64
>  |  
>  |--16.97%-- __mutex_lock_slowpath
>  |  mutex_lock
>  |  |  
>  |  |--47.65%-- ext4_orphan_del
>  |  |--45.01%-- ext4_orphan_add
> 
> With this patch:
> 2.14% 109982reaim  [kernel.kallsyms][k] 
> mspin_lock   

Less time in the mspin_lock as it's probably now in the real spin lock
somewhere.

>   |
>   --- mspin_lock
>  |  
>  |--68.67%-- __mutex_lock_killable_slowpath
>  |  mutex_lock_killable
>  |  vfs_readdir
>  |  SyS_getdents
>  |  system_call
>  |  __getdents64
>  |  
>  |--31.24%-- __mutex_lock_slowpath
>  |  mutex_lock
>  |  |  
>  |  |--40.36%-- ext4_orphan_del
> 
> 
> * short: is the big winner for this patch, +69% throughput improvement
> with 100-2000 users. This makes a lot of sense since the workload spends
> a ridiculous amount of time trying to acquire the d_lock:
> 
>   84.86%1569902reaim  [kernel.kallsyms][k] 
> _raw_spin_lock
>   |
>   --- _raw_spin_lock
>  |
>  |--49.96%-- dget_parent
>  |  __fsnotify_parent
>  |--49.71%-- dput
>  |  |
>  |  |--99.98%-- __fsnotify_parent
> 
> With this patch:
>70.65% 467422   reaim  [kernel.kallsyms] [k] tkt_q_do_spin
>  |
>  --- tkt_q_do_spin
> |
> |--100.00%-- tkt_spin_pass
>  

Re: [PATCH RFC ticketlock] Auto-queued ticketlock

2013-06-12 Thread Linus Torvalds
On Wed, Jun 12, 2013 at 10:50 AM, Davidlohr Bueso
 wrote:
>
> * short: is the big winner for this patch, +69% throughput improvement
> with 100-2000 users. This makes a lot of sense since the workload spends
> a ridiculous amount of time trying to acquire the d_lock:
>
>   84.86%1569902reaim  [kernel.kallsyms][k] 
> _raw_spin_lock
>   |
>   --- _raw_spin_lock
>  |
>  |--49.96%-- dget_parent
>  |  __fsnotify_parent
>  |--49.71%-- dput

Ugh. Do you have any idea what the heck that thing actually does?

Normally, we shouldn't see lots of dget contention, since the dcache
these days does everything but the last path component locklessly.

But there's a few exceptions, like symlinks (act as "last component"
in the middle). And obviously, if some crazy threaded program opens
the *same* file concurrently over and over again, then that "last
component" will hammer on the dentry lock of that particular path. But
that "open the same file concurrently" seems totally unrealistic -
although maybe that's what AIM does..

Anybody know the AIM subtests?

Also, we *may* actually be able to optimize this by making
dentry->d_count atomic, which will allow us to often do dget_parent
and put() without taking the dcache lock at all. That's what it used
to be, but the RCU patches actually made it be protected by the
d_lock. It made sense at the time, as a step in the sequence, and many
of the dentry d_count accesses are under the lock, but now that the
remaining hot-paths are dget_parent and dput and many of the dentry
d_count increments are gone from the hot-paths, we might want to
re-visit that decision.  It could go either way.

Al, comments?

 Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC ticketlock] Auto-queued ticketlock

2013-06-12 Thread Davidlohr Bueso
On Tue, 2013-06-11 at 14:10 -0400, Steven Rostedt wrote:
> Perhaps short work loads have a cold cache, and the impact on cache is
> not as drastic?
> 
> It would be interesting to see what perf reports on these runs.

After running the aim7 workloads on Paul's v3 patch (same 80 core, 8
socket box - HT off) the results are quite similar to the v1. One
difference is that the five_sec workload benefited with +15% throughput
after 500 users.

Taking a further look at each workload:

* five_sec: spends a large amount of time in the newish mcs style lock
at the spin on owner for the inode->i_mutex:

  24.13% 315655reaim  [kernel.kallsyms][k] 
mspin_lock 
  |
  --- mspin_lock
 |  
 |--99.76%-- __mutex_lock_killable_slowpath
 |  mutex_lock_killable
 |  vfs_readdir
 |  SyS_getdents
 |  system_call_fastpath
 |  __getdents64

With this patch:
23.56% 310531reaim  [kernel.kallsyms][k] 
mspin_lock   
  |
  --- mspin_lock
 |  
 |--99.78%-- __mutex_lock_killable_slowpath
 |  mutex_lock_killable
 |  vfs_readdir
 |  SyS_getdents
 |  system_call
 |  __getdents64
 
* custom: Got a -33% throughput regression with this patch with 10-100
users and -46% with 100 users and up. It spends most kernel space time
dealing trying to take the inode->i_mutex and the ext4 ->s_orphan_lock
(note that all runs are performed on ramdisks with ext4):

   3.12% 137131reaim  [kernel.kallsyms][k] 
mspin_lock 
  |
  --- mspin_lock
 |  
 |--82.98%-- __mutex_lock_killable_slowpath
 |  mutex_lock_killable
 |  vfs_readdir
 |  SyS_getdents
 |  system_call_fastpath
 |  __getdents64
 |  
 |--16.97%-- __mutex_lock_slowpath
 |  mutex_lock
 |  |  
 |  |--47.65%-- ext4_orphan_del
 |  |--45.01%-- ext4_orphan_add

With this patch:
2.14% 109982reaim  [kernel.kallsyms][k] 
mspin_lock   
  |
  --- mspin_lock
 |  
 |--68.67%-- __mutex_lock_killable_slowpath
 |  mutex_lock_killable
 |  vfs_readdir
 |  SyS_getdents
 |  system_call
 |  __getdents64
 |  
 |--31.24%-- __mutex_lock_slowpath
 |  mutex_lock
 |  |  
 |  |--40.36%-- ext4_orphan_del


* short: is the big winner for this patch, +69% throughput improvement
with 100-2000 users. This makes a lot of sense since the workload spends
a ridiculous amount of time trying to acquire the d_lock:

  84.86%1569902reaim  [kernel.kallsyms][k] 
_raw_spin_lock
  |
  --- _raw_spin_lock
 |
 |--49.96%-- dget_parent
 |  __fsnotify_parent
 |--49.71%-- dput
 |  |
 |  |--99.98%-- __fsnotify_parent

With this patch:
   70.65% 467422   reaim  [kernel.kallsyms] [k] tkt_q_do_spin
 |
 --- tkt_q_do_spin
|
|--100.00%-- tkt_spin_pass
|  |
|  |--100.00%-- _raw_spin_lock
|  |  |
|  |  |--50.07%-- dget_parent
|  |  |  __fsnotify_parent
|  |  |--49.93%-- dput
|  |  |  __fsnotify_parent
 

* disk: Thi

Re: [PATCH RFC ticketlock] Auto-queued ticketlock

2013-06-12 Thread Paul E. McKenney
On Wed, Jun 12, 2013 at 10:15:49PM +0800, Lai Jiangshan wrote:
> Hi, Paul
> 
> I have some question about smp_mb().(searching smp_mb() can find all
> my question)
> 
> Thanks,
> Lai
> 
> On Wed, Jun 12, 2013 at 3:49 AM, Paul E. McKenney
>  wrote:
> > On Tue, Jun 11, 2013 at 02:41:59PM -0400, Waiman Long wrote:
> >> On 06/11/2013 12:36 PM, Paul E. McKenney wrote:
> >> >
> >> >>I am a bit concern about the size of the head queue table itself.
> >> >>RHEL6, for example, had defined CONFIG_NR_CPUS to be 4096 which mean
> >> >>a table size of 256. Maybe it is better to dynamically allocate the
> >> >>table at init time depending on the actual number of CPUs in the
> >> >>system.
> >> >But if your kernel is built for 4096 CPUs, the 32*256=8192 bytes of memory
> >> >is way down in the noise.  Systems that care about that small an amount
> >> >of memory probably have a small enough number of CPUs that they can just
> >> >turn off queueing at build time using CONFIG_TICKET_LOCK_QUEUED=n, right?
> >>
> >> My concern is more about the latency on the table scan than the
> >> actual memory that was used.
> >>
> >> >>>+/*
> >> >>>+ * Return a pointer to the queue header associated with the specified 
> >> >>>lock,
> >> >>>+ * or return NULL if there is no queue for the lock or if the lock's 
> >> >>>queue
> >> >>>+ * is in transition.
> >> >>>+ */
> >> >>>+static struct tkt_q_head *tkt_q_find_head(arch_spinlock_t *asp)
> >> >>>+{
> >> >>>+  int i;
> >> >>>+  int start;
> >> >>>+
> >> >>>+  start = i = tkt_q_hash(asp);
> >> >>>+  do
> >> >>>+  if (tkt_q_heads[i].ref == asp)
> >> >>>+  return&tkt_q_heads[i];
> >> >>>+  while ((i = tkt_q_next_slot(i)) != start);
> >> >>>+  return NULL;
> >> >>>+}
> >> >>With a table size of 256 and you have to scan the whole table to
> >> >>find the right head queue. This can be a significant overhead. I
> >> >>will suggest setting a limiting of how many entries it scans before
> >> >>it aborts rather than checking the whole table.
> >> >But it will scan 256 entries only if there are 256 other locks in queued
> >> >mode, which is -very- unlikely, even given 4096 CPUs.  That said, if you
> >> >show me that this results in a real latency problem on a real system,
> >> >I would be happy to provide a way to limit the search.
> >>
> >> Looking at the code more carefully, the chance of actually scanning
> >> 256 entries is very small. However, I now have some concern on the
> >> way you set up the initial queue.
> >>
> >> +/*
> >> + * Start handling a period of high contention by finding a queue to 
> >> associate
> >> + * with this lock.  Returns true if successful (in which case we hold the
> >> + * lock) and false otherwise (in which case we do -not- hold the lock).
> >> + */
> >> +bool tkt_q_start_contend(arch_spinlock_t *asp, struct __raw_tickets inc)
> >> +{
> >> + int i;
> >> + int start;
> >> +
> >> + /* Hash the lock address to find a starting point. */
> >> + start = i = tkt_q_hash(asp);
> >> +
> >> + /*
> >> +  * Each pass through the following loop attempts to associate
> >> +  * the lock with the corresponding queue.
> >> +  */
> >> + do {
> >> + /*
> >> +  * Use 0x1 to mark the queue in use, but also avoiding
> >> +  * any spinners trying to use it before we get it all
> >> +  * initialized.
> >> +  */
> >> + if (cmpxchg(&tkt_q_heads[i].ref,
> >> + NULL,
> >> + (arch_spinlock_t *)0x1) == NULL) {
> >> +
> >> + /* Succeeded, now go initialize it. */
> >> + return tkt_q_init_contend(i, asp, inc);
> >> + }
> >> +
> >> + /* If someone beat us to it, go spin on their queue. */
> >> + if (ACCESS_ONCE(asp->tickets.head)&  0x1)
> >> + return tkt_q_do_spin(asp, inc);
> >> + } while ((i = tkt_q_next_slot(i)) != start);
> >> +
> >> + /* All the queues are in use, revert to spinning on the ticket lock. 
> >> */
> >> + return false;
> >> +}
> >> +
> >>
> >> Unconditional cmpxchg() can be a source of high contention by
> >> itself. Considering that 16 threads may be doing cmpxchg() more or
> >> less simultaneously on the same cache line, it can cause a lot of
> >> contention. It will be better if you check to see if tkt_q_heads[i]
> >> is NULL first before doing cmpxchg.
> >>
> >> Another point is that the 16 threads maybe setting up the queues in
> >> consecutive slots in the head table. This is both a source of
> >> contention and a waste of effort. One possible solution is to add
> >> one more field (set to cpuid + 1, for example) to indicate that that
> >> setup is being done with asp set to the target lock address
> >> immediately. We will need to use cmpxchg128() for 64-bit machine,
> >> though. Another solution is to have only that thread with ticket
> >> number that is a fixed distance from head (e.g. 16*2

Re: [PATCH RFC ticketlock] Auto-queued ticketlock

2013-06-12 Thread Paul E. McKenney
On Wed, Jun 12, 2013 at 07:06:53PM +0800, Lai Jiangshan wrote:
> On Wed, Jun 12, 2013 at 9:58 AM, Steven Rostedt  wrote:
> > On Wed, 2013-06-12 at 09:19 +0800, Lai Jiangshan wrote:
> >
> >> > +
> >> > +/*
> >> > + * Hand the lock off to the first CPU on the queue.
> >> > + */
> >> > +void tkt_q_do_wake(arch_spinlock_t *lock)
> >> > +{
> >> > +   struct tkt_q_head *tqhp;
> >> > +   struct tkt_q *tqp;
> >> > +
> >> > +   /* If the queue is still being set up, wait for it. */
> >> > +   while ((tqhp = tkt_q_find_head(lock)) == NULL)
> >> > +   cpu_relax();
> >> > +
> >> > +   for (;;) {
> >> > +
> >> > +   /* Find the first queue element. */
> >> > +   tqp = ACCESS_ONCE(tqhp->spin);
> >> > +   if (tqp != NULL)
> >> > +   break;  /* Element exists, hand off lock. */
> >> > +   if (tkt_q_try_unqueue(lock, tqhp))
> >> > +   return; /* No element, successfully removed 
> >> > queue. */
> >> > +   cpu_relax();
> >> > +   }
> >> > +   if (ACCESS_ONCE(tqhp->head_tkt) != -1)
> >> > +   ACCESS_ONCE(tqhp->head_tkt) = -1;
> >> > +   smp_mb(); /* Order pointer fetch and assignment against handoff. 
> >> > */
> >> > +   ACCESS_ONCE(tqp->cpu) = -1;
> >> > +}
> >> > +EXPORT_SYMBOL(tkt_q_do_wake);
> >> > +
> >> > +/*
> >> > + * Given a lock that already has a queue associated with it, spin on
> >> > + * that queue.  Return false if there was no queue (which means we do 
> >> > not
> >> > + * hold the lock) and true otherwise (meaning we -do- hold the lock).
> >> > + */
> >> > +bool tkt_q_do_spin(arch_spinlock_t *lock, struct __raw_tickets inc)
> >> > +{
> >> > +   struct tkt_q **oldtail;
> >> > +   struct tkt_q tq;
> >> > +   struct tkt_q_head *tqhp;
> >> > +
> >> > +   /*
> >> > +* Ensure that accesses to queue header happen after sensing
> >> > +* the lock's have-queue bit.
> >> > +*/
> >> > +   smp_mb();  /* See above block comment. */
> >> > +
> >> > +   /* If there no longer is a queue, leave. */
> >> > +   tqhp = tkt_q_find_head(lock);
> >> > +   if (tqhp == NULL)
> >> > +   return false;
> >> > +
> >> > +   /* Initialize our queue element. */
> >> > +   tq.cpu = raw_smp_processor_id();
> >> > +   tq.tail = inc.tail;
> >> > +   tq.next = NULL;
> >>
> >> I guess a mb() is needed here for between read tqhp->ref and read
> >> tqhp->head_tkt.
> >> you can move the above mb() to here.
> >
> > Do we?
> >
> > The only way to get into here is if you either set up the queue
> > yourself, or you saw the LSB set in head.
> >
> > If you were the one to set it up yourself, then there's nothing to worry
> > about because you are also the one that set head_tkt.
> >
> > If you didn't set up the queue, then someone else set the LSB in head,
> > which is done with a cmpxchg() which is also a full mb. This would make
> > head_tkt visible as well because it's set before cmpxchg is called.
> >
> > Thus, to come into this function you must have seen head & 1 set, and
> > the smp_mb() above will also make head_tkt visible.

Agreed, after looking again.

> > The only thing I can see now is that it might not find tqhp because ref
> > may not be set yet. If that's the case, then it will fall out back to
> > the main loop. But if it finds ref, then I don't see how it can't see
> > head_tkt up to date as well.
> >
> > Maybe I'm missing something.
> 
> No, you are right.
> 
> When I lay on the bed in the night, I was thinking about the V1,
> I wrongly considered the V2 has the same problem without deeper
> thought in this morning.
> 
> V2 has not such problem. sorry for the noisy.

Not a problem -- you did cause me to spot a missing ACCESS_ONCE() in
tkt_q_find_head(), which I have now added.  I also added a comment
to tkt_q_do_wake() noting that the caller's xadd() provides the needed
memory ordering.

Thank you both for looking this over!

Thanx, Paul

> Thanks,
> Lai
> 
> >
> > -- Steve
> >
> >
> >>
> >> > +
> >> > +   /* Check to see if we already hold the lock. */
> >> > +   if (ACCESS_ONCE(tqhp->head_tkt) == inc.tail) {
> >> > +   /* The last holder left before queue formed, we hold 
> >> > lock. */
> >> > +   tqhp->head_tkt = -1;
> >> > +   return true;
> >> > +   }
> >> > +
> >> > +   /*
> >> > +* Add our element to the tail of the queue.  Note that if the
> >> > +* queue is empty, the ->spin_tail pointer will reference
> >> > +* the queue's head pointer, namely ->spin.
> >> > +*/
> >> > +   oldtail = xchg(&tqhp->spin_tail, &tq.next);
> >> > +   ACCESS_ONCE(*oldtail) = &tq;
> >> > +
> >> > +   /* Spin until handoff. */
> >> > +   while (ACCESS_ONCE(tq.cpu) != -1)
> >> > +   cpu_relax();
> >> > +
> >> > +   /*
> >> > +* Remov

Re: [PATCH RFC ticketlock] Auto-queued ticketlock

2013-06-12 Thread Lai Jiangshan
Hi, Paul

I have some question about smp_mb().(searching smp_mb() can find all
my question)

Thanks,
Lai

On Wed, Jun 12, 2013 at 3:49 AM, Paul E. McKenney
 wrote:
> On Tue, Jun 11, 2013 at 02:41:59PM -0400, Waiman Long wrote:
>> On 06/11/2013 12:36 PM, Paul E. McKenney wrote:
>> >
>> >>I am a bit concern about the size of the head queue table itself.
>> >>RHEL6, for example, had defined CONFIG_NR_CPUS to be 4096 which mean
>> >>a table size of 256. Maybe it is better to dynamically allocate the
>> >>table at init time depending on the actual number of CPUs in the
>> >>system.
>> >But if your kernel is built for 4096 CPUs, the 32*256=8192 bytes of memory
>> >is way down in the noise.  Systems that care about that small an amount
>> >of memory probably have a small enough number of CPUs that they can just
>> >turn off queueing at build time using CONFIG_TICKET_LOCK_QUEUED=n, right?
>>
>> My concern is more about the latency on the table scan than the
>> actual memory that was used.
>>
>> >>>+/*
>> >>>+ * Return a pointer to the queue header associated with the specified 
>> >>>lock,
>> >>>+ * or return NULL if there is no queue for the lock or if the lock's 
>> >>>queue
>> >>>+ * is in transition.
>> >>>+ */
>> >>>+static struct tkt_q_head *tkt_q_find_head(arch_spinlock_t *asp)
>> >>>+{
>> >>>+  int i;
>> >>>+  int start;
>> >>>+
>> >>>+  start = i = tkt_q_hash(asp);
>> >>>+  do
>> >>>+  if (tkt_q_heads[i].ref == asp)
>> >>>+  return&tkt_q_heads[i];
>> >>>+  while ((i = tkt_q_next_slot(i)) != start);
>> >>>+  return NULL;
>> >>>+}
>> >>With a table size of 256 and you have to scan the whole table to
>> >>find the right head queue. This can be a significant overhead. I
>> >>will suggest setting a limiting of how many entries it scans before
>> >>it aborts rather than checking the whole table.
>> >But it will scan 256 entries only if there are 256 other locks in queued
>> >mode, which is -very- unlikely, even given 4096 CPUs.  That said, if you
>> >show me that this results in a real latency problem on a real system,
>> >I would be happy to provide a way to limit the search.
>>
>> Looking at the code more carefully, the chance of actually scanning
>> 256 entries is very small. However, I now have some concern on the
>> way you set up the initial queue.
>>
>> +/*
>> + * Start handling a period of high contention by finding a queue to 
>> associate
>> + * with this lock.  Returns true if successful (in which case we hold the
>> + * lock) and false otherwise (in which case we do -not- hold the lock).
>> + */
>> +bool tkt_q_start_contend(arch_spinlock_t *asp, struct __raw_tickets inc)
>> +{
>> + int i;
>> + int start;
>> +
>> + /* Hash the lock address to find a starting point. */
>> + start = i = tkt_q_hash(asp);
>> +
>> + /*
>> +  * Each pass through the following loop attempts to associate
>> +  * the lock with the corresponding queue.
>> +  */
>> + do {
>> + /*
>> +  * Use 0x1 to mark the queue in use, but also avoiding
>> +  * any spinners trying to use it before we get it all
>> +  * initialized.
>> +  */
>> + if (cmpxchg(&tkt_q_heads[i].ref,
>> + NULL,
>> + (arch_spinlock_t *)0x1) == NULL) {
>> +
>> + /* Succeeded, now go initialize it. */
>> + return tkt_q_init_contend(i, asp, inc);
>> + }
>> +
>> + /* If someone beat us to it, go spin on their queue. */
>> + if (ACCESS_ONCE(asp->tickets.head)&  0x1)
>> + return tkt_q_do_spin(asp, inc);
>> + } while ((i = tkt_q_next_slot(i)) != start);
>> +
>> + /* All the queues are in use, revert to spinning on the ticket lock. */
>> + return false;
>> +}
>> +
>>
>> Unconditional cmpxchg() can be a source of high contention by
>> itself. Considering that 16 threads may be doing cmpxchg() more or
>> less simultaneously on the same cache line, it can cause a lot of
>> contention. It will be better if you check to see if tkt_q_heads[i]
>> is NULL first before doing cmpxchg.
>>
>> Another point is that the 16 threads maybe setting up the queues in
>> consecutive slots in the head table. This is both a source of
>> contention and a waste of effort. One possible solution is to add
>> one more field (set to cpuid + 1, for example) to indicate that that
>> setup is being done with asp set to the target lock address
>> immediately. We will need to use cmpxchg128() for 64-bit machine,
>> though. Another solution is to have only that thread with ticket
>> number that is a fixed distance from head (e.g. 16*2) to do the
>> queue setup while the rest wait until the setup is done before
>> spinning on the queue.
>>
>> As my colleague Davidlohr had reported there are more regressions
>> than performance improvement in the AIM7 benchmark. I believe that
>> queue setup contention is likely a source of per

Re: [PATCH RFC ticketlock] Auto-queued ticketlock

2013-06-12 Thread Lai Jiangshan
On Wed, Jun 12, 2013 at 9:58 AM, Steven Rostedt  wrote:
> On Wed, 2013-06-12 at 09:19 +0800, Lai Jiangshan wrote:
>
>> > +
>> > +/*
>> > + * Hand the lock off to the first CPU on the queue.
>> > + */
>> > +void tkt_q_do_wake(arch_spinlock_t *lock)
>> > +{
>> > +   struct tkt_q_head *tqhp;
>> > +   struct tkt_q *tqp;
>> > +
>> > +   /* If the queue is still being set up, wait for it. */
>> > +   while ((tqhp = tkt_q_find_head(lock)) == NULL)
>> > +   cpu_relax();
>> > +
>> > +   for (;;) {
>> > +
>> > +   /* Find the first queue element. */
>> > +   tqp = ACCESS_ONCE(tqhp->spin);
>> > +   if (tqp != NULL)
>> > +   break;  /* Element exists, hand off lock. */
>> > +   if (tkt_q_try_unqueue(lock, tqhp))
>> > +   return; /* No element, successfully removed queue. 
>> > */
>> > +   cpu_relax();
>> > +   }
>> > +   if (ACCESS_ONCE(tqhp->head_tkt) != -1)
>> > +   ACCESS_ONCE(tqhp->head_tkt) = -1;
>> > +   smp_mb(); /* Order pointer fetch and assignment against handoff. */
>> > +   ACCESS_ONCE(tqp->cpu) = -1;
>> > +}
>> > +EXPORT_SYMBOL(tkt_q_do_wake);
>> > +
>> > +/*
>> > + * Given a lock that already has a queue associated with it, spin on
>> > + * that queue.  Return false if there was no queue (which means we do not
>> > + * hold the lock) and true otherwise (meaning we -do- hold the lock).
>> > + */
>> > +bool tkt_q_do_spin(arch_spinlock_t *lock, struct __raw_tickets inc)
>> > +{
>> > +   struct tkt_q **oldtail;
>> > +   struct tkt_q tq;
>> > +   struct tkt_q_head *tqhp;
>> > +
>> > +   /*
>> > +* Ensure that accesses to queue header happen after sensing
>> > +* the lock's have-queue bit.
>> > +*/
>> > +   smp_mb();  /* See above block comment. */
>> > +
>> > +   /* If there no longer is a queue, leave. */
>> > +   tqhp = tkt_q_find_head(lock);
>> > +   if (tqhp == NULL)
>> > +   return false;
>> > +
>> > +   /* Initialize our queue element. */
>> > +   tq.cpu = raw_smp_processor_id();
>> > +   tq.tail = inc.tail;
>> > +   tq.next = NULL;
>>
>> I guess a mb() is needed here for between read tqhp->ref and read
>> tqhp->head_tkt.
>> you can move the above mb() to here.
>
> Do we?
>
> The only way to get into here is if you either set up the queue
> yourself, or you saw the LSB set in head.
>
> If you were the one to set it up yourself, then there's nothing to worry
> about because you are also the one that set head_tkt.
>
> If you didn't set up the queue, then someone else set the LSB in head,
> which is done with a cmpxchg() which is also a full mb. This would make
> head_tkt visible as well because it's set before cmpxchg is called.
>
> Thus, to come into this function you must have seen head & 1 set, and
> the smp_mb() above will also make head_tkt visible.
>
> The only thing I can see now is that it might not find tqhp because ref
> may not be set yet. If that's the case, then it will fall out back to
> the main loop. But if it finds ref, then I don't see how it can't see
> head_tkt up to date as well.
>
> Maybe I'm missing something.

No, you are right.

When I lay on the bed in the night, I was thinking about the V1,
I wrongly considered the V2 has the same problem without deeper
thought in this morning.

V2 has not such problem. sorry for the noisy.

Thanks,
Lai

>
> -- Steve
>
>
>>
>> > +
>> > +   /* Check to see if we already hold the lock. */
>> > +   if (ACCESS_ONCE(tqhp->head_tkt) == inc.tail) {
>> > +   /* The last holder left before queue formed, we hold lock. 
>> > */
>> > +   tqhp->head_tkt = -1;
>> > +   return true;
>> > +   }
>> > +
>> > +   /*
>> > +* Add our element to the tail of the queue.  Note that if the
>> > +* queue is empty, the ->spin_tail pointer will reference
>> > +* the queue's head pointer, namely ->spin.
>> > +*/
>> > +   oldtail = xchg(&tqhp->spin_tail, &tq.next);
>> > +   ACCESS_ONCE(*oldtail) = &tq;
>> > +
>> > +   /* Spin until handoff. */
>> > +   while (ACCESS_ONCE(tq.cpu) != -1)
>> > +   cpu_relax();
>> > +
>> > +   /*
>> > +* Remove our element from the queue.  If the queue is now empty,
>> > +* update carefully so that the next acquisition will enqueue 
>> > itself
>> > +* at the head of the list.  Of course, the next enqueue operation
>> > +* might be happening concurrently, and this code needs to handle 
>> > all
>> > +* of the possible combinations, keeping in mind that the enqueue
>> > +* operation happens in two stages: (1) update the tail pointer and
>> > +* (2) update the predecessor's ->next pointer.  With this in mind,
>> > +* the following code needs to deal with three scenarios:
>> > +*
>> > +* 1.   tq is the last entry.  In

Re: [PATCH RFC ticketlock] Auto-queued ticketlock

2013-06-12 Thread Paul E. McKenney
On Tue, Jun 11, 2013 at 09:58:08PM -0400, Steven Rostedt wrote:
> On Wed, 2013-06-12 at 09:19 +0800, Lai Jiangshan wrote:
> 
> > > +
> > > +/*
> > > + * Hand the lock off to the first CPU on the queue.
> > > + */
> > > +void tkt_q_do_wake(arch_spinlock_t *lock)
> > > +{
> > > +   struct tkt_q_head *tqhp;
> > > +   struct tkt_q *tqp;
> > > +
> > > +   /* If the queue is still being set up, wait for it. */
> > > +   while ((tqhp = tkt_q_find_head(lock)) == NULL)
> > > +   cpu_relax();
> > > +
> > > +   for (;;) {
> > > +
> > > +   /* Find the first queue element. */
> > > +   tqp = ACCESS_ONCE(tqhp->spin);
> > > +   if (tqp != NULL)
> > > +   break;  /* Element exists, hand off lock. */
> > > +   if (tkt_q_try_unqueue(lock, tqhp))
> > > +   return; /* No element, successfully removed 
> > > queue. */
> > > +   cpu_relax();
> > > +   }
> > > +   if (ACCESS_ONCE(tqhp->head_tkt) != -1)
> > > +   ACCESS_ONCE(tqhp->head_tkt) = -1;
> > > +   smp_mb(); /* Order pointer fetch and assignment against handoff. 
> > > */
> > > +   ACCESS_ONCE(tqp->cpu) = -1;
> > > +}
> > > +EXPORT_SYMBOL(tkt_q_do_wake);
> > > +
> > > +/*
> > > + * Given a lock that already has a queue associated with it, spin on
> > > + * that queue.  Return false if there was no queue (which means we do not
> > > + * hold the lock) and true otherwise (meaning we -do- hold the lock).
> > > + */
> > > +bool tkt_q_do_spin(arch_spinlock_t *lock, struct __raw_tickets inc)
> > > +{
> > > +   struct tkt_q **oldtail;
> > > +   struct tkt_q tq;
> > > +   struct tkt_q_head *tqhp;
> > > +
> > > +   /*
> > > +* Ensure that accesses to queue header happen after sensing
> > > +* the lock's have-queue bit.
> > > +*/
> > > +   smp_mb();  /* See above block comment. */
> > > +
> > > +   /* If there no longer is a queue, leave. */
> > > +   tqhp = tkt_q_find_head(lock);
> > > +   if (tqhp == NULL)
> > > +   return false;
> > > +
> > > +   /* Initialize our queue element. */
> > > +   tq.cpu = raw_smp_processor_id();
> > > +   tq.tail = inc.tail;
> > > +   tq.next = NULL;
> > 
> > I guess a mb() is needed here for between read tqhp->ref and read
> > tqhp->head_tkt.
> > you can move the above mb() to here.
> 
> Do we?
> 
> The only way to get into here is if you either set up the queue
> yourself, or you saw the LSB set in head.
> 
> If you were the one to set it up yourself, then there's nothing to worry
> about because you are also the one that set head_tkt.
> 
> If you didn't set up the queue, then someone else set the LSB in head,
> which is done with a cmpxchg() which is also a full mb. This would make
> head_tkt visible as well because it's set before cmpxchg is called.
> 
> Thus, to come into this function you must have seen head & 1 set, and
> the smp_mb() above will also make head_tkt visible.
> 
> The only thing I can see now is that it might not find tqhp because ref
> may not be set yet. If that's the case, then it will fall out back to
> the main loop. But if it finds ref, then I don't see how it can't see
> head_tkt up to date as well.
> 
> Maybe I'm missing something.

Hmmm...  I need to look at this more carefully.  Lai might well be right
because if we are relying on the cmpxchg() for ordering, there needs
to be a memory barrier on the read side to pair with the cmpxchg().
You are of course quite correct in the case where the CPU reading the
->head_tkt is the one that set it up.

Something to think about at the gym, I guess.  ;-)

Thanx, Paul

> -- Steve
> 
> 
> > 
> > > +
> > > +   /* Check to see if we already hold the lock. */
> > > +   if (ACCESS_ONCE(tqhp->head_tkt) == inc.tail) {
> > > +   /* The last holder left before queue formed, we hold 
> > > lock. */
> > > +   tqhp->head_tkt = -1;
> > > +   return true;
> > > +   }
> > > +
> > > +   /*
> > > +* Add our element to the tail of the queue.  Note that if the
> > > +* queue is empty, the ->spin_tail pointer will reference
> > > +* the queue's head pointer, namely ->spin.
> > > +*/
> > > +   oldtail = xchg(&tqhp->spin_tail, &tq.next);
> > > +   ACCESS_ONCE(*oldtail) = &tq;
> > > +
> > > +   /* Spin until handoff. */
> > > +   while (ACCESS_ONCE(tq.cpu) != -1)
> > > +   cpu_relax();
> > > +
> > > +   /*
> > > +* Remove our element from the queue.  If the queue is now empty,
> > > +* update carefully so that the next acquisition will enqueue 
> > > itself
> > > +* at the head of the list.  Of course, the next enqueue operation
> > > +* might be happening concurrently, and this code needs to handle 
> > > all
> > > +* of the possible combinations, keeping

Re: [PATCH RFC ticketlock] Auto-queued ticketlock

2013-06-11 Thread Steven Rostedt
On Wed, 2013-06-12 at 09:19 +0800, Lai Jiangshan wrote:

> > +
> > +/*
> > + * Hand the lock off to the first CPU on the queue.
> > + */
> > +void tkt_q_do_wake(arch_spinlock_t *lock)
> > +{
> > +   struct tkt_q_head *tqhp;
> > +   struct tkt_q *tqp;
> > +
> > +   /* If the queue is still being set up, wait for it. */
> > +   while ((tqhp = tkt_q_find_head(lock)) == NULL)
> > +   cpu_relax();
> > +
> > +   for (;;) {
> > +
> > +   /* Find the first queue element. */
> > +   tqp = ACCESS_ONCE(tqhp->spin);
> > +   if (tqp != NULL)
> > +   break;  /* Element exists, hand off lock. */
> > +   if (tkt_q_try_unqueue(lock, tqhp))
> > +   return; /* No element, successfully removed queue. 
> > */
> > +   cpu_relax();
> > +   }
> > +   if (ACCESS_ONCE(tqhp->head_tkt) != -1)
> > +   ACCESS_ONCE(tqhp->head_tkt) = -1;
> > +   smp_mb(); /* Order pointer fetch and assignment against handoff. */
> > +   ACCESS_ONCE(tqp->cpu) = -1;
> > +}
> > +EXPORT_SYMBOL(tkt_q_do_wake);
> > +
> > +/*
> > + * Given a lock that already has a queue associated with it, spin on
> > + * that queue.  Return false if there was no queue (which means we do not
> > + * hold the lock) and true otherwise (meaning we -do- hold the lock).
> > + */
> > +bool tkt_q_do_spin(arch_spinlock_t *lock, struct __raw_tickets inc)
> > +{
> > +   struct tkt_q **oldtail;
> > +   struct tkt_q tq;
> > +   struct tkt_q_head *tqhp;
> > +
> > +   /*
> > +* Ensure that accesses to queue header happen after sensing
> > +* the lock's have-queue bit.
> > +*/
> > +   smp_mb();  /* See above block comment. */
> > +
> > +   /* If there no longer is a queue, leave. */
> > +   tqhp = tkt_q_find_head(lock);
> > +   if (tqhp == NULL)
> > +   return false;
> > +
> > +   /* Initialize our queue element. */
> > +   tq.cpu = raw_smp_processor_id();
> > +   tq.tail = inc.tail;
> > +   tq.next = NULL;
> 
> I guess a mb() is needed here for between read tqhp->ref and read
> tqhp->head_tkt.
> you can move the above mb() to here.

Do we?

The only way to get into here is if you either set up the queue
yourself, or you saw the LSB set in head.

If you were the one to set it up yourself, then there's nothing to worry
about because you are also the one that set head_tkt.

If you didn't set up the queue, then someone else set the LSB in head,
which is done with a cmpxchg() which is also a full mb. This would make
head_tkt visible as well because it's set before cmpxchg is called.

Thus, to come into this function you must have seen head & 1 set, and
the smp_mb() above will also make head_tkt visible.

The only thing I can see now is that it might not find tqhp because ref
may not be set yet. If that's the case, then it will fall out back to
the main loop. But if it finds ref, then I don't see how it can't see
head_tkt up to date as well.

Maybe I'm missing something.

-- Steve


> 
> > +
> > +   /* Check to see if we already hold the lock. */
> > +   if (ACCESS_ONCE(tqhp->head_tkt) == inc.tail) {
> > +   /* The last holder left before queue formed, we hold lock. 
> > */
> > +   tqhp->head_tkt = -1;
> > +   return true;
> > +   }
> > +
> > +   /*
> > +* Add our element to the tail of the queue.  Note that if the
> > +* queue is empty, the ->spin_tail pointer will reference
> > +* the queue's head pointer, namely ->spin.
> > +*/
> > +   oldtail = xchg(&tqhp->spin_tail, &tq.next);
> > +   ACCESS_ONCE(*oldtail) = &tq;
> > +
> > +   /* Spin until handoff. */
> > +   while (ACCESS_ONCE(tq.cpu) != -1)
> > +   cpu_relax();
> > +
> > +   /*
> > +* Remove our element from the queue.  If the queue is now empty,
> > +* update carefully so that the next acquisition will enqueue itself
> > +* at the head of the list.  Of course, the next enqueue operation
> > +* might be happening concurrently, and this code needs to handle 
> > all
> > +* of the possible combinations, keeping in mind that the enqueue
> > +* operation happens in two stages: (1) update the tail pointer and
> > +* (2) update the predecessor's ->next pointer.  With this in mind,
> > +* the following code needs to deal with three scenarios:
> > +*
> > +* 1.   tq is the last entry.  In this case, we use cmpxchg to
> > +*  point the list tail back to the list head (->spin).  If
> > +*  the cmpxchg fails, that indicates that we are instead
> > +*  in scenario 2 below.  If the cmpxchg succeeds, the next
> > +*  enqueue operation's tail-pointer exchange will enqueue
> > +*  the next element at the queue head, because the ->spin_tail
> > +*  point

Re: [PATCH RFC ticketlock] Auto-queued ticketlock

2013-06-11 Thread Lai Jiangshan
On Wed, Jun 12, 2013 at 3:49 AM, Paul E. McKenney
 wrote:
> On Tue, Jun 11, 2013 at 02:41:59PM -0400, Waiman Long wrote:
>> On 06/11/2013 12:36 PM, Paul E. McKenney wrote:
>> >
>> >>I am a bit concern about the size of the head queue table itself.
>> >>RHEL6, for example, had defined CONFIG_NR_CPUS to be 4096 which mean
>> >>a table size of 256. Maybe it is better to dynamically allocate the
>> >>table at init time depending on the actual number of CPUs in the
>> >>system.
>> >But if your kernel is built for 4096 CPUs, the 32*256=8192 bytes of memory
>> >is way down in the noise.  Systems that care about that small an amount
>> >of memory probably have a small enough number of CPUs that they can just
>> >turn off queueing at build time using CONFIG_TICKET_LOCK_QUEUED=n, right?
>>
>> My concern is more about the latency on the table scan than the
>> actual memory that was used.
>>
>> >>>+/*
>> >>>+ * Return a pointer to the queue header associated with the specified 
>> >>>lock,
>> >>>+ * or return NULL if there is no queue for the lock or if the lock's 
>> >>>queue
>> >>>+ * is in transition.
>> >>>+ */
>> >>>+static struct tkt_q_head *tkt_q_find_head(arch_spinlock_t *asp)
>> >>>+{
>> >>>+  int i;
>> >>>+  int start;
>> >>>+
>> >>>+  start = i = tkt_q_hash(asp);
>> >>>+  do
>> >>>+  if (tkt_q_heads[i].ref == asp)
>> >>>+  return&tkt_q_heads[i];
>> >>>+  while ((i = tkt_q_next_slot(i)) != start);
>> >>>+  return NULL;
>> >>>+}
>> >>With a table size of 256 and you have to scan the whole table to
>> >>find the right head queue. This can be a significant overhead. I
>> >>will suggest setting a limiting of how many entries it scans before
>> >>it aborts rather than checking the whole table.
>> >But it will scan 256 entries only if there are 256 other locks in queued
>> >mode, which is -very- unlikely, even given 4096 CPUs.  That said, if you
>> >show me that this results in a real latency problem on a real system,
>> >I would be happy to provide a way to limit the search.
>>
>> Looking at the code more carefully, the chance of actually scanning
>> 256 entries is very small. However, I now have some concern on the
>> way you set up the initial queue.
>>
>> +/*
>> + * Start handling a period of high contention by finding a queue to 
>> associate
>> + * with this lock.  Returns true if successful (in which case we hold the
>> + * lock) and false otherwise (in which case we do -not- hold the lock).
>> + */
>> +bool tkt_q_start_contend(arch_spinlock_t *asp, struct __raw_tickets inc)
>> +{
>> + int i;
>> + int start;
>> +
>> + /* Hash the lock address to find a starting point. */
>> + start = i = tkt_q_hash(asp);
>> +
>> + /*
>> +  * Each pass through the following loop attempts to associate
>> +  * the lock with the corresponding queue.
>> +  */
>> + do {
>> + /*
>> +  * Use 0x1 to mark the queue in use, but also avoiding
>> +  * any spinners trying to use it before we get it all
>> +  * initialized.
>> +  */
>> + if (cmpxchg(&tkt_q_heads[i].ref,
>> + NULL,
>> + (arch_spinlock_t *)0x1) == NULL) {
>> +
>> + /* Succeeded, now go initialize it. */
>> + return tkt_q_init_contend(i, asp, inc);
>> + }
>> +
>> + /* If someone beat us to it, go spin on their queue. */
>> + if (ACCESS_ONCE(asp->tickets.head)&  0x1)
>> + return tkt_q_do_spin(asp, inc);
>> + } while ((i = tkt_q_next_slot(i)) != start);
>> +
>> + /* All the queues are in use, revert to spinning on the ticket lock. */
>> + return false;
>> +}
>> +
>>
>> Unconditional cmpxchg() can be a source of high contention by
>> itself. Considering that 16 threads may be doing cmpxchg() more or
>> less simultaneously on the same cache line, it can cause a lot of
>> contention. It will be better if you check to see if tkt_q_heads[i]
>> is NULL first before doing cmpxchg.
>>
>> Another point is that the 16 threads maybe setting up the queues in
>> consecutive slots in the head table. This is both a source of
>> contention and a waste of effort. One possible solution is to add
>> one more field (set to cpuid + 1, for example) to indicate that that
>> setup is being done with asp set to the target lock address
>> immediately. We will need to use cmpxchg128() for 64-bit machine,
>> though. Another solution is to have only that thread with ticket
>> number that is a fixed distance from head (e.g. 16*2) to do the
>> queue setup while the rest wait until the setup is done before
>> spinning on the queue.
>>
>> As my colleague Davidlohr had reported there are more regressions
>> than performance improvement in the AIM7 benchmark. I believe that
>> queue setup contention is likely a source of performance regression.
>
> Please see below for a v3 patch that:
>
> 1.  Fixes cpu_relax().
>
> 2.  

Re: [PATCH RFC ticketlock] Auto-queued ticketlock

2013-06-11 Thread Paul E. McKenney
On Tue, Jun 11, 2013 at 04:56:50PM -0400, Steven Rostedt wrote:
> On Tue, 2013-06-11 at 12:49 -0700, Paul E. McKenney wrote:
> 
> > +config TICKET_LOCK_QUEUED
> > +   bool "Dynamically switch between ticket and queued locking"
> > +   depends on SMP
> > +   default n
> > +   ---help---
> > + Enable dynamic switching between ticketlock and queued locking
> > + on a per-lock basis.  This option will slow down low-contention
> > + acquisition and release very slightly (additional conditional
> > + in release path), but will provide more efficient operation at
> > + high levels of lock contention.  High-contention operation will
> > + not be quite as efficient as would be a pure queued lock, but
> > + this dynamic approach consumes less memory than queud locks
> > + and also runs faster at low levels of contention.
> > +
> > + Say "Y" if you are running on a large system with a workload
> > + that is likely to result in high levels of contention.
> > +
> > + Say "N" if you are unsure.
> > +
> > +config TICKET_LOCK_QUEUED_SWITCH
> > +   int "When to switch from ticket to queued locking"
> > +   depends on TICKET_LOCK_QUEUED
> > +   default 8
> > +   range 3 32
> > +   ---help---
> > + Specify how many tasks should be spinning on the lock before
> > + switching to queued mode.  Systems with low-latency memory/cache
> > + interconnects will prefer larger numbers, while extreme low-latency
> > + and real-time workloads will prefer a smaller number.  Of course,
> > + extreme real-time workloads would be even happier if contention
> > + on the locks were reduced to the point that there was never any
> > + need for queued locking in the first place.
> 
> Are you sure real-time wants low numbers? I would think that real-time
> would want this off. This is just a way to help prevent cache ping
> ponging, but it adds to non-deterministic behavior. As I mentioned
> before, even though you fixed the thundering herd on setup, once the
> queue is set, then we will get a thundering herd of tasks trying to
> queue itself, and the task that was spinning the longest could very well
> become the one at the end of the FIFO.

Me?  I think that real-time just wants contention to remain low, so that
this sort of thing isn't needed in the first place.  And now that you
mention it, I suppose that is one of the few things that real-time and
real-fast workloads have in common.

But if you had some mixed workload on a large system that was mostly
real-fast, but had a real-time component, and if the real-fast portion
needed TICKET_LOCK_QUEUED=y, then I would guess that the real-time
portion would want a relatively low number for TICKET_LOCK_QUEUED_SWITCH.

Thanx, Paul

> -- Steve
> 
> 
> 
> > +
> > + Take the default if you are unsure.
> > diff --git a/kernel/Makefile b/kernel/Makefile
> 
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC ticketlock] Auto-queued ticketlock

2013-06-11 Thread Steven Rostedt
On Tue, 2013-06-11 at 12:49 -0700, Paul E. McKenney wrote:

> +config TICKET_LOCK_QUEUED
> + bool "Dynamically switch between ticket and queued locking"
> + depends on SMP
> + default n
> + ---help---
> +   Enable dynamic switching between ticketlock and queued locking
> +   on a per-lock basis.  This option will slow down low-contention
> +   acquisition and release very slightly (additional conditional
> +   in release path), but will provide more efficient operation at
> +   high levels of lock contention.  High-contention operation will
> +   not be quite as efficient as would be a pure queued lock, but
> +   this dynamic approach consumes less memory than queud locks
> +   and also runs faster at low levels of contention.
> +
> +   Say "Y" if you are running on a large system with a workload
> +   that is likely to result in high levels of contention.
> +
> +   Say "N" if you are unsure.
> +
> +config TICKET_LOCK_QUEUED_SWITCH
> + int "When to switch from ticket to queued locking"
> + depends on TICKET_LOCK_QUEUED
> + default 8
> + range 3 32
> + ---help---
> +   Specify how many tasks should be spinning on the lock before
> +   switching to queued mode.  Systems with low-latency memory/cache
> +   interconnects will prefer larger numbers, while extreme low-latency
> +   and real-time workloads will prefer a smaller number.  Of course,
> +   extreme real-time workloads would be even happier if contention
> +   on the locks were reduced to the point that there was never any
> +   need for queued locking in the first place.

Are you sure real-time wants low numbers? I would think that real-time
would want this off. This is just a way to help prevent cache ping
ponging, but it adds to non-deterministic behavior. As I mentioned
before, even though you fixed the thundering herd on setup, once the
queue is set, then we will get a thundering herd of tasks trying to
queue itself, and the task that was spinning the longest could very well
become the one at the end of the FIFO.

-- Steve



> +
> +   Take the default if you are unsure.
> diff --git a/kernel/Makefile b/kernel/Makefile


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC ticketlock] Auto-queued ticketlock

2013-06-11 Thread Steven Rostedt
On Tue, 2013-06-11 at 13:32 -0700, Paul E. McKenney wrote:

>   /*
>* This lock has lots of spinners, but no queue.  Go create
>* a queue to spin on.
>*
>* In the common case, only the single task that
>* sees the head and tail tickets being different by
>* exactly TKT_Q_SWITCH will come here set up the queue,
>* which prevents a "thundering herd" of queue setups.
>* Although it is still possible for an unfortunate series
>* of lock handoffs and newly arrived tasks to result
>* in more than one task performing a queue setup, this
>* is unlikely.  Of course, this situation must still be
>* handled correctly, which is the job of the cmpxchg()
>* in tkt_q_start_contend().
>*/
>   if (tkt_q_start_contend(ap, inc))
>   return true;
> 
> Does that help?

Yes, very good.

> 
> > As TKT_Q_SWITCH doesn't have a type, I'm not sure how C will evaluate
> > this. I always screw type conversions up, and just add in the type casts
> > to be safe.
> > 
> > You could also give TKT_Q_SWITCH a type too.
> 
> This is an excellent point as well -- things might well get confused.
> My solution was to take your last suggestion and given TKT_Q_SWITCH the
> same type as inc.tail and inc.head, and also apply type-safety paranoia
> to TKT_Q_NQUEUES:
> 
> /*
>  * TKT_Q_SWITCH is twice the number of CPUs that must be spinning on a
>  * given ticket lock to motivate switching to spinning on a queue.
>  * The reason that it is twice the number is because the bottom bit of
>  * the ticket is reserved for the bit that indicates that a queue is
>  * associated with the lock.
>  */
> #define TKT_Q_SWITCH  ((__ticket_t)(CONFIG_TICKET_LOCK_QUEUED_SWITCH * 2))
> 
> /*
>  * TKT_Q_NQUEUES is the number of queues to maintain.  Large systems
>  * might have multiple highly contended locks, so provide more queues for
>  * systems with larger numbers of CPUs.
>  */
> #define TKT_Q_NQUEUES (2 * DIV_ROUND_UP(NR_CPUS + ((int)TKT_Q_SWITCH) - 1, \
>   (int)TKT_Q_SWITCH))
> 
> Does that look OK?  (The limits on the value of TKT_Q_SWITCH should avoid
> signed integer overflow.)
> 

Looks fine.

-- Steve


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC ticketlock] Auto-queued ticketlock

2013-06-11 Thread Paul E. McKenney
On Tue, Jun 11, 2013 at 01:25:15PM -0700, Jason Low wrote:
> On Tue, Jun 11, 2013 at 12:49 PM, Paul E. McKenney
>  wrote:
> > On Tue, Jun 11, 2013 at 02:41:59PM -0400, Waiman Long wrote:
> >> On 06/11/2013 12:36 PM, Paul E. McKenney wrote:
> >> >
> >> >>I am a bit concern about the size of the head queue table itself.
> >> >>RHEL6, for example, had defined CONFIG_NR_CPUS to be 4096 which mean
> >> >>a table size of 256. Maybe it is better to dynamically allocate the
> >> >>table at init time depending on the actual number of CPUs in the
> >> >>system.
> >> >But if your kernel is built for 4096 CPUs, the 32*256=8192 bytes of memory
> >> >is way down in the noise.  Systems that care about that small an amount
> >> >of memory probably have a small enough number of CPUs that they can just
> >> >turn off queueing at build time using CONFIG_TICKET_LOCK_QUEUED=n, right?
> >>
> >> My concern is more about the latency on the table scan than the
> >> actual memory that was used.
> >>
> >> >>>+/*
> >> >>>+ * Return a pointer to the queue header associated with the specified 
> >> >>>lock,
> >> >>>+ * or return NULL if there is no queue for the lock or if the lock's 
> >> >>>queue
> >> >>>+ * is in transition.
> >> >>>+ */
> >> >>>+static struct tkt_q_head *tkt_q_find_head(arch_spinlock_t *asp)
> >> >>>+{
> >> >>>+  int i;
> >> >>>+  int start;
> >> >>>+
> >> >>>+  start = i = tkt_q_hash(asp);
> >> >>>+  do
> >> >>>+  if (tkt_q_heads[i].ref == asp)
> >> >>>+  return&tkt_q_heads[i];
> >> >>>+  while ((i = tkt_q_next_slot(i)) != start);
> >> >>>+  return NULL;
> >> >>>+}
> >> >>With a table size of 256 and you have to scan the whole table to
> >> >>find the right head queue. This can be a significant overhead. I
> >> >>will suggest setting a limiting of how many entries it scans before
> >> >>it aborts rather than checking the whole table.
> >> >But it will scan 256 entries only if there are 256 other locks in queued
> >> >mode, which is -very- unlikely, even given 4096 CPUs.  That said, if you
> >> >show me that this results in a real latency problem on a real system,
> >> >I would be happy to provide a way to limit the search.
> >>
> >> Looking at the code more carefully, the chance of actually scanning
> >> 256 entries is very small. However, I now have some concern on the
> >> way you set up the initial queue.
> >>
> >> +/*
> >> + * Start handling a period of high contention by finding a queue to 
> >> associate
> >> + * with this lock.  Returns true if successful (in which case we hold the
> >> + * lock) and false otherwise (in which case we do -not- hold the lock).
> >> + */
> >> +bool tkt_q_start_contend(arch_spinlock_t *asp, struct __raw_tickets inc)
> >> +{
> >> + int i;
> >> + int start;
> >> +
> >> + /* Hash the lock address to find a starting point. */
> >> + start = i = tkt_q_hash(asp);
> >> +
> >> + /*
> >> +  * Each pass through the following loop attempts to associate
> >> +  * the lock with the corresponding queue.
> >> +  */
> >> + do {
> >> + /*
> >> +  * Use 0x1 to mark the queue in use, but also avoiding
> >> +  * any spinners trying to use it before we get it all
> >> +  * initialized.
> >> +  */
> >> + if (cmpxchg(&tkt_q_heads[i].ref,
> >> + NULL,
> >> + (arch_spinlock_t *)0x1) == NULL) {
> >> +
> >> + /* Succeeded, now go initialize it. */
> >> + return tkt_q_init_contend(i, asp, inc);
> >> + }
> >> +
> >> + /* If someone beat us to it, go spin on their queue. */
> >> + if (ACCESS_ONCE(asp->tickets.head)&  0x1)
> >> + return tkt_q_do_spin(asp, inc);
> >> + } while ((i = tkt_q_next_slot(i)) != start);
> >> +
> >> + /* All the queues are in use, revert to spinning on the ticket lock. 
> >> */
> >> + return false;
> >> +}
> >> +
> >>
> >> Unconditional cmpxchg() can be a source of high contention by
> >> itself. Considering that 16 threads may be doing cmpxchg() more or
> >> less simultaneously on the same cache line, it can cause a lot of
> >> contention. It will be better if you check to see if tkt_q_heads[i]
> >> is NULL first before doing cmpxchg.
> >>
> >> Another point is that the 16 threads maybe setting up the queues in
> >> consecutive slots in the head table. This is both a source of
> >> contention and a waste of effort. One possible solution is to add
> >> one more field (set to cpuid + 1, for example) to indicate that that
> >> setup is being done with asp set to the target lock address
> >> immediately. We will need to use cmpxchg128() for 64-bit machine,
> >> though. Another solution is to have only that thread with ticket
> >> number that is a fixed distance from head (e.g. 16*2) to do the
> >> queue setup while the rest wait until the setup is done before
> >> spinning on the queue.
> >>
> >> As my c

Re: [PATCH RFC ticketlock] Auto-queued ticketlock

2013-06-11 Thread Paul E. McKenney
On Tue, Jun 11, 2013 at 04:09:56PM -0400, Steven Rostedt wrote:
> On Tue, 2013-06-11 at 12:49 -0700, Paul E. McKenney wrote:
> 
> > +bool tkt_spin_pass(arch_spinlock_t *ap, struct __raw_tickets inc)
> > +{
> > +   if (unlikely(inc.head & 0x1)) {
> > +
> > +   /* This lock has a queue, so go spin on the queue. */
> > +   if (tkt_q_do_spin(ap, inc))
> > +   return true;
> > +
> > +   /* Get here if the queue is in transition: Retry next time. */
> > +
> 
> This looks better, but please add a comment, something to the likes of:
> 
>   /*
>* Only the TKT_Q_SWITCH waiter will set up the queue to prevent
>* a thundering herd of setups to occur. It is still possible for
>* more than one task to perform a setup if the lock is released
>* after this check, a waiter coming in may also match this test. But
>* that's covered by the cmpxchg() setup in tkt_q_start_contend.
>*/
> 
> > +   } else if (inc.tail - TKT_Q_SWITCH == inc.head) {
> 
> Also shouldn't this be:
> 
>   } else if ((__ticket_t)(inc.tail - TKT_Q_SWITCH) == inc.head) {

Good points on the comment, here is what I currently have:

} else if (inc.tail - TKT_Q_SWITCH == inc.head) {

/*
 * This lock has lots of spinners, but no queue.  Go create
 * a queue to spin on.
 *
 * In the common case, only the single task that
 * sees the head and tail tickets being different by
 * exactly TKT_Q_SWITCH will come here set up the queue,
 * which prevents a "thundering herd" of queue setups.
 * Although it is still possible for an unfortunate series
 * of lock handoffs and newly arrived tasks to result
 * in more than one task performing a queue setup, this
 * is unlikely.  Of course, this situation must still be
 * handled correctly, which is the job of the cmpxchg()
 * in tkt_q_start_contend().
 */
if (tkt_q_start_contend(ap, inc))
return true;

Does that help?

> As TKT_Q_SWITCH doesn't have a type, I'm not sure how C will evaluate
> this. I always screw type conversions up, and just add in the type casts
> to be safe.
> 
> You could also give TKT_Q_SWITCH a type too.

This is an excellent point as well -- things might well get confused.
My solution was to take your last suggestion and given TKT_Q_SWITCH the
same type as inc.tail and inc.head, and also apply type-safety paranoia
to TKT_Q_NQUEUES:

/*
 * TKT_Q_SWITCH is twice the number of CPUs that must be spinning on a
 * given ticket lock to motivate switching to spinning on a queue.
 * The reason that it is twice the number is because the bottom bit of
 * the ticket is reserved for the bit that indicates that a queue is
 * associated with the lock.
 */
#define TKT_Q_SWITCH  ((__ticket_t)(CONFIG_TICKET_LOCK_QUEUED_SWITCH * 2))

/*
 * TKT_Q_NQUEUES is the number of queues to maintain.  Large systems
 * might have multiple highly contended locks, so provide more queues for
 * systems with larger numbers of CPUs.
 */
#define TKT_Q_NQUEUES (2 * DIV_ROUND_UP(NR_CPUS + ((int)TKT_Q_SWITCH) - 1, \
(int)TKT_Q_SWITCH))

Does that look OK?  (The limits on the value of TKT_Q_SWITCH should avoid
signed integer overflow.)

Thanx, Paul

> -- Steve
> 
> > +
> > +   /*
> > +* This lock has lots of spinners, but no queue.
> > +* Go create a queue to spin on.
> > +*/
> > +   if (tkt_q_start_contend(ap, inc))
> > +   return true;
> > +
> > +   /* Get here if the queue is in transition: Retry next time. */
> > +   }
> > +
> > +   /* Either no need for a queue or the queue is in transition.  Spin. */
> > +   return false;
> > +}
> > +EXPORT_SYMBOL(tkt_spin_pass);
> 
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC ticketlock] Auto-queued ticketlock

2013-06-11 Thread Jason Low
On Tue, Jun 11, 2013 at 12:49 PM, Paul E. McKenney
 wrote:
> On Tue, Jun 11, 2013 at 02:41:59PM -0400, Waiman Long wrote:
>> On 06/11/2013 12:36 PM, Paul E. McKenney wrote:
>> >
>> >>I am a bit concern about the size of the head queue table itself.
>> >>RHEL6, for example, had defined CONFIG_NR_CPUS to be 4096 which mean
>> >>a table size of 256. Maybe it is better to dynamically allocate the
>> >>table at init time depending on the actual number of CPUs in the
>> >>system.
>> >But if your kernel is built for 4096 CPUs, the 32*256=8192 bytes of memory
>> >is way down in the noise.  Systems that care about that small an amount
>> >of memory probably have a small enough number of CPUs that they can just
>> >turn off queueing at build time using CONFIG_TICKET_LOCK_QUEUED=n, right?
>>
>> My concern is more about the latency on the table scan than the
>> actual memory that was used.
>>
>> >>>+/*
>> >>>+ * Return a pointer to the queue header associated with the specified 
>> >>>lock,
>> >>>+ * or return NULL if there is no queue for the lock or if the lock's 
>> >>>queue
>> >>>+ * is in transition.
>> >>>+ */
>> >>>+static struct tkt_q_head *tkt_q_find_head(arch_spinlock_t *asp)
>> >>>+{
>> >>>+  int i;
>> >>>+  int start;
>> >>>+
>> >>>+  start = i = tkt_q_hash(asp);
>> >>>+  do
>> >>>+  if (tkt_q_heads[i].ref == asp)
>> >>>+  return&tkt_q_heads[i];
>> >>>+  while ((i = tkt_q_next_slot(i)) != start);
>> >>>+  return NULL;
>> >>>+}
>> >>With a table size of 256 and you have to scan the whole table to
>> >>find the right head queue. This can be a significant overhead. I
>> >>will suggest setting a limiting of how many entries it scans before
>> >>it aborts rather than checking the whole table.
>> >But it will scan 256 entries only if there are 256 other locks in queued
>> >mode, which is -very- unlikely, even given 4096 CPUs.  That said, if you
>> >show me that this results in a real latency problem on a real system,
>> >I would be happy to provide a way to limit the search.
>>
>> Looking at the code more carefully, the chance of actually scanning
>> 256 entries is very small. However, I now have some concern on the
>> way you set up the initial queue.
>>
>> +/*
>> + * Start handling a period of high contention by finding a queue to 
>> associate
>> + * with this lock.  Returns true if successful (in which case we hold the
>> + * lock) and false otherwise (in which case we do -not- hold the lock).
>> + */
>> +bool tkt_q_start_contend(arch_spinlock_t *asp, struct __raw_tickets inc)
>> +{
>> + int i;
>> + int start;
>> +
>> + /* Hash the lock address to find a starting point. */
>> + start = i = tkt_q_hash(asp);
>> +
>> + /*
>> +  * Each pass through the following loop attempts to associate
>> +  * the lock with the corresponding queue.
>> +  */
>> + do {
>> + /*
>> +  * Use 0x1 to mark the queue in use, but also avoiding
>> +  * any spinners trying to use it before we get it all
>> +  * initialized.
>> +  */
>> + if (cmpxchg(&tkt_q_heads[i].ref,
>> + NULL,
>> + (arch_spinlock_t *)0x1) == NULL) {
>> +
>> + /* Succeeded, now go initialize it. */
>> + return tkt_q_init_contend(i, asp, inc);
>> + }
>> +
>> + /* If someone beat us to it, go spin on their queue. */
>> + if (ACCESS_ONCE(asp->tickets.head)&  0x1)
>> + return tkt_q_do_spin(asp, inc);
>> + } while ((i = tkt_q_next_slot(i)) != start);
>> +
>> + /* All the queues are in use, revert to spinning on the ticket lock. */
>> + return false;
>> +}
>> +
>>
>> Unconditional cmpxchg() can be a source of high contention by
>> itself. Considering that 16 threads may be doing cmpxchg() more or
>> less simultaneously on the same cache line, it can cause a lot of
>> contention. It will be better if you check to see if tkt_q_heads[i]
>> is NULL first before doing cmpxchg.
>>
>> Another point is that the 16 threads maybe setting up the queues in
>> consecutive slots in the head table. This is both a source of
>> contention and a waste of effort. One possible solution is to add
>> one more field (set to cpuid + 1, for example) to indicate that that
>> setup is being done with asp set to the target lock address
>> immediately. We will need to use cmpxchg128() for 64-bit machine,
>> though. Another solution is to have only that thread with ticket
>> number that is a fixed distance from head (e.g. 16*2) to do the
>> queue setup while the rest wait until the setup is done before
>> spinning on the queue.
>>
>> As my colleague Davidlohr had reported there are more regressions
>> than performance improvement in the AIM7 benchmark. I believe that
>> queue setup contention is likely a source of performance regression.
>
> Please see below for a v3 patch that:
>
> 1.  Fixes cpu_relax().
>
> 2. 

Re: [PATCH RFC ticketlock] Auto-queued ticketlock

2013-06-11 Thread Steven Rostedt
On Tue, 2013-06-11 at 12:49 -0700, Paul E. McKenney wrote:

> +bool tkt_spin_pass(arch_spinlock_t *ap, struct __raw_tickets inc)
> +{
> + if (unlikely(inc.head & 0x1)) {
> +
> + /* This lock has a queue, so go spin on the queue. */
> + if (tkt_q_do_spin(ap, inc))
> + return true;
> +
> + /* Get here if the queue is in transition: Retry next time. */
> +

This looks better, but please add a comment, something to the likes of:

/*
 * Only the TKT_Q_SWITCH waiter will set up the queue to prevent
 * a thundering herd of setups to occur. It is still possible for
 * more than one task to perform a setup if the lock is released
 * after this check, a waiter coming in may also match this test. But
 * that's covered by the cmpxchg() setup in tkt_q_start_contend.
 */


> + } else if (inc.tail - TKT_Q_SWITCH == inc.head) {

Also shouldn't this be:

} else if ((__ticket_t)(inc.tail - TKT_Q_SWITCH) == inc.head) {

As TKT_Q_SWITCH doesn't have a type, I'm not sure how C will evaluate
this. I always screw type conversions up, and just add in the type casts
to be safe.

You could also give TKT_Q_SWITCH a type too.

-- Steve

> +
> + /*
> +  * This lock has lots of spinners, but no queue.
> +  * Go create a queue to spin on.
> +  */
> + if (tkt_q_start_contend(ap, inc))
> + return true;
> +
> + /* Get here if the queue is in transition: Retry next time. */
> + }
> +
> + /* Either no need for a queue or the queue is in transition.  Spin. */
> + return false;
> +}
> +EXPORT_SYMBOL(tkt_spin_pass);


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC ticketlock] Auto-queued ticketlock

2013-06-11 Thread Paul E. McKenney
On Tue, Jun 11, 2013 at 02:41:59PM -0400, Waiman Long wrote:
> On 06/11/2013 12:36 PM, Paul E. McKenney wrote:
> >
> >>I am a bit concern about the size of the head queue table itself.
> >>RHEL6, for example, had defined CONFIG_NR_CPUS to be 4096 which mean
> >>a table size of 256. Maybe it is better to dynamically allocate the
> >>table at init time depending on the actual number of CPUs in the
> >>system.
> >But if your kernel is built for 4096 CPUs, the 32*256=8192 bytes of memory
> >is way down in the noise.  Systems that care about that small an amount
> >of memory probably have a small enough number of CPUs that they can just
> >turn off queueing at build time using CONFIG_TICKET_LOCK_QUEUED=n, right?
> 
> My concern is more about the latency on the table scan than the
> actual memory that was used.
> 
> >>>+/*
> >>>+ * Return a pointer to the queue header associated with the specified 
> >>>lock,
> >>>+ * or return NULL if there is no queue for the lock or if the lock's queue
> >>>+ * is in transition.
> >>>+ */
> >>>+static struct tkt_q_head *tkt_q_find_head(arch_spinlock_t *asp)
> >>>+{
> >>>+  int i;
> >>>+  int start;
> >>>+
> >>>+  start = i = tkt_q_hash(asp);
> >>>+  do
> >>>+  if (tkt_q_heads[i].ref == asp)
> >>>+  return&tkt_q_heads[i];
> >>>+  while ((i = tkt_q_next_slot(i)) != start);
> >>>+  return NULL;
> >>>+}
> >>With a table size of 256 and you have to scan the whole table to
> >>find the right head queue. This can be a significant overhead. I
> >>will suggest setting a limiting of how many entries it scans before
> >>it aborts rather than checking the whole table.
> >But it will scan 256 entries only if there are 256 other locks in queued
> >mode, which is -very- unlikely, even given 4096 CPUs.  That said, if you
> >show me that this results in a real latency problem on a real system,
> >I would be happy to provide a way to limit the search.
> 
> Looking at the code more carefully, the chance of actually scanning
> 256 entries is very small. However, I now have some concern on the
> way you set up the initial queue.
> 
> +/*
> + * Start handling a period of high contention by finding a queue to associate
> + * with this lock.  Returns true if successful (in which case we hold the
> + * lock) and false otherwise (in which case we do -not- hold the lock).
> + */
> +bool tkt_q_start_contend(arch_spinlock_t *asp, struct __raw_tickets inc)
> +{
> + int i;
> + int start;
> +
> + /* Hash the lock address to find a starting point. */
> + start = i = tkt_q_hash(asp);
> +
> + /*
> +  * Each pass through the following loop attempts to associate
> +  * the lock with the corresponding queue.
> +  */
> + do {
> + /*
> +  * Use 0x1 to mark the queue in use, but also avoiding
> +  * any spinners trying to use it before we get it all
> +  * initialized.
> +  */
> + if (cmpxchg(&tkt_q_heads[i].ref,
> + NULL,
> + (arch_spinlock_t *)0x1) == NULL) {
> +
> + /* Succeeded, now go initialize it. */
> + return tkt_q_init_contend(i, asp, inc);
> + }
> +
> + /* If someone beat us to it, go spin on their queue. */
> + if (ACCESS_ONCE(asp->tickets.head)&  0x1)
> + return tkt_q_do_spin(asp, inc);
> + } while ((i = tkt_q_next_slot(i)) != start);
> +
> + /* All the queues are in use, revert to spinning on the ticket lock. */
> + return false;
> +}
> +
> 
> Unconditional cmpxchg() can be a source of high contention by
> itself. Considering that 16 threads may be doing cmpxchg() more or
> less simultaneously on the same cache line, it can cause a lot of
> contention. It will be better if you check to see if tkt_q_heads[i]
> is NULL first before doing cmpxchg.
> 
> Another point is that the 16 threads maybe setting up the queues in
> consecutive slots in the head table. This is both a source of
> contention and a waste of effort. One possible solution is to add
> one more field (set to cpuid + 1, for example) to indicate that that
> setup is being done with asp set to the target lock address
> immediately. We will need to use cmpxchg128() for 64-bit machine,
> though. Another solution is to have only that thread with ticket
> number that is a fixed distance from head (e.g. 16*2) to do the
> queue setup while the rest wait until the setup is done before
> spinning on the queue.
> 
> As my colleague Davidlohr had reported there are more regressions
> than performance improvement in the AIM7 benchmark. I believe that
> queue setup contention is likely a source of performance regression.

Please see below for a v3 patch that:

1.  Fixes cpu_relax().

2.  Tests before doing cmpxchg().

3.  Reduces the number of CPUs attempting to set up the queue,
in the common case, to a single CPU.  (Multiple CPUs can
still be 

Re: [PATCH RFC ticketlock] Auto-queued ticketlock

2013-06-11 Thread Davidlohr Bueso
On Tue, 2013-06-11 at 14:41 -0400, Waiman Long wrote:
> On 06/11/2013 12:36 PM, Paul E. McKenney wrote:
> >
> >> I am a bit concern about the size of the head queue table itself.
> >> RHEL6, for example, had defined CONFIG_NR_CPUS to be 4096 which mean
> >> a table size of 256. Maybe it is better to dynamically allocate the
> >> table at init time depending on the actual number of CPUs in the
> >> system.
> > But if your kernel is built for 4096 CPUs, the 32*256=8192 bytes of memory
> > is way down in the noise.  Systems that care about that small an amount
> > of memory probably have a small enough number of CPUs that they can just
> > turn off queueing at build time using CONFIG_TICKET_LOCK_QUEUED=n, right?
> 
> My concern is more about the latency on the table scan than the actual 
> memory that was used.
> 
> >
> >>> +/*
> >>> + * Return a pointer to the queue header associated with the specified 
> >>> lock,
> >>> + * or return NULL if there is no queue for the lock or if the lock's 
> >>> queue
> >>> + * is in transition.
> >>> + */
> >>> +static struct tkt_q_head *tkt_q_find_head(arch_spinlock_t *asp)
> >>> +{
> >>> + int i;
> >>> + int start;
> >>> +
> >>> + start = i = tkt_q_hash(asp);
> >>> + do
> >>> + if (tkt_q_heads[i].ref == asp)
> >>> + return&tkt_q_heads[i];
> >>> + while ((i = tkt_q_next_slot(i)) != start);
> >>> + return NULL;
> >>> +}
> >> With a table size of 256 and you have to scan the whole table to
> >> find the right head queue. This can be a significant overhead. I
> >> will suggest setting a limiting of how many entries it scans before
> >> it aborts rather than checking the whole table.
> > But it will scan 256 entries only if there are 256 other locks in queued
> > mode, which is -very- unlikely, even given 4096 CPUs.  That said, if you
> > show me that this results in a real latency problem on a real system,
> > I would be happy to provide a way to limit the search.
> 
> Looking at the code more carefully, the chance of actually scanning 256 
> entries is very small. However, I now have some concern on the way you 
> set up the initial queue.
> 
> +/*
> + * Start handling a period of high contention by finding a queue to associate
> + * with this lock.  Returns true if successful (in which case we hold the
> + * lock) and false otherwise (in which case we do -not- hold the lock).
> + */
> +bool tkt_q_start_contend(arch_spinlock_t *asp, struct __raw_tickets inc)
> +{
> + int i;
> + int start;
> +
> + /* Hash the lock address to find a starting point. */
> + start = i = tkt_q_hash(asp);
> +
> + /*
> +  * Each pass through the following loop attempts to associate
> +  * the lock with the corresponding queue.
> +  */
> + do {
> + /*
> +  * Use 0x1 to mark the queue in use, but also avoiding
> +  * any spinners trying to use it before we get it all
> +  * initialized.
> +  */
> + if (cmpxchg(&tkt_q_heads[i].ref,
> + NULL,
> + (arch_spinlock_t *)0x1) == NULL) {
> +
> + /* Succeeded, now go initialize it. */
> + return tkt_q_init_contend(i, asp, inc);
> + }
> +
> + /* If someone beat us to it, go spin on their queue. */
> + if (ACCESS_ONCE(asp->tickets.head)&  0x1)
> + return tkt_q_do_spin(asp, inc);
> + } while ((i = tkt_q_next_slot(i)) != start);
> +
> + /* All the queues are in use, revert to spinning on the ticket lock. */
> + return false;
> +}
> +
> 
> Unconditional cmpxchg() can be a source of high contention by itself. 
> Considering that 16 threads may be doing cmpxchg() more or less 
> simultaneously on the same cache line, it can cause a lot of contention. 
> It will be better if you check to see if tkt_q_heads[i] is NULL first 
> before doing cmpxchg.

Good point, we already noticed good benefits in mutexes and rwsems when
using test and cmpxchg techniques.

Thanks,
davidlohr

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC ticketlock] Auto-queued ticketlock

2013-06-11 Thread Paul E. McKenney
On Tue, Jun 11, 2013 at 02:10:31PM -0400, Steven Rostedt wrote:
> On Tue, 2013-06-11 at 10:53 -0700, Davidlohr Bueso wrote:
> 
> > I hate to be the bearer of bad news but I got some pretty bad aim7
> > performance numbers with this patch on an 8-socket (80 core) 256 Gb
> > memory DL980 box against a vanilla 3.10-rc4 kernel:
> 
> This doesn't surprise me as the spin lock now contains a function call
> on any contention. Not to mention the added i$ pressure on the embedded
> spinlock code having to setup a function call.
> 
> Even if the queues are not used, it adds a slight overhead to all
> spinlocks, due to the code size increase as well as a function call on
> all contention, which will also have an impact on i$ and branch
> prediction.

Was this system hyperthreaded?  If so, it might be suffering from the
misplaced cpu_relax(), which would mean that hardware threads spinning
on the lock would fail to inform the CPU that it was not doing anything
useful.

Thanx, Paul

> > * shared workload: 
> > 10-100 users is in the noise area.
> > 100-2000 users: -13% throughput.
> > 
> > * high_systime workload: 
> > 10-700 users is in the noise area.
> > 700-2000 users: -55% throughput.
> > 
> > * disk:
> > 10-100 users -57% throughput.
> > 100-1000 users: -25% throughput
> > 1000-2000 users: +8% throughput (this patch only benefits when we have a
> 
> Perhaps this actually started using the queues?
> 
> > lot of concurrency).
> > 
> > * custom:
> > 10-100 users: -33% throughput.
> > 100-2000 users: -46% throughput.
> > 
> > * alltests:
> > 10-1000 users is in the noise area.
> > 1000-2000 users: -10% throughput.
> > 
> > One notable exception is the short workload where we actually see
> > positive numbers:
> > 10-100 users: +40% throughput.
> > 100-2000 users: +69% throughput.
> 
> Perhaps short work loads have a cold cache, and the impact on cache is
> not as drastic?
> 
> It would be interesting to see what perf reports on these runs.
> 
> -- Steve
> 
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC ticketlock] Auto-queued ticketlock

2013-06-11 Thread Waiman Long

On 06/11/2013 12:36 PM, Paul E. McKenney wrote:



I am a bit concern about the size of the head queue table itself.
RHEL6, for example, had defined CONFIG_NR_CPUS to be 4096 which mean
a table size of 256. Maybe it is better to dynamically allocate the
table at init time depending on the actual number of CPUs in the
system.

But if your kernel is built for 4096 CPUs, the 32*256=8192 bytes of memory
is way down in the noise.  Systems that care about that small an amount
of memory probably have a small enough number of CPUs that they can just
turn off queueing at build time using CONFIG_TICKET_LOCK_QUEUED=n, right?


My concern is more about the latency on the table scan than the actual 
memory that was used.





+/*
+ * Return a pointer to the queue header associated with the specified lock,
+ * or return NULL if there is no queue for the lock or if the lock's queue
+ * is in transition.
+ */
+static struct tkt_q_head *tkt_q_find_head(arch_spinlock_t *asp)
+{
+   int i;
+   int start;
+
+   start = i = tkt_q_hash(asp);
+   do
+   if (tkt_q_heads[i].ref == asp)
+   return&tkt_q_heads[i];
+   while ((i = tkt_q_next_slot(i)) != start);
+   return NULL;
+}

With a table size of 256 and you have to scan the whole table to
find the right head queue. This can be a significant overhead. I
will suggest setting a limiting of how many entries it scans before
it aborts rather than checking the whole table.

But it will scan 256 entries only if there are 256 other locks in queued
mode, which is -very- unlikely, even given 4096 CPUs.  That said, if you
show me that this results in a real latency problem on a real system,
I would be happy to provide a way to limit the search.


Looking at the code more carefully, the chance of actually scanning 256 
entries is very small. However, I now have some concern on the way you 
set up the initial queue.


+/*
+ * Start handling a period of high contention by finding a queue to associate
+ * with this lock.  Returns true if successful (in which case we hold the
+ * lock) and false otherwise (in which case we do -not- hold the lock).
+ */
+bool tkt_q_start_contend(arch_spinlock_t *asp, struct __raw_tickets inc)
+{
+   int i;
+   int start;
+
+   /* Hash the lock address to find a starting point. */
+   start = i = tkt_q_hash(asp);
+
+   /*
+* Each pass through the following loop attempts to associate
+* the lock with the corresponding queue.
+*/
+   do {
+   /*
+* Use 0x1 to mark the queue in use, but also avoiding
+* any spinners trying to use it before we get it all
+* initialized.
+*/
+   if (cmpxchg(&tkt_q_heads[i].ref,
+   NULL,
+   (arch_spinlock_t *)0x1) == NULL) {
+
+   /* Succeeded, now go initialize it. */
+   return tkt_q_init_contend(i, asp, inc);
+   }
+
+   /* If someone beat us to it, go spin on their queue. */
+   if (ACCESS_ONCE(asp->tickets.head)&  0x1)
+   return tkt_q_do_spin(asp, inc);
+   } while ((i = tkt_q_next_slot(i)) != start);
+
+   /* All the queues are in use, revert to spinning on the ticket lock. */
+   return false;
+}
+

Unconditional cmpxchg() can be a source of high contention by itself. 
Considering that 16 threads may be doing cmpxchg() more or less 
simultaneously on the same cache line, it can cause a lot of contention. 
It will be better if you check to see if tkt_q_heads[i] is NULL first 
before doing cmpxchg.


Another point is that the 16 threads maybe setting up the queues in 
consecutive slots in the head table. This is both a source of contention 
and a waste of effort. One possible solution is to add one more field 
(set to cpuid + 1, for example) to indicate that that setup is being 
done with asp set to the target lock address immediately. We will need 
to use cmpxchg128() for 64-bit machine, though. Another solution is to 
have only that thread with ticket number that is a fixed distance from 
head (e.g. 16*2) to do the queue setup while the rest wait until the 
setup is done before spinning on the queue.


As my colleague Davidlohr had reported there are more regressions than 
performance improvement in the AIM7 benchmark. I believe that queue 
setup contention is likely a source of performance regression.


Regards,
Longman
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC ticketlock] Auto-queued ticketlock

2013-06-11 Thread Davidlohr Bueso
On Tue, 2013-06-11 at 14:10 -0400, Steven Rostedt wrote:
> On Tue, 2013-06-11 at 10:53 -0700, Davidlohr Bueso wrote:
> 
> > I hate to be the bearer of bad news but I got some pretty bad aim7
> > performance numbers with this patch on an 8-socket (80 core) 256 Gb
> > memory DL980 box against a vanilla 3.10-rc4 kernel:
> 
> This doesn't surprise me as the spin lock now contains a function call
> on any contention. Not to mention the added i$ pressure on the embedded
> spinlock code having to setup a function call.
> 
> Even if the queues are not used, it adds a slight overhead to all
> spinlocks, due to the code size increase as well as a function call on
> all contention, which will also have an impact on i$ and branch
> prediction.

Agreed.

> > 
> > * shared workload: 
> > 10-100 users is in the noise area.
> > 100-2000 users: -13% throughput.
> > 
> > * high_systime workload: 
> > 10-700 users is in the noise area.
> > 700-2000 users: -55% throughput.
> > 
> > * disk:
> > 10-100 users -57% throughput.
> > 100-1000 users: -25% throughput
> > 1000-2000 users: +8% throughput (this patch only benefits when we have a
> 
> Perhaps this actually started using the queues?
> 
> > lot of concurrency).
> > 
> > * custom:
> > 10-100 users: -33% throughput.
> > 100-2000 users: -46% throughput.
> > 
> > * alltests:
> > 10-1000 users is in the noise area.
> > 1000-2000 users: -10% throughput.
> > 
> > One notable exception is the short workload where we actually see
> > positive numbers:
> > 10-100 users: +40% throughput.
> > 100-2000 users: +69% throughput.
> 
> Perhaps short work loads have a cold cache, and the impact on cache is
> not as drastic?
> 
> It would be interesting to see what perf reports on these runs.

I didn't actually collect perf traces in this run but I will rerun it
with that in mind. I'm also running some OLTP and data mining Oracle
based benchmarks where I'm already collecting perf reports.

Will post when I have everything.

Thanks,
Davidlohr


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC ticketlock] Auto-queued ticketlock

2013-06-11 Thread Steven Rostedt
On Tue, 2013-06-11 at 10:53 -0700, Davidlohr Bueso wrote:

> I hate to be the bearer of bad news but I got some pretty bad aim7
> performance numbers with this patch on an 8-socket (80 core) 256 Gb
> memory DL980 box against a vanilla 3.10-rc4 kernel:

This doesn't surprise me as the spin lock now contains a function call
on any contention. Not to mention the added i$ pressure on the embedded
spinlock code having to setup a function call.

Even if the queues are not used, it adds a slight overhead to all
spinlocks, due to the code size increase as well as a function call on
all contention, which will also have an impact on i$ and branch
prediction.


> 
> * shared workload: 
> 10-100 users is in the noise area.
> 100-2000 users: -13% throughput.
> 
> * high_systime workload: 
> 10-700 users is in the noise area.
> 700-2000 users: -55% throughput.
> 
> * disk:
> 10-100 users -57% throughput.
> 100-1000 users: -25% throughput
> 1000-2000 users: +8% throughput (this patch only benefits when we have a

Perhaps this actually started using the queues?

> lot of concurrency).
> 
> * custom:
> 10-100 users: -33% throughput.
> 100-2000 users: -46% throughput.
> 
> * alltests:
> 10-1000 users is in the noise area.
> 1000-2000 users: -10% throughput.
> 
> One notable exception is the short workload where we actually see
> positive numbers:
> 10-100 users: +40% throughput.
> 100-2000 users: +69% throughput.

Perhaps short work loads have a cold cache, and the impact on cache is
not as drastic?

It would be interesting to see what perf reports on these runs.

-- Steve


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC ticketlock] Auto-queued ticketlock

2013-06-11 Thread Paul E. McKenney
On Tue, Jun 11, 2013 at 10:53:06AM -0700, Davidlohr Bueso wrote:
> On Mon, 2013-06-10 at 17:51 -0700, Linus Torvalds wrote:
> > On Mon, Jun 10, 2013 at 5:44 PM, Steven Rostedt  wrote:
> > >
> > > OK, I haven't found a issue here yet, but youss are beiing trickssy! We
> > > don't like trickssy, and we must find preiouss!!!
> > 
> > .. and I personally have my usual reservations. I absolutely hate
> > papering over scalability issues, and historically whenever people
> > have ever thought that we want complex spinlocks, the problem has
> > always been that the locking sucks.
> > 
> > So reinforced by previous events, I really feel that code that needs
> > this kind of spinlock is broken and needs to be fixed, rather than
> > actually introduce tricky spinlocks.
> > 
> > So in order to merge something like this, I want (a) numbers for real
> > loads and (b) explanations for why the spinlock users cannot be fixed.
> 
> I hate to be the bearer of bad news but I got some pretty bad aim7
> performance numbers with this patch on an 8-socket (80 core) 256 Gb
> memory DL980 box against a vanilla 3.10-rc4 kernel:

Looks pretty ugly, sorry that it doesn't help in many of your situations.

Any info on what bottlenecks you are encountering?

Thanx, Paul

> * shared workload: 
> 10-100 users is in the noise area.
> 100-2000 users: -13% throughput.
> 
> * high_systime workload: 
> 10-700 users is in the noise area.
> 700-2000 users: -55% throughput.
> 
> * disk:
> 10-100 users -57% throughput.
> 100-1000 users: -25% throughput
> 1000-2000 users: +8% throughput (this patch only benefits when we have a
> lot of concurrency).
> 
> * custom:
> 10-100 users: -33% throughput.
> 100-2000 users: -46% throughput.
> 
> * alltests:
> 10-1000 users is in the noise area.
> 1000-2000 users: -10% throughput.
> 
> One notable exception is the short workload where we actually see
> positive numbers:
> 10-100 users: +40% throughput.
> 100-2000 users: +69% throughput.
> 
> Thanks,
> Davidlohr
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC ticketlock] Auto-queued ticketlock

2013-06-11 Thread Davidlohr Bueso
On Mon, 2013-06-10 at 17:51 -0700, Linus Torvalds wrote:
> On Mon, Jun 10, 2013 at 5:44 PM, Steven Rostedt  wrote:
> >
> > OK, I haven't found a issue here yet, but youss are beiing trickssy! We
> > don't like trickssy, and we must find preiouss!!!
> 
> .. and I personally have my usual reservations. I absolutely hate
> papering over scalability issues, and historically whenever people
> have ever thought that we want complex spinlocks, the problem has
> always been that the locking sucks.
> 
> So reinforced by previous events, I really feel that code that needs
> this kind of spinlock is broken and needs to be fixed, rather than
> actually introduce tricky spinlocks.
> 
> So in order to merge something like this, I want (a) numbers for real
> loads and (b) explanations for why the spinlock users cannot be fixed.

I hate to be the bearer of bad news but I got some pretty bad aim7
performance numbers with this patch on an 8-socket (80 core) 256 Gb
memory DL980 box against a vanilla 3.10-rc4 kernel:

* shared workload: 
10-100 users is in the noise area.
100-2000 users: -13% throughput.

* high_systime workload: 
10-700 users is in the noise area.
700-2000 users: -55% throughput.

* disk:
10-100 users -57% throughput.
100-1000 users: -25% throughput
1000-2000 users: +8% throughput (this patch only benefits when we have a
lot of concurrency).

* custom:
10-100 users: -33% throughput.
100-2000 users: -46% throughput.

* alltests:
10-1000 users is in the noise area.
1000-2000 users: -10% throughput.

One notable exception is the short workload where we actually see
positive numbers:
10-100 users: +40% throughput.
100-2000 users: +69% throughput.

Thanks,
Davidlohr

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC ticketlock] Auto-queued ticketlock

2013-06-11 Thread Paul E. McKenney
On Tue, Jun 11, 2013 at 01:13:53PM -0400, Steven Rostedt wrote:
> On Tue, 2013-06-11 at 09:43 -0700, Paul E. McKenney wrote:
> 
> > > > I am a bit concern about the size of the head queue table itself. 
> > > > RHEL6, 
> > > > for example, had defined CONFIG_NR_CPUS to be 4096 which mean a table 
> > > > size of 256. Maybe it is better to dynamically allocate the table at 
> > > > init time depending on the actual number of CPUs in the system.
> > > 
> > > Yeah, it can be allocated dynamically at boot.
> > 
> > But let's first demonstrate the need.  Keep in mind that an early-boot
> > deadlock would exercise this code.
> 
> I think an early-boot deadlock has more problems than this :-)
> 
> Now if we allocate this before other CPUs are enabled, there's no need
> to worry about accessing it before they are used. They can only be used
> on contention, and there would be no contention when we are only running
> on one CPU.
> 
> >   Yes, it is just a check for NULL,
> > but on the other hand I didn't get the impression that you thought that
> > this code was too simple.  ;-)
> 
> I wouldn't change the code that uses it. It should never be hit, and if
> it is triggered by an early boot deadlock, then I think this would
> actually be a plus. An early boot deadlock would cause the system to
> hang with no feedback whats so ever, causing the developer hours of
> crying for mommy and pulling out their hair because the system just
> stops doing anything except to show the developer a blinking cursor that
> blinks "haha, haha, haha".
> 
> But if an early boot deadlock were to cause this code to be triggered
> and do a NULL pointer dereference, then the system crashes. It would
> most likely produce a backtrace that will give a lot more information to
> the developer to see what is happening here. Sure, it may confuse them
> at first, but then they can say: "why is this code triggering before we
> have other CPUS? Oh, I have a deadlock here" and go fix the code in a
> matter of minutes instead of hours.
> 
> Note, I don't even see this triggering with an early boot deadlock. The
> only way that can happen is if the task tries to take a spinlock it
> already owns, or an interrupt goes off and grabs a spinlock that the
> task currently has but didn't disable interrupts. The ticket counter
> would be just 2, which is far below the threshold that triggers the
> queuing.

Fair point.  I suppose that the hapless kernel hacker could set
the threshold really low for that case.

But we are talking about only 8192 bytes of memory here.  Is that really
enough to worry about on current systems?

Thanx, Paul

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC ticketlock] Auto-queued ticketlock

2013-06-11 Thread Waiman Long

On 06/11/2013 12:20 PM, Steven Rostedt wrote:

diff --git a/arch/x86/include/asm/spinlock_types.h 
b/arch/x86/include/asm/spinlock_types.h
index ad0ad07..cdaefdd 100644
--- a/arch/x86/include/asm/spinlock_types.h
+++ b/arch/x86/include/asm/spinlock_types.h
@@ -7,12 +7,18 @@

   #include

-#if (CONFIG_NR_CPUS<   256)
+#if (CONFIG_NR_CPUS<   128)
   typedef u8  __ticket_t;
   typedef u16 __ticketpair_t;
-#else
+#define TICKET_T_CMP_GE(a, b) (UCHAR_MAX / 2>= (unsigned char)((a) - (b)))
+#elif (CONFIG_NR_CPUS<   32768)
   typedef u16 __ticket_t;
   typedef u32 __ticketpair_t;
+#define TICKET_T_CMP_GE(a, b) (USHRT_MAX / 2>= (unsigned short)((a) - (b)))
+#else
+typedef u32 __ticket_t;
+typedef u64 __ticketpair_t;
+#define TICKET_T_CMP_GE(a, b) (UINT_MAX / 2>= (unsigned int)((a) - (b)))
   #endif

It is theoretically possible that a large number of CPUs (says 64 or
more with CONFIG_NR_CPUS<  128) can acquire a ticket from the lock
before the check for TICKET_T_CMP_GE() in tkt_spin_pass(). So the check
will fail even when there is a large number of CPUs contending for the
lock. The chance of this happening is, of course, extremely rare. This
is not an error as the lock is still working as it should be without
your change.

Can you explain this more. How can you acquire the ticket and update at
the same time? If a queue has been set, then you can't acquire the
ticket as the head has a 1 appended to it.


I am sorry if I confuse you. What I meant is queuing up at the tail of 
the ticket lock incrementing the tail number, not actually getting the lock.





+/*
+ * Return a pointer to the queue header associated with the specified lock,
+ * or return NULL if there is no queue for the lock or if the lock's queue
+ * is in transition.
+ */
+static struct tkt_q_head *tkt_q_find_head(arch_spinlock_t *asp)
+{
+   int i;
+   int start;
+
+   start = i = tkt_q_hash(asp);
+   do
+   if (tkt_q_heads[i].ref == asp)
+   return&tkt_q_heads[i];
+   while ((i = tkt_q_next_slot(i)) != start);
+   return NULL;
+}

With a table size of 256 and you have to scan the whole table to find
the right head queue. This can be a significant overhead. I will suggest
setting a limiting of how many entries it scans before it aborts rather
than checking the whole table.

We could add a limit, but in practice I'm not sure that would have any
issue. I thought the same thing when I first saw this, but to hit most
of the list, would require a large collision in the hash algorithm,
would could probably be fixed with a better hash.


The current code will scan the whole table until either it gets a match 
or whole table scan is completed. I first thought that hitting a NULL 
entry can stop the search, but that is not true. It is entirely possible 
that an entry was used when a queue is created but become empty 
immediately after that. So we have to scan the whole table to be sure or 
unless we impose a limit on how many entries we scan.


Regards,
Longman
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC ticketlock] Auto-queued ticketlock

2013-06-11 Thread Paul E. McKenney
On Tue, Jun 11, 2013 at 10:17:52AM -0700, Linus Torvalds wrote:
> On Tue, Jun 11, 2013 at 9:48 AM, Paul E. McKenney
>  wrote:
> >
> > Another approach is to permanently associate queues with each lock,
> > but that increases the size of the lock -- something that has raised
> > concerns in the past.  But if adding 32 bytes to each ticketlock was OK,
> > this simplifies things quite a bit.
> 
> Yeah, no. The spinlocks need to be small.  We have them in
> size-conscious data structures like "struct dentry" and "struct page",
> and they really must not be bigger than an "int" in the non-debug
> case.
> 
> In fact, I've occasionally thought about combining a spinlock with a
> refcounter if that could make things fit in 32 bits on smaller
> machines, because we also have ops like "atomic_dec_and_lock()" that
> could possibly be optimized if they fit in one word. That is probably
> not worth it, but spinlocks do need to remain small.

I was afraid of that.  On the other hand, I guess that this means that
I sent out the correct patch of the two that I prepared.  ;-)

Thanx, Paul

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC ticketlock] Auto-queued ticketlock

2013-06-11 Thread Paul E. McKenney
On Tue, Jun 11, 2013 at 01:01:55PM -0400, Steven Rostedt wrote:
> On Tue, 2013-06-11 at 09:36 -0700, Paul E. McKenney wrote:
> 
> > > I am a bit concern about the size of the head queue table itself.
> > > RHEL6, for example, had defined CONFIG_NR_CPUS to be 4096 which mean
> > > a table size of 256. Maybe it is better to dynamically allocate the
> > > table at init time depending on the actual number of CPUs in the
> > > system.
> > 
> > But if your kernel is built for 4096 CPUs, the 32*256=8192 bytes of memory
> > is way down in the noise.  Systems that care about that small an amount
> > of memory probably have a small enough number of CPUs that they can just
> > turn off queueing at build time using CONFIG_TICKET_LOCK_QUEUED=n, right?
> 
> If this turns out to work for large machines, that means that distros
> will enable it, and distros tend to up the NR_CPUS, which is defined at
> compile time and is set regardless of if you are running with 2 CPUs or
> a 1000 CPUs.
> 
> For now it's fine to use NR_CPUS, but I always try to avoid it. Working
> in the ARM and POWER environment you are use to lots of kernels compiled
> specifically for the target. But in the x86 world, it is basically one
> kernel for all environments, where NR_CPUS does make a big difference.

Fair point.  Something to worry about should this ever be in danger of
actually going upstream.  ;-)

Thanx, Paul

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC ticketlock] Auto-queued ticketlock

2013-06-11 Thread Linus Torvalds
On Tue, Jun 11, 2013 at 9:48 AM, Paul E. McKenney
 wrote:
>
> Another approach is to permanently associate queues with each lock,
> but that increases the size of the lock -- something that has raised
> concerns in the past.  But if adding 32 bytes to each ticketlock was OK,
> this simplifies things quite a bit.

Yeah, no. The spinlocks need to be small.  We have them in
size-conscious data structures like "struct dentry" and "struct page",
and they really must not be bigger than an "int" in the non-debug
case.

In fact, I've occasionally thought about combining a spinlock with a
refcounter if that could make things fit in 32 bits on smaller
machines, because we also have ops like "atomic_dec_and_lock()" that
could possibly be optimized if they fit in one word. That is probably
not worth it, but spinlocks do need to remain small.

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC ticketlock] Auto-queued ticketlock

2013-06-11 Thread Steven Rostedt
On Tue, 2013-06-11 at 09:43 -0700, Paul E. McKenney wrote:

> > > I am a bit concern about the size of the head queue table itself. RHEL6, 
> > > for example, had defined CONFIG_NR_CPUS to be 4096 which mean a table 
> > > size of 256. Maybe it is better to dynamically allocate the table at 
> > > init time depending on the actual number of CPUs in the system.
> > 
> > Yeah, it can be allocated dynamically at boot.
> 
> But let's first demonstrate the need.  Keep in mind that an early-boot
> deadlock would exercise this code.

I think an early-boot deadlock has more problems than this :-)

Now if we allocate this before other CPUs are enabled, there's no need
to worry about accessing it before they are used. They can only be used
on contention, and there would be no contention when we are only running
on one CPU.


>   Yes, it is just a check for NULL,
> but on the other hand I didn't get the impression that you thought that
> this code was too simple.  ;-)

I wouldn't change the code that uses it. It should never be hit, and if
it is triggered by an early boot deadlock, then I think this would
actually be a plus. An early boot deadlock would cause the system to
hang with no feedback whats so ever, causing the developer hours of
crying for mommy and pulling out their hair because the system just
stops doing anything except to show the developer a blinking cursor that
blinks "haha, haha, haha".

But if an early boot deadlock were to cause this code to be triggered
and do a NULL pointer dereference, then the system crashes. It would
most likely produce a backtrace that will give a lot more information to
the developer to see what is happening here. Sure, it may confuse them
at first, but then they can say: "why is this code triggering before we
have other CPUS? Oh, I have a deadlock here" and go fix the code in a
matter of minutes instead of hours.

Note, I don't even see this triggering with an early boot deadlock. The
only way that can happen is if the task tries to take a spinlock it
already owns, or an interrupt goes off and grabs a spinlock that the
task currently has but didn't disable interrupts. The ticket counter
would be just 2, which is far below the threshold that triggers the
queuing.

-- Steve


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC ticketlock] Auto-queued ticketlock

2013-06-11 Thread Steven Rostedt
On Tue, 2013-06-11 at 09:36 -0700, Paul E. McKenney wrote:

> > I am a bit concern about the size of the head queue table itself.
> > RHEL6, for example, had defined CONFIG_NR_CPUS to be 4096 which mean
> > a table size of 256. Maybe it is better to dynamically allocate the
> > table at init time depending on the actual number of CPUs in the
> > system.
> 
> But if your kernel is built for 4096 CPUs, the 32*256=8192 bytes of memory
> is way down in the noise.  Systems that care about that small an amount
> of memory probably have a small enough number of CPUs that they can just
> turn off queueing at build time using CONFIG_TICKET_LOCK_QUEUED=n, right?

If this turns out to work for large machines, that means that distros
will enable it, and distros tend to up the NR_CPUS, which is defined at
compile time and is set regardless of if you are running with 2 CPUs or
a 1000 CPUs.

For now it's fine to use NR_CPUS, but I always try to avoid it. Working
in the ARM and POWER environment you are use to lots of kernels compiled
specifically for the target. But in the x86 world, it is basically one
kernel for all environments, where NR_CPUS does make a big difference.

-- Steve


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC ticketlock] Auto-queued ticketlock

2013-06-11 Thread Paul E. McKenney
On Tue, Jun 11, 2013 at 11:10:30PM +0800, Lai Jiangshan wrote:
> On Tue, Jun 11, 2013 at 10:48 PM, Lai Jiangshan  wrote:
> > On Mon, Jun 10, 2013 at 3:36 AM, Paul E. McKenney
> >  wrote:
> >> Breaking up locks is better than implementing high-contention locks, but
> >> if we must have high-contention locks, why not make them automatically
> >> switch between light-weight ticket locks at low contention and queued
> >> locks at high contention?
> >>
> >> This commit therefore allows ticket locks to automatically switch between
> >> pure ticketlock and queued-lock operation as needed.  If too many CPUs
> >> are spinning on a given ticket lock, a queue structure will be allocated
> >> and the lock will switch to queued-lock operation.  When the lock becomes
> >> free, it will switch back into ticketlock operation.  The low-order bit
> >> of the head counter is used to indicate that the lock is in queued mode,
> >> which forces an unconditional mismatch between the head and tail counters.
> >> This approach means that the common-case code path under conditions of
> >> low contention is very nearly that of a plain ticket lock.
> >>
> >> A fixed number of queueing structures is statically allocated in an
> >> array.  The ticket-lock address is used to hash into an initial element,
> >> but if that element is already in use, it moves to the next element.  If
> >> the entire array is already in use, continue to spin in ticket mode.
> >>
> >> This has been only lightly tested in the kernel, though a userspace
> >> implementation has survived substantial testing.
> >>
> >> Signed-off-by: Paul E. McKenney 
> >>
> >> diff --git a/arch/x86/include/asm/spinlock.h 
> >> b/arch/x86/include/asm/spinlock.h
> >> index 33692ea..b4a91b0 100644
> >> --- a/arch/x86/include/asm/spinlock.h
> >> +++ b/arch/x86/include/asm/spinlock.h
> >> @@ -34,6 +34,8 @@
> >>  # define UNLOCK_LOCK_PREFIX
> >>  #endif
> >>
> >> +#ifndef CONFIG_TICKET_LOCK_QUEUED
> >> +
> >>  /*
> >>   * Ticket locks are conceptually two parts, one indicating the current 
> >> head of
> >>   * the queue, and the other indicating the current tail. The lock is 
> >> acquired
> >> @@ -62,6 +64,25 @@ static __always_inline void 
> >> __ticket_spin_lock(arch_spinlock_t *lock)
> >> barrier();  /* make sure nothing creeps before the 
> >> lock is taken */
> >>  }
> >>
> >> +#else /* #ifndef CONFIG_TICKET_LOCK_QUEUED */
> >> +
> >> +bool tkt_spin_pass(arch_spinlock_t *ap, struct __raw_tickets inc);
> >> +
> >> +static __always_inline void __ticket_spin_lock(arch_spinlock_t *lock)
> >> +{
> >> +   register struct __raw_tickets inc = { .tail = 2 };
> >> +
> >> +   inc = xadd(&lock->tickets, inc);
> >> +   for (;;) {
> >> +   if (inc.head == inc.tail || tkt_spin_pass(lock, inc))
> >> +   break;
> >> +   inc.head = ACCESS_ONCE(lock->tickets.head);
> >> +   }
> >> +   barrier(); /* smp_mb() on Power or ARM. */
> >> +}
> >> +
> >> +#endif /* #else #ifndef CONFIG_TICKET_LOCK_QUEUED */
> >> +
> >>  static __always_inline int __ticket_spin_trylock(arch_spinlock_t *lock)
> >>  {
> >> arch_spinlock_t old, new;
> >> @@ -70,17 +91,37 @@ static __always_inline int 
> >> __ticket_spin_trylock(arch_spinlock_t *lock)
> >> if (old.tickets.head != old.tickets.tail)
> >> return 0;
> >>
> >> +#ifndef CONFIG_TICKET_LOCK_QUEUED
> >> new.head_tail = old.head_tail + (1 << TICKET_SHIFT);
> >> +#else /* #ifndef CONFIG_TICKET_LOCK_QUEUED */
> >> +   new.head_tail = old.head_tail + (2 << TICKET_SHIFT);
> >> +#endif /* #else #ifndef CONFIG_TICKET_LOCK_QUEUED */
> >>
> >> /* cmpxchg is a full barrier, so nothing can move before it */
> >> return cmpxchg(&lock->head_tail, old.head_tail, new.head_tail) == 
> >> old.head_tail;
> >>  }
> >>
> >> +#ifndef CONFIG_TICKET_LOCK_QUEUED
> >> +
> >>  static __always_inline void __ticket_spin_unlock(arch_spinlock_t *lock)
> >>  {
> >> __add(&lock->tickets.head, 1, UNLOCK_LOCK_PREFIX);
> >>  }
> >>
> >> +#else /* #ifndef CONFIG_TICKET_LOCK_QUEUED */
> >> +
> >> +extern void tkt_q_do_wake(arch_spinlock_t *asp);
> >> +
> >> +static __always_inline void __ticket_spin_unlock(arch_spinlock_t *lock)
> >> +{
> >> +   __ticket_t head = 2;
> >> +
> >> +   head = xadd(&lock->tickets.head, 2);
> >> +   if (head & 0x1)
> >> +   tkt_q_do_wake(lock);
> >> +}
> >> +#endif /* #else #ifndef CONFIG_TICKET_LOCK_QUEUED */
> >> +
> >>  static inline int __ticket_spin_is_locked(arch_spinlock_t *lock)
> >>  {
> >> struct __raw_tickets tmp = ACCESS_ONCE(lock->tickets);
> >> diff --git a/arch/x86/include/asm/spinlock_types.h 
> >> b/arch/x86/include/asm/spinlock_types.h
> >> index ad0ad07..cdaefdd 100644
> >> --- a/arch/x86/include/asm/spinlock_types.h
> >> +++ b/arch/x86/include/asm/spinlock_types.h
> >> @@ -7,12 +7,18 @@
> >>
> >>  #include 
> >>
> >> -#if (CONFIG_NR_CPUS < 256)
> >> +#if (CONFIG_NR_CPUS < 128)

Re: [PATCH RFC ticketlock] Auto-queued ticketlock

2013-06-11 Thread Paul E. McKenney
On Tue, Jun 11, 2013 at 11:22:45AM -0400, Steven Rostedt wrote:
> On Tue, 2013-06-11 at 03:14 -0700, Paul E. McKenney wrote:
> >  
> > > Off-topic, although I am in this community for several years,
> > > I am not exactly clear with this problem.
> > > 
> > > 1) In general case, which lock is the most competitive in the kernel? 
> > > what it protects for?
> > > 2) In which special case, which lock is the most competitive in the 
> > > kernel? what it protects for?
> > > 3) In general case, which list is the most hot list?
> > > 4) In which special case, which list is the most hot list?
> > 
> > Others would know better than I, but mmap_sem has been called out as a
> 
> If the contention is with mmap_sem, then I doubt this is going to help
> much, as that's a sleeping rw semaphore. Now, rw semaphores are
> implemented with raw spinlocks, but I doubt that would be the main point
> of contention, compared to the sleeping part.

If I remember correctly, someone actually hit this earlier this year,
which prompted use of a special-purpose queued lock to guard the
semaphore data.  I don't recall whether it was mmap_sem or not, so
cannot say whether it was a straight mutex or an rw semaphore.

Thanx, Paul

> -- Steve
> 
> > prime offender for some workloads.  There is of course some debate as
> > to whether the fault lies mmap_sem or with the workloads.  There have
> > been some efforts to solve this one on LKML, plus some in academia have
> > worked on this as well:
> > 
> 
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC ticketlock] Auto-queued ticketlock

2013-06-11 Thread Paul E. McKenney
On Tue, Jun 11, 2013 at 11:57:14AM -0400, Waiman Long wrote:
> On 06/09/2013 03:36 PM, Paul E. McKenney wrote:
> >Breaking up locks is better than implementing high-contention locks, but
> >if we must have high-contention locks, why not make them automatically
> >switch between light-weight ticket locks at low contention and queued
> >locks at high contention?
> >
> >This commit therefore allows ticket locks to automatically switch between
> >pure ticketlock and queued-lock operation as needed.  If too many CPUs
> >are spinning on a given ticket lock, a queue structure will be allocated
> >and the lock will switch to queued-lock operation.  When the lock becomes
> >free, it will switch back into ticketlock operation.  The low-order bit
> >of the head counter is used to indicate that the lock is in queued mode,
> >which forces an unconditional mismatch between the head and tail counters.
> >This approach means that the common-case code path under conditions of
> >low contention is very nearly that of a plain ticket lock.
> >
> >A fixed number of queueing structures is statically allocated in an
> >array.  The ticket-lock address is used to hash into an initial element,
> >but if that element is already in use, it moves to the next element.  If
> >the entire array is already in use, continue to spin in ticket mode.
> >
> >This has been only lightly tested in the kernel, though a userspace
> >implementation has survived substantial testing.
> >
> >Signed-off-by: Paul E. McKenney
> 
> This is an interesting patch and I think it is useful for workloads
> that run on systems with a large number of CPUs.
> 
> >diff --git a/arch/x86/include/asm/spinlock_types.h 
> >b/arch/x86/include/asm/spinlock_types.h
> >index ad0ad07..cdaefdd 100644
> >--- a/arch/x86/include/asm/spinlock_types.h
> >+++ b/arch/x86/include/asm/spinlock_types.h
> >@@ -7,12 +7,18 @@
> >
> >  #include
> >
> >-#if (CONFIG_NR_CPUS<  256)
> >+#if (CONFIG_NR_CPUS<  128)
> >  typedef u8  __ticket_t;
> >  typedef u16 __ticketpair_t;
> >-#else
> >+#define TICKET_T_CMP_GE(a, b) (UCHAR_MAX / 2>= (unsigned char)((a) - (b)))
> >+#elif (CONFIG_NR_CPUS<  32768)
> >  typedef u16 __ticket_t;
> >  typedef u32 __ticketpair_t;
> >+#define TICKET_T_CMP_GE(a, b) (USHRT_MAX / 2>= (unsigned short)((a) - (b)))
> >+#else
> >+typedef u32 __ticket_t;
> >+typedef u64 __ticketpair_t;
> >+#define TICKET_T_CMP_GE(a, b) (UINT_MAX / 2>= (unsigned int)((a) - (b)))
> >  #endif
> 
> It is theoretically possible that a large number of CPUs (says 64 or
> more with CONFIG_NR_CPUS < 128) can acquire a ticket from the lock
> before the check for TICKET_T_CMP_GE() in tkt_spin_pass(). So the
> check will fail even when there is a large number of CPUs contending
> for the lock. The chance of this happening is, of course, extremely
> rare. This is not an error as the lock is still working as it should
> be without your change.

Good point, I need to change the limits from 128 and 32768 to 64 and 16384
in order to guarantee that the comparison will work correctly.  Done.

> >+/*
> >+ * TKT_Q_SWITCH is twice the number of CPUs that must be spinning on a
> >+ * given ticket lock to motivate switching to spinning on a queue.
> >+ * The reason that it is twice the number is because the bottom bit of
> >+ * the ticket is reserved for the bit that indicates that a queue is
> >+ * associated with the lock.
> >+ */
> >+#define TKT_Q_SWITCH  (16 * 2)
> >+
> >+/*
> >+ * TKT_Q_NQUEUES is the number of queues to maintain.  Large systems
> >+ * might have multiple highly contended locks, so provide more queues for
> >+ * systems with larger numbers of CPUs.
> >+ */
> >+#define TKT_Q_NQUEUES (DIV_ROUND_UP(NR_CPUS + TKT_Q_SWITCH - 1, 
> >TKT_Q_SWITCH) * 2)
> >+
> >+/* The queues themselves. */
> >+struct tkt_q_head tkt_q_heads[TKT_Q_NQUEUES];
> 
> I am a bit concern about the size of the head queue table itself.
> RHEL6, for example, had defined CONFIG_NR_CPUS to be 4096 which mean
> a table size of 256. Maybe it is better to dynamically allocate the
> table at init time depending on the actual number of CPUs in the
> system.

But if your kernel is built for 4096 CPUs, the 32*256=8192 bytes of memory
is way down in the noise.  Systems that care about that small an amount
of memory probably have a small enough number of CPUs that they can just
turn off queueing at build time using CONFIG_TICKET_LOCK_QUEUED=n, right?

> >+/*
> >+ * Return a pointer to the queue header associated with the specified lock,
> >+ * or return NULL if there is no queue for the lock or if the lock's queue
> >+ * is in transition.
> >+ */
> >+static struct tkt_q_head *tkt_q_find_head(arch_spinlock_t *asp)
> >+{
> >+int i;
> >+int start;
> >+
> >+start = i = tkt_q_hash(asp);
> >+do
> >+if (tkt_q_heads[i].ref == asp)
> >+return&tkt_q_heads[i];
> >+while ((i = tkt_q_next_slot(i)) != start);
> >+return NULL;
> >+}
> 
> With a table size of 256 and you have to scan the whole table to

Re: [PATCH RFC ticketlock] Auto-queued ticketlock

2013-06-11 Thread Paul E. McKenney
On Tue, Jun 11, 2013 at 12:20:32PM -0400, Steven Rostedt wrote:
> On Tue, 2013-06-11 at 11:57 -0400, Waiman Long wrote:
> 
> > This is an interesting patch and I think it is useful for workloads that 
> > run on systems with a large number of CPUs.
> 
> I would say it is definitely a fun academic patch, now if it is
> something for a production environment remains to be seen.

At the moment, it should be considered an intellectual exercise.  Might be
useful at some point, but I would personally rather that the offending
lock be broken up to reduce its contention.

> > > diff --git a/arch/x86/include/asm/spinlock_types.h 
> > > b/arch/x86/include/asm/spinlock_types.h
> > > index ad0ad07..cdaefdd 100644
> > > --- a/arch/x86/include/asm/spinlock_types.h
> > > +++ b/arch/x86/include/asm/spinlock_types.h
> > > @@ -7,12 +7,18 @@
> > >
> > >   #include
> > >
> > > -#if (CONFIG_NR_CPUS<  256)
> > > +#if (CONFIG_NR_CPUS<  128)
> > >   typedef u8  __ticket_t;
> > >   typedef u16 __ticketpair_t;
> > > -#else
> > > +#define TICKET_T_CMP_GE(a, b) (UCHAR_MAX / 2>= (unsigned char)((a) - 
> > > (b)))
> > > +#elif (CONFIG_NR_CPUS<  32768)
> > >   typedef u16 __ticket_t;
> > >   typedef u32 __ticketpair_t;
> > > +#define TICKET_T_CMP_GE(a, b) (USHRT_MAX / 2>= (unsigned short)((a) - 
> > > (b)))
> > > +#else
> > > +typedef u32 __ticket_t;
> > > +typedef u64 __ticketpair_t;
> > > +#define TICKET_T_CMP_GE(a, b) (UINT_MAX / 2>= (unsigned int)((a) - (b)))
> > >   #endif
> > 
> > It is theoretically possible that a large number of CPUs (says 64 or 
> > more with CONFIG_NR_CPUS < 128) can acquire a ticket from the lock 
> > before the check for TICKET_T_CMP_GE() in tkt_spin_pass(). So the check 
> > will fail even when there is a large number of CPUs contending for the 
> > lock. The chance of this happening is, of course, extremely rare. This 
> > is not an error as the lock is still working as it should be without 
> > your change.
> 
> Can you explain this more. How can you acquire the ticket and update at
> the same time? If a queue has been set, then you can't acquire the
> ticket as the head has a 1 appended to it.

Suppose that CONFIG_NR_CPUS=127, and suppose that 65 CPUs atomically
increment ->tail before ...

Ah, good point.  If TKT_Q_SWITCH is less than 64, then at least one CPU
will see the need to switch to queued mode, and will do so regardless of
what the other CPUs think.  The key point is that each CPU will get its
ticket from the xadd(), and these will be issued in order.  I therefore
backed out my change of the limits.

> > > +/*
> > > + * TKT_Q_SWITCH is twice the number of CPUs that must be spinning on a
> > > + * given ticket lock to motivate switching to spinning on a queue.
> > > + * The reason that it is twice the number is because the bottom bit of
> > > + * the ticket is reserved for the bit that indicates that a queue is
> > > + * associated with the lock.
> > > + */
> > > +#define TKT_Q_SWITCH  (16 * 2)
> > > +
> > > +/*
> > > + * TKT_Q_NQUEUES is the number of queues to maintain.  Large systems
> > > + * might have multiple highly contended locks, so provide more queues for
> > > + * systems with larger numbers of CPUs.
> > > + */
> > > +#define TKT_Q_NQUEUES (DIV_ROUND_UP(NR_CPUS + TKT_Q_SWITCH - 1, 
> > > TKT_Q_SWITCH) * 2)
> > > +
> > > +/* The queues themselves. */
> > > +struct tkt_q_head tkt_q_heads[TKT_Q_NQUEUES];
> > 
> > I am a bit concern about the size of the head queue table itself. RHEL6, 
> > for example, had defined CONFIG_NR_CPUS to be 4096 which mean a table 
> > size of 256. Maybe it is better to dynamically allocate the table at 
> > init time depending on the actual number of CPUs in the system.
> 
> Yeah, it can be allocated dynamically at boot.

But let's first demonstrate the need.  Keep in mind that an early-boot
deadlock would exercise this code.  Yes, it is just a check for NULL,
but on the other hand I didn't get the impression that you thought that
this code was too simple.  ;-)

> > > +/*
> > > + * Return a pointer to the queue header associated with the specified 
> > > lock,
> > > + * or return NULL if there is no queue for the lock or if the lock's 
> > > queue
> > > + * is in transition.
> > > + */
> > > +static struct tkt_q_head *tkt_q_find_head(arch_spinlock_t *asp)
> > > +{
> > > + int i;
> > > + int start;
> > > +
> > > + start = i = tkt_q_hash(asp);
> > > + do
> > > + if (tkt_q_heads[i].ref == asp)
> > > + return&tkt_q_heads[i];
> > > + while ((i = tkt_q_next_slot(i)) != start);
> > > + return NULL;
> > > +}
> > 
> > With a table size of 256 and you have to scan the whole table to find 
> > the right head queue. This can be a significant overhead. I will suggest 
> > setting a limiting of how many entries it scans before it aborts rather 
> > than checking the whole table.
> 
> We could add a limit, but in practice I'm not sure that would have any
> issue. I thought the same thing when I first saw this, but to hit most
> of the list, would r

Re: [PATCH RFC ticketlock] Auto-queued ticketlock

2013-06-11 Thread Paul E. McKenney
On Tue, Jun 11, 2013 at 10:48:17PM +0800, Lai Jiangshan wrote:
> On Mon, Jun 10, 2013 at 3:36 AM, Paul E. McKenney
>  wrote:
> > Breaking up locks is better than implementing high-contention locks, but
> > if we must have high-contention locks, why not make them automatically
> > switch between light-weight ticket locks at low contention and queued
> > locks at high contention?
> >
> > This commit therefore allows ticket locks to automatically switch between
> > pure ticketlock and queued-lock operation as needed.  If too many CPUs
> > are spinning on a given ticket lock, a queue structure will be allocated
> > and the lock will switch to queued-lock operation.  When the lock becomes
> > free, it will switch back into ticketlock operation.  The low-order bit
> > of the head counter is used to indicate that the lock is in queued mode,
> > which forces an unconditional mismatch between the head and tail counters.
> > This approach means that the common-case code path under conditions of
> > low contention is very nearly that of a plain ticket lock.
> >
> > A fixed number of queueing structures is statically allocated in an
> > array.  The ticket-lock address is used to hash into an initial element,
> > but if that element is already in use, it moves to the next element.  If
> > the entire array is already in use, continue to spin in ticket mode.
> >
> > This has been only lightly tested in the kernel, though a userspace
> > implementation has survived substantial testing.
> >
> > Signed-off-by: Paul E. McKenney 
> >
> > diff --git a/arch/x86/include/asm/spinlock.h 
> > b/arch/x86/include/asm/spinlock.h
> > index 33692ea..b4a91b0 100644
> > --- a/arch/x86/include/asm/spinlock.h
> > +++ b/arch/x86/include/asm/spinlock.h
> > @@ -34,6 +34,8 @@
> >  # define UNLOCK_LOCK_PREFIX
> >  #endif
> >
> > +#ifndef CONFIG_TICKET_LOCK_QUEUED
> > +
> >  /*
> >   * Ticket locks are conceptually two parts, one indicating the current 
> > head of
> >   * the queue, and the other indicating the current tail. The lock is 
> > acquired
> > @@ -62,6 +64,25 @@ static __always_inline void 
> > __ticket_spin_lock(arch_spinlock_t *lock)
> > barrier();  /* make sure nothing creeps before the lock 
> > is taken */
> >  }
> >
> > +#else /* #ifndef CONFIG_TICKET_LOCK_QUEUED */
> > +
> > +bool tkt_spin_pass(arch_spinlock_t *ap, struct __raw_tickets inc);
> > +
> > +static __always_inline void __ticket_spin_lock(arch_spinlock_t *lock)
> > +{
> > +   register struct __raw_tickets inc = { .tail = 2 };
> > +
> > +   inc = xadd(&lock->tickets, inc);
> > +   for (;;) {
> > +   if (inc.head == inc.tail || tkt_spin_pass(lock, inc))
> > +   break;
> > +   inc.head = ACCESS_ONCE(lock->tickets.head);
> > +   }
> > +   barrier(); /* smp_mb() on Power or ARM. */
> > +}
> > +
> > +#endif /* #else #ifndef CONFIG_TICKET_LOCK_QUEUED */
> > +
> >  static __always_inline int __ticket_spin_trylock(arch_spinlock_t *lock)
> >  {
> > arch_spinlock_t old, new;
> > @@ -70,17 +91,37 @@ static __always_inline int 
> > __ticket_spin_trylock(arch_spinlock_t *lock)
> > if (old.tickets.head != old.tickets.tail)
> > return 0;
> >
> > +#ifndef CONFIG_TICKET_LOCK_QUEUED
> > new.head_tail = old.head_tail + (1 << TICKET_SHIFT);
> > +#else /* #ifndef CONFIG_TICKET_LOCK_QUEUED */
> > +   new.head_tail = old.head_tail + (2 << TICKET_SHIFT);
> > +#endif /* #else #ifndef CONFIG_TICKET_LOCK_QUEUED */
> >
> > /* cmpxchg is a full barrier, so nothing can move before it */
> > return cmpxchg(&lock->head_tail, old.head_tail, new.head_tail) == 
> > old.head_tail;
> >  }
> >
> > +#ifndef CONFIG_TICKET_LOCK_QUEUED
> > +
> >  static __always_inline void __ticket_spin_unlock(arch_spinlock_t *lock)
> >  {
> > __add(&lock->tickets.head, 1, UNLOCK_LOCK_PREFIX);
> >  }
> >
> > +#else /* #ifndef CONFIG_TICKET_LOCK_QUEUED */
> > +
> > +extern void tkt_q_do_wake(arch_spinlock_t *asp);
> > +
> > +static __always_inline void __ticket_spin_unlock(arch_spinlock_t *lock)
> > +{
> > +   __ticket_t head = 2;
> > +
> > +   head = xadd(&lock->tickets.head, 2);
> > +   if (head & 0x1)
> > +   tkt_q_do_wake(lock);
> > +}
> > +#endif /* #else #ifndef CONFIG_TICKET_LOCK_QUEUED */
> > +
> >  static inline int __ticket_spin_is_locked(arch_spinlock_t *lock)
> >  {
> > struct __raw_tickets tmp = ACCESS_ONCE(lock->tickets);
> > diff --git a/arch/x86/include/asm/spinlock_types.h 
> > b/arch/x86/include/asm/spinlock_types.h
> > index ad0ad07..cdaefdd 100644
> > --- a/arch/x86/include/asm/spinlock_types.h
> > +++ b/arch/x86/include/asm/spinlock_types.h
> > @@ -7,12 +7,18 @@
> >
> >  #include 
> >
> > -#if (CONFIG_NR_CPUS < 256)
> > +#if (CONFIG_NR_CPUS < 128)
> >  typedef u8  __ticket_t;
> >  typedef u16 __ticketpair_t;
> > -#else
> > +#define TICKET_T_CMP_GE(a, b) (UCHAR_MAX / 2 >= (unsigned char)((a) - (b)))
> > +#elif (CONFIG_NR_CPUS 

Re: [PATCH RFC ticketlock] Auto-queued ticketlock

2013-06-11 Thread Steven Rostedt
On Tue, 2013-06-11 at 11:57 -0400, Waiman Long wrote:

> This is an interesting patch and I think it is useful for workloads that 
> run on systems with a large number of CPUs.

I would say it is definitely a fun academic patch, now if it is
something for a production environment remains to be seen.
 
> 
> > diff --git a/arch/x86/include/asm/spinlock_types.h 
> > b/arch/x86/include/asm/spinlock_types.h
> > index ad0ad07..cdaefdd 100644
> > --- a/arch/x86/include/asm/spinlock_types.h
> > +++ b/arch/x86/include/asm/spinlock_types.h
> > @@ -7,12 +7,18 @@
> >
> >   #include
> >
> > -#if (CONFIG_NR_CPUS<  256)
> > +#if (CONFIG_NR_CPUS<  128)
> >   typedef u8  __ticket_t;
> >   typedef u16 __ticketpair_t;
> > -#else
> > +#define TICKET_T_CMP_GE(a, b) (UCHAR_MAX / 2>= (unsigned char)((a) - (b)))
> > +#elif (CONFIG_NR_CPUS<  32768)
> >   typedef u16 __ticket_t;
> >   typedef u32 __ticketpair_t;
> > +#define TICKET_T_CMP_GE(a, b) (USHRT_MAX / 2>= (unsigned short)((a) - (b)))
> > +#else
> > +typedef u32 __ticket_t;
> > +typedef u64 __ticketpair_t;
> > +#define TICKET_T_CMP_GE(a, b) (UINT_MAX / 2>= (unsigned int)((a) - (b)))
> >   #endif
> 
> It is theoretically possible that a large number of CPUs (says 64 or 
> more with CONFIG_NR_CPUS < 128) can acquire a ticket from the lock 
> before the check for TICKET_T_CMP_GE() in tkt_spin_pass(). So the check 
> will fail even when there is a large number of CPUs contending for the 
> lock. The chance of this happening is, of course, extremely rare. This 
> is not an error as the lock is still working as it should be without 
> your change.

Can you explain this more. How can you acquire the ticket and update at
the same time? If a queue has been set, then you can't acquire the
ticket as the head has a 1 appended to it.


> >
> > +/*
> > + * TKT_Q_SWITCH is twice the number of CPUs that must be spinning on a
> > + * given ticket lock to motivate switching to spinning on a queue.
> > + * The reason that it is twice the number is because the bottom bit of
> > + * the ticket is reserved for the bit that indicates that a queue is
> > + * associated with the lock.
> > + */
> > +#define TKT_Q_SWITCH  (16 * 2)
> > +
> > +/*
> > + * TKT_Q_NQUEUES is the number of queues to maintain.  Large systems
> > + * might have multiple highly contended locks, so provide more queues for
> > + * systems with larger numbers of CPUs.
> > + */
> > +#define TKT_Q_NQUEUES (DIV_ROUND_UP(NR_CPUS + TKT_Q_SWITCH - 1, 
> > TKT_Q_SWITCH) * 2)
> > +
> > +/* The queues themselves. */
> > +struct tkt_q_head tkt_q_heads[TKT_Q_NQUEUES];
> 
> I am a bit concern about the size of the head queue table itself. RHEL6, 
> for example, had defined CONFIG_NR_CPUS to be 4096 which mean a table 
> size of 256. Maybe it is better to dynamically allocate the table at 
> init time depending on the actual number of CPUs in the system.

Yeah, it can be allocated dynamically at boot.

> 
> > +/*
> > + * Return a pointer to the queue header associated with the specified lock,
> > + * or return NULL if there is no queue for the lock or if the lock's queue
> > + * is in transition.
> > + */
> > +static struct tkt_q_head *tkt_q_find_head(arch_spinlock_t *asp)
> > +{
> > +   int i;
> > +   int start;
> > +
> > +   start = i = tkt_q_hash(asp);
> > +   do
> > +   if (tkt_q_heads[i].ref == asp)
> > +   return&tkt_q_heads[i];
> > +   while ((i = tkt_q_next_slot(i)) != start);
> > +   return NULL;
> > +}
> 
> With a table size of 256 and you have to scan the whole table to find 
> the right head queue. This can be a significant overhead. I will suggest 
> setting a limiting of how many entries it scans before it aborts rather 
> than checking the whole table.

We could add a limit, but in practice I'm not sure that would have any
issue. I thought the same thing when I first saw this, but to hit most
of the list, would require a large collision in the hash algorithm,
would could probably be fixed with a better hash.

> 
> > +/*
> > + * Hand the lock off to the first CPU on the queue.
> > + */
> > +void tkt_q_do_wake(arch_spinlock_t *asp)
> > +{
> > +   struct tkt_q_head *tqhp;
> > +   struct tkt_q *tqp;
> > +
> > +   /* If the queue is still being set up, wait for it. */
> > +   while ((tqhp = tkt_q_find_head(asp)) == NULL)
> > +   cpu_relax();
> > +
> > +   for (;;) {
> > +
> > +   /* Find the first queue element. */
> > +   tqp = ACCESS_ONCE(tqhp->spin);
> > +   if (tqp != NULL)
> > +   break;  /* Element exists, hand off lock. */
> > +   if (tkt_q_try_unqueue(asp, tqhp))
> > +   return; /* No element, successfully removed queue. */
> > +   cpu_relax();
> > +   }
> > +   if (ACCESS_ONCE(tqhp->head_tkt) != -1)
> > +   ACCESS_ONCE(tqhp->head_tkt) = -1;
> 
> In case NR_CPUS is 32768 or higher, the ticket will be of type u32 and 
> tqhp->head_tkt is s32. So -1 will be a valid ticket number. You may have 
> to conditionally defin

Re: [PATCH RFC ticketlock] Auto-queued ticketlock

2013-06-11 Thread Waiman Long

On 06/09/2013 03:36 PM, Paul E. McKenney wrote:

Breaking up locks is better than implementing high-contention locks, but
if we must have high-contention locks, why not make them automatically
switch between light-weight ticket locks at low contention and queued
locks at high contention?

This commit therefore allows ticket locks to automatically switch between
pure ticketlock and queued-lock operation as needed.  If too many CPUs
are spinning on a given ticket lock, a queue structure will be allocated
and the lock will switch to queued-lock operation.  When the lock becomes
free, it will switch back into ticketlock operation.  The low-order bit
of the head counter is used to indicate that the lock is in queued mode,
which forces an unconditional mismatch between the head and tail counters.
This approach means that the common-case code path under conditions of
low contention is very nearly that of a plain ticket lock.

A fixed number of queueing structures is statically allocated in an
array.  The ticket-lock address is used to hash into an initial element,
but if that element is already in use, it moves to the next element.  If
the entire array is already in use, continue to spin in ticket mode.

This has been only lightly tested in the kernel, though a userspace
implementation has survived substantial testing.

Signed-off-by: Paul E. McKenney


This is an interesting patch and I think it is useful for workloads that 
run on systems with a large number of CPUs.



diff --git a/arch/x86/include/asm/spinlock_types.h 
b/arch/x86/include/asm/spinlock_types.h
index ad0ad07..cdaefdd 100644
--- a/arch/x86/include/asm/spinlock_types.h
+++ b/arch/x86/include/asm/spinlock_types.h
@@ -7,12 +7,18 @@

  #include

-#if (CONFIG_NR_CPUS<  256)
+#if (CONFIG_NR_CPUS<  128)
  typedef u8  __ticket_t;
  typedef u16 __ticketpair_t;
-#else
+#define TICKET_T_CMP_GE(a, b) (UCHAR_MAX / 2>= (unsigned char)((a) - (b)))
+#elif (CONFIG_NR_CPUS<  32768)
  typedef u16 __ticket_t;
  typedef u32 __ticketpair_t;
+#define TICKET_T_CMP_GE(a, b) (USHRT_MAX / 2>= (unsigned short)((a) - (b)))
+#else
+typedef u32 __ticket_t;
+typedef u64 __ticketpair_t;
+#define TICKET_T_CMP_GE(a, b) (UINT_MAX / 2>= (unsigned int)((a) - (b)))
  #endif


It is theoretically possible that a large number of CPUs (says 64 or 
more with CONFIG_NR_CPUS < 128) can acquire a ticket from the lock 
before the check for TICKET_T_CMP_GE() in tkt_spin_pass(). So the check 
will fail even when there is a large number of CPUs contending for the 
lock. The chance of this happening is, of course, extremely rare. This 
is not an error as the lock is still working as it should be without 
your change.




+/*
+ * TKT_Q_SWITCH is twice the number of CPUs that must be spinning on a
+ * given ticket lock to motivate switching to spinning on a queue.
+ * The reason that it is twice the number is because the bottom bit of
+ * the ticket is reserved for the bit that indicates that a queue is
+ * associated with the lock.
+ */
+#define TKT_Q_SWITCH  (16 * 2)
+
+/*
+ * TKT_Q_NQUEUES is the number of queues to maintain.  Large systems
+ * might have multiple highly contended locks, so provide more queues for
+ * systems with larger numbers of CPUs.
+ */
+#define TKT_Q_NQUEUES (DIV_ROUND_UP(NR_CPUS + TKT_Q_SWITCH - 1, TKT_Q_SWITCH) 
* 2)
+
+/* The queues themselves. */
+struct tkt_q_head tkt_q_heads[TKT_Q_NQUEUES];


I am a bit concern about the size of the head queue table itself. RHEL6, 
for example, had defined CONFIG_NR_CPUS to be 4096 which mean a table 
size of 256. Maybe it is better to dynamically allocate the table at 
init time depending on the actual number of CPUs in the system.



+/*
+ * Return a pointer to the queue header associated with the specified lock,
+ * or return NULL if there is no queue for the lock or if the lock's queue
+ * is in transition.
+ */
+static struct tkt_q_head *tkt_q_find_head(arch_spinlock_t *asp)
+{
+   int i;
+   int start;
+
+   start = i = tkt_q_hash(asp);
+   do
+   if (tkt_q_heads[i].ref == asp)
+   return&tkt_q_heads[i];
+   while ((i = tkt_q_next_slot(i)) != start);
+   return NULL;
+}


With a table size of 256 and you have to scan the whole table to find 
the right head queue. This can be a significant overhead. I will suggest 
setting a limiting of how many entries it scans before it aborts rather 
than checking the whole table.



+/*
+ * Hand the lock off to the first CPU on the queue.
+ */
+void tkt_q_do_wake(arch_spinlock_t *asp)
+{
+   struct tkt_q_head *tqhp;
+   struct tkt_q *tqp;
+
+   /* If the queue is still being set up, wait for it. */
+   while ((tqhp = tkt_q_find_head(asp)) == NULL)
+   cpu_relax();
+
+   for (;;) {
+
+   /* Find the first queue element. */
+   tqp = ACCESS_ONCE(tqhp->spin);
+   if (tqp != NULL)
+   break;  /* Element exists, hand off lock. */
+   if (tkt

Re: [PATCH RFC ticketlock] Auto-queued ticketlock

2013-06-11 Thread Steven Rostedt
On Tue, 2013-06-11 at 03:14 -0700, Paul E. McKenney wrote:
>  
> > Off-topic, although I am in this community for several years,
> > I am not exactly clear with this problem.
> > 
> > 1) In general case, which lock is the most competitive in the kernel? what 
> > it protects for?
> > 2) In which special case, which lock is the most competitive in the kernel? 
> > what it protects for?
> > 3) In general case, which list is the most hot list?
> > 4) In which special case, which list is the most hot list?
> 
> Others would know better than I, but mmap_sem has been called out as a

If the contention is with mmap_sem, then I doubt this is going to help
much, as that's a sleeping rw semaphore. Now, rw semaphores are
implemented with raw spinlocks, but I doubt that would be the main point
of contention, compared to the sleeping part.

-- Steve

> prime offender for some workloads.  There is of course some debate as
> to whether the fault lies mmap_sem or with the workloads.  There have
> been some efforts to solve this one on LKML, plus some in academia have
> worked on this as well:
> 


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC ticketlock] Auto-queued ticketlock

2013-06-11 Thread Paul E. McKenney
On Tue, Jun 11, 2013 at 02:56:55AM -0700, Paul E. McKenney wrote:
> On Mon, Jun 10, 2013 at 08:44:40PM -0400, Steven Rostedt wrote:
> > On Sun, 2013-06-09 at 12:36 -0700, Paul E. McKenney wrote:

[ . . . ]

> > > +bool tkt_q_do_spin(arch_spinlock_t *asp, struct __raw_tickets inc)
> > > +{
> > > + struct tkt_q **oldtail;
> > > + struct tkt_q tq;
> > > + struct tkt_q_head *tqhp;
> > > +
> > > + /*
> > > +  * Ensure that accesses to queue header happen after sensing
> > > +  * the lock's have-queue bit.
> > > +  */
> > > + smp_mb();  /* See above block comment. */
> > > +
> > > + /* If there no longer is a queue, leave. */
> > > + tqhp = tkt_q_find_head(asp);
> > > + if (tqhp == NULL)
> > > + return false;
> > > +
> > > + /* Initialize our queue element. */
> > > + tq.cpu = raw_smp_processor_id();
> > > + tq.tail = inc.tail;
> > > + tq.next = NULL;
> > > +
> > > + /* Check to see if we already hold the lock. */
> > > + if (ACCESS_ONCE(tqhp->head_tkt) == inc.tail) {
> > > + /* The last holder left before queue formed, we hold lock. */
> > > + tqhp->head_tkt = -1;
> > > + return true;
> > > + }
> > > +
> > > + /* Add our element to the tail of the queue. */
> > > + oldtail = xchg(&tqhp->spin_tail, &tq.next);
> > 
> > Boy this is tricky code! I thought I found a race window here, but as I
> > went to write my email saying "Gotcha!" I found that it wasn't a race
> > after all. But as I went though the effort of writing this, I figured I
> > would send this out as documentation for others to see. Hmm, I wonder if
> > we can use this email to add more comments. Anyway, here's what I
> > thought was wrong ;-)
> 
> If you didn't know any better, you might even think that I had done
> something like this before.  ;-)
> 
> > OK, I originally thought there was a race window here. Let's say that an
> > NMI hit right here, and it happens to be a big one, where lots of things
> > can happen on other CPUs right now.
> > 
> > The scenario is that there's just one item on the queue, which is
> > waiting for the lock to be released, and is spinning below in the:
> > 
> > while (ACCESS_ONCE(tq.cpu) != -1)
> > cpu_relax();
> > 
> > And then the lock is released, where in tkt_q_do_wake() the following is
> > called:
> > 
> > ACCESS_ONCE(tqp->cpu) = -1;
> > 
> > Now the old queued task is released. But it's tq->next hasn't been set
> > yet, and is still NULL. It leaves by doing:
> > 
> > ACCESS_ONCE(tqhp->spin) = tq.next;
> > return true;
> > 
> > All before this task gets to set *oldtail to &tq. But, I then looked
> > below...
> > 
> > 
> > > + ACCESS_ONCE(*oldtail) = &tq;
> > > +
> > > + /* Spin until handoff. */
> > > + while (ACCESS_ONCE(tq.cpu) != -1)
> > > + cpu_relax();
> > > +
> > > + /*
> > > +  * Remove our element from the queue.  If the queue is now empty,
> > > +  * update carefully so that the next acquisition will queue itself
> > > +  * at the head of the list.
> > > +  */
> > > + if (tq.next == NULL) {
> > 
> > This checks for that scenario.
> 
> Yep!
> 
> >As if the old task were to come out
> > spinning, the problem would only be if it was the last one on the list,
> > and its tq.next was NULL. But if that was the case, then we set spin to
> > NULL and do the next trick, where I thought I gotcha again...
> > 
> > 
> > > +
> > > + /* Mark the queue empty. */
> > > + tqhp->spin = NULL;
> > > +
> > > + /* Try to point the tail back at the head. */
> > > + if (cmpxchg(&tqhp->spin_tail,
> > > + &tq.next,
> > > + &tqhp->spin) == &tq.next)
> > 
> > Here, I was thinking, oh wait, what happens if this is called right
> > before the xchg() above. Then we would set spin_tail but not update the
> > old tq.next. But wait! look at what we assign spin_tail to. It's the
> > address of spin, which would be what oldtail would point to above, and
> > then above would set spin to the new tq!
> 
> Yep again!
> 
> > OK, I haven't found a issue here yet, but youss are beiing trickssy! We
> > don't like trickssy, and we must find preiouss!!!
> > 
> > 
> > This code is starting to make me look like Gollum :-p
> 
> Hmmm...  The time and effort to do this might almost have been worthwhile
> just to accomplish that!  ;-)
> 
> But yes, this would need better comments, design documentation, or
> maybe both.

And for whatever it might be worth, here is an attempted upgrade for
comments.

First, I upgrade the comment for the xchg() that does the enqueue:

/*
 * Add our element to the tail of the queue.  Note that if the
 * queue is empty, the ->spin_tail pointer will reference
 * the queue's head pointer, namely ->spin.
 */
oldtail = xchg(&tqhp->spin_tail, &tq.next);
ACCESS_ONCE(*oldtail) = &tq;

Next, I upgrade the comment for the dequeue operation:

/*
 * Remove our element from the queue.  If the 

Re: [PATCH RFC ticketlock] Auto-queued ticketlock

2013-06-11 Thread Lai Jiangshan
On Tue, Jun 11, 2013 at 10:48 PM, Lai Jiangshan  wrote:
> On Mon, Jun 10, 2013 at 3:36 AM, Paul E. McKenney
>  wrote:
>> Breaking up locks is better than implementing high-contention locks, but
>> if we must have high-contention locks, why not make them automatically
>> switch between light-weight ticket locks at low contention and queued
>> locks at high contention?
>>
>> This commit therefore allows ticket locks to automatically switch between
>> pure ticketlock and queued-lock operation as needed.  If too many CPUs
>> are spinning on a given ticket lock, a queue structure will be allocated
>> and the lock will switch to queued-lock operation.  When the lock becomes
>> free, it will switch back into ticketlock operation.  The low-order bit
>> of the head counter is used to indicate that the lock is in queued mode,
>> which forces an unconditional mismatch between the head and tail counters.
>> This approach means that the common-case code path under conditions of
>> low contention is very nearly that of a plain ticket lock.
>>
>> A fixed number of queueing structures is statically allocated in an
>> array.  The ticket-lock address is used to hash into an initial element,
>> but if that element is already in use, it moves to the next element.  If
>> the entire array is already in use, continue to spin in ticket mode.
>>
>> This has been only lightly tested in the kernel, though a userspace
>> implementation has survived substantial testing.
>>
>> Signed-off-by: Paul E. McKenney 
>>
>> diff --git a/arch/x86/include/asm/spinlock.h 
>> b/arch/x86/include/asm/spinlock.h
>> index 33692ea..b4a91b0 100644
>> --- a/arch/x86/include/asm/spinlock.h
>> +++ b/arch/x86/include/asm/spinlock.h
>> @@ -34,6 +34,8 @@
>>  # define UNLOCK_LOCK_PREFIX
>>  #endif
>>
>> +#ifndef CONFIG_TICKET_LOCK_QUEUED
>> +
>>  /*
>>   * Ticket locks are conceptually two parts, one indicating the current head 
>> of
>>   * the queue, and the other indicating the current tail. The lock is 
>> acquired
>> @@ -62,6 +64,25 @@ static __always_inline void 
>> __ticket_spin_lock(arch_spinlock_t *lock)
>> barrier();  /* make sure nothing creeps before the lock 
>> is taken */
>>  }
>>
>> +#else /* #ifndef CONFIG_TICKET_LOCK_QUEUED */
>> +
>> +bool tkt_spin_pass(arch_spinlock_t *ap, struct __raw_tickets inc);
>> +
>> +static __always_inline void __ticket_spin_lock(arch_spinlock_t *lock)
>> +{
>> +   register struct __raw_tickets inc = { .tail = 2 };
>> +
>> +   inc = xadd(&lock->tickets, inc);
>> +   for (;;) {
>> +   if (inc.head == inc.tail || tkt_spin_pass(lock, inc))
>> +   break;
>> +   inc.head = ACCESS_ONCE(lock->tickets.head);
>> +   }
>> +   barrier(); /* smp_mb() on Power or ARM. */
>> +}
>> +
>> +#endif /* #else #ifndef CONFIG_TICKET_LOCK_QUEUED */
>> +
>>  static __always_inline int __ticket_spin_trylock(arch_spinlock_t *lock)
>>  {
>> arch_spinlock_t old, new;
>> @@ -70,17 +91,37 @@ static __always_inline int 
>> __ticket_spin_trylock(arch_spinlock_t *lock)
>> if (old.tickets.head != old.tickets.tail)
>> return 0;
>>
>> +#ifndef CONFIG_TICKET_LOCK_QUEUED
>> new.head_tail = old.head_tail + (1 << TICKET_SHIFT);
>> +#else /* #ifndef CONFIG_TICKET_LOCK_QUEUED */
>> +   new.head_tail = old.head_tail + (2 << TICKET_SHIFT);
>> +#endif /* #else #ifndef CONFIG_TICKET_LOCK_QUEUED */
>>
>> /* cmpxchg is a full barrier, so nothing can move before it */
>> return cmpxchg(&lock->head_tail, old.head_tail, new.head_tail) == 
>> old.head_tail;
>>  }
>>
>> +#ifndef CONFIG_TICKET_LOCK_QUEUED
>> +
>>  static __always_inline void __ticket_spin_unlock(arch_spinlock_t *lock)
>>  {
>> __add(&lock->tickets.head, 1, UNLOCK_LOCK_PREFIX);
>>  }
>>
>> +#else /* #ifndef CONFIG_TICKET_LOCK_QUEUED */
>> +
>> +extern void tkt_q_do_wake(arch_spinlock_t *asp);
>> +
>> +static __always_inline void __ticket_spin_unlock(arch_spinlock_t *lock)
>> +{
>> +   __ticket_t head = 2;
>> +
>> +   head = xadd(&lock->tickets.head, 2);
>> +   if (head & 0x1)
>> +   tkt_q_do_wake(lock);
>> +}
>> +#endif /* #else #ifndef CONFIG_TICKET_LOCK_QUEUED */
>> +
>>  static inline int __ticket_spin_is_locked(arch_spinlock_t *lock)
>>  {
>> struct __raw_tickets tmp = ACCESS_ONCE(lock->tickets);
>> diff --git a/arch/x86/include/asm/spinlock_types.h 
>> b/arch/x86/include/asm/spinlock_types.h
>> index ad0ad07..cdaefdd 100644
>> --- a/arch/x86/include/asm/spinlock_types.h
>> +++ b/arch/x86/include/asm/spinlock_types.h
>> @@ -7,12 +7,18 @@
>>
>>  #include 
>>
>> -#if (CONFIG_NR_CPUS < 256)
>> +#if (CONFIG_NR_CPUS < 128)
>>  typedef u8  __ticket_t;
>>  typedef u16 __ticketpair_t;
>> -#else
>> +#define TICKET_T_CMP_GE(a, b) (UCHAR_MAX / 2 >= (unsigned char)((a) - (b)))
>> +#elif (CONFIG_NR_CPUS < 32768)
>>  typedef u16 __ticket_t;
>>  typedef u32 __ticketpair_t;
>> +#define TICKET_T_CMP_GE(a, b) (USHRT_MAX / 2 >= (unsigned s

Re: [PATCH RFC ticketlock] Auto-queued ticketlock

2013-06-11 Thread Lai Jiangshan
On Mon, Jun 10, 2013 at 3:36 AM, Paul E. McKenney
 wrote:
> Breaking up locks is better than implementing high-contention locks, but
> if we must have high-contention locks, why not make them automatically
> switch between light-weight ticket locks at low contention and queued
> locks at high contention?
>
> This commit therefore allows ticket locks to automatically switch between
> pure ticketlock and queued-lock operation as needed.  If too many CPUs
> are spinning on a given ticket lock, a queue structure will be allocated
> and the lock will switch to queued-lock operation.  When the lock becomes
> free, it will switch back into ticketlock operation.  The low-order bit
> of the head counter is used to indicate that the lock is in queued mode,
> which forces an unconditional mismatch between the head and tail counters.
> This approach means that the common-case code path under conditions of
> low contention is very nearly that of a plain ticket lock.
>
> A fixed number of queueing structures is statically allocated in an
> array.  The ticket-lock address is used to hash into an initial element,
> but if that element is already in use, it moves to the next element.  If
> the entire array is already in use, continue to spin in ticket mode.
>
> This has been only lightly tested in the kernel, though a userspace
> implementation has survived substantial testing.
>
> Signed-off-by: Paul E. McKenney 
>
> diff --git a/arch/x86/include/asm/spinlock.h b/arch/x86/include/asm/spinlock.h
> index 33692ea..b4a91b0 100644
> --- a/arch/x86/include/asm/spinlock.h
> +++ b/arch/x86/include/asm/spinlock.h
> @@ -34,6 +34,8 @@
>  # define UNLOCK_LOCK_PREFIX
>  #endif
>
> +#ifndef CONFIG_TICKET_LOCK_QUEUED
> +
>  /*
>   * Ticket locks are conceptually two parts, one indicating the current head 
> of
>   * the queue, and the other indicating the current tail. The lock is acquired
> @@ -62,6 +64,25 @@ static __always_inline void 
> __ticket_spin_lock(arch_spinlock_t *lock)
> barrier();  /* make sure nothing creeps before the lock 
> is taken */
>  }
>
> +#else /* #ifndef CONFIG_TICKET_LOCK_QUEUED */
> +
> +bool tkt_spin_pass(arch_spinlock_t *ap, struct __raw_tickets inc);
> +
> +static __always_inline void __ticket_spin_lock(arch_spinlock_t *lock)
> +{
> +   register struct __raw_tickets inc = { .tail = 2 };
> +
> +   inc = xadd(&lock->tickets, inc);
> +   for (;;) {
> +   if (inc.head == inc.tail || tkt_spin_pass(lock, inc))
> +   break;
> +   inc.head = ACCESS_ONCE(lock->tickets.head);
> +   }
> +   barrier(); /* smp_mb() on Power or ARM. */
> +}
> +
> +#endif /* #else #ifndef CONFIG_TICKET_LOCK_QUEUED */
> +
>  static __always_inline int __ticket_spin_trylock(arch_spinlock_t *lock)
>  {
> arch_spinlock_t old, new;
> @@ -70,17 +91,37 @@ static __always_inline int 
> __ticket_spin_trylock(arch_spinlock_t *lock)
> if (old.tickets.head != old.tickets.tail)
> return 0;
>
> +#ifndef CONFIG_TICKET_LOCK_QUEUED
> new.head_tail = old.head_tail + (1 << TICKET_SHIFT);
> +#else /* #ifndef CONFIG_TICKET_LOCK_QUEUED */
> +   new.head_tail = old.head_tail + (2 << TICKET_SHIFT);
> +#endif /* #else #ifndef CONFIG_TICKET_LOCK_QUEUED */
>
> /* cmpxchg is a full barrier, so nothing can move before it */
> return cmpxchg(&lock->head_tail, old.head_tail, new.head_tail) == 
> old.head_tail;
>  }
>
> +#ifndef CONFIG_TICKET_LOCK_QUEUED
> +
>  static __always_inline void __ticket_spin_unlock(arch_spinlock_t *lock)
>  {
> __add(&lock->tickets.head, 1, UNLOCK_LOCK_PREFIX);
>  }
>
> +#else /* #ifndef CONFIG_TICKET_LOCK_QUEUED */
> +
> +extern void tkt_q_do_wake(arch_spinlock_t *asp);
> +
> +static __always_inline void __ticket_spin_unlock(arch_spinlock_t *lock)
> +{
> +   __ticket_t head = 2;
> +
> +   head = xadd(&lock->tickets.head, 2);
> +   if (head & 0x1)
> +   tkt_q_do_wake(lock);
> +}
> +#endif /* #else #ifndef CONFIG_TICKET_LOCK_QUEUED */
> +
>  static inline int __ticket_spin_is_locked(arch_spinlock_t *lock)
>  {
> struct __raw_tickets tmp = ACCESS_ONCE(lock->tickets);
> diff --git a/arch/x86/include/asm/spinlock_types.h 
> b/arch/x86/include/asm/spinlock_types.h
> index ad0ad07..cdaefdd 100644
> --- a/arch/x86/include/asm/spinlock_types.h
> +++ b/arch/x86/include/asm/spinlock_types.h
> @@ -7,12 +7,18 @@
>
>  #include 
>
> -#if (CONFIG_NR_CPUS < 256)
> +#if (CONFIG_NR_CPUS < 128)
>  typedef u8  __ticket_t;
>  typedef u16 __ticketpair_t;
> -#else
> +#define TICKET_T_CMP_GE(a, b) (UCHAR_MAX / 2 >= (unsigned char)((a) - (b)))
> +#elif (CONFIG_NR_CPUS < 32768)
>  typedef u16 __ticket_t;
>  typedef u32 __ticketpair_t;
> +#define TICKET_T_CMP_GE(a, b) (USHRT_MAX / 2 >= (unsigned short)((a) - (b)))
> +#else
> +typedef u32 __ticket_t;
> +typedef u64 __ticketpair_t;
> +#define TICKET_T_CMP_GE(a, b) (UINT_MAX / 2 >= (unsigned int)((a) - (b)))
>  #endif
>
>  #define TICKET_SH

Re: [PATCH RFC ticketlock] Auto-queued ticketlock

2013-06-11 Thread Paul E. McKenney
On Tue, Jun 11, 2013 at 03:53:17PM +0800, Lai Jiangshan wrote:
> On 06/11/2013 08:51 AM, Linus Torvalds wrote:
> > On Mon, Jun 10, 2013 at 5:44 PM, Steven Rostedt  wrote:
> >>
> >> OK, I haven't found a issue here yet, but youss are beiing trickssy! We
> >> don't like trickssy, and we must find preiouss!!!
> > 
> > .. and I personally have my usual reservations. I absolutely hate
> > papering over scalability issues, and historically whenever people
> > have ever thought that we want complex spinlocks, the problem has
> > always been that the locking sucks.
> > 
> > So reinforced by previous events, I really feel that code that needs
> > this kind of spinlock is broken and needs to be fixed, rather than
> > actually introduce tricky spinlocks.
> > 
> > So in order to merge something like this, I want (a) numbers for real
> > loads and (b) explanations for why the spinlock users cannot be fixed.
> > 
> > Because "we might hit loads" is just not good enough. I would counter
> > with "hiding problems causes more of them".
> > 
> 
> Hi, all
> 
> Off-topic, although I am in this community for several years,
> I am not exactly clear with this problem.
> 
> 1) In general case, which lock is the most competitive in the kernel? what it 
> protects for?
> 2) In which special case, which lock is the most competitive in the kernel? 
> what it protects for?
> 3) In general case, which list is the most hot list?
> 4) In which special case, which list is the most hot list?

Others would know better than I, but mmap_sem has been called out as a
prime offender for some workloads.  There is of course some debate as
to whether the fault lies mmap_sem or with the workloads.  There have
been some efforts to solve this one on LKML, plus some in academia have
worked on this as well:

http://people.csail.mit.edu/nickolai/papers/clements-bonsai.pdf
http://pdos.csail.mit.edu/papers/radixvm:eurosys13.pdf

And IIRC this was the subject of a session at a recent minisummit.

There are a few locks within the RCU implementation that have popped
up from time to time on very large systems, but I have dealt with those
and have plans for each should it become a problem.  The plans probably
won't survive first contact with a real workload, but having thought
things through is very helpful.

Thanx, Paul

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC ticketlock] Auto-queued ticketlock

2013-06-11 Thread Paul E. McKenney
On Mon, Jun 10, 2013 at 05:51:14PM -0700, Linus Torvalds wrote:
> On Mon, Jun 10, 2013 at 5:44 PM, Steven Rostedt  wrote:
> >
> > OK, I haven't found a issue here yet, but youss are beiing trickssy! We
> > don't like trickssy, and we must find preiouss!!!

Heh!  You should see what it looks like if you make slightly different
design decisions.  For example, just you try switching back from queued
to ticket mode while there are still CPUs spinning on the lock!  ;-)

> .. and I personally have my usual reservations. I absolutely hate
> papering over scalability issues, and historically whenever people
> have ever thought that we want complex spinlocks, the problem has
> always been that the locking sucks.
> 
> So reinforced by previous events, I really feel that code that needs
> this kind of spinlock is broken and needs to be fixed, rather than
> actually introduce tricky spinlocks.

If the only effect of this patch submission is to give people a bit more
motivation to solve the underlying lock-contention problems, I am happy.

> So in order to merge something like this, I want (a) numbers for real
> loads and (b) explanations for why the spinlock users cannot be fixed.
> 
> Because "we might hit loads" is just not good enough. I would counter
> with "hiding problems causes more of them".

Agreed.  As I said in the first paragraph of the commit log:

... if we must have high-contention locks, why not make them
automatically switch between light-weight ticket locks at low
contention and queued locks at high contention?

The reason that I created this patch was that I was seeing people
arguing for locks optimized for high contention, and the ones that I
saw required the developer to predict which locks would encounter high
levels of contention.  Changes in workloads would of course invalidate
those predictions.

But again, if the only effect of this patch submission is to give people
a bit more motivation to solve the underlying lock-contention problems,
I am happy.

Thanx, Paul

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC ticketlock] Auto-queued ticketlock

2013-06-11 Thread Paul E. McKenney
On Mon, Jun 10, 2013 at 08:44:40PM -0400, Steven Rostedt wrote:
> On Sun, 2013-06-09 at 12:36 -0700, Paul E. McKenney wrote:
> 
> > +#else /* #ifndef CONFIG_TICKET_LOCK_QUEUED */
> > +
> > +bool tkt_spin_pass(arch_spinlock_t *ap, struct __raw_tickets inc);
> > +
> > +static __always_inline void __ticket_spin_lock(arch_spinlock_t *lock)
> > +{
> > +   register struct __raw_tickets inc = { .tail = 2 };
> > +
> > +   inc = xadd(&lock->tickets, inc);
> > +   for (;;) {
> > +   if (inc.head == inc.tail || tkt_spin_pass(lock, inc))
> > +   break;
> > +   inc.head = ACCESS_ONCE(lock->tickets.head);
> > +   }
> > +   barrier(); /* smp_mb() on Power or ARM. */
> > +}
> > +
> > +#endif /* #else #ifndef CONFIG_TICKET_LOCK_QUEUED */
> > +
> >  static __always_inline int __ticket_spin_trylock(arch_spinlock_t *lock)
> >  {
> > arch_spinlock_t old, new;
> > @@ -70,17 +91,37 @@ static __always_inline int 
> > __ticket_spin_trylock(arch_spinlock_t *lock)
> > if (old.tickets.head != old.tickets.tail)
> > return 0;
> >  
> > +#ifndef CONFIG_TICKET_LOCK_QUEUED
> > new.head_tail = old.head_tail + (1 << TICKET_SHIFT);
> > +#else /* #ifndef CONFIG_TICKET_LOCK_QUEUED */
> > +   new.head_tail = old.head_tail + (2 << TICKET_SHIFT);
> > +#endif /* #else #ifndef CONFIG_TICKET_LOCK_QUEUED */
> >  
> > /* cmpxchg is a full barrier, so nothing can move before it */
> > return cmpxchg(&lock->head_tail, old.head_tail, new.head_tail) == 
> > old.head_tail;
> >  }
> >  
> > +#ifndef CONFIG_TICKET_LOCK_QUEUED
> > +
> >  static __always_inline void __ticket_spin_unlock(arch_spinlock_t *lock)
> >  {
> > __add(&lock->tickets.head, 1, UNLOCK_LOCK_PREFIX);
> >  }
> >  
> > +#else /* #ifndef CONFIG_TICKET_LOCK_QUEUED */
> > +
> > +extern void tkt_q_do_wake(arch_spinlock_t *asp);
> > +
> > +static __always_inline void __ticket_spin_unlock(arch_spinlock_t *lock)
> > +{
> > +   __ticket_t head = 2;
> > +
> > +   head = xadd(&lock->tickets.head, 2);
> > +   if (head & 0x1)
> > +   tkt_q_do_wake(lock);
> > +}
> > +#endif /* #else #ifndef CONFIG_TICKET_LOCK_QUEUED */
> > +
> >  static inline int __ticket_spin_is_locked(arch_spinlock_t *lock)
> >  {
> > struct __raw_tickets tmp = ACCESS_ONCE(lock->tickets);
> > diff --git a/arch/x86/include/asm/spinlock_types.h 
> > b/arch/x86/include/asm/spinlock_types.h
> > index ad0ad07..cdaefdd 100644
> > --- a/arch/x86/include/asm/spinlock_types.h
> > +++ b/arch/x86/include/asm/spinlock_types.h
> > @@ -7,12 +7,18 @@
> >  
> >  #include 
> >  
> > -#if (CONFIG_NR_CPUS < 256)
> > +#if (CONFIG_NR_CPUS < 128)
> >  typedef u8  __ticket_t;
> >  typedef u16 __ticketpair_t;
> > -#else
> > +#define TICKET_T_CMP_GE(a, b) (UCHAR_MAX / 2 >= (unsigned char)((a) - (b)))
> > +#elif (CONFIG_NR_CPUS < 32768)
> >  typedef u16 __ticket_t;
> >  typedef u32 __ticketpair_t;
> > +#define TICKET_T_CMP_GE(a, b) (USHRT_MAX / 2 >= (unsigned short)((a) - 
> > (b)))
> > +#else
> > +typedef u32 __ticket_t;
> > +typedef u64 __ticketpair_t;
> > +#define TICKET_T_CMP_GE(a, b) (UINT_MAX / 2 >= (unsigned int)((a) - (b)))
> >  #endif
> >  
> >  #define TICKET_SHIFT   (sizeof(__ticket_t) * 8)
> > @@ -21,7 +27,11 @@ typedef struct arch_spinlock {
> > union {
> > __ticketpair_t head_tail;
> > struct __raw_tickets {
> > +#ifdef __BIG_ENDIAN__
> > +   __ticket_t tail, head;
> > +#else /* #ifdef __BIG_ENDIAN__ */
> > __ticket_t head, tail;
> > +#endif /* #else #ifdef __BIG_ENDIAN__ */
> > } tickets;
> > };
> >  } arch_spinlock_t;
> > diff --git a/include/linux/kernel.h b/include/linux/kernel.h
> > index e9ef6d6..816a87c 100644
> > --- a/include/linux/kernel.h
> > +++ b/include/linux/kernel.h
> > @@ -15,6 +15,7 @@
> >  #include 
> >  #include 
> >  
> > +#define UCHAR_MAX  ((u8)(~0U))
> >  #define USHRT_MAX  ((u16)(~0U))
> >  #define SHRT_MAX   ((s16)(USHRT_MAX>>1))
> >  #define SHRT_MIN   ((s16)(-SHRT_MAX - 1))
> > diff --git a/kernel/Kconfig.locks b/kernel/Kconfig.locks
> > index 44511d1..ad9c67c 100644
> > --- a/kernel/Kconfig.locks
> > +++ b/kernel/Kconfig.locks
> > @@ -223,3 +223,21 @@ endif
> >  config MUTEX_SPIN_ON_OWNER
> > def_bool y
> > depends on SMP && !DEBUG_MUTEXES
> > +
> > +config TICKET_LOCK_QUEUED
> > +   bool "Dynamically switch between ticket and queued locking"
> > +   default n
> > +   ---help---
> > + Enable dynamic switching between ticketlock and queued locking
> > + on a per-lock basis.  This option will slow down low-contention
> > + acquisition and release very slightly (additional conditional
> > + in release path), but will provide more efficient operation at
> > + high levels of lock contention.  High-contention operation will
> > + not be quite as efficient as would be a pure queued lock, but
> > + this dynamic approach consumes less memory than queud locks
> > + and also runs faster at low levels of contention.
> > +
> > +   

Re: [PATCH RFC ticketlock] Auto-queued ticketlock

2013-06-11 Thread Paul E. McKenney
On Mon, Jun 10, 2013 at 09:04:09PM -0400, Steven Rostedt wrote:
> On Sun, 2013-06-09 at 12:36 -0700, Paul E. McKenney wrote:
> > Breaking up locks is better than implementing high-contention locks, but
> > if we must have high-contention locks, why not make them automatically
> > switch between light-weight ticket locks at low contention and queued
> > locks at high contention?
> > 
> > This commit therefore allows ticket locks to automatically switch between
> > pure ticketlock and queued-lock operation as needed.  If too many CPUs
> > are spinning on a given ticket lock, a queue structure will be allocated
> > and the lock will switch to queued-lock operation.  When the lock becomes
> > free, it will switch back into ticketlock operation.  The low-order bit
> > of the head counter is used to indicate that the lock is in queued mode,
> > which forces an unconditional mismatch between the head and tail counters.
> > This approach means that the common-case code path under conditions of
> > low contention is very nearly that of a plain ticket lock.
> > 
> > A fixed number of queueing structures is statically allocated in an
> > array.  The ticket-lock address is used to hash into an initial element,
> > but if that element is already in use, it moves to the next element.  If
> > the entire array is already in use, continue to spin in ticket mode.
> > 
> > This has been only lightly tested in the kernel, though a userspace
> > implementation has survived substantial testing.
> 
> I guess the point of this patch is to lower the cache ping pong effect
> of spin locks. I believe Rik was doing something similar to this as
> well.

Yep, as was Michel Lespinasse, IIRC.

> Now, when we switch from ticket to queue, we basically blow away the old
> FIFO and all the tasks do a thundering herd to get on the queue. Even if
> the task was next to get the lock, it could end up being stuck at the
> back of the queue again and have to wait. When I put my real-time hat
> on, this bothers me. Even though it's still a bound latency, as it only
> gets put to the end of the queue once, it just doubled the length of the
> worse case grabbing of a lock.

Almost.

The size of the switch-to-queue thundering herd is limited by the
ticket gap that initiates the switch.  So what you are saying is that
in an RT kernel, you might want to tune down the ticket gap.  Which
reminds me -- I do need to make this tuning more explicit.

Thanx, Paul

> -- Steve
> 
> 
> 
> > 
> > Signed-off-by: Paul E. McKenney 
> > 
> > diff --git a/arch/x86/include/asm/spinlock.h 
> > b/arch/x86/include/asm/spinlock.h
> > index 33692ea..b4a91b0 100644
> > --- a/arch/x86/include/asm/spinlock.h
> > +++ b/arch/x86/include/asm/spinlock.h
> > @@ -34,6 +34,8 @@
> >  # define UNLOCK_LOCK_PREFIX
> >  #endif
> >  
> > +#ifndef CONFIG_TICKET_LOCK_QUEUED
> > +
> >  /*
> >   * Ticket locks are conceptually two parts, one indicating the current 
> > head of
> >   * the queue, and the other indicating the current tail. The lock is 
> > acquired
> > @@ -62,6 +64,25 @@ static __always_inline void 
> > __ticket_spin_lock(arch_spinlock_t *lock)
> > barrier();  /* make sure nothing creeps before the lock is 
> > taken */
> >  }
> >  
> > +#else /* #ifndef CONFIG_TICKET_LOCK_QUEUED */
> > +
> > +bool tkt_spin_pass(arch_spinlock_t *ap, struct __raw_tickets inc);
> > +
> > +static __always_inline void __ticket_spin_lock(arch_spinlock_t *lock)
> > +{
> > +   register struct __raw_tickets inc = { .tail = 2 };
> > +
> > +   inc = xadd(&lock->tickets, inc);
> > +   for (;;) {
> > +   if (inc.head == inc.tail || tkt_spin_pass(lock, inc))
> > +   break;
> > +   inc.head = ACCESS_ONCE(lock->tickets.head);
> > +   }
> > +   barrier(); /* smp_mb() on Power or ARM. */
> > +}
> > +
> > +#endif /* #else #ifndef CONFIG_TICKET_LOCK_QUEUED */
> > +
> >  static __always_inline int __ticket_spin_trylock(arch_spinlock_t *lock)
> >  {
> > arch_spinlock_t old, new;
> > @@ -70,17 +91,37 @@ static __always_inline int 
> > __ticket_spin_trylock(arch_spinlock_t *lock)
> > if (old.tickets.head != old.tickets.tail)
> > return 0;
> >  
> > +#ifndef CONFIG_TICKET_LOCK_QUEUED
> > new.head_tail = old.head_tail + (1 << TICKET_SHIFT);
> > +#else /* #ifndef CONFIG_TICKET_LOCK_QUEUED */
> > +   new.head_tail = old.head_tail + (2 << TICKET_SHIFT);
> > +#endif /* #else #ifndef CONFIG_TICKET_LOCK_QUEUED */
> >  
> > /* cmpxchg is a full barrier, so nothing can move before it */
> > return cmpxchg(&lock->head_tail, old.head_tail, new.head_tail) == 
> > old.head_tail;
> >  }
> >  
> > +#ifndef CONFIG_TICKET_LOCK_QUEUED
> > +
> >  static __always_inline void __ticket_spin_unlock(arch_spinlock_t *lock)
> >  {
> > __add(&lock->tickets.head, 1, UNLOCK_LOCK_PREFIX);
> >  }
> >  
> > +#else /* #ifndef CONFIG_TICKET_LOCK_QUEUED */
> > +
> > +extern void tkt_q_do_wake(arch_spinlock_t *asp);
>

Re: [PATCH RFC ticketlock] Auto-queued ticketlock

2013-06-11 Thread Lai Jiangshan
On 06/11/2013 08:51 AM, Linus Torvalds wrote:
> On Mon, Jun 10, 2013 at 5:44 PM, Steven Rostedt  wrote:
>>
>> OK, I haven't found a issue here yet, but youss are beiing trickssy! We
>> don't like trickssy, and we must find preiouss!!!
> 
> .. and I personally have my usual reservations. I absolutely hate
> papering over scalability issues, and historically whenever people
> have ever thought that we want complex spinlocks, the problem has
> always been that the locking sucks.
> 
> So reinforced by previous events, I really feel that code that needs
> this kind of spinlock is broken and needs to be fixed, rather than
> actually introduce tricky spinlocks.
> 
> So in order to merge something like this, I want (a) numbers for real
> loads and (b) explanations for why the spinlock users cannot be fixed.
> 
> Because "we might hit loads" is just not good enough. I would counter
> with "hiding problems causes more of them".
> 

Hi, all

Off-topic, although I am in this community for several years,
I am not exactly clear with this problem.

1) In general case, which lock is the most competitive in the kernel? what it 
protects for?
2) In which special case, which lock is the most competitive in the kernel? 
what it protects for?
3) In general case, which list is the most hot list?
4) In which special case, which list is the most hot list?

thanks,
Lai
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC ticketlock] Auto-queued ticketlock

2013-06-10 Thread Steven Rostedt
On Sun, 2013-06-09 at 12:36 -0700, Paul E. McKenney wrote:
> Breaking up locks is better than implementing high-contention locks, but
> if we must have high-contention locks, why not make them automatically
> switch between light-weight ticket locks at low contention and queued
> locks at high contention?
> 
> This commit therefore allows ticket locks to automatically switch between
> pure ticketlock and queued-lock operation as needed.  If too many CPUs
> are spinning on a given ticket lock, a queue structure will be allocated
> and the lock will switch to queued-lock operation.  When the lock becomes
> free, it will switch back into ticketlock operation.  The low-order bit
> of the head counter is used to indicate that the lock is in queued mode,
> which forces an unconditional mismatch between the head and tail counters.
> This approach means that the common-case code path under conditions of
> low contention is very nearly that of a plain ticket lock.
> 
> A fixed number of queueing structures is statically allocated in an
> array.  The ticket-lock address is used to hash into an initial element,
> but if that element is already in use, it moves to the next element.  If
> the entire array is already in use, continue to spin in ticket mode.
> 
> This has been only lightly tested in the kernel, though a userspace
> implementation has survived substantial testing.

I guess the point of this patch is to lower the cache ping pong effect
of spin locks. I believe Rik was doing something similar to this as
well.

Now, when we switch from ticket to queue, we basically blow away the old
FIFO and all the tasks do a thundering herd to get on the queue. Even if
the task was next to get the lock, it could end up being stuck at the
back of the queue again and have to wait. When I put my real-time hat
on, this bothers me. Even though it's still a bound latency, as it only
gets put to the end of the queue once, it just doubled the length of the
worse case grabbing of a lock.

-- Steve



> 
> Signed-off-by: Paul E. McKenney 
> 
> diff --git a/arch/x86/include/asm/spinlock.h b/arch/x86/include/asm/spinlock.h
> index 33692ea..b4a91b0 100644
> --- a/arch/x86/include/asm/spinlock.h
> +++ b/arch/x86/include/asm/spinlock.h
> @@ -34,6 +34,8 @@
>  # define UNLOCK_LOCK_PREFIX
>  #endif
>  
> +#ifndef CONFIG_TICKET_LOCK_QUEUED
> +
>  /*
>   * Ticket locks are conceptually two parts, one indicating the current head 
> of
>   * the queue, and the other indicating the current tail. The lock is acquired
> @@ -62,6 +64,25 @@ static __always_inline void 
> __ticket_spin_lock(arch_spinlock_t *lock)
>   barrier();  /* make sure nothing creeps before the lock is 
> taken */
>  }
>  
> +#else /* #ifndef CONFIG_TICKET_LOCK_QUEUED */
> +
> +bool tkt_spin_pass(arch_spinlock_t *ap, struct __raw_tickets inc);
> +
> +static __always_inline void __ticket_spin_lock(arch_spinlock_t *lock)
> +{
> + register struct __raw_tickets inc = { .tail = 2 };
> +
> + inc = xadd(&lock->tickets, inc);
> + for (;;) {
> + if (inc.head == inc.tail || tkt_spin_pass(lock, inc))
> + break;
> + inc.head = ACCESS_ONCE(lock->tickets.head);
> + }
> + barrier(); /* smp_mb() on Power or ARM. */
> +}
> +
> +#endif /* #else #ifndef CONFIG_TICKET_LOCK_QUEUED */
> +
>  static __always_inline int __ticket_spin_trylock(arch_spinlock_t *lock)
>  {
>   arch_spinlock_t old, new;
> @@ -70,17 +91,37 @@ static __always_inline int 
> __ticket_spin_trylock(arch_spinlock_t *lock)
>   if (old.tickets.head != old.tickets.tail)
>   return 0;
>  
> +#ifndef CONFIG_TICKET_LOCK_QUEUED
>   new.head_tail = old.head_tail + (1 << TICKET_SHIFT);
> +#else /* #ifndef CONFIG_TICKET_LOCK_QUEUED */
> + new.head_tail = old.head_tail + (2 << TICKET_SHIFT);
> +#endif /* #else #ifndef CONFIG_TICKET_LOCK_QUEUED */
>  
>   /* cmpxchg is a full barrier, so nothing can move before it */
>   return cmpxchg(&lock->head_tail, old.head_tail, new.head_tail) == 
> old.head_tail;
>  }
>  
> +#ifndef CONFIG_TICKET_LOCK_QUEUED
> +
>  static __always_inline void __ticket_spin_unlock(arch_spinlock_t *lock)
>  {
>   __add(&lock->tickets.head, 1, UNLOCK_LOCK_PREFIX);
>  }
>  
> +#else /* #ifndef CONFIG_TICKET_LOCK_QUEUED */
> +
> +extern void tkt_q_do_wake(arch_spinlock_t *asp);
> +
> +static __always_inline void __ticket_spin_unlock(arch_spinlock_t *lock)
> +{
> + __ticket_t head = 2;
> +
> + head = xadd(&lock->tickets.head, 2);
> + if (head & 0x1)
> + tkt_q_do_wake(lock);
> +}
> +#endif /* #else #ifndef CONFIG_TICKET_LOCK_QUEUED */
> +
>  static inline int __ticket_spin_is_locked(arch_spinlock_t *lock)
>  {
>   struct __raw_tickets tmp = ACCESS_ONCE(lock->tickets);
> diff --git a/arch/x86/include/asm/spinlock_types.h 
> b/arch/x86/include/asm/spinlock_types.h
> index ad0ad07..cdaefdd 100644
> --- a/arch/x86/include/asm/spinlock_types.h
> +++ b/arch/x86/include/asm/spinlock_typ

Re: [PATCH RFC ticketlock] Auto-queued ticketlock

2013-06-10 Thread Linus Torvalds
On Mon, Jun 10, 2013 at 5:44 PM, Steven Rostedt  wrote:
>
> OK, I haven't found a issue here yet, but youss are beiing trickssy! We
> don't like trickssy, and we must find preiouss!!!

.. and I personally have my usual reservations. I absolutely hate
papering over scalability issues, and historically whenever people
have ever thought that we want complex spinlocks, the problem has
always been that the locking sucks.

So reinforced by previous events, I really feel that code that needs
this kind of spinlock is broken and needs to be fixed, rather than
actually introduce tricky spinlocks.

So in order to merge something like this, I want (a) numbers for real
loads and (b) explanations for why the spinlock users cannot be fixed.

Because "we might hit loads" is just not good enough. I would counter
with "hiding problems causes more of them".

   Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC ticketlock] Auto-queued ticketlock

2013-06-10 Thread Steven Rostedt
On Sun, 2013-06-09 at 12:36 -0700, Paul E. McKenney wrote:

> +#else /* #ifndef CONFIG_TICKET_LOCK_QUEUED */
> +
> +bool tkt_spin_pass(arch_spinlock_t *ap, struct __raw_tickets inc);
> +
> +static __always_inline void __ticket_spin_lock(arch_spinlock_t *lock)
> +{
> + register struct __raw_tickets inc = { .tail = 2 };
> +
> + inc = xadd(&lock->tickets, inc);
> + for (;;) {
> + if (inc.head == inc.tail || tkt_spin_pass(lock, inc))
> + break;
> + inc.head = ACCESS_ONCE(lock->tickets.head);
> + }
> + barrier(); /* smp_mb() on Power or ARM. */
> +}
> +
> +#endif /* #else #ifndef CONFIG_TICKET_LOCK_QUEUED */
> +
>  static __always_inline int __ticket_spin_trylock(arch_spinlock_t *lock)
>  {
>   arch_spinlock_t old, new;
> @@ -70,17 +91,37 @@ static __always_inline int 
> __ticket_spin_trylock(arch_spinlock_t *lock)
>   if (old.tickets.head != old.tickets.tail)
>   return 0;
>  
> +#ifndef CONFIG_TICKET_LOCK_QUEUED
>   new.head_tail = old.head_tail + (1 << TICKET_SHIFT);
> +#else /* #ifndef CONFIG_TICKET_LOCK_QUEUED */
> + new.head_tail = old.head_tail + (2 << TICKET_SHIFT);
> +#endif /* #else #ifndef CONFIG_TICKET_LOCK_QUEUED */
>  
>   /* cmpxchg is a full barrier, so nothing can move before it */
>   return cmpxchg(&lock->head_tail, old.head_tail, new.head_tail) == 
> old.head_tail;
>  }
>  
> +#ifndef CONFIG_TICKET_LOCK_QUEUED
> +
>  static __always_inline void __ticket_spin_unlock(arch_spinlock_t *lock)
>  {
>   __add(&lock->tickets.head, 1, UNLOCK_LOCK_PREFIX);
>  }
>  
> +#else /* #ifndef CONFIG_TICKET_LOCK_QUEUED */
> +
> +extern void tkt_q_do_wake(arch_spinlock_t *asp);
> +
> +static __always_inline void __ticket_spin_unlock(arch_spinlock_t *lock)
> +{
> + __ticket_t head = 2;
> +
> + head = xadd(&lock->tickets.head, 2);
> + if (head & 0x1)
> + tkt_q_do_wake(lock);
> +}
> +#endif /* #else #ifndef CONFIG_TICKET_LOCK_QUEUED */
> +
>  static inline int __ticket_spin_is_locked(arch_spinlock_t *lock)
>  {
>   struct __raw_tickets tmp = ACCESS_ONCE(lock->tickets);
> diff --git a/arch/x86/include/asm/spinlock_types.h 
> b/arch/x86/include/asm/spinlock_types.h
> index ad0ad07..cdaefdd 100644
> --- a/arch/x86/include/asm/spinlock_types.h
> +++ b/arch/x86/include/asm/spinlock_types.h
> @@ -7,12 +7,18 @@
>  
>  #include 
>  
> -#if (CONFIG_NR_CPUS < 256)
> +#if (CONFIG_NR_CPUS < 128)
>  typedef u8  __ticket_t;
>  typedef u16 __ticketpair_t;
> -#else
> +#define TICKET_T_CMP_GE(a, b) (UCHAR_MAX / 2 >= (unsigned char)((a) - (b)))
> +#elif (CONFIG_NR_CPUS < 32768)
>  typedef u16 __ticket_t;
>  typedef u32 __ticketpair_t;
> +#define TICKET_T_CMP_GE(a, b) (USHRT_MAX / 2 >= (unsigned short)((a) - (b)))
> +#else
> +typedef u32 __ticket_t;
> +typedef u64 __ticketpair_t;
> +#define TICKET_T_CMP_GE(a, b) (UINT_MAX / 2 >= (unsigned int)((a) - (b)))
>  #endif
>  
>  #define TICKET_SHIFT (sizeof(__ticket_t) * 8)
> @@ -21,7 +27,11 @@ typedef struct arch_spinlock {
>   union {
>   __ticketpair_t head_tail;
>   struct __raw_tickets {
> +#ifdef __BIG_ENDIAN__
> + __ticket_t tail, head;
> +#else /* #ifdef __BIG_ENDIAN__ */
>   __ticket_t head, tail;
> +#endif /* #else #ifdef __BIG_ENDIAN__ */
>   } tickets;
>   };
>  } arch_spinlock_t;
> diff --git a/include/linux/kernel.h b/include/linux/kernel.h
> index e9ef6d6..816a87c 100644
> --- a/include/linux/kernel.h
> +++ b/include/linux/kernel.h
> @@ -15,6 +15,7 @@
>  #include 
>  #include 
>  
> +#define UCHAR_MAX((u8)(~0U))
>  #define USHRT_MAX((u16)(~0U))
>  #define SHRT_MAX ((s16)(USHRT_MAX>>1))
>  #define SHRT_MIN ((s16)(-SHRT_MAX - 1))
> diff --git a/kernel/Kconfig.locks b/kernel/Kconfig.locks
> index 44511d1..ad9c67c 100644
> --- a/kernel/Kconfig.locks
> +++ b/kernel/Kconfig.locks
> @@ -223,3 +223,21 @@ endif
>  config MUTEX_SPIN_ON_OWNER
>   def_bool y
>   depends on SMP && !DEBUG_MUTEXES
> +
> +config TICKET_LOCK_QUEUED
> + bool "Dynamically switch between ticket and queued locking"
> + default n
> + ---help---
> +   Enable dynamic switching between ticketlock and queued locking
> +   on a per-lock basis.  This option will slow down low-contention
> +   acquisition and release very slightly (additional conditional
> +   in release path), but will provide more efficient operation at
> +   high levels of lock contention.  High-contention operation will
> +   not be quite as efficient as would be a pure queued lock, but
> +   this dynamic approach consumes less memory than queud locks
> +   and also runs faster at low levels of contention.
> +
> +   Say "Y" if you are running on a large system with a workload
> +   that is likely to result in high levels of contention.
> +
> +   Say "N" if you are unsure.
> diff --git a/kernel/Makefile b/kernel/Makefile
> index 271fd31..70a91f7 100644
> --- a

Re: [PATCH RFC ticketlock] Auto-queued ticketlock

2013-06-10 Thread Paul E. McKenney
On Mon, Jun 10, 2013 at 07:02:56PM -0400, Steven Rostedt wrote:
> On Sun, 2013-06-09 at 12:36 -0700, Paul E. McKenney wrote:
> 
> > +/*
> > + * Return a pointer to the queue header associated with the specified lock,
> > + * or return NULL if there is no queue for the lock or if the lock's queue
> > + * is in transition.
> > + */
> > +static struct tkt_q_head *tkt_q_find_head(arch_spinlock_t *asp)
> 
> BTW, what does "asp" mean? arch_spinlock?

"arch_spinlock pointer", but yes.  Or I suppose a millenia-late warning
to Cleopatra.

>If so, can we just call it
> "lock" and be consistent with all the other spinlock calls in the
> kernel. Because, I keep thinking this has something to do with Microsoft
> dynamic web pages.

Fair enough!

Thanx, Paul

> -- Steve
> 
> > +{
> > +   int i;
> > +   int start;
> > +
> > +   start = i = tkt_q_hash(asp);
> > +   do
> > +   if (tkt_q_heads[i].ref == asp)
> > +   return &tkt_q_heads[i];
> > +   while ((i = tkt_q_next_slot(i)) != start);
> > +   return NULL;
> > +}
> > +
> 
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC ticketlock] Auto-queued ticketlock

2013-06-10 Thread Steven Rostedt
On Sun, 2013-06-09 at 12:36 -0700, Paul E. McKenney wrote:

> +/*
> + * Return a pointer to the queue header associated with the specified lock,
> + * or return NULL if there is no queue for the lock or if the lock's queue
> + * is in transition.
> + */
> +static struct tkt_q_head *tkt_q_find_head(arch_spinlock_t *asp)

BTW, what does "asp" mean? arch_spinlock?  If so, can we just call it
"lock" and be consistent with all the other spinlock calls in the
kernel. Because, I keep thinking this has something to do with Microsoft
dynamic web pages.

-- Steve

> +{
> + int i;
> + int start;
> +
> + start = i = tkt_q_hash(asp);
> + do
> + if (tkt_q_heads[i].ref == asp)
> + return &tkt_q_heads[i];
> + while ((i = tkt_q_next_slot(i)) != start);
> + return NULL;
> +}
> +


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC ticketlock] Auto-queued ticketlock

2013-06-10 Thread Paul E. McKenney
On Mon, Jun 10, 2013 at 02:35:06PM -0700, Eric Dumazet wrote:
> On Sun, 2013-06-09 at 12:36 -0700, Paul E. McKenney wrote:
> > Breaking up locks is better than implementing high-contention locks, but
> > if we must have high-contention locks, why not make them automatically
> > switch between light-weight ticket locks at low contention and queued
> > locks at high contention?
> > 
> > This commit therefore allows ticket locks to automatically switch between
> > pure ticketlock and queued-lock operation as needed.  If too many CPUs
> > are spinning on a given ticket lock, a queue structure will be allocated
> > and the lock will switch to queued-lock operation.  When the lock becomes
> > free, it will switch back into ticketlock operation.  The low-order bit
> > of the head counter is used to indicate that the lock is in queued mode,
> > which forces an unconditional mismatch between the head and tail counters.
> > This approach means that the common-case code path under conditions of
> > low contention is very nearly that of a plain ticket lock.
> > 
> > A fixed number of queueing structures is statically allocated in an
> > array.  The ticket-lock address is used to hash into an initial element,
> > but if that element is already in use, it moves to the next element.  If
> > the entire array is already in use, continue to spin in ticket mode.
> > 
> > This has been only lightly tested in the kernel, though a userspace
> > implementation has survived substantial testing.
> > 
> > Signed-off-by: Paul E. McKenney 
> 
> This looks a great idea ;)

Glad you like it!  Hopefully workloads like it as well.  ;-)

> > +
> > +static __always_inline void __ticket_spin_unlock(arch_spinlock_t *lock)
> > +{
> > +   __ticket_t head = 2;
> > +
> > +   head = xadd(&lock->tickets.head, 2);
> 
>   head = xadd(&lock->tickets.head, head);

Yikes!  Good catch, fixed.

> > +   if (head & 0x1)
> > +   tkt_q_do_wake(lock);
> > +}
> > +#endif /* #else #ifndef CONFIG_TICKET_LOCK_QUEUED */
> 
> > + */
> > +void tkt_q_do_wake(arch_spinlock_t *asp)
> > +{
> > +   struct tkt_q_head *tqhp;
> > +   struct tkt_q *tqp;
> > +
> > +   /* If the queue is still being set up, wait for it. */
> > +   while ((tqhp = tkt_q_find_head(asp)) == NULL)
> > +   cpu_relax();
> > +
> > +   for (;;) {
> > +
> > +   /* Find the first queue element. */
> > +   tqp = ACCESS_ONCE(tqhp->spin);
> > +   if (tqp != NULL)
> > +   break;  /* Element exists, hand off lock. */
> > +   if (tkt_q_try_unqueue(asp, tqhp))
> > +   return; /* No element, successfully removed queue. */
> > +   cpu_relax();
> > +   }
> > +   if (ACCESS_ONCE(tqhp->head_tkt) != -1)
> > +   ACCESS_ONCE(tqhp->head_tkt) = -1;
> > +   smp_mb(); /* Order pointer fetch and assignment against handoff. */
> > +   ACCESS_ONCE(tqp->cpu) = -1;
> > +}
> 
> EXPORT_SYMBOL(tkt_q_do_wake) ?

Good point, just in case we want to use spinlocks in modules.  ;-)
Same for tkt_spin_pass(), I guess.

> Hmm, unfortunately I lack time this week to fully read the patch !

I suspect that there is very little danger of this patch going in this
week, so you should have some additional time.  ;-)

Thanx, Paul

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC ticketlock] Auto-queued ticketlock

2013-06-10 Thread Eric Dumazet
On Sun, 2013-06-09 at 12:36 -0700, Paul E. McKenney wrote:
> Breaking up locks is better than implementing high-contention locks, but
> if we must have high-contention locks, why not make them automatically
> switch between light-weight ticket locks at low contention and queued
> locks at high contention?
> 
> This commit therefore allows ticket locks to automatically switch between
> pure ticketlock and queued-lock operation as needed.  If too many CPUs
> are spinning on a given ticket lock, a queue structure will be allocated
> and the lock will switch to queued-lock operation.  When the lock becomes
> free, it will switch back into ticketlock operation.  The low-order bit
> of the head counter is used to indicate that the lock is in queued mode,
> which forces an unconditional mismatch between the head and tail counters.
> This approach means that the common-case code path under conditions of
> low contention is very nearly that of a plain ticket lock.
> 
> A fixed number of queueing structures is statically allocated in an
> array.  The ticket-lock address is used to hash into an initial element,
> but if that element is already in use, it moves to the next element.  If
> the entire array is already in use, continue to spin in ticket mode.
> 
> This has been only lightly tested in the kernel, though a userspace
> implementation has survived substantial testing.
> 
> Signed-off-by: Paul E. McKenney 
> 

This looks a great idea ;)

> +
> +static __always_inline void __ticket_spin_unlock(arch_spinlock_t *lock)
> +{
> + __ticket_t head = 2;
> +
> + head = xadd(&lock->tickets.head, 2);

head = xadd(&lock->tickets.head, head);

> + if (head & 0x1)
> + tkt_q_do_wake(lock);
> +}
> +#endif /* #else #ifndef CONFIG_TICKET_LOCK_QUEUED */

> + */
> +void tkt_q_do_wake(arch_spinlock_t *asp)
> +{
> + struct tkt_q_head *tqhp;
> + struct tkt_q *tqp;
> +
> + /* If the queue is still being set up, wait for it. */
> + while ((tqhp = tkt_q_find_head(asp)) == NULL)
> + cpu_relax();
> +
> + for (;;) {
> +
> + /* Find the first queue element. */
> + tqp = ACCESS_ONCE(tqhp->spin);
> + if (tqp != NULL)
> + break;  /* Element exists, hand off lock. */
> + if (tkt_q_try_unqueue(asp, tqhp))
> + return; /* No element, successfully removed queue. */
> + cpu_relax();
> + }
> + if (ACCESS_ONCE(tqhp->head_tkt) != -1)
> + ACCESS_ONCE(tqhp->head_tkt) = -1;
> + smp_mb(); /* Order pointer fetch and assignment against handoff. */
> + ACCESS_ONCE(tqp->cpu) = -1;
> +}

EXPORT_SYMBOL(tkt_q_do_wake) ?

Hmm, unfortunately I lack time this week to fully read the patch !



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC ticketlock] Auto-queued ticketlock

2013-06-10 Thread Paul E. McKenney
On Mon, Jun 10, 2013 at 05:08:25PM -0400, Steven Rostedt wrote:
> On Sun, 2013-06-09 at 12:36 -0700, Paul E. McKenney wrote:
> >  
> > +#else /* #ifndef CONFIG_TICKET_LOCK_QUEUED */
> > +
> > +bool tkt_spin_pass(arch_spinlock_t *ap, struct __raw_tickets inc);
> > +
> > +static __always_inline void __ticket_spin_lock(arch_spinlock_t *lock)
> > +{
> > +   register struct __raw_tickets inc = { .tail = 2 };
> > +
> > +   inc = xadd(&lock->tickets, inc);
> > +   for (;;) {
> > +   if (inc.head == inc.tail || tkt_spin_pass(lock, inc))
> > +   break;
> > +   inc.head = ACCESS_ONCE(lock->tickets.head);
> > +   }
> > +   barrier(); /* smp_mb() on Power or ARM. */
> > +}
> > +
> > +#endif /* #else #ifndef CONFIG_TICKET_LOCK_QUEUED */
> > +
> 
> To avoid the above code duplication, I would have this instead:

Nice!  I have updated accordingly.

Thanx, Paul

> #ifdef CONFIG_TICKET_LOCK_QUEUED
> 
> bool tkt_spin_pass(arch_spinlock_t *ap, struct __raw_tickets inc);
> #define __TKT_SPIN_INC 2
> 
> #else
> 
> static inline bool tkt_spin_pass(arch_spinlock_t *ap, struct
> __raw_tickets inc)
> {
>   return false;
> }
> 
> #define __TKT_SPIN_INC 1
> 
> #endif
> 
> static __always_inline void __ticket_spin_lock(arch_spinlock_t *lock)
> {
>   register struct __raw_tickets inc = { .tail = __TKT_SPIN_INC };
> 
>   inc = xadd(&lock->tickets, inc);
> 
>   for (;;) {
>   if (inc.head == inc.tail || tkt_spin_pass(lock, inc))
>   break;
>   cpu_relax();
>   inc.head = ACCESS_ONCE(lock->tickets.head);
>   }
>   barrier;/* make sure nothing creeps before the lock is taken */
> }
> 
> -- Steve
> 
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC ticketlock] Auto-queued ticketlock

2013-06-10 Thread Paul E. McKenney
On Mon, Jun 10, 2013 at 11:01:50PM +0200, Thomas Gleixner wrote:
> On Mon, 10 Jun 2013, Steven Rostedt wrote:
> > On Sun, 2013-06-09 at 12:36 -0700, Paul E. McKenney wrote:
> > > +#ifdef __BIG_ENDIAN__
> > 
> > Is there such a thing as a BIG_ENDIAN x86 box? This is in
> > arch/x86/include/asm/spinlock_types.h
> 
> That's just an habit for people who have been forced to deal with BE
> CPUs.
> 
> The sad thing is that BE CPUs have been designed by folks who blindly
> expanded BCD computing without understanding the hardware
> implications.
> 
> Unfortunately they managed to inflict BE to the network protocols
> which in turn manifested the only excuse of keeping the BE nonsense
> alive.

Hey, we had to do -something-, after all, EBCDIC lost out to ASCII!!!  ;-)

Thanx, Paul

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC ticketlock] Auto-queued ticketlock

2013-06-10 Thread Steven Rostedt
On Sun, 2013-06-09 at 12:36 -0700, Paul E. McKenney wrote:
>  
> +#else /* #ifndef CONFIG_TICKET_LOCK_QUEUED */
> +
> +bool tkt_spin_pass(arch_spinlock_t *ap, struct __raw_tickets inc);
> +
> +static __always_inline void __ticket_spin_lock(arch_spinlock_t *lock)
> +{
> + register struct __raw_tickets inc = { .tail = 2 };
> +
> + inc = xadd(&lock->tickets, inc);
> + for (;;) {
> + if (inc.head == inc.tail || tkt_spin_pass(lock, inc))
> + break;
> + inc.head = ACCESS_ONCE(lock->tickets.head);
> + }
> + barrier(); /* smp_mb() on Power or ARM. */
> +}
> +
> +#endif /* #else #ifndef CONFIG_TICKET_LOCK_QUEUED */
> +

To avoid the above code duplication, I would have this instead:

#ifdef CONFIG_TICKET_LOCK_QUEUED

bool tkt_spin_pass(arch_spinlock_t *ap, struct __raw_tickets inc);
#define __TKT_SPIN_INC 2

#else

static inline bool tkt_spin_pass(arch_spinlock_t *ap, struct
__raw_tickets inc)
{
return false;
}

#define __TKT_SPIN_INC 1

#endif

static __always_inline void __ticket_spin_lock(arch_spinlock_t *lock)
{
register struct __raw_tickets inc = { .tail = __TKT_SPIN_INC };

inc = xadd(&lock->tickets, inc);

for (;;) {
if (inc.head == inc.tail || tkt_spin_pass(lock, inc))
break;
cpu_relax();
inc.head = ACCESS_ONCE(lock->tickets.head);
}
barrier;/* make sure nothing creeps before the lock is taken */
}

-- Steve


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC ticketlock] Auto-queued ticketlock

2013-06-10 Thread Thomas Gleixner
On Mon, 10 Jun 2013, Steven Rostedt wrote:
> On Sun, 2013-06-09 at 12:36 -0700, Paul E. McKenney wrote:
> > +#ifdef __BIG_ENDIAN__
> 
> Is there such a thing as a BIG_ENDIAN x86 box? This is in
> arch/x86/include/asm/spinlock_types.h

That's just an habit for people who have been forced to deal with BE
CPUs.

The sad thing is that BE CPUs have been designed by folks who blindly
expanded BCD computing without understanding the hardware
implications.

Unfortunately they managed to inflict BE to the network protocols
which in turn manifested the only excuse of keeping the BE nonsense
alive.

Thanks,

tglx
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC ticketlock] Auto-queued ticketlock

2013-06-10 Thread Paul E. McKenney
On Mon, Jun 10, 2013 at 04:47:58PM -0400, Steven Rostedt wrote:
> On Sun, 2013-06-09 at 12:36 -0700, Paul E. McKenney wrote:
> 
> > --- a/arch/x86/include/asm/spinlock_types.h
> > +++ b/arch/x86/include/asm/spinlock_types.h
> > @@ -7,12 +7,18 @@
> >  
> >  #include 
> >  
> > -#if (CONFIG_NR_CPUS < 256)
> > +#if (CONFIG_NR_CPUS < 128)
> >  typedef u8  __ticket_t;
> >  typedef u16 __ticketpair_t;
> > -#else
> > +#define TICKET_T_CMP_GE(a, b) (UCHAR_MAX / 2 >= (unsigned char)((a) - (b)))
> > +#elif (CONFIG_NR_CPUS < 32768)
> >  typedef u16 __ticket_t;
> >  typedef u32 __ticketpair_t;
> > +#define TICKET_T_CMP_GE(a, b) (USHRT_MAX / 2 >= (unsigned short)((a) - 
> > (b)))
> > +#else
> > +typedef u32 __ticket_t;
> > +typedef u64 __ticketpair_t;
> > +#define TICKET_T_CMP_GE(a, b) (UINT_MAX / 2 >= (unsigned int)((a) - (b)))
> >  #endif
> >  
> >  #define TICKET_SHIFT   (sizeof(__ticket_t) * 8)
> > @@ -21,7 +27,11 @@ typedef struct arch_spinlock {
> > union {
> > __ticketpair_t head_tail;
> > struct __raw_tickets {
> > +#ifdef __BIG_ENDIAN__
> 
> Is there such a thing as a BIG_ENDIAN x86 box? This is in
> arch/x86/include/asm/spinlock_types.h

Nope.  Preparation work for moving this to common code.

Good point though, I should have skipped this #ifdef until I got everything
into common code.

Thanx, Paul

> -- Steve
> 
> > +   __ticket_t tail, head;
> > +#else /* #ifdef __BIG_ENDIAN__ */
> > __ticket_t head, tail;
> > +#endif /* #else #ifdef __BIG_ENDIAN__ */
> > } tickets;
> > };
> >  } arch_spinlock_t;
> 
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC ticketlock] Auto-queued ticketlock

2013-06-10 Thread Steven Rostedt
On Sun, 2013-06-09 at 12:36 -0700, Paul E. McKenney wrote:

> --- a/arch/x86/include/asm/spinlock_types.h
> +++ b/arch/x86/include/asm/spinlock_types.h
> @@ -7,12 +7,18 @@
>  
>  #include 
>  
> -#if (CONFIG_NR_CPUS < 256)
> +#if (CONFIG_NR_CPUS < 128)
>  typedef u8  __ticket_t;
>  typedef u16 __ticketpair_t;
> -#else
> +#define TICKET_T_CMP_GE(a, b) (UCHAR_MAX / 2 >= (unsigned char)((a) - (b)))
> +#elif (CONFIG_NR_CPUS < 32768)
>  typedef u16 __ticket_t;
>  typedef u32 __ticketpair_t;
> +#define TICKET_T_CMP_GE(a, b) (USHRT_MAX / 2 >= (unsigned short)((a) - (b)))
> +#else
> +typedef u32 __ticket_t;
> +typedef u64 __ticketpair_t;
> +#define TICKET_T_CMP_GE(a, b) (UINT_MAX / 2 >= (unsigned int)((a) - (b)))
>  #endif
>  
>  #define TICKET_SHIFT (sizeof(__ticket_t) * 8)
> @@ -21,7 +27,11 @@ typedef struct arch_spinlock {
>   union {
>   __ticketpair_t head_tail;
>   struct __raw_tickets {
> +#ifdef __BIG_ENDIAN__

Is there such a thing as a BIG_ENDIAN x86 box? This is in
arch/x86/include/asm/spinlock_types.h

-- Steve

> + __ticket_t tail, head;
> +#else /* #ifdef __BIG_ENDIAN__ */
>   __ticket_t head, tail;
> +#endif /* #else #ifdef __BIG_ENDIAN__ */
>   } tickets;
>   };
>  } arch_spinlock_t;


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH RFC ticketlock] Auto-queued ticketlock

2013-06-09 Thread Paul E. McKenney
Breaking up locks is better than implementing high-contention locks, but
if we must have high-contention locks, why not make them automatically
switch between light-weight ticket locks at low contention and queued
locks at high contention?

This commit therefore allows ticket locks to automatically switch between
pure ticketlock and queued-lock operation as needed.  If too many CPUs
are spinning on a given ticket lock, a queue structure will be allocated
and the lock will switch to queued-lock operation.  When the lock becomes
free, it will switch back into ticketlock operation.  The low-order bit
of the head counter is used to indicate that the lock is in queued mode,
which forces an unconditional mismatch between the head and tail counters.
This approach means that the common-case code path under conditions of
low contention is very nearly that of a plain ticket lock.

A fixed number of queueing structures is statically allocated in an
array.  The ticket-lock address is used to hash into an initial element,
but if that element is already in use, it moves to the next element.  If
the entire array is already in use, continue to spin in ticket mode.

This has been only lightly tested in the kernel, though a userspace
implementation has survived substantial testing.

Signed-off-by: Paul E. McKenney 

diff --git a/arch/x86/include/asm/spinlock.h b/arch/x86/include/asm/spinlock.h
index 33692ea..b4a91b0 100644
--- a/arch/x86/include/asm/spinlock.h
+++ b/arch/x86/include/asm/spinlock.h
@@ -34,6 +34,8 @@
 # define UNLOCK_LOCK_PREFIX
 #endif
 
+#ifndef CONFIG_TICKET_LOCK_QUEUED
+
 /*
  * Ticket locks are conceptually two parts, one indicating the current head of
  * the queue, and the other indicating the current tail. The lock is acquired
@@ -62,6 +64,25 @@ static __always_inline void 
__ticket_spin_lock(arch_spinlock_t *lock)
barrier();  /* make sure nothing creeps before the lock is 
taken */
 }
 
+#else /* #ifndef CONFIG_TICKET_LOCK_QUEUED */
+
+bool tkt_spin_pass(arch_spinlock_t *ap, struct __raw_tickets inc);
+
+static __always_inline void __ticket_spin_lock(arch_spinlock_t *lock)
+{
+   register struct __raw_tickets inc = { .tail = 2 };
+
+   inc = xadd(&lock->tickets, inc);
+   for (;;) {
+   if (inc.head == inc.tail || tkt_spin_pass(lock, inc))
+   break;
+   inc.head = ACCESS_ONCE(lock->tickets.head);
+   }
+   barrier(); /* smp_mb() on Power or ARM. */
+}
+
+#endif /* #else #ifndef CONFIG_TICKET_LOCK_QUEUED */
+
 static __always_inline int __ticket_spin_trylock(arch_spinlock_t *lock)
 {
arch_spinlock_t old, new;
@@ -70,17 +91,37 @@ static __always_inline int 
__ticket_spin_trylock(arch_spinlock_t *lock)
if (old.tickets.head != old.tickets.tail)
return 0;
 
+#ifndef CONFIG_TICKET_LOCK_QUEUED
new.head_tail = old.head_tail + (1 << TICKET_SHIFT);
+#else /* #ifndef CONFIG_TICKET_LOCK_QUEUED */
+   new.head_tail = old.head_tail + (2 << TICKET_SHIFT);
+#endif /* #else #ifndef CONFIG_TICKET_LOCK_QUEUED */
 
/* cmpxchg is a full barrier, so nothing can move before it */
return cmpxchg(&lock->head_tail, old.head_tail, new.head_tail) == 
old.head_tail;
 }
 
+#ifndef CONFIG_TICKET_LOCK_QUEUED
+
 static __always_inline void __ticket_spin_unlock(arch_spinlock_t *lock)
 {
__add(&lock->tickets.head, 1, UNLOCK_LOCK_PREFIX);
 }
 
+#else /* #ifndef CONFIG_TICKET_LOCK_QUEUED */
+
+extern void tkt_q_do_wake(arch_spinlock_t *asp);
+
+static __always_inline void __ticket_spin_unlock(arch_spinlock_t *lock)
+{
+   __ticket_t head = 2;
+
+   head = xadd(&lock->tickets.head, 2);
+   if (head & 0x1)
+   tkt_q_do_wake(lock);
+}
+#endif /* #else #ifndef CONFIG_TICKET_LOCK_QUEUED */
+
 static inline int __ticket_spin_is_locked(arch_spinlock_t *lock)
 {
struct __raw_tickets tmp = ACCESS_ONCE(lock->tickets);
diff --git a/arch/x86/include/asm/spinlock_types.h 
b/arch/x86/include/asm/spinlock_types.h
index ad0ad07..cdaefdd 100644
--- a/arch/x86/include/asm/spinlock_types.h
+++ b/arch/x86/include/asm/spinlock_types.h
@@ -7,12 +7,18 @@
 
 #include 
 
-#if (CONFIG_NR_CPUS < 256)
+#if (CONFIG_NR_CPUS < 128)
 typedef u8  __ticket_t;
 typedef u16 __ticketpair_t;
-#else
+#define TICKET_T_CMP_GE(a, b) (UCHAR_MAX / 2 >= (unsigned char)((a) - (b)))
+#elif (CONFIG_NR_CPUS < 32768)
 typedef u16 __ticket_t;
 typedef u32 __ticketpair_t;
+#define TICKET_T_CMP_GE(a, b) (USHRT_MAX / 2 >= (unsigned short)((a) - (b)))
+#else
+typedef u32 __ticket_t;
+typedef u64 __ticketpair_t;
+#define TICKET_T_CMP_GE(a, b) (UINT_MAX / 2 >= (unsigned int)((a) - (b)))
 #endif
 
 #define TICKET_SHIFT   (sizeof(__ticket_t) * 8)
@@ -21,7 +27,11 @@ typedef struct arch_spinlock {
union {
__ticketpair_t head_tail;
struct __raw_tickets {
+#ifdef __BIG_ENDIAN__
+   __ticket_t tail, head;
+#else /* #ifdef __BIG_ENDIAN__ */