Re: [patch 1/4] x86: FIFO ticket spinlocks

2007-11-07 Thread Nick Piggin
On Fri, Nov 02, 2007 at 04:33:32PM +0100, Ingo Molnar wrote:
> 
> * Nick Piggin <[EMAIL PROTECTED]> wrote:
> 
> > Anyway, if this can make its way to the x86 tree, I think it will get 
> > pulled into -mm (?) and get some exposure...
> 
> ok, we can certainly try it there.

Anything particular I have to do to get it into the x86 tree? And
presumably that tree is going to be picked up by Andrew at some
point? (-mm exposure is the main objective here, the only reason I
didn't send it to Andrew directly is in the interests of doing the
right thing).


> Your code is really nifty.

Thanks. One of the CPU engineers at Intel asked to use it as their reference
ticket lock code sequence, which was pretty cool :)
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 1/4] x86: FIFO ticket spinlocks

2007-11-02 Thread Nick Piggin
On Fri, Nov 02, 2007 at 08:56:46PM -0400, Chuck Ebbert wrote:
> On 11/02/2007 07:01 PM, Nick Piggin wrote:
> > 
> > In the contended multi-threaded tight loop, the xchg lock is slower than inc
> > lock but still beats the fair xadd lock, but that's only because it is
> > just as unfair if not more so on this hardware (runtime difference of up to
> > about 10%)
> > 
> 
> I meant xchg for unlock, not lock.

That is for unlock. 2x the number of atomic operations ~= 2x the cost.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 1/4] x86: FIFO ticket spinlocks

2007-11-02 Thread Chuck Ebbert
On 11/02/2007 07:01 PM, Nick Piggin wrote:
> 
> In the contended multi-threaded tight loop, the xchg lock is slower than inc
> lock but still beats the fair xadd lock, but that's only because it is
> just as unfair if not more so on this hardware (runtime difference of up to
> about 10%)
> 

I meant xchg for unlock, not lock.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 1/4] x86: FIFO ticket spinlocks

2007-11-02 Thread Nick Piggin
On Fri, Nov 02, 2007 at 09:51:27AM -0700, Linus Torvalds wrote:
> 
> 
> On Fri, 2 Nov 2007, Chuck Ebbert wrote:
> > 
> > There's also a very easy way to get better fairness with our current 
> > spinlocks:
> > use xchg to release the lock instead of mov.
> 
> That does nothing at all.
> 
> Yes, it slows the unlock down, which in turn on some machines will make it 
> easier for another core/socket to get it, but it's purely about the 
> slowdown, nothing else. 

Yeah, it's not such a good idea... it slows down the single threaded case
like crazy. On my dual core core2:

_Single thread_
inc-lock in cache takes 21.94ns
xadd-lock in cache takes 22.64ns
xchg-lock in cache takes 35.21ns

inc-lock out of cache takes 140.73ns
xadd-lock out of cache takes 141.15ns
xchg-lock out of cache takes 155.13ns


In the contended multi-threaded tight loop, the xchg lock is slower than inc
lock but still beats the fair xadd lock, but that's only because it is
just as unfair if not more so on this hardware (runtime difference of up to
about 10%)
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 1/4] x86: FIFO ticket spinlocks

2007-11-02 Thread Nick Piggin
On Fri, Nov 02, 2007 at 10:05:37AM -0400, Rik van Riel wrote:
> On Fri, 2 Nov 2007 07:42:20 +0100
> Nick Piggin <[EMAIL PROTECTED]> wrote:
> 
> > On Thu, Nov 01, 2007 at 06:19:41PM -0700, Linus Torvalds wrote:
> > > 
> > > 
> > > On Thu, 1 Nov 2007, Rik van Riel wrote:
> > > > 
> > > > Larry Woodman managed to wedge the VM into a state where, on his
> > > > 4x dual core system, only 2 cores (on the same CPU) could get the
> > > > zone->lru_lock overnight.  The other 6 cores on the system were
> > > > just spinning, without being able to get the lock.
> > 
> > That's quite incredible, considering that the CPUs actually _taking_
> > the locks also drop the locks and do quite a bit of work before taking
> > them again (ie. they take them to pull pages off the LRU, but then
> > do a reasonable amount of work to remove each one from pagecache
> > before refilling from the LRU).
> > 
> > Possibly actually that is a *more* difficult case for the HW to
> > handle: once the CPU actually goes away and operates on other
> > cachelines, it may get a little more difficult to detect that it is
> > causing starvation issues.
> 
> In case of the zone->lru_lock, grabbing the spinlock does
> not mean that the process is letting go of the cacheline.
> 
> On the contrary, once the spinlock has been grabbed, the
> real cacheline prodding begins.

I didn't say that, though. Obviously the hardware can't do anything
about starvating until a lock is released.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 1/4] x86: FIFO ticket spinlocks

2007-11-02 Thread Linus Torvalds


On Fri, 2 Nov 2007, Chuck Ebbert wrote:
> 
> There's also a very easy way to get better fairness with our current 
> spinlocks:
> use xchg to release the lock instead of mov.

That does nothing at all.

Yes, it slows the unlock down, which in turn on some machines will make it 
easier for another core/socket to get it, but it's purely about the 
slowdown, nothing else. 

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 1/4] x86: FIFO ticket spinlocks

2007-11-02 Thread Chuck Ebbert
On 11/01/2007 10:03 AM, Nick Piggin wrote:
> Introduce ticket lock spinlocks for x86 which are FIFO. The implementation
> is described in the comments. The straight-line lock/unlock instruction
> sequence is slightly slower than the dec based locks on modern x86 CPUs,
> however the difference is quite small on Core2 and Opteron when working out of
> cache, and becomes almost insignificant even on P4 when the lock misses cache.
> trylock is more significantly slower, but they are relatively rare.
> 
> On an 8 core (2 socket) Opteron, spinlock unfairness is extremely noticable,
> with a userspace test having a difference of up to 2x runtime per thread, and
> some threads are starved or "unfairly" granted the lock up to 1 000 000 (!)
> times. After this patch, all threads appear to finish at exactly the same
> time.

There's also a very easy way to get better fairness with our current spinlocks:
use xchg to release the lock instead of mov.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 1/4] x86: FIFO ticket spinlocks

2007-11-02 Thread Ingo Molnar

* Nick Piggin <[EMAIL PROTECTED]> wrote:

> Anyway, if this can make its way to the x86 tree, I think it will get 
> pulled into -mm (?) and get some exposure...

ok, we can certainly try it there. Your code is really nifty.

Ingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 1/4] x86: FIFO ticket spinlocks

2007-11-02 Thread Gregory Haskins
Linus Torvalds wrote:
> 
> On Thu, 1 Nov 2007, Gregory Haskins wrote:
>> I had observed this phenomenon on some 8-ways here as well, but I didn't
>> have the bandwidth to code something up.  Thumbs up!
> 
> Can you test under interesting loads?

Sure thing.  Ill try this next week.

> 
> We're interested in:
>  - is the unfairness fix really noticeable (or does it just move the 
>problem somewhere else, and there is no real change in behaviour)
>  - what is the performance impact?
> 
> In particular, unfair spinlocks have the potential to perform much better. 

I see where you are going here, and I mostly agree.  I think the key is
that "given equal contention, let the guy with the hottest cache win".
The problem with the current implementation is that the spinlocks have
no way to gauge the details of the contention.  They can only gauge
instantaneous snapshots of state as viewed by each TSL invocation, which
effectively resets your position each time.

On the flip side, Nick's patches take the opposite extreme.  If a lock
is contended, get in line. ;)  This has the desirable property of
avoiding starvation.  However, it will also tend to cause more bouncing
since you are virtually guaranteed not to re-win the contended lock, as
you point out next.

> Not so much because the spinlock itself acts all that differently, but 
> because being unfair also fundmanetally tends to keep the data structures 
> that are *protected* by the spinlock on just one CPU.

My issue here is that this behavior can also be described as precisely
part of the problem being addressed:  That is, both CPUs presumably
*want/need* access to the data or they wouldn't be taking the spinlock
to begin with.  So its really not a question of keeping the structures
on one cpu per se (at least, not for unbounded durations or the system
won't operate properly).

Rather, I think the key is to minimize the impact by bouncing things
intelligently. ;)  I.e. If all things are equal, favor the hottest task
so the data only bounces once instead of twice.  Outside of this
condition, operate strict FIFO.  If we can reasonably calculate when
this optimization is possible, we will have the best of both worlds.  I
have some ideas about ways to extend Nicks algorithm to support this
which I will submit ASAP.

I think the rest of what you said is very fair:  Prove that it's a
problem, this concept helps, and we don't make things worse ;)

Will do, ASAP.

Regards,
-Greg

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 1/4] x86: FIFO ticket spinlocks

2007-11-02 Thread Rik van Riel
On Fri, 2 Nov 2007 07:42:20 +0100
Nick Piggin <[EMAIL PROTECTED]> wrote:

> On Thu, Nov 01, 2007 at 06:19:41PM -0700, Linus Torvalds wrote:
> > 
> > 
> > On Thu, 1 Nov 2007, Rik van Riel wrote:
> > > 
> > > Larry Woodman managed to wedge the VM into a state where, on his
> > > 4x dual core system, only 2 cores (on the same CPU) could get the
> > > zone->lru_lock overnight.  The other 6 cores on the system were
> > > just spinning, without being able to get the lock.
> 
> That's quite incredible, considering that the CPUs actually _taking_
> the locks also drop the locks and do quite a bit of work before taking
> them again (ie. they take them to pull pages off the LRU, but then
> do a reasonable amount of work to remove each one from pagecache
> before refilling from the LRU).
> 
> Possibly actually that is a *more* difficult case for the HW to
> handle: once the CPU actually goes away and operates on other
> cachelines, it may get a little more difficult to detect that it is
> causing starvation issues.

In case of the zone->lru_lock, grabbing the spinlock does
not mean that the process is letting go of the cacheline.

On the contrary, once the spinlock has been grabbed, the
real cacheline prodding begins.

-- 
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." - Brian W. Kernighan
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 1/4] x86: FIFO ticket spinlocks

2007-11-01 Thread Nick Piggin
On Thu, Nov 01, 2007 at 06:19:41PM -0700, Linus Torvalds wrote:
> 
> 
> On Thu, 1 Nov 2007, Rik van Riel wrote:
> > 
> > Larry Woodman managed to wedge the VM into a state where, on his
> > 4x dual core system, only 2 cores (on the same CPU) could get the
> > zone->lru_lock overnight.  The other 6 cores on the system were
> > just spinning, without being able to get the lock.

That's quite incredible, considering that the CPUs actually _taking_
the locks also drop the locks and do quite a bit of work before taking
them again (ie. they take them to pull pages off the LRU, but then
do a reasonable amount of work to remove each one from pagecache before
refilling from the LRU).

Possibly actually that is a *more* difficult case for the HW to handle:
once the CPU actually goes away and operates on other cachelines, it 
may get a little more difficult to detect that it is causing starvation
issues.


> .. and this is almost always the result of a locking *bug*, not unfairness 
> per se. IOW, unfairness just ends up showing the bug in the first place.

I'd almost agree, but there are always going to be corner cases where
we get multiple contentions on a spinlock -- the fact that a lock is
needed at all obviously suggests that it can be contended. The LRU locking
could be improved, but you could have eg. scheduler runqueue lock starvation
if the planets lined up just right, and it is a little more difficult to
improve on runqueue locking.

Anyway, I also think this is partially a hardware issue, and as muliple
cores, threads, and sockets get more common, I hope it will improve (it
affects Intel CPUs as well as AMD). So it is possible to have an option
to switch between locks if the hardware is fairer, but I want to get
as much exposure with this locking as possible for now, to see if there
is any funny performance corner cases exposed (which quite possibly will
turn out to be caused by suboptimal locking itself).

Anyway, if this can make its way to the x86 tree, I think it will get
pulled into -mm (?) and get some exposure...

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 1/4] x86: FIFO ticket spinlocks

2007-11-01 Thread Rik van Riel
On Thu, 1 Nov 2007 18:19:41 -0700 (PDT)
Linus Torvalds <[EMAIL PROTECTED]> wrote:
> On Thu, 1 Nov 2007, Rik van Riel wrote:
> > 
> > Larry Woodman managed to wedge the VM into a state where, on his
> > 4x dual core system, only 2 cores (on the same CPU) could get the
> > zone->lru_lock overnight.  The other 6 cores on the system were
> > just spinning, without being able to get the lock.
> 
> .. and this is almost always the result of a locking *bug*, not
> unfairness per se. IOW, unfairness just ends up showing the bug in
> the first place.

No argument there.  If you have the kind of lock contention where
fairness matters, the contention is probably what needs to be fixed,
not the locking mechanism.

Having said that, making bugs like that less likely to totally wedge
a system would be a good thing for everybody who uses Linux in
production.  Exposing bugs is good for development, bad for business.

-- 
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." - Brian W. Kernighan
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 1/4] x86: FIFO ticket spinlocks

2007-11-01 Thread Linus Torvalds


On Thu, 1 Nov 2007, Rik van Riel wrote:
> 
> Larry Woodman managed to wedge the VM into a state where, on his
> 4x dual core system, only 2 cores (on the same CPU) could get the
> zone->lru_lock overnight.  The other 6 cores on the system were
> just spinning, without being able to get the lock.

.. and this is almost always the result of a locking *bug*, not unfairness 
per se. IOW, unfairness just ends up showing the bug in the first place.

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 1/4] x86: FIFO ticket spinlocks

2007-11-01 Thread Rik van Riel
On Thu, 1 Nov 2007 09:38:22 -0700 (PDT)
Linus Torvalds <[EMAIL PROTECTED]> wrote:

> So "unfair" is obviously always bad. Except when it isn't.

Larry Woodman managed to wedge the VM into a state where, on his
4x dual core system, only 2 cores (on the same CPU) could get the
zone->lru_lock overnight.  The other 6 cores on the system were
just spinning, without being able to get the lock.

On the other hand, spinlock contention in the page replacement
code is just a symptom of the fact that we scan too many pages.
It can probably be fixed in other ways...

-- 
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." - Brian W. Kernighan
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 1/4] x86: FIFO ticket spinlocks

2007-11-01 Thread Nick Piggin
On Thu, Nov 01, 2007 at 04:01:45PM -0400, Chuck Ebbert wrote:
> On 11/01/2007 10:03 AM, Nick Piggin wrote:
> 
> [edited to show the resulting code]
> 
> > +   __asm__ __volatile__ (
> > +   LOCK_PREFIX "xaddw %w0, %1\n"
> > +   "1:\t"
> > +   "cmpb %h0, %b0\n\t"
> > +   "je 2f\n\t"
> > +   "rep ; nop\n\t"
> > +   "movb %1, %b0\n\t"
> > +   /* don't need lfence here, because loads are in-order */
> > "jmp 1b\n"
> > +   "2:"
> > +   :"+Q" (inc), "+m" (lock->slock)
> > +   :
> > +   :"memory", "cc");
> >  }
> 
> If you really thought you might get long queues, you could figure out
> how far back you are and use that to determine how long to wait before
> testing the lock again. That cmpb could become a subb without adding
> overhead to the fast path -- that would give you the queue length (or
> its complement anyway.)

Indeed. You can use this as a really nice input into a backoff
algorithm (eg. if you're next in line, don't back off, or at least
don't go into exponential backoff; if you've got people in front
of you, start throttling harder).

I think I'll leave that to SGI if they come up with a big x86 SSI ;)
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 1/4] x86: FIFO ticket spinlocks

2007-11-01 Thread Chuck Ebbert
On 11/01/2007 10:03 AM, Nick Piggin wrote:

[edited to show the resulting code]

> + __asm__ __volatile__ (
> + LOCK_PREFIX "xaddw %w0, %1\n"
> + "1:\t"
> + "cmpb %h0, %b0\n\t"
> + "je 2f\n\t"
> + "rep ; nop\n\t"
> + "movb %1, %b0\n\t"
> + /* don't need lfence here, because loads are in-order */
>   "jmp 1b\n"
> + "2:"
> + :"+Q" (inc), "+m" (lock->slock)
> + :
> + :"memory", "cc");
>  }

If you really thought you might get long queues, you could figure out
how far back you are and use that to determine how long to wait before
testing the lock again. That cmpb could become a subb without adding
overhead to the fast path -- that would give you the queue length (or
its complement anyway.)

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 1/4] x86: FIFO ticket spinlocks

2007-11-01 Thread Linus Torvalds


On Thu, 1 Nov 2007, Gregory Haskins wrote:
> 
> I had observed this phenomenon on some 8-ways here as well, but I didn't
> have the bandwidth to code something up.  Thumbs up!

Can you test under interesting loads?

We're interested in:
 - is the unfairness fix really noticeable (or does it just move the 
   problem somewhere else, and there is no real change in behaviour)
 - what is the performance impact?

In particular, unfair spinlocks have the potential to perform much better. 
Not so much because the spinlock itself acts all that differently, but 
because being unfair also fundmanetally tends to keep the data structures 
that are *protected* by the spinlock on just one CPU.

So "unfair" is obviously always bad. Except when it isn't. I'd personally 
like to merge the ticket spinlocks, but I'd really like to have people who 
have real loads where they matter actually also do some performance 
testing. Because I do think it will potentially be a performance downer. 

(I obviously hope it won't be, but..)

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 1/4] x86: FIFO ticket spinlocks

2007-11-01 Thread Gregory Haskins
Nick Piggin wrote:
> Introduce ticket lock spinlocks for x86 which are FIFO. The implementation
> is described in the comments. The straight-line lock/unlock instruction
> sequence is slightly slower than the dec based locks on modern x86 CPUs,
> however the difference is quite small on Core2 and Opteron when working out of
> cache, and becomes almost insignificant even on P4 when the lock misses cache.
> trylock is more significantly slower, but they are relatively rare.
> 
> On an 8 core (2 socket) Opteron, spinlock unfairness is extremely noticable,
> with a userspace test having a difference of up to 2x runtime per thread, and
> some threads are starved or "unfairly" granted the lock up to 1 000 000 (!)
> times. After this patch, all threads appear to finish at exactly the same
> time.

I had observed this phenomenon on some 8-ways here as well, but I didn't
have the bandwidth to code something up.  Thumbs up!

Regards,
-Greg
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/