Re: [PATCH -v4 0/5] x86,smp: make ticket spinlock proportional backoff w/ auto tuning

2013-02-03 Thread Chegu Vinod

On 1/25/2013 11:05 AM, Rik van Riel wrote:

Many spinlocks are embedded in data structures; having many CPUs
pounce on the cache line the lock is in will slow down the lock
holder, and can cause system performance to fall off a cliff.

The paper "Non-scalable locks are dangerous" is a good reference:

http://pdos.csail.mit.edu/papers/linux:lock.pdf

In the Linux kernel, spinlocks are optimized for the case of
there not being contention. After all, if there is contention,
the data structure can be improved to reduce or eliminate
lock contention.

Likewise, the spinlock API should remain simple, and the
common case of the lock not being contended should remain
as fast as ever.

However, since spinlock contention should be fairly uncommon,
we can add functionality into the spinlock slow path that keeps
system performance from falling off a cliff when there is lock
contention.

Proportional delay in ticket locks is delaying the time between
checking the ticket based on a delay factor, and the number of
CPUs ahead of us in the queue for this lock. Checking the lock
less often allows the lock holder to continue running, resulting
in better throughput and preventing performance from dropping
off a cliff.

The test case has a number of threads locking and unlocking a
semaphore. With just one thread, everything sits in the CPU
cache and throughput is around 2.6 million operations per
second, with a 5-10% variation.

Once a second thread gets involved, data structures bounce
from CPU to CPU, and performance deteriorates to about 1.25
million operations per second, with a 5-10% variation.

However, as more and more threads get added to the mix,
performance with the vanilla kernel continues to deteriorate.
Once I hit 24 threads, on a 24 CPU, 4 node test system,
performance is down to about 290k operations/second.

With a proportional backoff delay added to the spinlock
code, performance with 24 threads goes up to about 400k
operations/second with a 50x delay, and about 900k operations/second
with a 250x delay. However, with a 250x delay, performance with
2-5 threads is worse than with a 50x delay.

Making the code auto-tune the delay factor results in a system
that performs well with both light and heavy lock contention,
and should also protect against the (likely) case of the fixed
delay factor being wrong for other hardware.

The attached graph shows the performance of the multi threaded
semaphore lock/unlock test case, with 1-24 threads, on the
vanilla kernel, with 10x, 50x, and 250x proportional delay,
as well as the v1 patch series with autotuning for 2x and 2.7x
spinning before the lock is obtained, and with the v2 series.

The v2 series integrates several ideas from Michel Lespinasse
and Eric Dumazet, which should result in better throughput and
nicer behaviour in situations with contention on multiple locks.

For the v3 series, I tried out all the ideas suggested by
Michel. They made perfect sense, but in the end it turned
out they did not work as well as the simple, aggressive
"try to make the delay longer" policy I have now. Several
small bug fixes and cleanups have been integrated.

For the v4 series, I added code to keep the maximum spinlock
delay to a small value when running in a virtual machine. That
should solve the performance regression seen in virtual machines.

The performance issue observed with AIM7 is still a mystery.

Performance is within the margin of error of v2, so the graph
has not been update.

Please let me know if you manage to break this code in any way,
so I can fix it...

I got back on the machine re-ran the AIM7-highsystime microbenchmark 
with a 2000 users and 100 jobs per user
on a 20, 40, 80 vcpu guest using 3.7.5 kernel with and without Rik's 
latest patches.


Host Platform : 8 socket (80 Core) Westmere with 1TB RAM.

Config 1 : 3.7.5 base running on host and in the guest

Config 2 : 3.7.5 + Rik's patches running on host and in the guest

(Note: I didn't change the PLE settings on the host... The severe drop 
from at 40 way and 80 way is due to the un-optimized PLE handler. 
Raghu's PLE's fixes should address those. ).


  Config 1  Config 2
20vcpu  -  170K   168K
40vcpu  -  37K   37K
80vcpu  - 10.5K11.5K


Not much difference between the two configs.. (need to test it along 
with Raghu's fixes).


BTW, I noticed that there were results posted using AIM7-compute 
workload earlier.  The AIM7-highsystime is a lot more kernel intensive.


FYI
Vinod
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH -v4 0/5] x86,smp: make ticket spinlock proportional backoff w/ auto tuning

2013-02-03 Thread Chegu Vinod

On 1/25/2013 11:05 AM, Rik van Riel wrote:

Many spinlocks are embedded in data structures; having many CPUs
pounce on the cache line the lock is in will slow down the lock
holder, and can cause system performance to fall off a cliff.

The paper Non-scalable locks are dangerous is a good reference:

http://pdos.csail.mit.edu/papers/linux:lock.pdf

In the Linux kernel, spinlocks are optimized for the case of
there not being contention. After all, if there is contention,
the data structure can be improved to reduce or eliminate
lock contention.

Likewise, the spinlock API should remain simple, and the
common case of the lock not being contended should remain
as fast as ever.

However, since spinlock contention should be fairly uncommon,
we can add functionality into the spinlock slow path that keeps
system performance from falling off a cliff when there is lock
contention.

Proportional delay in ticket locks is delaying the time between
checking the ticket based on a delay factor, and the number of
CPUs ahead of us in the queue for this lock. Checking the lock
less often allows the lock holder to continue running, resulting
in better throughput and preventing performance from dropping
off a cliff.

The test case has a number of threads locking and unlocking a
semaphore. With just one thread, everything sits in the CPU
cache and throughput is around 2.6 million operations per
second, with a 5-10% variation.

Once a second thread gets involved, data structures bounce
from CPU to CPU, and performance deteriorates to about 1.25
million operations per second, with a 5-10% variation.

However, as more and more threads get added to the mix,
performance with the vanilla kernel continues to deteriorate.
Once I hit 24 threads, on a 24 CPU, 4 node test system,
performance is down to about 290k operations/second.

With a proportional backoff delay added to the spinlock
code, performance with 24 threads goes up to about 400k
operations/second with a 50x delay, and about 900k operations/second
with a 250x delay. However, with a 250x delay, performance with
2-5 threads is worse than with a 50x delay.

Making the code auto-tune the delay factor results in a system
that performs well with both light and heavy lock contention,
and should also protect against the (likely) case of the fixed
delay factor being wrong for other hardware.

The attached graph shows the performance of the multi threaded
semaphore lock/unlock test case, with 1-24 threads, on the
vanilla kernel, with 10x, 50x, and 250x proportional delay,
as well as the v1 patch series with autotuning for 2x and 2.7x
spinning before the lock is obtained, and with the v2 series.

The v2 series integrates several ideas from Michel Lespinasse
and Eric Dumazet, which should result in better throughput and
nicer behaviour in situations with contention on multiple locks.

For the v3 series, I tried out all the ideas suggested by
Michel. They made perfect sense, but in the end it turned
out they did not work as well as the simple, aggressive
try to make the delay longer policy I have now. Several
small bug fixes and cleanups have been integrated.

For the v4 series, I added code to keep the maximum spinlock
delay to a small value when running in a virtual machine. That
should solve the performance regression seen in virtual machines.

The performance issue observed with AIM7 is still a mystery.

Performance is within the margin of error of v2, so the graph
has not been update.

Please let me know if you manage to break this code in any way,
so I can fix it...

I got back on the machine re-ran the AIM7-highsystime microbenchmark 
with a 2000 users and 100 jobs per user
on a 20, 40, 80 vcpu guest using 3.7.5 kernel with and without Rik's 
latest patches.


Host Platform : 8 socket (80 Core) Westmere with 1TB RAM.

Config 1 : 3.7.5 base running on host and in the guest

Config 2 : 3.7.5 + Rik's patches running on host and in the guest

(Note: I didn't change the PLE settings on the host... The severe drop 
from at 40 way and 80 way is due to the un-optimized PLE handler. 
Raghu's PLE's fixes should address those. ).


  Config 1  Config 2
20vcpu  -  170K   168K
40vcpu  -  37K   37K
80vcpu  - 10.5K11.5K


Not much difference between the two configs.. (need to test it along 
with Raghu's fixes).


BTW, I noticed that there were results posted using AIM7-compute 
workload earlier.  The AIM7-highsystime is a lot more kernel intensive.


FYI
Vinod
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH -v4 0/5] x86,smp: make ticket spinlock proportional backoff w/ auto tuning

2013-01-27 Thread Raghavendra K T

On 01/26/2013 12:35 AM, Rik van Riel wrote:

Many spinlocks are embedded in data structures; having many CPUs
pounce on the cache line the lock is in will slow down the lock
holder, and can cause system performance to fall off a cliff.

The paper "Non-scalable locks are dangerous" is a good reference:

http://pdos.csail.mit.edu/papers/linux:lock.pdf

In the Linux kernel, spinlocks are optimized for the case of
there not being contention. After all, if there is contention,
the data structure can be improved to reduce or eliminate
lock contention.

Likewise, the spinlock API should remain simple, and the
common case of the lock not being contended should remain
as fast as ever.

However, since spinlock contention should be fairly uncommon,
we can add functionality into the spinlock slow path that keeps
system performance from falling off a cliff when there is lock
contention.

Proportional delay in ticket locks is delaying the time between
checking the ticket based on a delay factor, and the number of
CPUs ahead of us in the queue for this lock. Checking the lock
less often allows the lock holder to continue running, resulting
in better throughput and preventing performance from dropping
off a cliff.

The test case has a number of threads locking and unlocking a
semaphore. With just one thread, everything sits in the CPU
cache and throughput is around 2.6 million operations per
second, with a 5-10% variation.

Once a second thread gets involved, data structures bounce
from CPU to CPU, and performance deteriorates to about 1.25
million operations per second, with a 5-10% variation.

However, as more and more threads get added to the mix,
performance with the vanilla kernel continues to deteriorate.
Once I hit 24 threads, on a 24 CPU, 4 node test system,
performance is down to about 290k operations/second.

With a proportional backoff delay added to the spinlock
code, performance with 24 threads goes up to about 400k
operations/second with a 50x delay, and about 900k operations/second
with a 250x delay. However, with a 250x delay, performance with
2-5 threads is worse than with a 50x delay.

Making the code auto-tune the delay factor results in a system
that performs well with both light and heavy lock contention,
and should also protect against the (likely) case of the fixed
delay factor being wrong for other hardware.

The attached graph shows the performance of the multi threaded
semaphore lock/unlock test case, with 1-24 threads, on the
vanilla kernel, with 10x, 50x, and 250x proportional delay,
as well as the v1 patch series with autotuning for 2x and 2.7x
spinning before the lock is obtained, and with the v2 series.

The v2 series integrates several ideas from Michel Lespinasse
and Eric Dumazet, which should result in better throughput and
nicer behaviour in situations with contention on multiple locks.

For the v3 series, I tried out all the ideas suggested by
Michel. They made perfect sense, but in the end it turned
out they did not work as well as the simple, aggressive
"try to make the delay longer" policy I have now. Several
small bug fixes and cleanups have been integrated.

For the v4 series, I added code to keep the maximum spinlock
delay to a small value when running in a virtual machine. That
should solve the performance regression seen in virtual machines.

The performance issue observed with AIM7 is still a mystery.

Performance is within the margin of error of v2, so the graph
has not been update.

Please let me know if you manage to break this code in any way,
so I can fix it...



After the introduction of patch 5 which limits the delay loops to 16,
I am no more seeing the degradation in virtual guests as reported
earlier, but improvements.

For the whole series:
Reviewed-by: Raghavendra K T 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH -v4 0/5] x86,smp: make ticket spinlock proportional backoff w/ auto tuning

2013-01-27 Thread Raghavendra K T

On 01/26/2013 12:35 AM, Rik van Riel wrote:

Many spinlocks are embedded in data structures; having many CPUs
pounce on the cache line the lock is in will slow down the lock
holder, and can cause system performance to fall off a cliff.

The paper Non-scalable locks are dangerous is a good reference:

http://pdos.csail.mit.edu/papers/linux:lock.pdf

In the Linux kernel, spinlocks are optimized for the case of
there not being contention. After all, if there is contention,
the data structure can be improved to reduce or eliminate
lock contention.

Likewise, the spinlock API should remain simple, and the
common case of the lock not being contended should remain
as fast as ever.

However, since spinlock contention should be fairly uncommon,
we can add functionality into the spinlock slow path that keeps
system performance from falling off a cliff when there is lock
contention.

Proportional delay in ticket locks is delaying the time between
checking the ticket based on a delay factor, and the number of
CPUs ahead of us in the queue for this lock. Checking the lock
less often allows the lock holder to continue running, resulting
in better throughput and preventing performance from dropping
off a cliff.

The test case has a number of threads locking and unlocking a
semaphore. With just one thread, everything sits in the CPU
cache and throughput is around 2.6 million operations per
second, with a 5-10% variation.

Once a second thread gets involved, data structures bounce
from CPU to CPU, and performance deteriorates to about 1.25
million operations per second, with a 5-10% variation.

However, as more and more threads get added to the mix,
performance with the vanilla kernel continues to deteriorate.
Once I hit 24 threads, on a 24 CPU, 4 node test system,
performance is down to about 290k operations/second.

With a proportional backoff delay added to the spinlock
code, performance with 24 threads goes up to about 400k
operations/second with a 50x delay, and about 900k operations/second
with a 250x delay. However, with a 250x delay, performance with
2-5 threads is worse than with a 50x delay.

Making the code auto-tune the delay factor results in a system
that performs well with both light and heavy lock contention,
and should also protect against the (likely) case of the fixed
delay factor being wrong for other hardware.

The attached graph shows the performance of the multi threaded
semaphore lock/unlock test case, with 1-24 threads, on the
vanilla kernel, with 10x, 50x, and 250x proportional delay,
as well as the v1 patch series with autotuning for 2x and 2.7x
spinning before the lock is obtained, and with the v2 series.

The v2 series integrates several ideas from Michel Lespinasse
and Eric Dumazet, which should result in better throughput and
nicer behaviour in situations with contention on multiple locks.

For the v3 series, I tried out all the ideas suggested by
Michel. They made perfect sense, but in the end it turned
out they did not work as well as the simple, aggressive
try to make the delay longer policy I have now. Several
small bug fixes and cleanups have been integrated.

For the v4 series, I added code to keep the maximum spinlock
delay to a small value when running in a virtual machine. That
should solve the performance regression seen in virtual machines.

The performance issue observed with AIM7 is still a mystery.

Performance is within the margin of error of v2, so the graph
has not been update.

Please let me know if you manage to break this code in any way,
so I can fix it...



After the introduction of patch 5 which limits the delay loops to 16,
I am no more seeing the degradation in virtual guests as reported
earlier, but improvements.

For the whole series:
Reviewed-by: Raghavendra K T raghavendra...@linux.vnet.ibm.com

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH -v4 0/5] x86,smp: make ticket spinlock proportional backoff w/ auto tuning

2013-01-26 Thread Mike Galbraith
On Sat, 2013-01-26 at 13:05 +0100, Ingo Molnar wrote: 
> * Mike Galbraith  wrote:
> 
> > On Fri, 2013-01-25 at 14:05 -0500, Rik van Riel wrote:
> > 
> > > The performance issue observed with AIM7 is still a mystery.
> > 
> > Hm.  AIM7 mystery _may_ be the same crud I see on a 4 node 40 
> > core box. Stock scheduler knobs are too preempt happy, produce 
> > unstable results. I twiddle them as below to stabilize 
> > results.
> > 
> > I'm testing a load balancing series from Alex Shi with AIM7 
> > and whatnot, added your series on top of it and retested.  
> > What I see is improvement.
> 
> Since we want both series that's good news, agreed?

Yes, both look good.

BTW, the balance and powersaving low end improvements in the Alex set
are not due to any select_idle_sibling() misbehavior as one might
suspect when looking at the set, it's innocent (this time), it's the
balancing policies themselves.  I don't see where it comes from, gotta
be some turbo thingy methinks, but who cares, better is better.

-Mike

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH -v4 0/5] x86,smp: make ticket spinlock proportional backoff w/ auto tuning

2013-01-26 Thread Ingo Molnar

* Mike Galbraith  wrote:

> On Fri, 2013-01-25 at 14:05 -0500, Rik van Riel wrote:
> 
> > The performance issue observed with AIM7 is still a mystery.
> 
> Hm.  AIM7 mystery _may_ be the same crud I see on a 4 node 40 
> core box. Stock scheduler knobs are too preempt happy, produce 
> unstable results. I twiddle them as below to stabilize 
> results.
> 
> I'm testing a load balancing series from Alex Shi with AIM7 
> and whatnot, added your series on top of it and retested.  
> What I see is improvement.

Since we want both series that's good news, agreed?

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH -v4 0/5] x86,smp: make ticket spinlock proportional backoff w/ auto tuning

2013-01-26 Thread Ingo Molnar

* Mike Galbraith bitbuc...@online.de wrote:

 On Fri, 2013-01-25 at 14:05 -0500, Rik van Riel wrote:
 
  The performance issue observed with AIM7 is still a mystery.
 
 Hm.  AIM7 mystery _may_ be the same crud I see on a 4 node 40 
 core box. Stock scheduler knobs are too preempt happy, produce 
 unstable results. I twiddle them as below to stabilize 
 results.
 
 I'm testing a load balancing series from Alex Shi with AIM7 
 and whatnot, added your series on top of it and retested.  
 What I see is improvement.

Since we want both series that's good news, agreed?

Thanks,

Ingo
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH -v4 0/5] x86,smp: make ticket spinlock proportional backoff w/ auto tuning

2013-01-26 Thread Mike Galbraith
On Sat, 2013-01-26 at 13:05 +0100, Ingo Molnar wrote: 
 * Mike Galbraith bitbuc...@online.de wrote:
 
  On Fri, 2013-01-25 at 14:05 -0500, Rik van Riel wrote:
  
   The performance issue observed with AIM7 is still a mystery.
  
  Hm.  AIM7 mystery _may_ be the same crud I see on a 4 node 40 
  core box. Stock scheduler knobs are too preempt happy, produce 
  unstable results. I twiddle them as below to stabilize 
  results.
  
  I'm testing a load balancing series from Alex Shi with AIM7 
  and whatnot, added your series on top of it and retested.  
  What I see is improvement.
 
 Since we want both series that's good news, agreed?

Yes, both look good.

BTW, the balance and powersaving low end improvements in the Alex set
are not due to any select_idle_sibling() misbehavior as one might
suspect when looking at the set, it's innocent (this time), it's the
balancing policies themselves.  I don't see where it comes from, gotta
be some turbo thingy methinks, but who cares, better is better.

-Mike

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH -v4 0/5] x86,smp: make ticket spinlock proportional backoff w/ auto tuning

2013-01-25 Thread Mike Galbraith
On Fri, 2013-01-25 at 14:05 -0500, Rik van Riel wrote:

> The performance issue observed with AIM7 is still a mystery.

Hm.  AIM7 mystery _may_ be the same crud I see on a 4 node 40 core box.
Stock scheduler knobs are too preempt happy, produce unstable results.
I twiddle them as below to stabilize results.

I'm testing a load balancing series from Alex Shi with AIM7 and whatnot,
added your series on top of it and retested.  What I see is
improvement. 

Oodles of numbers follow.  Sorry that your numbers are mixed in with my
numbers, but this is just an excerpt from my test log, and I'm too lazy
to reformat and filter.  You can save wear and tear on your eyeballs by
just poking 'D'.  There does appear to be evidence that your patch set
improved this load though, so in case you want to see numbers, here come
a bunch, a quick scroll-by may be worth it.

The very heavy load end did not improve, which seems odd, but whatever.
Numbers...

sched_latency_ns = 24ms
sched_min_granularity_ns = 8ms
sched_wakeup_granularity_ns = 10ms

aim7 compute
 3.8.0-performance   3.8.0-balance  
 3.8.0-powersaving
Tasksjobs/min  jti  jobs/min/task  real   cpujobs/min  jti  
jobs/min/task  real   cpujobs/min  jti  jobs/min/task  real 
  cpu
1  432.86  100   432.8571 14.00  3.99  433.48  100  
 433.4764 13.98  3.97  433.17  100   433.1665 13.99 
 3.98
1  437.23  100   437.2294 13.86  3.85  436.60  100  
 436.5994 13.88  3.86  435.66  100   435.6578 13.91 
 3.90
1  434.10  100   434.0974 13.96  3.95  436.29  100  
 436.2851 13.89  3.89  436.29  100   436.2851 13.89 
 3.87
5 2400.95   99   480.1902 12.62 12.49 2554.81   98  
 510.9612 11.86  7.55 2487.68   98   497.5369 12.18 
 8.22
5 2341.58   99   468.3153 12.94 13.95 2578.72   99  
 515.7447 11.75  7.25 2527.11   99   505.4212 11.99 
 7.90
5 2350.66   99   470.1319 12.89 13.66 2600.86   99  
 520.1717 11.65  7.09 2508.28   98   501.6556 12.08 
 8.24
   10 4291.78   99   429.1785 14.12 40.14 5334.51   99  
 533.4507 11.36 11.13 5183.92   98   518.3918 11.69 
12.15
   10 4334.76   99   433.4764 13.98 38.70 5311.13   99  
 531.1131 11.41 11.23 5215.15   99   521.5146 11.62 
12.53
   10 4273.62   99   427.3625 14.18 40.29 5287.96   99  
 528.7958 11.46 11.46 5144.31   98   514.4312 11.78 
12.32
   20 8487.39   94   424.3697 14.28 63.1410594.41   99  
 529.7203 11.44 23.7210575.92   99   528.7958 11.46 
22.08
   20 8387.54   97   419.3772 14.45 77.0110575.92   98  
 528.7958 11.46 23.4110520.83   99   526.0417 11.52 
21.88
   20 8713.16   95   435.6578 13.91 55.1010659.63   99  
 532.9815 11.37 24.1710539.13   99   526.9565 11.50 
22.13
   4016786.70   99   419.6676 14.44170.0819469.88   98  
 486.7470 12.45 60.7819967.05   98   499.1763 12.14 
51.40
   4016728.78   99   418.2195 14.49172.9619627.53   98  
 490.6883 12.35 65.2620386.88   98   509.6720 11.89 
46.91
   4016763.49   99   419.0871 14.46171.4220033.06   98  
 500.8264 12.10 51.4420682.59   98   517.0648 11.72 
42.45
   8033024.52   98   412.8065 14.68355.1033205.48   98  
 415.0685 14.60336.9033690.06   97   421.1258 14.39 
   248.91
   8033002.04   99   412.5255 14.69356.2733949.58   96  
 424.3697 14.28283.8733160.05   97   414.5007 14.62 
   264.85
   8033047.03   99   413.0879 14.67355.2233137.39   98  
 414.2174 14.63338.9233526.97   97   419.0871 14.46 
   257.31
  16064254.47   98   401.5905 15.09391.3064000.00   98  
 400. 15.15396.8765073.83   97   406.7114 14.90 
   371.09
  16064468.09   98   402.9255 15.04390.2864553.93   98  
 403.4621 15.02389.4964640.00   98   404. 15.00 
   379.82
  16064297.08   98   401.8568 15.08389.4564856.19   98  
 405.3512 14.95383.6464683.12   98   404.2695 14.99 
   379.43
  320   121579.94   98  

[PATCH -v4 0/5] x86,smp: make ticket spinlock proportional backoff w/ auto tuning

2013-01-25 Thread Rik van Riel
Many spinlocks are embedded in data structures; having many CPUs
pounce on the cache line the lock is in will slow down the lock
holder, and can cause system performance to fall off a cliff.

The paper "Non-scalable locks are dangerous" is a good reference:

http://pdos.csail.mit.edu/papers/linux:lock.pdf

In the Linux kernel, spinlocks are optimized for the case of
there not being contention. After all, if there is contention,
the data structure can be improved to reduce or eliminate
lock contention.

Likewise, the spinlock API should remain simple, and the
common case of the lock not being contended should remain
as fast as ever.

However, since spinlock contention should be fairly uncommon,
we can add functionality into the spinlock slow path that keeps
system performance from falling off a cliff when there is lock
contention.

Proportional delay in ticket locks is delaying the time between
checking the ticket based on a delay factor, and the number of
CPUs ahead of us in the queue for this lock. Checking the lock
less often allows the lock holder to continue running, resulting
in better throughput and preventing performance from dropping
off a cliff.

The test case has a number of threads locking and unlocking a
semaphore. With just one thread, everything sits in the CPU
cache and throughput is around 2.6 million operations per
second, with a 5-10% variation.

Once a second thread gets involved, data structures bounce
from CPU to CPU, and performance deteriorates to about 1.25
million operations per second, with a 5-10% variation.

However, as more and more threads get added to the mix,
performance with the vanilla kernel continues to deteriorate.
Once I hit 24 threads, on a 24 CPU, 4 node test system,
performance is down to about 290k operations/second.

With a proportional backoff delay added to the spinlock
code, performance with 24 threads goes up to about 400k
operations/second with a 50x delay, and about 900k operations/second
with a 250x delay. However, with a 250x delay, performance with
2-5 threads is worse than with a 50x delay.

Making the code auto-tune the delay factor results in a system
that performs well with both light and heavy lock contention,
and should also protect against the (likely) case of the fixed
delay factor being wrong for other hardware.

The attached graph shows the performance of the multi threaded
semaphore lock/unlock test case, with 1-24 threads, on the
vanilla kernel, with 10x, 50x, and 250x proportional delay,
as well as the v1 patch series with autotuning for 2x and 2.7x
spinning before the lock is obtained, and with the v2 series.

The v2 series integrates several ideas from Michel Lespinasse
and Eric Dumazet, which should result in better throughput and
nicer behaviour in situations with contention on multiple locks.

For the v3 series, I tried out all the ideas suggested by
Michel. They made perfect sense, but in the end it turned
out they did not work as well as the simple, aggressive
"try to make the delay longer" policy I have now. Several
small bug fixes and cleanups have been integrated.

For the v4 series, I added code to keep the maximum spinlock
delay to a small value when running in a virtual machine. That
should solve the performance regression seen in virtual machines.

The performance issue observed with AIM7 is still a mystery.

Performance is within the margin of error of v2, so the graph
has not been update.

Please let me know if you manage to break this code in any way,
so I can fix it...

-- 
All rights reversed.

<>

[PATCH -v4 0/5] x86,smp: make ticket spinlock proportional backoff w/ auto tuning

2013-01-25 Thread Rik van Riel
Many spinlocks are embedded in data structures; having many CPUs
pounce on the cache line the lock is in will slow down the lock
holder, and can cause system performance to fall off a cliff.

The paper Non-scalable locks are dangerous is a good reference:

http://pdos.csail.mit.edu/papers/linux:lock.pdf

In the Linux kernel, spinlocks are optimized for the case of
there not being contention. After all, if there is contention,
the data structure can be improved to reduce or eliminate
lock contention.

Likewise, the spinlock API should remain simple, and the
common case of the lock not being contended should remain
as fast as ever.

However, since spinlock contention should be fairly uncommon,
we can add functionality into the spinlock slow path that keeps
system performance from falling off a cliff when there is lock
contention.

Proportional delay in ticket locks is delaying the time between
checking the ticket based on a delay factor, and the number of
CPUs ahead of us in the queue for this lock. Checking the lock
less often allows the lock holder to continue running, resulting
in better throughput and preventing performance from dropping
off a cliff.

The test case has a number of threads locking and unlocking a
semaphore. With just one thread, everything sits in the CPU
cache and throughput is around 2.6 million operations per
second, with a 5-10% variation.

Once a second thread gets involved, data structures bounce
from CPU to CPU, and performance deteriorates to about 1.25
million operations per second, with a 5-10% variation.

However, as more and more threads get added to the mix,
performance with the vanilla kernel continues to deteriorate.
Once I hit 24 threads, on a 24 CPU, 4 node test system,
performance is down to about 290k operations/second.

With a proportional backoff delay added to the spinlock
code, performance with 24 threads goes up to about 400k
operations/second with a 50x delay, and about 900k operations/second
with a 250x delay. However, with a 250x delay, performance with
2-5 threads is worse than with a 50x delay.

Making the code auto-tune the delay factor results in a system
that performs well with both light and heavy lock contention,
and should also protect against the (likely) case of the fixed
delay factor being wrong for other hardware.

The attached graph shows the performance of the multi threaded
semaphore lock/unlock test case, with 1-24 threads, on the
vanilla kernel, with 10x, 50x, and 250x proportional delay,
as well as the v1 patch series with autotuning for 2x and 2.7x
spinning before the lock is obtained, and with the v2 series.

The v2 series integrates several ideas from Michel Lespinasse
and Eric Dumazet, which should result in better throughput and
nicer behaviour in situations with contention on multiple locks.

For the v3 series, I tried out all the ideas suggested by
Michel. They made perfect sense, but in the end it turned
out they did not work as well as the simple, aggressive
try to make the delay longer policy I have now. Several
small bug fixes and cleanups have been integrated.

For the v4 series, I added code to keep the maximum spinlock
delay to a small value when running in a virtual machine. That
should solve the performance regression seen in virtual machines.

The performance issue observed with AIM7 is still a mystery.

Performance is within the margin of error of v2, so the graph
has not been update.

Please let me know if you manage to break this code in any way,
so I can fix it...

-- 
All rights reversed.

attachment: spinlock-backoff-v2.png

Re: [PATCH -v4 0/5] x86,smp: make ticket spinlock proportional backoff w/ auto tuning

2013-01-25 Thread Mike Galbraith
On Fri, 2013-01-25 at 14:05 -0500, Rik van Riel wrote:

 The performance issue observed with AIM7 is still a mystery.

Hm.  AIM7 mystery _may_ be the same crud I see on a 4 node 40 core box.
Stock scheduler knobs are too preempt happy, produce unstable results.
I twiddle them as below to stabilize results.

I'm testing a load balancing series from Alex Shi with AIM7 and whatnot,
added your series on top of it and retested.  What I see is
improvement. 

Oodles of numbers follow.  Sorry that your numbers are mixed in with my
numbers, but this is just an excerpt from my test log, and I'm too lazy
to reformat and filter.  You can save wear and tear on your eyeballs by
just poking 'D'.  There does appear to be evidence that your patch set
improved this load though, so in case you want to see numbers, here come
a bunch, a quick scroll-by may be worth it.

The very heavy load end did not improve, which seems odd, but whatever.
Numbers...

sched_latency_ns = 24ms
sched_min_granularity_ns = 8ms
sched_wakeup_granularity_ns = 10ms

aim7 compute
 3.8.0-performance   3.8.0-balance  
 3.8.0-powersaving
Tasksjobs/min  jti  jobs/min/task  real   cpujobs/min  jti  
jobs/min/task  real   cpujobs/min  jti  jobs/min/task  real 
  cpu
1  432.86  100   432.8571 14.00  3.99  433.48  100  
 433.4764 13.98  3.97  433.17  100   433.1665 13.99 
 3.98
1  437.23  100   437.2294 13.86  3.85  436.60  100  
 436.5994 13.88  3.86  435.66  100   435.6578 13.91 
 3.90
1  434.10  100   434.0974 13.96  3.95  436.29  100  
 436.2851 13.89  3.89  436.29  100   436.2851 13.89 
 3.87
5 2400.95   99   480.1902 12.62 12.49 2554.81   98  
 510.9612 11.86  7.55 2487.68   98   497.5369 12.18 
 8.22
5 2341.58   99   468.3153 12.94 13.95 2578.72   99  
 515.7447 11.75  7.25 2527.11   99   505.4212 11.99 
 7.90
5 2350.66   99   470.1319 12.89 13.66 2600.86   99  
 520.1717 11.65  7.09 2508.28   98   501.6556 12.08 
 8.24
   10 4291.78   99   429.1785 14.12 40.14 5334.51   99  
 533.4507 11.36 11.13 5183.92   98   518.3918 11.69 
12.15
   10 4334.76   99   433.4764 13.98 38.70 5311.13   99  
 531.1131 11.41 11.23 5215.15   99   521.5146 11.62 
12.53
   10 4273.62   99   427.3625 14.18 40.29 5287.96   99  
 528.7958 11.46 11.46 5144.31   98   514.4312 11.78 
12.32
   20 8487.39   94   424.3697 14.28 63.1410594.41   99  
 529.7203 11.44 23.7210575.92   99   528.7958 11.46 
22.08
   20 8387.54   97   419.3772 14.45 77.0110575.92   98  
 528.7958 11.46 23.4110520.83   99   526.0417 11.52 
21.88
   20 8713.16   95   435.6578 13.91 55.1010659.63   99  
 532.9815 11.37 24.1710539.13   99   526.9565 11.50 
22.13
   4016786.70   99   419.6676 14.44170.0819469.88   98  
 486.7470 12.45 60.7819967.05   98   499.1763 12.14 
51.40
   4016728.78   99   418.2195 14.49172.9619627.53   98  
 490.6883 12.35 65.2620386.88   98   509.6720 11.89 
46.91
   4016763.49   99   419.0871 14.46171.4220033.06   98  
 500.8264 12.10 51.4420682.59   98   517.0648 11.72 
42.45
   8033024.52   98   412.8065 14.68355.1033205.48   98  
 415.0685 14.60336.9033690.06   97   421.1258 14.39 
   248.91
   8033002.04   99   412.5255 14.69356.2733949.58   96  
 424.3697 14.28283.8733160.05   97   414.5007 14.62 
   264.85
   8033047.03   99   413.0879 14.67355.2233137.39   98  
 414.2174 14.63338.9233526.97   97   419.0871 14.46 
   257.31
  16064254.47   98   401.5905 15.09391.3064000.00   98  
 400. 15.15396.8765073.83   97   406.7114 14.90 
   371.09
  16064468.09   98   402.9255 15.04390.2864553.93   98  
 403.4621 15.02389.4964640.00   98   404. 15.00 
   379.82
  16064297.08   98   401.8568 15.08389.4564856.19   98  
 405.3512 14.95383.6464683.12   98   404.2695 14.99 
   379.43
  320   121579.94   98