Re: [PATCH RFC V10 0/18] Paravirtualized ticket spinlocks

2013-06-27 Thread Raghavendra K T

On 06/26/2013 02:03 PM, Raghavendra K T wrote:

On 06/24/2013 06:47 PM, Andrew Jones wrote:

On Mon, Jun 24, 2013 at 06:10:14PM +0530, Raghavendra K T wrote:


Results:
===
base = 3.10-rc2 kernel
patched = base + this series

The test was on 32 core (model: Intel(R) Xeon(R) CPU X7560) HT disabled
with 32 KVM guest vcpu 8GB RAM.


Have you ever tried to get results with HT enabled?



+---+---+---++---+
ebizzy (records/sec) higher is better
+---+---+---++---+
 basestdevpatchedstdev%improvement
+---+---+---++---+
1x  5574.9000   237.49975618.94.0366 0.77311
2x  2741.5000   561.30903332.   102.473821.53930
3x  2146.2500   216.77182302.76.3870 7.27237
4x  1663.   141.92351753.750083.5220 5.45701
+---+---+---++---+


This looks good. Are your ebizzy results consistent run to run
though?


+---+---+---++---+
   dbench  (Throughput) higher is better
+---+---+---++---+
 basestdevpatchedstdev%improvement
+---+---+---++---+
1x 14111.5600   754.4525   14645.9900   114.3087 3.78718
2x  2481.627071.26652667.128073.8193 7.47498
3x  1510.248331.86341503.879236.0777-0.42173
4x  1029.487516.91661039.706943.8840 0.99267
+---+---+---++---+


Hmm, I wonder what 2.5x looks like. Also, the 3% improvement with
no overcommit is interesting. What's happening there? It makes
me wonder what  1x looks like.



Hi Andrew,

I tried 2.5x case sort where I used 3 guests with 27 vcpu each on 32
core (HT disabled machine) and here is the output. almost no gain there.

  throughput avgstdev
base: 1768.7458 MB/sec 54.044221
patched:  1772.5617 MB/sec 41.227689
gain %0.226

I am yet to try HT enabled cases that would give 0.5x to 2x performance
results.



I have the result of HT enabled case now.
config: total 64 cpu (HT on) 32 vcpu guests.
I am seeing some inconsistency in ebizzy results in this case (May be 
Drew had tried with HT on and had observed the same in ebizzy runs).


patched-nonple and  base  performance in case of 1.5x and 2x also have 
been little inconsistent for dbench too. Overall I see pvspinlock + ple 
on case more stable.
and overall pvspinlock performance seem to be very impressive in HT 
enabled case.


patched = pvspinv10_hton
+---+---+---++---+
  ebizzy
++--+---+---++---+
basestdev   patched   stdev%improvement
++-+---+---++---+
 0.5x  6925.300074.4342   7317.86.3018 5.65607
 1.0x  2379.8000   405.3519   3427.   574.878944.00370
 1.5x  1850.833397.8114   2733.4167   459.801647.68573
 2.0x  1477.6250   105.2411   2525.250097.592170.89925
+---+---+---++---+
+---+---+---++---+
  dbench
++--+---+---++---+
basestdev   patched   stdev%improvement
++-+---+---++---+
 0.5x 9045.9950   463.1447   16482.720057.601782.21014
 1.0x 6251.1680   543.8219   11212.7600   380.754279.37064
 1.5x 3095.7475   231.15674308.8583   266.587339.18636
 2.0x 1219.120075.42941979.6750   134.693462.38557
+---+---+---++---+

patched = pvspinv10_hton_nople
+---+---+---++---+
  ebizzy
++--+---+---++---+
basestdev   patched   stdev%improvement
++-+---+---++---+
 0.5x 6925.300074.43427473.8000   224.6344 7.92023
 1.0x 2379.8000   405.35196176.2000   417.1133   159.52601
 1.5x 1850.833397.81142214.1667   515.687519.63080
 2.0x 1477.6250   105.2411 758.   108.8131   -48.70146
+---+---+---++---+
+---+---+---++---+
  dbench
++--+---+---++---+
basestdev   patched   stdev%improvement
++-+---+---++---+
 0.5x 9045.9950   463.1447  

Re: [PATCH RFC V10 0/18] Paravirtualized ticket spinlocks

2013-06-26 Thread Raghavendra K T

On 06/24/2013 06:47 PM, Andrew Jones wrote:

On Mon, Jun 24, 2013 at 06:10:14PM +0530, Raghavendra K T wrote:


Results:
===
base = 3.10-rc2 kernel
patched = base + this series

The test was on 32 core (model: Intel(R) Xeon(R) CPU X7560) HT disabled
with 32 KVM guest vcpu 8GB RAM.


Have you ever tried to get results with HT enabled?



+---+---+---++---+
ebizzy (records/sec) higher is better
+---+---+---++---+
 basestdevpatchedstdev%improvement
+---+---+---++---+
1x  5574.9000   237.49975618.94.0366 0.77311
2x  2741.5000   561.30903332.   102.473821.53930
3x  2146.2500   216.77182302.76.3870 7.27237
4x  1663.   141.92351753.750083.5220 5.45701
+---+---+---++---+


This looks good. Are your ebizzy results consistent run to run
though?


+---+---+---++---+
   dbench  (Throughput) higher is better
+---+---+---++---+
 basestdevpatchedstdev%improvement
+---+---+---++---+
1x 14111.5600   754.4525   14645.9900   114.3087 3.78718
2x  2481.627071.26652667.128073.8193 7.47498
3x  1510.248331.86341503.879236.0777-0.42173
4x  1029.487516.91661039.706943.8840 0.99267
+---+---+---++---+


Hmm, I wonder what 2.5x looks like. Also, the 3% improvement with
no overcommit is interesting. What's happening there? It makes
me wonder what  1x looks like.



Hi Andrew,

I tried 2.5x case sort where I used 3 guests with 27 vcpu each on 32
core (HT disabled machine) and here is the output. almost no gain there.

 throughput avgstdev
base: 1768.7458 MB/sec 54.044221
patched:  1772.5617 MB/sec 41.227689
gain %0.226

I am yet to try HT enabled cases that would give 0.5x to 2x performance
results.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH RFC V10 0/18] Paravirtualized ticket spinlocks

2013-06-24 Thread Raghavendra K T

This series replaces the existing paravirtualized spinlock mechanism
with a paravirtualized ticketlock mechanism. The series provides
implementation for both Xen and KVM.

Changes in V10:
Addressed Konrad's review comments:
- Added break in patch 5 since now we know exact cpu to wakeup
- Dropped patch 12 and Konrad needs to revert two patches to enable xen on hvm 
  70dd4998, f10cd522c
- Remove TIMEOUT and corrected spacing in patch 15
- Kicked spelling and correct spacing in patches 17, 18 

Changes in V9:
- Changed spin_threshold to 32k to avoid excess halt exits that are
   causing undercommit degradation (after PLE handler improvement).
- Added  kvm_irq_delivery_to_apic (suggested by Gleb)
- Optimized halt exit path to use PLE handler

V8 of PVspinlock was posted last year. After Avi's suggestions to look
at PLE handler's improvements, various optimizations in PLE handling
have been tried.

With this series we see that we could get little more improvements on top
of that. 

Ticket locks have an inherent problem in a virtualized case, because
the vCPUs are scheduled rather than running concurrently (ignoring
gang scheduled vCPUs).  This can result in catastrophic performance
collapses when the vCPU scheduler doesn't schedule the correct next
vCPU, and ends up scheduling a vCPU which burns its entire timeslice
spinning.  (Note that this is not the same problem as lock-holder
preemption, which this series also addresses; that's also a problem,
but not catastrophic).

(See Thomas Friebel's talk Prevent Guests from Spinning Around
http://www.xen.org/files/xensummitboston08/LHP.pdf for more details.)

Currently we deal with this by having PV spinlocks, which adds a layer
of indirection in front of all the spinlock functions, and defining a
completely new implementation for Xen (and for other pvops users, but
there are none at present).

PV ticketlocks keeps the existing ticketlock implemenentation
(fastpath) as-is, but adds a couple of pvops for the slow paths:

- If a CPU has been waiting for a spinlock for SPIN_THRESHOLD
  iterations, then call out to the __ticket_lock_spinning() pvop,
  which allows a backend to block the vCPU rather than spinning.  This
  pvop can set the lock into slowpath state.

- When releasing a lock, if it is in slowpath state, the call
  __ticket_unlock_kick() to kick the next vCPU in line awake.  If the
  lock is no longer in contention, it also clears the slowpath flag.

The slowpath state is stored in the LSB of the within the lock tail
ticket.  This has the effect of reducing the max number of CPUs by
half (so, a small ticket can deal with 128 CPUs, and large ticket
32768).

For KVM, one hypercall is introduced in hypervisor,that allows a vcpu to kick
another vcpu out of halt state.
The blocking of vcpu is done using halt() in (lock_spinning) slowpath.

Overall, it results in a large reduction in code, it makes the native
and virtualized cases closer, and it removes a layer of indirection
around all the spinlock functions.

The fast path (taking an uncontended lock which isn't in slowpath
state) is optimal, identical to the non-paravirtualized case.

The inner part of ticket lock code becomes:
inc = xadd(lock-tickets, inc);
inc.tail = ~TICKET_SLOWPATH_FLAG;

if (likely(inc.head == inc.tail))
goto out;
for (;;) {
unsigned count = SPIN_THRESHOLD;
do {
if (ACCESS_ONCE(lock-tickets.head) == inc.tail)
goto out;
cpu_relax();
} while (--count);
__ticket_lock_spinning(lock, inc.tail);
}
out:barrier();
which results in:
push   %rbp
mov%rsp,%rbp

mov$0x200,%eax
lock xadd %ax,(%rdi)
movzbl %ah,%edx
cmp%al,%dl
jne1f   # Slowpath if lock in contention

pop%rbp
retq   

### SLOWPATH START
1:  and$-2,%edx
movzbl %dl,%esi

2:  mov$0x800,%eax
jmp4f

3:  pause  
sub$0x1,%eax
je 5f

4:  movzbl (%rdi),%ecx
cmp%cl,%dl
jne3b

pop%rbp
retq   

5:  callq  *__ticket_lock_spinning
jmp2b
### SLOWPATH END

with CONFIG_PARAVIRT_SPINLOCKS=n, the code has changed slightly, where
the fastpath case is straight through (taking the lock without
contention), and the spin loop is out of line:

push   %rbp
mov%rsp,%rbp

mov$0x100,%eax
lock xadd %ax,(%rdi)
movzbl %ah,%edx
cmp%al,%dl
jne1f

pop%rbp
retq   

### SLOWPATH START
1:  pause  
movzbl (%rdi),%eax
cmp%dl,%al
jne1b

pop%rbp
retq   
### SLOWPATH END

The unlock code is complicated by the need to both add to the lock's
head and fetch the slowpath flag from tail. 

Re: [PATCH RFC V10 0/18] Paravirtualized ticket spinlocks

2013-06-24 Thread Andrew Jones
On Mon, Jun 24, 2013 at 06:10:14PM +0530, Raghavendra K T wrote:
 
 Results:
 ===
 base = 3.10-rc2 kernel
 patched = base + this series
 
 The test was on 32 core (model: Intel(R) Xeon(R) CPU X7560) HT disabled
 with 32 KVM guest vcpu 8GB RAM.

Have you ever tried to get results with HT enabled?

 
 +---+---+---++---+
ebizzy (records/sec) higher is better
 +---+---+---++---+
 basestdevpatchedstdev%improvement
 +---+---+---++---+
 1x  5574.9000   237.49975618.94.0366 0.77311
 2x  2741.5000   561.30903332.   102.473821.53930
 3x  2146.2500   216.77182302.76.3870 7.27237
 4x  1663.   141.92351753.750083.5220 5.45701
 +---+---+---++---+

This looks good. Are your ebizzy results consistent run to run
though?

 +---+---+---++---+
   dbench  (Throughput) higher is better
 +---+---+---++---+
 basestdevpatchedstdev%improvement
 +---+---+---++---+
 1x 14111.5600   754.4525   14645.9900   114.3087 3.78718
 2x  2481.627071.26652667.128073.8193 7.47498
 3x  1510.248331.86341503.879236.0777-0.42173
 4x  1029.487516.91661039.706943.8840 0.99267
 +---+---+---++---+

Hmm, I wonder what 2.5x looks like. Also, the 3% improvement with
no overcommit is interesting. What's happening there? It makes
me wonder what  1x looks like.

thanks,
drew
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC V10 0/18] Paravirtualized ticket spinlocks

2013-06-24 Thread Raghavendra K T

On 06/24/2013 06:47 PM, Andrew Jones wrote:

On Mon, Jun 24, 2013 at 06:10:14PM +0530, Raghavendra K T wrote:


Results:
===
base = 3.10-rc2 kernel
patched = base + this series

The test was on 32 core (model: Intel(R) Xeon(R) CPU X7560) HT disabled
with 32 KVM guest vcpu 8GB RAM.


Have you ever tried to get results with HT enabled?



I have not done it yet with the latest. I will get that result.



+---+---+---++---+
ebizzy (records/sec) higher is better
+---+---+---++---+
 basestdevpatchedstdev%improvement
+---+---+---++---+
1x  5574.9000   237.49975618.94.0366 0.77311
2x  2741.5000   561.30903332.   102.473821.53930
3x  2146.2500   216.77182302.76.3870 7.27237
4x  1663.   141.92351753.750083.5220 5.45701
+---+---+---++---+


This looks good. Are your ebizzy results consistent run to run
though?



yes.. ebizzy looked more consistent.


+---+---+---++---+
   dbench  (Throughput) higher is better
+---+---+---++---+
 basestdevpatchedstdev%improvement
+---+---+---++---+
1x 14111.5600   754.4525   14645.9900   114.3087 3.78718
2x  2481.627071.26652667.128073.8193 7.47498
3x  1510.248331.86341503.879236.0777-0.42173
4x  1029.487516.91661039.706943.8840 0.99267
+---+---+---++---+


Hmm, I wonder what 2.5x looks like. Also, the 3% improvement with
no overcommit is interesting. What's happening there? It makes
me wonder what  1x looks like.



I 'll try to get 0.5x and 2.5x run for dbench.


thanks,
drew





--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html