Re: [PATCH RFC V10 0/18] Paravirtualized ticket spinlocks
On 06/26/2013 02:03 PM, Raghavendra K T wrote: On 06/24/2013 06:47 PM, Andrew Jones wrote: On Mon, Jun 24, 2013 at 06:10:14PM +0530, Raghavendra K T wrote: Results: === base = 3.10-rc2 kernel patched = base + this series The test was on 32 core (model: Intel(R) Xeon(R) CPU X7560) HT disabled with 32 KVM guest vcpu 8GB RAM. Have you ever tried to get results with HT enabled? +---+---+---++---+ ebizzy (records/sec) higher is better +---+---+---++---+ basestdevpatchedstdev%improvement +---+---+---++---+ 1x 5574.9000 237.49975618.94.0366 0.77311 2x 2741.5000 561.30903332. 102.473821.53930 3x 2146.2500 216.77182302.76.3870 7.27237 4x 1663. 141.92351753.750083.5220 5.45701 +---+---+---++---+ This looks good. Are your ebizzy results consistent run to run though? +---+---+---++---+ dbench (Throughput) higher is better +---+---+---++---+ basestdevpatchedstdev%improvement +---+---+---++---+ 1x 14111.5600 754.4525 14645.9900 114.3087 3.78718 2x 2481.627071.26652667.128073.8193 7.47498 3x 1510.248331.86341503.879236.0777-0.42173 4x 1029.487516.91661039.706943.8840 0.99267 +---+---+---++---+ Hmm, I wonder what 2.5x looks like. Also, the 3% improvement with no overcommit is interesting. What's happening there? It makes me wonder what 1x looks like. Hi Andrew, I tried 2.5x case sort where I used 3 guests with 27 vcpu each on 32 core (HT disabled machine) and here is the output. almost no gain there. throughput avgstdev base: 1768.7458 MB/sec 54.044221 patched: 1772.5617 MB/sec 41.227689 gain %0.226 I am yet to try HT enabled cases that would give 0.5x to 2x performance results. I have the result of HT enabled case now. config: total 64 cpu (HT on) 32 vcpu guests. I am seeing some inconsistency in ebizzy results in this case (May be Drew had tried with HT on and had observed the same in ebizzy runs). patched-nonple and base performance in case of 1.5x and 2x also have been little inconsistent for dbench too. Overall I see pvspinlock + ple on case more stable. and overall pvspinlock performance seem to be very impressive in HT enabled case. patched = pvspinv10_hton +---+---+---++---+ ebizzy ++--+---+---++---+ basestdev patched stdev%improvement ++-+---+---++---+ 0.5x 6925.300074.4342 7317.86.3018 5.65607 1.0x 2379.8000 405.3519 3427. 574.878944.00370 1.5x 1850.833397.8114 2733.4167 459.801647.68573 2.0x 1477.6250 105.2411 2525.250097.592170.89925 +---+---+---++---+ +---+---+---++---+ dbench ++--+---+---++---+ basestdev patched stdev%improvement ++-+---+---++---+ 0.5x 9045.9950 463.1447 16482.720057.601782.21014 1.0x 6251.1680 543.8219 11212.7600 380.754279.37064 1.5x 3095.7475 231.15674308.8583 266.587339.18636 2.0x 1219.120075.42941979.6750 134.693462.38557 +---+---+---++---+ patched = pvspinv10_hton_nople +---+---+---++---+ ebizzy ++--+---+---++---+ basestdev patched stdev%improvement ++-+---+---++---+ 0.5x 6925.300074.43427473.8000 224.6344 7.92023 1.0x 2379.8000 405.35196176.2000 417.1133 159.52601 1.5x 1850.833397.81142214.1667 515.687519.63080 2.0x 1477.6250 105.2411 758. 108.8131 -48.70146 +---+---+---++---+ +---+---+---++---+ dbench ++--+---+---++---+ basestdev patched stdev%improvement ++-+---+---++---+ 0.5x 9045.9950 463.1447
Re: [PATCH RFC V10 0/18] Paravirtualized ticket spinlocks
On 06/24/2013 06:47 PM, Andrew Jones wrote: On Mon, Jun 24, 2013 at 06:10:14PM +0530, Raghavendra K T wrote: Results: === base = 3.10-rc2 kernel patched = base + this series The test was on 32 core (model: Intel(R) Xeon(R) CPU X7560) HT disabled with 32 KVM guest vcpu 8GB RAM. Have you ever tried to get results with HT enabled? +---+---+---++---+ ebizzy (records/sec) higher is better +---+---+---++---+ basestdevpatchedstdev%improvement +---+---+---++---+ 1x 5574.9000 237.49975618.94.0366 0.77311 2x 2741.5000 561.30903332. 102.473821.53930 3x 2146.2500 216.77182302.76.3870 7.27237 4x 1663. 141.92351753.750083.5220 5.45701 +---+---+---++---+ This looks good. Are your ebizzy results consistent run to run though? +---+---+---++---+ dbench (Throughput) higher is better +---+---+---++---+ basestdevpatchedstdev%improvement +---+---+---++---+ 1x 14111.5600 754.4525 14645.9900 114.3087 3.78718 2x 2481.627071.26652667.128073.8193 7.47498 3x 1510.248331.86341503.879236.0777-0.42173 4x 1029.487516.91661039.706943.8840 0.99267 +---+---+---++---+ Hmm, I wonder what 2.5x looks like. Also, the 3% improvement with no overcommit is interesting. What's happening there? It makes me wonder what 1x looks like. Hi Andrew, I tried 2.5x case sort where I used 3 guests with 27 vcpu each on 32 core (HT disabled machine) and here is the output. almost no gain there. throughput avgstdev base: 1768.7458 MB/sec 54.044221 patched: 1772.5617 MB/sec 41.227689 gain %0.226 I am yet to try HT enabled cases that would give 0.5x to 2x performance results. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH RFC V10 0/18] Paravirtualized ticket spinlocks
This series replaces the existing paravirtualized spinlock mechanism with a paravirtualized ticketlock mechanism. The series provides implementation for both Xen and KVM. Changes in V10: Addressed Konrad's review comments: - Added break in patch 5 since now we know exact cpu to wakeup - Dropped patch 12 and Konrad needs to revert two patches to enable xen on hvm 70dd4998, f10cd522c - Remove TIMEOUT and corrected spacing in patch 15 - Kicked spelling and correct spacing in patches 17, 18 Changes in V9: - Changed spin_threshold to 32k to avoid excess halt exits that are causing undercommit degradation (after PLE handler improvement). - Added kvm_irq_delivery_to_apic (suggested by Gleb) - Optimized halt exit path to use PLE handler V8 of PVspinlock was posted last year. After Avi's suggestions to look at PLE handler's improvements, various optimizations in PLE handling have been tried. With this series we see that we could get little more improvements on top of that. Ticket locks have an inherent problem in a virtualized case, because the vCPUs are scheduled rather than running concurrently (ignoring gang scheduled vCPUs). This can result in catastrophic performance collapses when the vCPU scheduler doesn't schedule the correct next vCPU, and ends up scheduling a vCPU which burns its entire timeslice spinning. (Note that this is not the same problem as lock-holder preemption, which this series also addresses; that's also a problem, but not catastrophic). (See Thomas Friebel's talk Prevent Guests from Spinning Around http://www.xen.org/files/xensummitboston08/LHP.pdf for more details.) Currently we deal with this by having PV spinlocks, which adds a layer of indirection in front of all the spinlock functions, and defining a completely new implementation for Xen (and for other pvops users, but there are none at present). PV ticketlocks keeps the existing ticketlock implemenentation (fastpath) as-is, but adds a couple of pvops for the slow paths: - If a CPU has been waiting for a spinlock for SPIN_THRESHOLD iterations, then call out to the __ticket_lock_spinning() pvop, which allows a backend to block the vCPU rather than spinning. This pvop can set the lock into slowpath state. - When releasing a lock, if it is in slowpath state, the call __ticket_unlock_kick() to kick the next vCPU in line awake. If the lock is no longer in contention, it also clears the slowpath flag. The slowpath state is stored in the LSB of the within the lock tail ticket. This has the effect of reducing the max number of CPUs by half (so, a small ticket can deal with 128 CPUs, and large ticket 32768). For KVM, one hypercall is introduced in hypervisor,that allows a vcpu to kick another vcpu out of halt state. The blocking of vcpu is done using halt() in (lock_spinning) slowpath. Overall, it results in a large reduction in code, it makes the native and virtualized cases closer, and it removes a layer of indirection around all the spinlock functions. The fast path (taking an uncontended lock which isn't in slowpath state) is optimal, identical to the non-paravirtualized case. The inner part of ticket lock code becomes: inc = xadd(lock-tickets, inc); inc.tail = ~TICKET_SLOWPATH_FLAG; if (likely(inc.head == inc.tail)) goto out; for (;;) { unsigned count = SPIN_THRESHOLD; do { if (ACCESS_ONCE(lock-tickets.head) == inc.tail) goto out; cpu_relax(); } while (--count); __ticket_lock_spinning(lock, inc.tail); } out:barrier(); which results in: push %rbp mov%rsp,%rbp mov$0x200,%eax lock xadd %ax,(%rdi) movzbl %ah,%edx cmp%al,%dl jne1f # Slowpath if lock in contention pop%rbp retq ### SLOWPATH START 1: and$-2,%edx movzbl %dl,%esi 2: mov$0x800,%eax jmp4f 3: pause sub$0x1,%eax je 5f 4: movzbl (%rdi),%ecx cmp%cl,%dl jne3b pop%rbp retq 5: callq *__ticket_lock_spinning jmp2b ### SLOWPATH END with CONFIG_PARAVIRT_SPINLOCKS=n, the code has changed slightly, where the fastpath case is straight through (taking the lock without contention), and the spin loop is out of line: push %rbp mov%rsp,%rbp mov$0x100,%eax lock xadd %ax,(%rdi) movzbl %ah,%edx cmp%al,%dl jne1f pop%rbp retq ### SLOWPATH START 1: pause movzbl (%rdi),%eax cmp%dl,%al jne1b pop%rbp retq ### SLOWPATH END The unlock code is complicated by the need to both add to the lock's head and fetch the slowpath flag from tail.
Re: [PATCH RFC V10 0/18] Paravirtualized ticket spinlocks
On Mon, Jun 24, 2013 at 06:10:14PM +0530, Raghavendra K T wrote: Results: === base = 3.10-rc2 kernel patched = base + this series The test was on 32 core (model: Intel(R) Xeon(R) CPU X7560) HT disabled with 32 KVM guest vcpu 8GB RAM. Have you ever tried to get results with HT enabled? +---+---+---++---+ ebizzy (records/sec) higher is better +---+---+---++---+ basestdevpatchedstdev%improvement +---+---+---++---+ 1x 5574.9000 237.49975618.94.0366 0.77311 2x 2741.5000 561.30903332. 102.473821.53930 3x 2146.2500 216.77182302.76.3870 7.27237 4x 1663. 141.92351753.750083.5220 5.45701 +---+---+---++---+ This looks good. Are your ebizzy results consistent run to run though? +---+---+---++---+ dbench (Throughput) higher is better +---+---+---++---+ basestdevpatchedstdev%improvement +---+---+---++---+ 1x 14111.5600 754.4525 14645.9900 114.3087 3.78718 2x 2481.627071.26652667.128073.8193 7.47498 3x 1510.248331.86341503.879236.0777-0.42173 4x 1029.487516.91661039.706943.8840 0.99267 +---+---+---++---+ Hmm, I wonder what 2.5x looks like. Also, the 3% improvement with no overcommit is interesting. What's happening there? It makes me wonder what 1x looks like. thanks, drew -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC V10 0/18] Paravirtualized ticket spinlocks
On 06/24/2013 06:47 PM, Andrew Jones wrote: On Mon, Jun 24, 2013 at 06:10:14PM +0530, Raghavendra K T wrote: Results: === base = 3.10-rc2 kernel patched = base + this series The test was on 32 core (model: Intel(R) Xeon(R) CPU X7560) HT disabled with 32 KVM guest vcpu 8GB RAM. Have you ever tried to get results with HT enabled? I have not done it yet with the latest. I will get that result. +---+---+---++---+ ebizzy (records/sec) higher is better +---+---+---++---+ basestdevpatchedstdev%improvement +---+---+---++---+ 1x 5574.9000 237.49975618.94.0366 0.77311 2x 2741.5000 561.30903332. 102.473821.53930 3x 2146.2500 216.77182302.76.3870 7.27237 4x 1663. 141.92351753.750083.5220 5.45701 +---+---+---++---+ This looks good. Are your ebizzy results consistent run to run though? yes.. ebizzy looked more consistent. +---+---+---++---+ dbench (Throughput) higher is better +---+---+---++---+ basestdevpatchedstdev%improvement +---+---+---++---+ 1x 14111.5600 754.4525 14645.9900 114.3087 3.78718 2x 2481.627071.26652667.128073.8193 7.47498 3x 1510.248331.86341503.879236.0777-0.42173 4x 1029.487516.91661039.706943.8840 0.99267 +---+---+---++---+ Hmm, I wonder what 2.5x looks like. Also, the 3% improvement with no overcommit is interesting. What's happening there? It makes me wonder what 1x looks like. I 'll try to get 0.5x and 2.5x run for dbench. thanks, drew -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html