Re: [PATCH RFC V8 0/17] Paravirtualized ticket spinlocks
On 05/30/2012 04:56 PM, Raghavendra K T wrote: On 05/16/2012 08:49 AM, Raghavendra K T wrote: On 05/14/2012 12:15 AM, Raghavendra K T wrote: On 05/07/2012 08:22 PM, Avi Kivity wrote: I could not come with pv-flush results (also Nikunj had clarified that the result was on NOn PLE I'd like to see those numbers, then. Ingo, please hold on the kvm-specific patches, meanwhile. [...] To summarise, with 32 vcpu guest with nr thread=32 we get around 27% improvement. In very low/undercommitted systems we may see very small improvement or small acceptable degradation ( which it deserves). For large guests, current value SPIN_THRESHOLD, along with ple_window needed some of research/experiment. [Thanks to Jeremy/Nikunj for inputs and help in result analysis ] I started with debugfs spinlock/histograms, and ran experiments with 32, 64 vcpu guests for spin threshold of 2k, 4k, 8k, 16k, and 32k with 1vm/2vm/4vm for kernbench, sysbench, ebizzy, hackbench. [ spinlock/histogram gives logarithmic view of lockwait times ] machine: PLE machine with 32 cores. Here is the result summary. The summary includes 2 part, (1) %improvement w.r.t 2K spin threshold, (2) improvement w.r.t sum of histogram numbers in debugfs (that gives rough indication of contention/cpu time wasted) For e.g 98% for 4k threshold kbench 1 vm would imply, there is a 98% reduction in sigma(histogram values) compared to 2k case Result for 32 vcpu guest == ++---+---+---+---+ | Base-2k | 4k | 8k | 16k | 32k | ++---+---+---+---+ | kbench-1vm | 44 | 50 | 46 | 41 | | SPINHisto-1vm | 98 | 99 | 99 | 99 | | kbench-2vm | 25 | 45 | 49 | 45 | | SPINHisto-2vm | 31 | 91 | 99 | 99 | | kbench-4vm | -13 | -27 | -2 | -4 | | SPINHisto-4vm | 29 | 66 | 95 | 99 | ++---+---+---+---+ | ebizzy-1vm | 954 | 942 | 913 | 915 | | SPINHisto-1vm | 96 | 99 | 99 | 99 | | ebizzy-2vm | 158 | 135 | 123 | 106 | | SPINHisto-2vm | 90 | 98 | 99 | 99 | | ebizzy-4vm | -13 | -28 | -33 | -37 | | SPINHisto-4vm | 83 | 98 | 99 | 99 | ++---+---+---+---+ | hbench-1vm | 48 | 56 | 52 | 64 | | SPINHisto-1vm | 92 | 95 | 99 | 99 | | hbench-2vm | 32 | 40 | 39 | 21 | | SPINHisto-2vm | 74 | 96 | 99 | 99 | | hbench-4vm | 27 | 15 | 3 | -57 | | SPINHisto-4vm | 68 | 88 | 94 | 97 | ++---+---+---+---+ | sysbnch-1vm | 0 | 0 | 1 | 0 | | SPINHisto-1vm | 76 | 98 | 99 | 99 | | sysbnch-2vm | -1 | 3 | -1 | -4 | | SPINHisto-2vm | 82 | 94 | 96 | 99 | | sysbnch-4vm | 0 | -2 | -8 | -14 | | SPINHisto-4vm | 57 | 79 | 88 | 95 | ++---+---+---+---+ result for 64 vcpu guest = ++---+---+---+---+ | Base-2k | 4k | 8k | 16k | 32k | ++---+---+---+---+ | kbench-1vm | 1 | -11 | -25 | 31 | | SPINHisto-1vm | 3 | 10 | 47 | 99 | | kbench-2vm | 15 | -9 | -66 | -15 | | SPINHisto-2vm | 2 | 11 | 19 | 90 | ++---+---+---+---+ | ebizzy-1vm | 784 | 1097 | 978 | 930 | | SPINHisto-1vm | 74 | 97 | 98 | 99 | | ebizzy-2vm | 43 | 48 | 56 | 32 | | SPINHisto-2vm | 58 | 93 | 97 | 98 | ++---+---+---+---+ | hbench-1vm | 8 | 55 | 56 | 62 | | SPINHisto-1vm | 18 | 69 | 96 | 99 | | hbench-2vm | 13 | -14 | -75 | -29 | | SPINHisto-2vm | 57 | 74 | 80 | 97 | ++---+---+---+---+ | sysbnch-1vm | 9 | 11 | 15 | 10 | | SPINHisto-1vm | 80 | 93 | 98 | 99 | | sysbnch-2vm | 3 | 3 | 4 | 2 | | SPINHisto-2vm | 72 | 89 | 94 | 97 | ++---+---+---+---+ From this, value around 4k-8k threshold seem to be optimal one. [ This is amost inline with ple_window default ] (lower the spin threshold, we would cover lesser % of spinlocks, that would result in more halt_exit/wakeups. [ www.xen.org/files/xensummitboston08/LHP.pdf also has good graphical detail on covering spinlock waits ] After 8k threshold, we see no more contention but that would mean we have wasted lot of cpu time in busy waits. Will get a PLE machine again, and 'll continue experimenting with further tuning of SPIN_THRESHOLD. Sorry for delayed response. Was doing too much of analysis and experiments. Continued my experiment, with spin threshold. unfortunately could not settle between which one of 4k/8k threshold is better, since it depends on load and type of workload. Here is the result for 32 vcpu guest for sysbench and kernebench for 4 8GB RAM vms on same PLE machine with: 1x: benchmark running on 1 guest 2x: same benchmark running on 2 guest and so on 1x run is taken over 8*3 run averages 2x run was taken with 4*3 runs 3x run was with 6*3 4x run was with 4*3 kernbench = total j
Re: [PATCH RFC V8 0/17] Paravirtualized ticket spinlocks
On 05/16/2012 08:49 AM, Raghavendra K T wrote: On 05/14/2012 12:15 AM, Raghavendra K T wrote: On 05/07/2012 08:22 PM, Avi Kivity wrote: I could not come with pv-flush results (also Nikunj had clarified that the result was on NOn PLE I'd like to see those numbers, then. Ingo, please hold on the kvm-specific patches, meanwhile. [...] To summarise, with 32 vcpu guest with nr thread=32 we get around 27% improvement. In very low/undercommitted systems we may see very small improvement or small acceptable degradation ( which it deserves). For large guests, current value SPIN_THRESHOLD, along with ple_window needed some of research/experiment. [Thanks to Jeremy/Nikunj for inputs and help in result analysis ] I started with debugfs spinlock/histograms, and ran experiments with 32, 64 vcpu guests for spin threshold of 2k, 4k, 8k, 16k, and 32k with 1vm/2vm/4vm for kernbench, sysbench, ebizzy, hackbench. [ spinlock/histogram gives logarithmic view of lockwait times ] machine: PLE machine with 32 cores. Here is the result summary. The summary includes 2 part, (1) %improvement w.r.t 2K spin threshold, (2) improvement w.r.t sum of histogram numbers in debugfs (that gives rough indication of contention/cpu time wasted) For e.g 98% for 4k threshold kbench 1 vm would imply, there is a 98% reduction in sigma(histogram values) compared to 2k case Result for 32 vcpu guest == ++---+---+---+---+ |Base-2k | 4k|8k | 16k |32k| ++---+---+---+---+ | kbench-1vm | 44 | 50 | 46 | 41 | | SPINHisto-1vm | 98 | 99 | 99 | 99 | | kbench-2vm | 25 | 45 | 49 | 45 | | SPINHisto-2vm | 31 | 91 | 99 | 99 | | kbench-4vm | -13 | -27 | -2 | -4 | | SPINHisto-4vm | 29 | 66 | 95 | 99 | ++---+---+---+---+ | ebizzy-1vm | 954 | 942 | 913 | 915 | | SPINHisto-1vm | 96 | 99 | 99 | 99 | | ebizzy-2vm | 158 | 135 | 123 | 106 | | SPINHisto-2vm | 90 | 98 | 99 | 99 | | ebizzy-4vm | -13 | -28 | -33 | -37 | | SPINHisto-4vm | 83 | 98 | 99 | 99 | ++---+---+---+---+ | hbench-1vm | 48 | 56 | 52 | 64 | | SPINHisto-1vm | 92 | 95 | 99 | 99 | | hbench-2vm | 32 | 40 | 39 | 21 | | SPINHisto-2vm | 74 | 96 | 99 | 99 | | hbench-4vm | 27 | 15 |3 | -57 | | SPINHisto-4vm | 68 | 88 | 94 | 97 | ++---+---+---+---+ |sysbnch-1vm |0 |0 |1 |0 | | SPINHisto-1vm | 76 | 98 | 99 | 99 | |sysbnch-2vm | -1 |3 | -1 | -4 | | SPINHisto-2vm | 82 | 94 | 96 | 99 | |sysbnch-4vm |0 | -2 | -8 | -14 | | SPINHisto-4vm | 57 | 79 | 88 | 95 | ++---+---+---+---+ result for 64 vcpu guest = ++---+---+---+---+ |Base-2k | 4k|8k | 16k |32k| ++---+---+---+---+ | kbench-1vm |1 | -11 | -25 | 31 | | SPINHisto-1vm |3 | 10 | 47 | 99 | | kbench-2vm | 15 | -9 | -66 | -15 | | SPINHisto-2vm |2 | 11 | 19 | 90 | ++---+---+---+---+ | ebizzy-1vm | 784 | 1097 | 978 | 930 | | SPINHisto-1vm | 74 | 97 | 98 | 99 | | ebizzy-2vm | 43 | 48 | 56 | 32 | | SPINHisto-2vm | 58 | 93 | 97 | 98 | ++---+---+---+---+ | hbench-1vm |8 | 55 | 56 | 62 | | SPINHisto-1vm | 18 | 69 | 96 | 99 | | hbench-2vm | 13 | -14 | -75 | -29 | | SPINHisto-2vm | 57 | 74 | 80 | 97 | ++---+---+---+---+ |sysbnch-1vm |9 | 11 | 15 | 10 | | SPINHisto-1vm | 80 | 93 | 98 | 99 | |sysbnch-2vm |3 |3 |4 |2 | | SPINHisto-2vm | 72 |
Re: [PATCH RFC V8 0/17] Paravirtualized ticket spinlocks
On 05/14/2012 12:15 AM, Raghavendra K T wrote: On 05/07/2012 08:22 PM, Avi Kivity wrote: I could not come with pv-flush results (also Nikunj had clarified that the result was on NOn PLE I'd like to see those numbers, then. Ingo, please hold on the kvm-specific patches, meanwhile. 3 guests 8GB RAM, 1 used for kernbench (kernbench -f -H -M -o 20) other for cpuhog (shell script with while true do hackbench) 1x: no hogs 2x: 8hogs in one guest 3x: 8hogs each in two guest kernbench on PLE: Machine : IBM xSeries with Intel(R) Xeon(R) X7560 2.27GHz CPU with 32 core, with 8 online cpus and 4*64GB RAM. The average is taken over 4 iterations with 3 run each (4*3=12). and stdev is calculated over mean reported in each run. A): 8 vcpu guest BASE BASE+patch %improvement w.r.t mean (sd) mean (sd) patched kernel time case 1*1x: 61.7075 (1.17872) 60.93 (1.475625) 1.27605 case 1*2x: 107.2125 (1.3821349) 97.506675 (1.3461878) 9.95401 case 1*3x: 144.3515 (1.8203927) 138.9525 (0.58309319) 3.8855 B): 16 vcpu guest BASE BASE+patch %improvement w.r.t mean (sd) mean (sd) patched kernel time case 2*1x: 70.524 (1.5941395) 69.68866 (1.9392529) 1.19867 case 2*2x: 133.0738 (1.4558653) 124.8568 (1.4544986) 6.58114 case 2*3x: 206.0094 (1.3437359) 181.4712 (2.9134116) 13.5218 B): 32 vcpu guest BASE BASE+patch %improvementw.r.t mean (sd) mean (sd) patched kernel time case 4*1x: 100.61046 (2.7603485) 85.48734 (2.6035035) 17.6905 It seems while we do not see any improvement in low contention case, the benefit becomes evident with overcommit and large guests. I am continuing analysis with other benchmarks (now with pgbench to check if it has acceptable improvement/degradation in low contenstion case). Here are the results for pgbench and sysbench. Here the results are on a single guest. Machine : IBM xSeries with Intel(R) Xeon(R) X7560 2.27GHz CPU with 32 core, with 8 online cpus and 4*64GB RAM. Guest config: 8GB RAM pgbench == unit=tps (higher is better) pgbench based on pgsql 9.2-dev: http://www.postgresql.org/ftp/snapshot/dev/ (link given by Attilo) tool used to collect benachmark: git://git.postgresql.org/git/pgbench-tools.git config: MAX_WORKER=16 SCALE=32 run for NRCLIENTS = 1, 8, 64 Average taken over 10 iterations. 8 vcpu guest N base patchimprovement 1 5271 5235 -0.687679 8 37953 382020.651798 64 37546 377740.60359 16 vcpu guest N base patchimprovement 1 5229 5239 0.190876 8 34908 360483.16245 64 51796 528521.99803 sysbench == sysbench 0.4.12 cnfigured for postgres driver ran with sysbench --num-threads=8/16/32 --max-requests=10 --test=oltp --oltp-table-size=50 --db-driver=pgsql --oltp-read-only run annalysed with ministat with x patch + base 8 vcpu guest --- 1) num_threads = 8 N Min MaxMedian AvgStddev x 10 20.7805 21.55 20.9667 21.035020.22682186 + 1021.025 22.3122 21.29535 21.417930.39542349 Difference at 98.0% confidence 1.82035% +/- 1.74892% 2) num_threads = 16 N Min MaxMedian AvgStddev x 10 20.8786 21.3967 21.1566 21.144410.15490983 + 10 21.3992 21.9437 21.46235 21.58724 0.2089425 Difference at 98.0% confidence 2.09431% +/- 0.992732% 3) num_threads = 32 N Min MaxMedian AvgStddev x 10 21.1329 21.3726 21.33415 21.28930.08324195 + 10 21.5692 21.8966 21.6441 21.65679 0.093430003 Difference at 98.0% confidence 1.72617% +/- 0.474343% 16 vcpu guest --- 1) num_threads = 8 N Min MaxMedian AvgStddev x 10 23.5314 25.6118 24.76145 24.645170.74856264 + 10 22.2675 26.6204 22.9131 23.50554 1.345386 No difference proven at 98.0% confidence 2) num_threads = 16 N Min MaxMedian AvgStddev x 10 12.0095 12.2305 12.15575 12.13926 0.070872722 + 1011.413 11.6986 11.481711.493 0.080007819 Difference at 98.0% confidence -5.32372% +/- 0.710561% 3) num_threads = 32 N Min MaxMedian AvgStddev x 10 12.1378 12.3567 12.21675 12.22703 0.0670695 + 1011.573 11.7438 11.6306 11.64905 0.062780221 Difference at 98.0% confidence -4.72707% +/- 0.606349% 32 vcpu guest --- 1) num_threads = 8 N Min MaxMedian AvgStddev x 10 30.5602 41.4756
Re: [Xen-devel] [PATCH RFC V8 0/17] Paravirtualized ticket spinlocks
>>> On 07.05.12 at 19:25, Ingo Molnar wrote: (apologies for the late reply, the mail just now made it to my inbox via xen-devel) > I'll hold off on the whole thing - frankly, we don't want this > kind of Xen-only complexity. If KVM can make use of PLE then Xen > ought to be able to do it as well. It does - for fully virtualized guests. For para-virtualized ones, it can't (as the hardware feature is an extension to VMX/SVM). > If both Xen and KVM makes good use of it then that's a different > matter. I saw in a later reply that you're now tending towards trying it out at least - thanks. Jan ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH RFC V8 0/17] Paravirtualized ticket spinlocks
On 05/14/2012 10:27 AM, Nikunj A Dadhania wrote: On Mon, 14 May 2012 00:15:30 +0530, Raghavendra K T wrote: On 05/07/2012 08:22 PM, Avi Kivity wrote: I could not come with pv-flush results (also Nikunj had clarified that the result was on NOn PLE Did you see any issues on PLE? No, I did not see issues in setup, but did not get time to check that out yet .. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH RFC V8 0/17] Paravirtualized ticket spinlocks
On 05/14/2012 01:08 PM, Jeremy Fitzhardinge wrote: On 05/13/2012 11:45 AM, Raghavendra K T wrote: On 05/07/2012 08:22 PM, Avi Kivity wrote: I could not come with pv-flush results (also Nikunj had clarified that the result was on NOn PLE I'd like to see those numbers, then. Ingo, please hold on the kvm-specific patches, meanwhile. 3 guests 8GB RAM, 1 used for kernbench (kernbench -f -H -M -o 20) other for cpuhog (shell script with while true do hackbench) 1x: no hogs 2x: 8hogs in one guest 3x: 8hogs each in two guest kernbench on PLE: Machine : IBM xSeries with Intel(R) Xeon(R) X7560 2.27GHz CPU with 32 core, with 8 online cpus and 4*64GB RAM. The average is taken over 4 iterations with 3 run each (4*3=12). and stdev is calculated over mean reported in each run. A): 8 vcpu guest BASEBASE+patch %improvement w.r.t mean (sd) mean (sd) patched kernel time case 1*1x:61.7075 (1.17872)60.93 (1.475625)1.27605 case 1*2x:107.2125 (1.3821349)97.506675 (1.3461878) 9.95401 case 1*3x:144.3515 (1.8203927)138.9525 (0.58309319) 3.8855 B): 16 vcpu guest BASEBASE+patch %improvement w.r.t mean (sd) mean (sd) patched kernel time case 2*1x:70.524 (1.5941395)69.68866 (1.9392529) 1.19867 case 2*2x:133.0738 (1.4558653)124.8568 (1.4544986) 6.58114 case 2*3x:206.0094 (1.3437359)181.4712 (2.9134116) 13.5218 B): 32 vcpu guest BASEBASE+patch %improvementw.r.t mean (sd) mean (sd) patched kernel time case 4*1x:100.61046 (2.7603485) 85.48734 (2.6035035) 17.6905 What does the "4*1x" notation mean? Do these workloads have overcommit of the PCPU resources? When I measured it, even quite small amounts of overcommit lead to large performance drops with non-pv ticket locks (on the order of 10% improvements when there were 5 busy VCPUs on a 4 cpu system). I never tested it on larger machines, but I guess that represents around 25% overcommit, or 40 busy VCPUs on a 32-PCPU system. All the above measurements are on PLE machine. It is 32 vcpu single guest on a 8 pcpu. (PS:One problem I saw in my kernbench run itself is that number of threads spawned = 20 instead of 2* number of vcpu. I ll correct during next measurement.) "even quite small amounts of overcommit lead to large performance drops with non-pv ticket locks": This is very much true on non PLE machine. probably compilation takes even a day vs just one hour. ( with just 1:3x overcommit I had got 25 x speedup). ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH RFC V8 0/17] Paravirtualized ticket spinlocks
On 05/13/2012 11:45 AM, Raghavendra K T wrote: > On 05/07/2012 08:22 PM, Avi Kivity wrote: > > I could not come with pv-flush results (also Nikunj had clarified that > the result was on NOn PLE > >> I'd like to see those numbers, then. >> >> Ingo, please hold on the kvm-specific patches, meanwhile. >> > > 3 guests 8GB RAM, 1 used for kernbench > (kernbench -f -H -M -o 20) other for cpuhog (shell script with while > true do hackbench) > > 1x: no hogs > 2x: 8hogs in one guest > 3x: 8hogs each in two guest > > kernbench on PLE: > Machine : IBM xSeries with Intel(R) Xeon(R) X7560 2.27GHz CPU with 32 > core, with 8 online cpus and 4*64GB RAM. > > The average is taken over 4 iterations with 3 run each (4*3=12). and > stdev is calculated over mean reported in each run. > > > A): 8 vcpu guest > > BASEBASE+patch %improvement w.r.t > mean (sd) mean (sd) > patched kernel time > case 1*1x:61.7075 (1.17872)60.93 (1.475625)1.27605 > case 1*2x:107.2125 (1.3821349)97.506675 (1.3461878) 9.95401 > case 1*3x:144.3515 (1.8203927)138.9525 (0.58309319) 3.8855 > > > B): 16 vcpu guest > BASEBASE+patch %improvement w.r.t > mean (sd) mean (sd) > patched kernel time > case 2*1x:70.524 (1.5941395)69.68866 (1.9392529) 1.19867 > case 2*2x:133.0738 (1.4558653)124.8568 (1.4544986) 6.58114 > case 2*3x:206.0094 (1.3437359)181.4712 (2.9134116) 13.5218 > > B): 32 vcpu guest > BASEBASE+patch %improvementw.r.t > mean (sd) mean (sd) > patched kernel time > case 4*1x:100.61046 (2.7603485) 85.48734 (2.6035035) 17.6905 What does the "4*1x" notation mean? Do these workloads have overcommit of the PCPU resources? When I measured it, even quite small amounts of overcommit lead to large performance drops with non-pv ticket locks (on the order of 10% improvements when there were 5 busy VCPUs on a 4 cpu system). I never tested it on larger machines, but I guess that represents around 25% overcommit, or 40 busy VCPUs on a 32-PCPU system. J ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH RFC V8 0/17] Paravirtualized ticket spinlocks
On Mon, 14 May 2012 00:15:30 +0530, Raghavendra K T wrote: > On 05/07/2012 08:22 PM, Avi Kivity wrote: > > I could not come with pv-flush results (also Nikunj had clarified that > the result was on NOn PLE > Did you see any issues on PLE? Regards, Nikunj ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH RFC V8 0/17] Paravirtualized ticket spinlocks
On 05/07/2012 08:22 PM, Avi Kivity wrote: I could not come with pv-flush results (also Nikunj had clarified that the result was on NOn PLE I'd like to see those numbers, then. Ingo, please hold on the kvm-specific patches, meanwhile. 3 guests 8GB RAM, 1 used for kernbench (kernbench -f -H -M -o 20) other for cpuhog (shell script with while true do hackbench) 1x: no hogs 2x: 8hogs in one guest 3x: 8hogs each in two guest kernbench on PLE: Machine : IBM xSeries with Intel(R) Xeon(R) X7560 2.27GHz CPU with 32 core, with 8 online cpus and 4*64GB RAM. The average is taken over 4 iterations with 3 run each (4*3=12). and stdev is calculated over mean reported in each run. A): 8 vcpu guest BASEBASE+patch %improvement w.r.t mean (sd) mean (sd) patched kernel time case 1*1x: 61.7075 (1.17872) 60.93 (1.475625)1.27605 case 1*2x: 107.2125 (1.3821349)97.506675 (1.3461878) 9.95401 case 1*3x: 144.3515 (1.8203927)138.9525 (0.58309319) 3.8855 B): 16 vcpu guest BASEBASE+patch %improvement w.r.t mean (sd) mean (sd) patched kernel time case 2*1x: 70.524 (1.5941395)69.68866 (1.9392529) 1.19867 case 2*2x: 133.0738 (1.4558653)124.8568 (1.4544986) 6.58114 case 2*3x: 206.0094 (1.3437359)181.4712 (2.9134116) 13.5218 B): 32 vcpu guest BASEBASE+patch %improvementw.r.t mean (sd) mean (sd) patched kernel time case 4*1x: 100.61046 (2.7603485)85.48734 (2.6035035) 17.6905 It seems while we do not see any improvement in low contention case, the benefit becomes evident with overcommit and large guests. I am continuing analysis with other benchmarks (now with pgbench to check if it has acceptable improvement/degradation in low contenstion case). Avi, Can patch series go ahead for inclusion into tree with following reasons: The patch series brings fairness with ticketlock ( hence the predictability, since during contention, vcpu trying to acqire lock is sure that it gets its turn in less than total number of vcpus conntending for lock), which is very much desired irrespective of its low benefit/degradation (if any) in low contention scenarios. Ofcourse ticketlocks had undesirable effect of exploding LHP problem, and the series addresses with improvement in scheduling and sleeping instead of burning cpu time. Finally a less famous one, it brings almost PLE equivalent capabilty to all the non PLE hardware (TBH I always preferred my experiment kernel to be compiled in my pv guest that saves more than 30 min of time for each run). It would be nice to see any results if somebody got benefited/suffered with patchset. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH RFC V8 0/17] Paravirtualized ticket spinlocks
On 05/07/2012 05:36 PM, Avi Kivity wrote: On 05/07/2012 01:58 PM, Raghavendra K T wrote: On 05/07/2012 02:02 PM, Avi Kivity wrote: On 05/07/2012 11:29 AM, Ingo Molnar wrote: (Less is better. Below is time elapsed in sec for x86_64_defconfig (3+3 runs)). BASEBASE+patch%improvement mean (sd) mean (sd) case 1x: 66.0566 (74.0304) 61.3233 (68.8299) 7.16552 case 2x: 1253.2 (1795.74) 131.606 (137.358) 89.4984 case 3x: 3431.04 (5297.26) 134.964 (149.861) 96.0664 You're calculating the improvement incorrectly. In the last case, it's not 96%, rather it's 2400% (25x). Similarly the second case is about 900% faster. speedup calculation is clear. I think confusion for me was more because of the types of benchmarks. I always did |(patch - base)| * 100 / base So, for (1) lesser is better sort of benchmarks, improvement calculation would be like |(patched - base)| * 100/ patched e.g for kernbench, suppose base= 150 sec patched = 100 sec improvement = 50 % ( = 33% degradation of base) (2) for higher is better sort of benchmarks improvement calculation would be like |(patched - base)| * 100 / base for e.g say for pgbench/ ebizzy... base = 100 tps (transactions per sec) patched = 150 tps improvement = 50 % of pathched kernel ( OR 33 % degradation of base ) Is this is what generally done? just wanted to be on same page before publishing benchmark results, other than kernbench. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH RFC V8 0/17] Paravirtualized ticket spinlocks
On Mon, 7 May 2012 22:42:30 +0200 (CEST), Thomas Gleixner wrote: > On Mon, 7 May 2012, Ingo Molnar wrote: > > * Avi Kivity wrote: > > > > > > PS: Nikunj had experimented that pv-flush tlb + > > > > paravirt-spinlock is a win on PLE where only one of them > > > > alone could not prove the benefit. > > > Do not have PLE numbers yet for pvflush and pvspinlock. I have seen on Non-PLE having pvflush and pvspinlock patches - kernbench, ebizzy, specjbb, hackbench and dbench all of them improved. I am chasing a race currently on pv-flush path, it is causing file-system corruption. I will post these number along with my v2 post. > > > I'd like to see those numbers, then. > > > > > > Ingo, please hold on the kvm-specific patches, meanwhile. > > > > I'll hold off on the whole thing - frankly, we don't want this > > kind of Xen-only complexity. If KVM can make use of PLE then Xen > > ought to be able to do it as well. > > > > If both Xen and KVM makes good use of it then that's a different > > matter. > > Aside of that, it's kinda strange that a dude named "Nikunj" is > referenced in the argument chain, but I can't find him on the CC list. > /me waves my hand Regards Nikunj ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH RFC V8 0/17] Paravirtualized ticket spinlocks
On 05/08/2012 02:15 AM, Jeremy Fitzhardinge wrote: > On 05/07/2012 06:49 AM, Avi Kivity wrote: > > On 05/07/2012 04:46 PM, Srivatsa Vaddagiri wrote: > >> * Raghavendra K T [2012-05-07 > >> 19:08:51]: > >> > >>> I 'll get hold of a PLE mc and come up with the numbers soon. but I > >>> 'll expect the improvement around 1-3% as it was in last version. > >> Deferring preemption (when vcpu is holding lock) may give us better than > >> 1-3% > >> results on PLE hardware. Something worth trying IMHO. > > Is the improvement so low, because PLE is interfering with the patch, or > > because PLE already does a good job? > > How does PLE help with ticket scheduling on unlock? I thought it would > just help with the actual spin loops. PLE yields to up a random vcpu, hoping it is the lock holder. This patchset wakes up the right vcpu. For small vcpu counts the difference is a few bad wakeups (and even a bad wakeup sometimes works, since it can put the spinner to sleep for a bit). I expect that large vcpu counts would show a greater difference. -- error compiling committee.c: too many arguments to function ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH RFC V8 0/17] Paravirtualized ticket spinlocks
On 05/07/2012 08:22 PM, Avi Kivity wrote: On 05/07/2012 05:47 PM, Raghavendra K T wrote: Not good. Solving a problem in software that is already solved by hardware? It's okay if there are no costs involved, but here we're introducing a new ABI that we'll have to maintain for a long time. Hmm agree that being a step ahead of mighty hardware (and just an improvement of 1-3%) is no good for long term (where PLE is future). PLE is the present, not the future. It was introduced on later Nehalems and is present on all Westmeres. Two more processor generations have passed meanwhile. The AMD equivalent was also introduced around that timeframe. Having said that, it is hard for me to resist saying : bottleneck is somewhere else on PLE m/c and IMHO answer would be combination of paravirt-spinlock + pv-flush-tb. But I need to come up with good number to argue in favour of the claim. PS: Nikunj had experimented that pv-flush tlb + paravirt-spinlock is a win on PLE where only one of them alone could not prove the benefit. I'd like to see those numbers, then. Ingo, please hold on the kvm-specific patches, meanwhile. Hmm. I think I messed up the fact while saying 1-3% improvement on PLE. Going by what I had posted in https://lkml.org/lkml/2012/4/5/73 (with correct calculation) 1x 70.475 (85.6979) 63.5033 (72.7041) 15.7% 2x 110.971 (132.829) 105.099 (128.738)5.56% 3x 150.265 (184.766) 138.341 (172.69) 8.62% It was around 12% with optimization patch posted separately with that (That one Needs more experiment though) But anyways, I will come up with result for current patch series.. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH RFC V8 0/17] Paravirtualized ticket spinlocks
On 05/08/2012 04:45 AM, Jeremy Fitzhardinge wrote: On 05/07/2012 06:49 AM, Avi Kivity wrote: On 05/07/2012 04:46 PM, Srivatsa Vaddagiri wrote: * Raghavendra K T [2012-05-07 19:08:51]: I 'll get hold of a PLE mc and come up with the numbers soon. but I 'll expect the improvement around 1-3% as it was in last version. Deferring preemption (when vcpu is holding lock) may give us better than 1-3% results on PLE hardware. Something worth trying IMHO. Is the improvement so low, because PLE is interfering with the patch, or because PLE already does a good job? How does PLE help with ticket scheduling on unlock? I thought it would just help with the actual spin loops. Hmm. This strikes something to me. I think I should replace while 1 hog in with some *real job* to measure over-commit case. I hope to see greater improvements because of fairness and scheduling of the patch-set. May be all the way I was measuring something equal to 1x case. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH RFC V8 0/17] Paravirtualized ticket spinlocks
On 05/07/2012 06:49 AM, Avi Kivity wrote: > On 05/07/2012 04:46 PM, Srivatsa Vaddagiri wrote: >> * Raghavendra K T [2012-05-07 19:08:51]: >> >>> I 'll get hold of a PLE mc and come up with the numbers soon. but I >>> 'll expect the improvement around 1-3% as it was in last version. >> Deferring preemption (when vcpu is holding lock) may give us better than >> 1-3% >> results on PLE hardware. Something worth trying IMHO. > Is the improvement so low, because PLE is interfering with the patch, or > because PLE already does a good job? How does PLE help with ticket scheduling on unlock? I thought it would just help with the actual spin loops. J ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH RFC V8 0/17] Paravirtualized ticket spinlocks
On Mon, 7 May 2012, Ingo Molnar wrote: > * Avi Kivity wrote: > > > > PS: Nikunj had experimented that pv-flush tlb + > > > paravirt-spinlock is a win on PLE where only one of them > > > alone could not prove the benefit. > > > > I'd like to see those numbers, then. > > > > Ingo, please hold on the kvm-specific patches, meanwhile. > > I'll hold off on the whole thing - frankly, we don't want this > kind of Xen-only complexity. If KVM can make use of PLE then Xen > ought to be able to do it as well. > > If both Xen and KVM makes good use of it then that's a different > matter. Aside of that, it's kinda strange that a dude named "Nikunj" is referenced in the argument chain, but I can't find him on the CC list. Thanks, tglx ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH RFC V8 0/17] Paravirtualized ticket spinlocks
* Avi Kivity wrote: > > PS: Nikunj had experimented that pv-flush tlb + > > paravirt-spinlock is a win on PLE where only one of them > > alone could not prove the benefit. > > I'd like to see those numbers, then. > > Ingo, please hold on the kvm-specific patches, meanwhile. I'll hold off on the whole thing - frankly, we don't want this kind of Xen-only complexity. If KVM can make use of PLE then Xen ought to be able to do it as well. If both Xen and KVM makes good use of it then that's a different matter. Thanks, Ingo ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH RFC V8 0/17] Paravirtualized ticket spinlocks
* Raghavendra K T wrote: > This series replaces the existing paravirtualized spinlock mechanism > with a paravirtualized ticketlock mechanism. The series provides > implementation for both Xen and KVM.(targeted for 3.5 window) > > Note: This needs debugfs changes patch that should be in Xen / linux-next >https://lkml.org/lkml/2012/3/30/687 > > Changes in V8: > - Reabsed patches to 3.4-rc4 > - Combined the KVM changes with ticketlock + Xen changes (Ingo) > - Removed CAP_PV_UNHALT since it is redundant (Avi). But note that we > need newer qemu which uses KVM_GET_SUPPORTED_CPUID ioctl. > - Rewrite GET_MP_STATE condition (Avi) > - Make pv_unhalt = bool (Avi) > - Move out reset pv_unhalt code to vcpu_run from vcpu_block (Gleb) > - Documentation changes (Rob Landley) > - Have a printk to recognize that paravirt spinlock is enabled (Nikunj) > - Move out kick hypercall out of CONFIG_PARAVIRT_SPINLOCK now >so that it can be used for other optimizations such as >flush_tlb_ipi_others etc. (Nikunj) > > Ticket locks have an inherent problem in a virtualized case, because > the vCPUs are scheduled rather than running concurrently (ignoring > gang scheduled vCPUs). This can result in catastrophic performance > collapses when the vCPU scheduler doesn't schedule the correct "next" > vCPU, and ends up scheduling a vCPU which burns its entire timeslice > spinning. (Note that this is not the same problem as lock-holder > preemption, which this series also addresses; that's also a problem, > but not catastrophic). > > (See Thomas Friebel's talk "Prevent Guests from Spinning Around" > http://www.xen.org/files/xensummitboston08/LHP.pdf for more details.) > > Currently we deal with this by having PV spinlocks, which adds a layer > of indirection in front of all the spinlock functions, and defining a > completely new implementation for Xen (and for other pvops users, but > there are none at present). > > PV ticketlocks keeps the existing ticketlock implemenentation > (fastpath) as-is, but adds a couple of pvops for the slow paths: > > - If a CPU has been waiting for a spinlock for SPIN_THRESHOLD > iterations, then call out to the __ticket_lock_spinning() pvop, > which allows a backend to block the vCPU rather than spinning. This > pvop can set the lock into "slowpath state". > > - When releasing a lock, if it is in "slowpath state", the call > __ticket_unlock_kick() to kick the next vCPU in line awake. If the > lock is no longer in contention, it also clears the slowpath flag. > > The "slowpath state" is stored in the LSB of the within the lock tail > ticket. This has the effect of reducing the max number of CPUs by > half (so, a "small ticket" can deal with 128 CPUs, and "large ticket" > 32768). > > For KVM, one hypercall is introduced in hypervisor,that allows a vcpu to kick > another vcpu out of halt state. > The blocking of vcpu is done using halt() in (lock_spinning) slowpath. > > Overall, it results in a large reduction in code, it makes the native > and virtualized cases closer, and it removes a layer of indirection > around all the spinlock functions. > > The fast path (taking an uncontended lock which isn't in "slowpath" > state) is optimal, identical to the non-paravirtualized case. > > The inner part of ticket lock code becomes: > inc = xadd(&lock->tickets, inc); > inc.tail &= ~TICKET_SLOWPATH_FLAG; > > if (likely(inc.head == inc.tail)) > goto out; > for (;;) { > unsigned count = SPIN_THRESHOLD; > do { > if (ACCESS_ONCE(lock->tickets.head) == inc.tail) > goto out; > cpu_relax(); > } while (--count); > __ticket_lock_spinning(lock, inc.tail); > } > out: barrier(); > which results in: > push %rbp > mov%rsp,%rbp > > mov$0x200,%eax > lock xadd %ax,(%rdi) > movzbl %ah,%edx > cmp%al,%dl > jne1f # Slowpath if lock in contention > > pop%rbp > retq > > ### SLOWPATH START > 1:and$-2,%edx > movzbl %dl,%esi > > 2:mov$0x800,%eax > jmp4f > > 3:pause > sub$0x1,%eax > je 5f > > 4:movzbl (%rdi),%ecx > cmp%cl,%dl > jne3b > > pop%rbp > retq > > 5:callq *__ticket_lock_spinning > jmp2b > ### SLOWPATH END > > with CONFIG_PARAVIRT_SPINLOCKS=n, the code has changed slightly, where > the fastpath case is straight through (taking the lock without > contention), and the spin loop is out of line: > > push %rbp > mov%rsp,%rbp > > mov$0x100,%eax > lock xadd %ax,(%rdi) > movzbl %ah,%edx > cmp%al,%dl > jne1f > > pop%rbp > retq > > ### SLOWPATH START > 1:pause > movzbl (%rdi),%eax > cmp%dl,%al >
Re: [PATCH RFC V8 0/17] Paravirtualized ticket spinlocks
On 05/07/2012 05:47 PM, Raghavendra K T wrote: >> Not good. Solving a problem in software that is already solved by >> hardware? It's okay if there are no costs involved, but here we're >> introducing a new ABI that we'll have to maintain for a long time. >> > > > Hmm agree that being a step ahead of mighty hardware (and just an > improvement of 1-3%) is no good for long term (where PLE is future). > PLE is the present, not the future. It was introduced on later Nehalems and is present on all Westmeres. Two more processor generations have passed meanwhile. The AMD equivalent was also introduced around that timeframe. > Having said that, it is hard for me to resist saying : > bottleneck is somewhere else on PLE m/c and IMHO answer would be > combination of paravirt-spinlock + pv-flush-tb. > > But I need to come up with good number to argue in favour of the claim. > > PS: Nikunj had experimented that pv-flush tlb + paravirt-spinlock is a > win on PLE where only one of them alone could not prove the benefit. > I'd like to see those numbers, then. Ingo, please hold on the kvm-specific patches, meanwhile. -- error compiling committee.c: too many arguments to function ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH RFC V8 0/17] Paravirtualized ticket spinlocks
On 05/07/2012 05:52 PM, Avi Kivity wrote: > > Having said that, it is hard for me to resist saying : > > bottleneck is somewhere else on PLE m/c and IMHO answer would be > > combination of paravirt-spinlock + pv-flush-tb. > > > > But I need to come up with good number to argue in favour of the claim. > > > > PS: Nikunj had experimented that pv-flush tlb + paravirt-spinlock is a > > win on PLE where only one of them alone could not prove the benefit. > > > > I'd like to see those numbers, then. > Note: it's probably best to try very wide guests, where the overhead of iterating on all vcpus begins to show. -- error compiling committee.c: too many arguments to function ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH RFC V8 0/17] Paravirtualized ticket spinlocks
On 05/07/2012 07:28 PM, Avi Kivity wrote: On 05/07/2012 04:53 PM, Raghavendra K T wrote: Is the improvement so low, because PLE is interfering with the patch, or because PLE already does a good job? It is because PLE already does a good job (of not burning cpu). The 1-3% improvement is because, patchset knows atleast who is next to hold lock, which is lacking in PLE. Not good. Solving a problem in software that is already solved by hardware? It's okay if there are no costs involved, but here we're introducing a new ABI that we'll have to maintain for a long time. Hmm agree that being a step ahead of mighty hardware (and just an improvement of 1-3%) is no good for long term (where PLE is future). Having said that, it is hard for me to resist saying : bottleneck is somewhere else on PLE m/c and IMHO answer would be combination of paravirt-spinlock + pv-flush-tb. But I need to come up with good number to argue in favour of the claim. PS: Nikunj had experimented that pv-flush tlb + paravirt-spinlock is a win on PLE where only one of them alone could not prove the benefit. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH RFC V8 0/17] Paravirtualized ticket spinlocks
On 05/07/2012 04:53 PM, Raghavendra K T wrote: >> Is the improvement so low, because PLE is interfering with the patch, or >> because PLE already does a good job? >> > > > It is because PLE already does a good job (of not burning cpu). The > 1-3% improvement is because, patchset knows atleast who is next to hold > lock, which is lacking in PLE. > Not good. Solving a problem in software that is already solved by hardware? It's okay if there are no costs involved, but here we're introducing a new ABI that we'll have to maintain for a long time. -- error compiling committee.c: too many arguments to function ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH RFC V8 0/17] Paravirtualized ticket spinlocks
On 05/07/2012 07:16 PM, Srivatsa Vaddagiri wrote: * Raghavendra K T [2012-05-07 19:08:51]: I 'll get hold of a PLE mc and come up with the numbers soon. but I 'll expect the improvement around 1-3% as it was in last version. Deferring preemption (when vcpu is holding lock) may give us better than 1-3% results on PLE hardware. Something worth trying IMHO. Yes, Sure. 'll take-up this and any scalability improvement possible further. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH RFC V8 0/17] Paravirtualized ticket spinlocks
* Avi Kivity [2012-05-07 16:49:25]: > > Deferring preemption (when vcpu is holding lock) may give us better than > > 1-3% > > results on PLE hardware. Something worth trying IMHO. > > Is the improvement so low, because PLE is interfering with the patch, or > because PLE already does a good job? I think its latter (PLE already doing a good job). - vatsa ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH RFC V8 0/17] Paravirtualized ticket spinlocks
On 05/07/2012 07:19 PM, Avi Kivity wrote: On 05/07/2012 04:46 PM, Srivatsa Vaddagiri wrote: * Raghavendra K T [2012-05-07 19:08:51]: I 'll get hold of a PLE mc and come up with the numbers soon. but I 'll expect the improvement around 1-3% as it was in last version. Deferring preemption (when vcpu is holding lock) may give us better than 1-3% results on PLE hardware. Something worth trying IMHO. Is the improvement so low, because PLE is interfering with the patch, or because PLE already does a good job? It is because PLE already does a good job (of not burning cpu). The 1-3% improvement is because, patchset knows atleast who is next to hold lock, which is lacking in PLE. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH RFC V8 0/17] Paravirtualized ticket spinlocks
On 05/07/2012 04:46 PM, Srivatsa Vaddagiri wrote: > * Raghavendra K T [2012-05-07 19:08:51]: > > > I 'll get hold of a PLE mc and come up with the numbers soon. but I > > 'll expect the improvement around 1-3% as it was in last version. > > Deferring preemption (when vcpu is holding lock) may give us better than 1-3% > results on PLE hardware. Something worth trying IMHO. Is the improvement so low, because PLE is interfering with the patch, or because PLE already does a good job? -- error compiling committee.c: too many arguments to function ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH RFC V8 0/17] Paravirtualized ticket spinlocks
* Raghavendra K T [2012-05-07 19:08:51]: > I 'll get hold of a PLE mc and come up with the numbers soon. but I > 'll expect the improvement around 1-3% as it was in last version. Deferring preemption (when vcpu is holding lock) may give us better than 1-3% results on PLE hardware. Something worth trying IMHO. - vatsa ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH RFC V8 0/17] Paravirtualized ticket spinlocks
On 05/07/2012 06:52 PM, Avi Kivity wrote: On 05/07/2012 04:20 PM, Raghavendra K T wrote: On 05/07/2012 05:36 PM, Avi Kivity wrote: On 05/07/2012 01:58 PM, Raghavendra K T wrote: On 05/07/2012 02:02 PM, Avi Kivity wrote: On 05/07/2012 11:29 AM, Ingo Molnar wrote: This is looking pretty good and complete now - any objections from anyone to trying this out in a separate x86 topic tree? No objections, instead an Acked-by: Avi Kivity [...] (Less is better. Below is time elapsed in sec for x86_64_defconfig (3+3 runs)). BASEBASE+patch%improvement mean (sd) mean (sd) case 1x: 66.0566 (74.0304) 61.3233 (68.8299) 7.16552 case 2x: 1253.2 (1795.74) 131.606 (137.358) 89.4984 case 3x: 3431.04 (5297.26) 134.964 (149.861) 96.0664 You're calculating the improvement incorrectly. In the last case, it's not 96%, rather it's 2400% (25x). Similarly the second case is about 900% faster. You are right, my %improvement was intended to be like if 1) base takes 100 sec ==> patch takes 93 sec 2) base takes 100 sec ==> patch takes 11 sec 3) base takes 100 sec ==> patch takes 4 sec The above is more confusing (and incorrect!). Better is what you told which boils to 10x and 25x improvement in case 2 and case 3. And IMO, this *really* gives the feeling of magnitude of improvement with patches. I ll change script to report that way :). btw, this is on non-PLE hardware, right? What are the numbers for PLE? Sure. I 'll get hold of a PLE mc and come up with the numbers soon. but I 'll expect the improvement around 1-3% as it was in last version. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH RFC V8 0/17] Paravirtualized ticket spinlocks
On 05/07/2012 04:20 PM, Raghavendra K T wrote: > On 05/07/2012 05:36 PM, Avi Kivity wrote: >> On 05/07/2012 01:58 PM, Raghavendra K T wrote: >>> On 05/07/2012 02:02 PM, Avi Kivity wrote: On 05/07/2012 11:29 AM, Ingo Molnar wrote: > This is looking pretty good and complete now - any objections > from anyone to trying this out in a separate x86 topic tree? No objections, instead an Acked-by: Avi Kivity > [...] >>> >>> (Less is better. Below is time elapsed in sec for x86_64_defconfig >>> (3+3 runs)). >>> >>> BASEBASE+patch%improvement >>> mean (sd) mean (sd) >>> case 1x: 66.0566 (74.0304) 61.3233 (68.8299) 7.16552 >>> case 2x: 1253.2 (1795.74) 131.606 (137.358) 89.4984 >>> case 3x: 3431.04 (5297.26) 134.964 (149.861) 96.0664 >>> >> >> You're calculating the improvement incorrectly. In the last case, it's >> not 96%, rather it's 2400% (25x). Similarly the second case is about >> 900% faster. >> > > You are right, > my %improvement was intended to be like > if > 1) base takes 100 sec ==> patch takes 93 sec > 2) base takes 100 sec ==> patch takes 11 sec > 3) base takes 100 sec ==> patch takes 4 sec > > The above is more confusing (and incorrect!). > > Better is what you told which boils to 10x and 25x improvement in case > 2 and case 3. And IMO, this *really* gives the feeling of magnitude of > improvement with patches. > > I ll change script to report that way :). > btw, this is on non-PLE hardware, right? What are the numbers for PLE? -- error compiling committee.c: too many arguments to function ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH RFC V8 0/17] Paravirtualized ticket spinlocks
On 05/07/2012 05:36 PM, Avi Kivity wrote: On 05/07/2012 01:58 PM, Raghavendra K T wrote: On 05/07/2012 02:02 PM, Avi Kivity wrote: On 05/07/2012 11:29 AM, Ingo Molnar wrote: This is looking pretty good and complete now - any objections from anyone to trying this out in a separate x86 topic tree? No objections, instead an Acked-by: Avi Kivity [...] (Less is better. Below is time elapsed in sec for x86_64_defconfig (3+3 runs)). BASEBASE+patch%improvement mean (sd) mean (sd) case 1x: 66.0566 (74.0304) 61.3233 (68.8299) 7.16552 case 2x: 1253.2 (1795.74) 131.606 (137.358) 89.4984 case 3x: 3431.04 (5297.26) 134.964 (149.861) 96.0664 You're calculating the improvement incorrectly. In the last case, it's not 96%, rather it's 2400% (25x). Similarly the second case is about 900% faster. You are right, my %improvement was intended to be like if 1) base takes 100 sec ==> patch takes 93 sec 2) base takes 100 sec ==> patch takes 11 sec 3) base takes 100 sec ==> patch takes 4 sec The above is more confusing (and incorrect!). Better is what you told which boils to 10x and 25x improvement in case 2 and case 3. And IMO, this *really* gives the feeling of magnitude of improvement with patches. I ll change script to report that way :). ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH RFC V8 0/17] Paravirtualized ticket spinlocks
On 05/07/2012 01:58 PM, Raghavendra K T wrote: > On 05/07/2012 02:02 PM, Avi Kivity wrote: >> On 05/07/2012 11:29 AM, Ingo Molnar wrote: >>> This is looking pretty good and complete now - any objections >>> from anyone to trying this out in a separate x86 topic tree? >> >> No objections, instead an >> >> Acked-by: Avi Kivity >> > > Thank you. > > Here is a benchmark result with the patches. > > 3 guests with 8VCPU, 8GB RAM, 1 used for kernbench > (kernbench -f -H -M -o 20) other for cpuhog (shell script while > true with an instruction) > > unpinned scenario > 1x: no hogs > 2x: 8hogs in one guest > 3x: 8hogs each in two guest > > BASE: 3.4-rc4 vanilla with CONFIG_PARAVIRT_SPINLOCK=n > BASE+patch: 3.4-rc4 + debugfs + pv patches with > CONFIG_PARAVIRT_SPINLOCK=y > > Machine : IBM xSeries with Intel(R) Xeon(R) x5570 2.93GHz CPU (Non > PLE) with 8 core , 64GB RAM > > (Less is better. Below is time elapsed in sec for x86_64_defconfig > (3+3 runs)). > > BASEBASE+patch%improvement > mean (sd) mean (sd) > case 1x: 66.0566 (74.0304) 61.3233 (68.8299) 7.16552 > case 2x: 1253.2 (1795.74) 131.606 (137.358) 89.4984 > case 3x: 3431.04 (5297.26) 134.964 (149.861) 96.0664 > You're calculating the improvement incorrectly. In the last case, it's not 96%, rather it's 2400% (25x). Similarly the second case is about 900% faster. -- error compiling committee.c: too many arguments to function ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH RFC V8 0/17] Paravirtualized ticket spinlocks
On 05/07/2012 02:02 PM, Avi Kivity wrote: On 05/07/2012 11:29 AM, Ingo Molnar wrote: This is looking pretty good and complete now - any objections from anyone to trying this out in a separate x86 topic tree? No objections, instead an Acked-by: Avi Kivity Thank you. Here is a benchmark result with the patches. 3 guests with 8VCPU, 8GB RAM, 1 used for kernbench (kernbench -f -H -M -o 20) other for cpuhog (shell script while true with an instruction) unpinned scenario 1x: no hogs 2x: 8hogs in one guest 3x: 8hogs each in two guest BASE: 3.4-rc4 vanilla with CONFIG_PARAVIRT_SPINLOCK=n BASE+patch: 3.4-rc4 + debugfs + pv patches with CONFIG_PARAVIRT_SPINLOCK=y Machine : IBM xSeries with Intel(R) Xeon(R) x5570 2.93GHz CPU (Non PLE) with 8 core , 64GB RAM (Less is better. Below is time elapsed in sec for x86_64_defconfig (3+3 runs)). BASEBASE+patch%improvement mean (sd) mean (sd) case 1x: 66.0566 (74.0304) 61.3233 (68.8299) 7.16552 case 2x: 1253.2 (1795.74)131.606 (137.358) 89.4984 case 3x: 3431.04 (5297.26) 134.964 (149.861) 96.0664 Will be working on further analysis with other benchmarks (pgbench/sysbench/ebizzy...) and further optimization. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH RFC V8 0/17] Paravirtualized ticket spinlocks
On 05/07/2012 11:29 AM, Ingo Molnar wrote: > This is looking pretty good and complete now - any objections > from anyone to trying this out in a separate x86 topic tree? No objections, instead an Acked-by: Avi Kivity -- error compiling committee.c: too many arguments to function ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH RFC V8 0/17] Paravirtualized ticket spinlocks
This series replaces the existing paravirtualized spinlock mechanism with a paravirtualized ticketlock mechanism. The series provides implementation for both Xen and KVM.(targeted for 3.5 window) Note: This needs debugfs changes patch that should be in Xen / linux-next https://lkml.org/lkml/2012/3/30/687 Changes in V8: - Reabsed patches to 3.4-rc4 - Combined the KVM changes with ticketlock + Xen changes (Ingo) - Removed CAP_PV_UNHALT since it is redundant (Avi). But note that we need newer qemu which uses KVM_GET_SUPPORTED_CPUID ioctl. - Rewrite GET_MP_STATE condition (Avi) - Make pv_unhalt = bool (Avi) - Move out reset pv_unhalt code to vcpu_run from vcpu_block (Gleb) - Documentation changes (Rob Landley) - Have a printk to recognize that paravirt spinlock is enabled (Nikunj) - Move out kick hypercall out of CONFIG_PARAVIRT_SPINLOCK now so that it can be used for other optimizations such as flush_tlb_ipi_others etc. (Nikunj) Ticket locks have an inherent problem in a virtualized case, because the vCPUs are scheduled rather than running concurrently (ignoring gang scheduled vCPUs). This can result in catastrophic performance collapses when the vCPU scheduler doesn't schedule the correct "next" vCPU, and ends up scheduling a vCPU which burns its entire timeslice spinning. (Note that this is not the same problem as lock-holder preemption, which this series also addresses; that's also a problem, but not catastrophic). (See Thomas Friebel's talk "Prevent Guests from Spinning Around" http://www.xen.org/files/xensummitboston08/LHP.pdf for more details.) Currently we deal with this by having PV spinlocks, which adds a layer of indirection in front of all the spinlock functions, and defining a completely new implementation for Xen (and for other pvops users, but there are none at present). PV ticketlocks keeps the existing ticketlock implemenentation (fastpath) as-is, but adds a couple of pvops for the slow paths: - If a CPU has been waiting for a spinlock for SPIN_THRESHOLD iterations, then call out to the __ticket_lock_spinning() pvop, which allows a backend to block the vCPU rather than spinning. This pvop can set the lock into "slowpath state". - When releasing a lock, if it is in "slowpath state", the call __ticket_unlock_kick() to kick the next vCPU in line awake. If the lock is no longer in contention, it also clears the slowpath flag. The "slowpath state" is stored in the LSB of the within the lock tail ticket. This has the effect of reducing the max number of CPUs by half (so, a "small ticket" can deal with 128 CPUs, and "large ticket" 32768). For KVM, one hypercall is introduced in hypervisor,that allows a vcpu to kick another vcpu out of halt state. The blocking of vcpu is done using halt() in (lock_spinning) slowpath. Overall, it results in a large reduction in code, it makes the native and virtualized cases closer, and it removes a layer of indirection around all the spinlock functions. The fast path (taking an uncontended lock which isn't in "slowpath" state) is optimal, identical to the non-paravirtualized case. The inner part of ticket lock code becomes: inc = xadd(&lock->tickets, inc); inc.tail &= ~TICKET_SLOWPATH_FLAG; if (likely(inc.head == inc.tail)) goto out; for (;;) { unsigned count = SPIN_THRESHOLD; do { if (ACCESS_ONCE(lock->tickets.head) == inc.tail) goto out; cpu_relax(); } while (--count); __ticket_lock_spinning(lock, inc.tail); } out:barrier(); which results in: push %rbp mov%rsp,%rbp mov$0x200,%eax lock xadd %ax,(%rdi) movzbl %ah,%edx cmp%al,%dl jne1f # Slowpath if lock in contention pop%rbp retq ### SLOWPATH START 1: and$-2,%edx movzbl %dl,%esi 2: mov$0x800,%eax jmp4f 3: pause sub$0x1,%eax je 5f 4: movzbl (%rdi),%ecx cmp%cl,%dl jne3b pop%rbp retq 5: callq *__ticket_lock_spinning jmp2b ### SLOWPATH END with CONFIG_PARAVIRT_SPINLOCKS=n, the code has changed slightly, where the fastpath case is straight through (taking the lock without contention), and the spin loop is out of line: push %rbp mov%rsp,%rbp mov$0x100,%eax lock xadd %ax,(%rdi) movzbl %ah,%edx cmp%al,%dl jne1f pop%rbp retq ### SLOWPATH START 1: pause movzbl (%rdi),%eax cmp%dl,%al jne1b pop%rbp retq ### SLOWPATH END The unlock code is complicated by the need to both add to the lock's "head" and fetch the slowpath flag from "tail". Th