Re: [PATCH] kvm: handle last_boosted_vcpu = 0 case
On Tue, Jun 19, 2012 at 04:51:04PM -0400, Rik van Riel wrote: On Wed, 20 Jun 2012 01:50:50 +0530 Raghavendra K T raghavendra...@linux.vnet.ibm.com wrote: In ple handler code, last_boosted_vcpu (lbv) variable is serving as reference point to start when we enter. Also statistical analysis (below) is showing lbv is not very well distributed with current approach. You are the second person to spot this bug today (yes, today). Due to time zones, the first person has not had a chance yet to test the patch below, which might fix the issue... Please let me know how it goes. 8 If last_boosted_vcpu == 0, then we fall through all test cases and may end up with all VCPUs pouncing on vcpu 0. With a large enough guest, this can result in enormous runqueue lock contention, which can prevent vcpu0 from running, leading to a livelock. Changing to = makes sure we properly handle that case. Signed-off-by: Rik van Riel r...@redhat.com Applied, thanks. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] kvm: handle last_boosted_vcpu = 0 case
On Mon, 2012-07-02 at 10:49 -0400, Rik van Riel wrote: On 06/28/2012 06:55 PM, Vinod, Chegu wrote: Hello, I am just catching up on this email thread... Perhaps one of you may be able to help answer this query.. preferably along with some data. [BTW, I do understand the basic intent behind PLE in a typical [sweet spot] use case where there is over subscription etc. and the need to optimize the PLE handler in the host etc. ] In a use case where the host has fewer but much larger guests (say 40VCPUs and higher) and there is no over subscription (i.e. # of vcpus across guests= physical cpus in the host and perhaps each guest has their vcpu's pinned to specific physical cpus for other reasons), I would like to understand if/how the PLE really helps ? For these use cases would it be ok to turn PLE off (ple_gap=0) since is no real need to take an exit and find some other VCPU to yield to ? Yes, that should be ok. On a related note, I wonder if we should increase the ple_gap significantly. After all, 4096 cycles of spinning is not that much, when you consider how much time is spent doing the subsequent vmexit, scanning the other VCPU's status (200 cycles per cache miss), deciding what to do, maybe poking another CPU, and eventually a vmenter. A factor 4 increase in ple_gap might be what it takes to get the amount of time spent spinning equal to the amount of time spent on the host side doing KVM stuff... I was recently thinking the same thing as I have observed over 180,000 exits/sec from a 40-way VM on a 80-way host, where there should be no cpu overcommit. Also, the number of directed yields for this was only 1800/sec, so we have a 1% usefulness for our exits. I am wondering if the ple_window should be similar to the host scheduler task switching granularity, and not what we think a typical max cycles should be for holding a lock. BTW, I have a patch to add a couple PLE stats to kvmstat which I will send out shortly. -Andrew -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] kvm: handle last_boosted_vcpu = 0 case
On 06/28/2012 06:55 PM, Vinod, Chegu wrote: Hello, I am just catching up on this email thread... Perhaps one of you may be able to help answer this query.. preferably along with some data. [BTW, I do understand the basic intent behind PLE in a typical [sweet spot] use case where there is over subscription etc. and the need to optimize the PLE handler in the host etc. ] In a use case where the host has fewer but much larger guests (say 40VCPUs and higher) and there is no over subscription (i.e. # of vcpus across guests= physical cpus in the host and perhaps each guest has their vcpu's pinned to specific physical cpus for other reasons), I would like to understand if/how the PLE really helps ? For these use cases would it be ok to turn PLE off (ple_gap=0) since is no real need to take an exit and find some other VCPU to yield to ? Yes, that should be ok. On a related note, I wonder if we should increase the ple_gap significantly. After all, 4096 cycles of spinning is not that much, when you consider how much time is spent doing the subsequent vmexit, scanning the other VCPU's status (200 cycles per cache miss), deciding what to do, maybe poking another CPU, and eventually a vmenter. A factor 4 increase in ple_gap might be what it takes to get the amount of time spent spinning equal to the amount of time spent on the host side doing KVM stuff... -- All rights reversed -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] kvm: handle last_boosted_vcpu = 0 case
On 07/02/2012 08:19 PM, Rik van Riel wrote: On 06/28/2012 06:55 PM, Vinod, Chegu wrote: Hello, I am just catching up on this email thread... Perhaps one of you may be able to help answer this query.. preferably along with some data. [BTW, I do understand the basic intent behind PLE in a typical [sweet spot] use case where there is over subscription etc. and the need to optimize the PLE handler in the host etc. ] In a use case where the host has fewer but much larger guests (say 40VCPUs and higher) and there is no over subscription (i.e. # of vcpus across guests= physical cpus in the host and perhaps each guest has their vcpu's pinned to specific physical cpus for other reasons), I would like to understand if/how the PLE really helps ? For these use cases would it be ok to turn PLE off (ple_gap=0) since is no real need to take an exit and find some other VCPU to yield to ? Yes, that should be ok. I think this should be true when we have ple_window tuned to correct value for guest. (same what you raised) But otherwise, IMO, it is a very tricky question to answer. PLE is currently benefiting even flush_tlb_ipi etc apart from spinlock. Having a properly tuned value for all types of workload, (+load) is really complicated. Coming back to ple_handler, IMHO, if we have slight increase in run_queue length, having directed yield may worsen the scenario. (In the case Vinod explained, even-though we will succeed in setting other vcpu task as next_buddy, caller itself gets scheduled out, so ganging effect reduces. on top of this we always have a question, have we chosen right guy OR a really bad guy for yielding.) On a related note, I wonder if we should increase the ple_gap significantly. Did you mean ple_window? After all, 4096 cycles of spinning is not that much, when you consider how much time is spent doing the subsequent vmexit, scanning the other VCPU's status (200 cycles per cache miss), deciding what to do, maybe poking another CPU, and eventually a vmenter. A factor 4 increase in ple_gap might be what it takes to get the amount of time spent spinning equal to the amount of time spent on the host side doing KVM stuff... I agree, I am experimenting with all these things left and right, along with several optimization ideas I have. Hope to comeback on the experiments soon. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] kvm: handle last_boosted_vcpu = 0 case
- Original Message - In summary, current PV has huge benefit on non-PLE machine. On PLE machine, the results become very sensitive to load, type of workload and SPIN_THRESHOLD. Also PLE interference has significant effect on them. But still it has slight edge over non PV. Hi Raghu, sorry for my slow response. I'm on vacation right now (until the 9th of July) and I have limited access to mail. Also, thanks for continuing the benchmarking. Question, when you compare PLE vs. non-PLE, are you using different machines (one with and one without), or are you disabling its use by loading the kvm module with the ple_gap=0 modparam as I did? Drew -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] kvm: handle last_boosted_vcpu = 0 case
On 06/28/2012 09:30 PM, Andrew Jones wrote: - Original Message - In summary, current PV has huge benefit on non-PLE machine. On PLE machine, the results become very sensitive to load, type of workload and SPIN_THRESHOLD. Also PLE interference has significant effect on them. But still it has slight edge over non PV. Hi Raghu, sorry for my slow response. I'm on vacation right now (until the 9th of July) and I have limited access to mail. Ok. Happy Vacation :) Also, thanks for continuing the benchmarking. Question, when you compare PLE vs. non-PLE, are you using different machines (one with and one without), or are you disabling its use by loading the kvm module with the ple_gap=0 modparam as I did? Yes, I am doing the same when I say with PLE disabled and comparing the benchmarks (i.e loading kvm module with ple_gap=0). But older non-PLE results were on a different machine altogether. (I had limited access to PLE machine). -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH] kvm: handle last_boosted_vcpu = 0 case
Hello, I am just catching up on this email thread... Perhaps one of you may be able to help answer this query.. preferably along with some data. [BTW, I do understand the basic intent behind PLE in a typical [sweet spot] use case where there is over subscription etc. and the need to optimize the PLE handler in the host etc. ] In a use case where the host has fewer but much larger guests (say 40VCPUs and higher) and there is no over subscription (i.e. # of vcpus across guests = physical cpus in the host and perhaps each guest has their vcpu's pinned to specific physical cpus for other reasons), I would like to understand if/how the PLE really helps ? For these use cases would it be ok to turn PLE off (ple_gap=0) since is no real need to take an exit and find some other VCPU to yield to ? Thanks Vinod -Original Message- From: Raghavendra K T [mailto:raghavendra...@linux.vnet.ibm.com] Sent: Thursday, June 28, 2012 9:22 AM To: Andrew Jones Cc: Rik van Riel; Marcelo Tosatti; Srikar; Srivatsa Vaddagiri; Peter Zijlstra; Nikunj A. Dadhania; KVM; LKML; Gleb Natapov; Vinod, Chegu; Jeremy Fitzhardinge; Avi Kivity; Ingo Molnar Subject: Re: [PATCH] kvm: handle last_boosted_vcpu = 0 case On 06/28/2012 09:30 PM, Andrew Jones wrote: - Original Message - In summary, current PV has huge benefit on non-PLE machine. On PLE machine, the results become very sensitive to load, type of workload and SPIN_THRESHOLD. Also PLE interference has significant effect on them. But still it has slight edge over non PV. Hi Raghu, sorry for my slow response. I'm on vacation right now (until the 9th of July) and I have limited access to mail. Ok. Happy Vacation :) Also, thanks for continuing the benchmarking. Question, when you compare PLE vs. non-PLE, are you using different machines (one with and one without), or are you disabling its use by loading the kvm module with the ple_gap=0 modparam as I did? Yes, I am doing the same when I say with PLE disabled and comparing the benchmarks (i.e loading kvm module with ple_gap=0). But older non-PLE results were on a different machine altogether. (I had limited access to PLE machine).
Re: [PATCH] kvm: handle last_boosted_vcpu = 0 case
On 06/24/2012 12:04 AM, Raghavendra K T wrote: On 06/23/2012 02:30 AM, Raghavendra K T wrote: On 06/22/2012 08:41 PM, Andrew Jones wrote: [...] My run for other benchmarks did not have Rik's patches, so re-spinning everything with that now. Here is the detailed info on env and benchmark I am currently trying. Let me know if you have any comments === kernel 3.5.0-rc1 with Rik's Ple handler fix as base Machine : Intel(R) Xeon(R) CPU X7560 @ 2.27GHz, 4 numa node, 256GB RAM, 32 core machine Host: enterprise linux gcc version 4.4.6 20120305 (Red Hat 4.4.6-4) (GCC) with test kernels Guest: fedora 16 with different built-in kernel from same source tree. 32 vcpus 8GB memory. (configs not changed with patches except for CONFIG_PARAVIRT_SPINLOCK) Note: for Pv patches, SPIN_THRESHOLD is set to 4k Benchmarks: 1) kernbench: kernbench-0.50 cmd: echo 3 /proc/sys/vm/drop_caches ccache -C kernbench -f -H -M -o 2*vcpu Very first run in kernbench is omitted. 2) dbench: dbench version 4.00 cmd: dbench --warmup=30 -t 120 2*vcpu 3) hackbench: https://build.opensuse.org/package/files?package=hackbenchproject=benchmark hackbench.c modified with loops=1 used hackbench with num-threads = 2* vcpu 4) Specjbb: specjbb2000-1.02 Input Properties: ramp_up_seconds = 30 measurement_seconds = 120 forcegc = true starting_number_warehouses = 1 increment_number_warehouses = 1 ending_number_warehouses = 8 5) sysbench: 0.4.12 sysbench --test=oltp --db-driver=pgsql prepare sysbench --num-threads=2*vcpu --max-requests=10 --test=oltp --oltp-table-size=50 --db-driver=pgsql --oltp-read-only run Note that driver for this pgsql. 6) ebizzy: release 0.3 cmd: ebizzy -S 120 - specjbb ran for 1x and 2x others mostly for 1x, 2x, 3x overcommit. - overcommit of 2x means same benchmark running on 2 guests. - sample for each overcommit is mostly 8 Note: I ran kernbench with old kernbench0.50, may be I can try kcbench with ramfs if necessary will soon come with detailed results With the above env, Here is the result I have for 4k SPIN_THRESHOLD. Lower is better for following benchmarks: kernbench: (time in sec) hackbench: (time in sec) sysbench : (time in sec) Higher is better for following benchmarks: specjbb: score (Throughput) dbench : Throughput in MB/sec ebizzy : records/sec In summary, current PV has huge benefit on non-PLE machine. On PLE machine, the results become very sensitive to load, type of workload and SPIN_THRESHOLD. Also PLE interference has significant effect on them. But still it has slight edge over non PV. Overall, specjbb, sysbench, kernbench seem to do well with PV. dbench has been little unreliable (same reason I have not published 2x, 3x result but experimental values are included in tarball) but seem to be on par with PV hackbench non-overcommit case is better and ebizzy overcommit case is better. [ebizzy seems to very sensitive w.r.t SPIN_THRESHOLD]. I have still not experimented with SPIN_THRESHOLD of 2k/8k and w/, w/o PLE after having Rik's fix. +---+---+---++-+ specjbb +---+---+---++-+ | value | stdev | value |stdev | %improve| +---+---+---++-+ |114232.2500|21774.0660 |122591.| 18239.0900 | 7.31733 | |112154.5000|19696.6860 |113386.2500| 22262.5890 | 1.09826 | +---+---+---++-+ +---+---+---++-+ kernbench +---+---+---++-+ | value | stdev | value |stdev | %improve| +---+---+---++-+ | 48.9150 | 0.8608 | 48.5550 | 0.7372 | 0.74143 | | 96.3691 | 7.9724 | 96.6367 | 1.6938 |-0.27691 | | 192.6972 | 9.1881 | 188.3195 | 8.1267 | 2.32461 | | 320.6500 | 29.6892 | 302.1225 | 16.0515 | 6.13245 | ++---+---+---++-+ +---+---+---++-+ sysbench +---+---+---++-+ | value | stdev | value |stdev | %improve| +---+---+---++-+ | 12.4082 | 0.2370 | 12.2797 | 0.1037 | 1.04644 | | 14.1705 | 0.4272 | 14.0300 | 1.1478 | 1.00143 | | 19.3769 | 1.0833 | 18.9745 | 0.0560 | 2.12074 | | 24.5373 | 1.3237 | 22.3078 | 0.8999 | 9.99426 | +---+---+---++-+ +---+---+---++-+ hackbench +---+---+---++-+ | value | stdev | value |stdev | %improve| +---+---+---++-+ | 73.2627 | 11.2413 | 67.5125 |
Re: [PATCH] kvm: handle last_boosted_vcpu = 0 case with benchmark detail attachment
On 06/28/2012 01:57 AM, Raghavendra K T wrote: On 06/24/2012 12:04 AM, Raghavendra K T wrote: On 06/23/2012 02:30 AM, Raghavendra K T wrote: On 06/22/2012 08:41 PM, Andrew Jones wrote: [...] (benchmark values will be attached in reply to this mail) pv_benchmark_summary.bz2 Description: application/bzip
Re: [PATCH] kvm: handle last_boosted_vcpu = 0 case
On 06/21/2012 12:13 PM, Gleb Natapov wrote: On Tue, Jun 19, 2012 at 04:51:04PM -0400, Rik van Riel wrote: On Wed, 20 Jun 2012 01:50:50 +0530 Raghavendra K Traghavendra...@linux.vnet.ibm.com wrote: In ple handler code, last_boosted_vcpu (lbv) variable is serving as reference point to start when we enter. Also statistical analysis (below) is showing lbv is not very well distributed with current approach. You are the second person to spot this bug today (yes, today). Due to time zones, the first person has not had a chance yet to test the patch below, which might fix the issue... Please let me know how it goes. 8 If last_boosted_vcpu == 0, then we fall through all test cases and may end up with all VCPUs pouncing on vcpu 0. With a large enough guest, this can result in enormous runqueue lock contention, which can prevent vcpu0 from running, leading to a livelock. Changing to= makes sure we properly handle that case. Signed-off-by: Rik van Rielr...@redhat.com --- virt/kvm/kvm_main.c |2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index 7e14068..1da542b 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -1586,7 +1586,7 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me) */ for (pass = 0; pass 2 !yielded; pass++) { kvm_for_each_vcpu(i, vcpu, kvm) { - if (!pass i last_boosted_vcpu) { + if (!pass i= last_boosted_vcpu) { i = last_boosted_vcpu; continue; } else if (pass i last_boosted_vcpu) Looks correct. We can simplify this by introducing something like: #define kvm_for_each_vcpu_from(idx, n, vcpup, kvm) \ for (n = atomic_read(kvm-online_vcpus); \ n (vcpup = kvm_get_vcpu(kvm, idx)) != NULL; \ n--, idx = (idx+1) % atomic_read(kvm-online_vcpus)) Gleb, Rik, Any updates on this or Rik's patch status? I can come up with the above suggested cleanup patch with Gleb's from,sob. Please let me know. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] kvm: handle last_boosted_vcpu = 0 case
On 06/23/2012 02:30 AM, Raghavendra K T wrote: On 06/22/2012 08:41 PM, Andrew Jones wrote: On Thu, Jun 21, 2012 at 04:56:08PM +0530, Raghavendra K T wrote: Here are the results from kernbench. PS: I think we have to only take that, both the patches perform better, than reading into actual numbers since I am seeing more variance in especially 3x. may be I can test with some more stable benchmark if somebody points [...] can we agree like, for kernbench 1x= -j (2*#vcpu) in 1 vm. 1.5x = -j (2*#vcpu) in 1 vm and -j (#vcpu) in other.. and so on. also a SPIN_THRESHOLD of 4k? Please forget about 1.5x above. I am not too sure on that. Any ideas on benchmarks is welcome from all. My run for other benchmarks did not have Rik's patches, so re-spinning everything with that now. Here is the detailed info on env and benchmark I am currently trying. Let me know if you have any comments === kernel 3.5.0-rc1 with Rik's Ple handler fix as base Machine : Intel(R) Xeon(R) CPU X7560 @ 2.27GHz, 4 numa node, 256GB RAM, 32 core machine Host: enterprise linux gcc version 4.4.6 20120305 (Red Hat 4.4.6-4) (GCC) with test kernels Guest: fedora 16 with different built-in kernel from same source tree. 32 vcpus 8GB memory. (configs not changed with patches except for CONFIG_PARAVIRT_SPINLOCK) Note: for Pv patches, SPIN_THRESHOLD is set to 4k Benchmarks: 1) kernbench: kernbench-0.50 cmd: echo 3 /proc/sys/vm/drop_caches ccache -C kernbench -f -H -M -o 2*vcpu Very first run in kernbench is omitted. 2) dbench: dbench version 4.00 cmd: dbench --warmup=30 -t 120 2*vcpu 3) hackbench: https://build.opensuse.org/package/files?package=hackbenchproject=benchmark hackbench.c modified with loops=1 used hackbench with num-threads = 2* vcpu 4) Specjbb: specjbb2000-1.02 Input Properties: ramp_up_seconds = 30 measurement_seconds = 120 forcegc = true starting_number_warehouses = 1 increment_number_warehouses = 1 ending_number_warehouses = 8 5) sysbench: 0.4.12 sysbench --test=oltp --db-driver=pgsql prepare sysbench --num-threads=2*vcpu --max-requests=10 --test=oltp --oltp-table-size=50 --db-driver=pgsql --oltp-read-only run Note that driver for this pgsql. 6) ebizzy: release 0.3 cmd: ebizzy -S 120 - specjbb ran for 1x and 2x others mostly for 1x, 2x, 3x overcommit. - overcommit of 2x means same benchmark running on 2 guests. - sample for each overcommit is mostly 8 Note: I ran kernbench with old kernbench0.50, may be I can try kcbench with ramfs if necessary will soon come with detailed results - Raghu -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] kvm: handle last_boosted_vcpu = 0 case
On Thu, Jun 21, 2012 at 04:56:08PM +0530, Raghavendra K T wrote: Here are the results from kernbench. PS: I think we have to only take that, both the patches perform better, than reading into actual numbers since I am seeing more variance in especially 3x. may be I can test with some more stable benchmark if somebody points Hi Raghu, I wonder if we should back up and try to determine the best benchmark/test environment first. I think kernbench is good, but I wonder about how to simulate the overcommit, and to what degree (1x, 3x, ??). What are you currently running to simulate overcommit now? Originally we were running kernbench in one VM and cpu hogs (bash infinite loops) in other VMs. Then we added vcpus and infinite loops to get up to the desired overcommit. I saw later that you've experimented with running kernbench in the other VMs as well, rather than cpu hogs. Is that still the case? I started playing with benchmarking these proposals myself, but so far have stuck to the cpu hog, since I wanted to keep variability limited. However, when targeting a reasonable host loadavg with a bunch of cpu hog vcpus, it limits the overcommit too much. I certainly haven't tried 3x this way. So I'm inclined to throw out the cpu hog approach as well. The question is, what to replace it with? It appears that the performance of the PLE and pvticketlock proposals are quite dependant on the level of overcommit, so we should choose a target overcommit level and also a constraint on the host loadavg first, then determine how to setup a test environment that fits it and yields results with low variance. Here are results from my 1.125x overcommit test environment using cpu hogs. kcbench (a.k.a kernbench) results; 'mean-time (stddev)' base-noPLE: 235.730 (25.932) base-PLE: 238.820 (11.199) rand_start-PLE: 283.193 (23.262) pvticketlocks-noPLE: 244.987 (7.562) pvticketlocks-PLE:247.597 (17.200) base kernel: 3.5.0-rc3 + Rik's new last_boosted patch rand_start kernel:3.5.0-rc3 + Raghu's proposed random start patch pvticketlocks kernel: 3.5.0-rc3 + Rik's new last_boosted patch + Raghu's pvticketlock series The relative standard deviations are as high as 11%. So I'm not real pleased with the results, and they show degradation everywhere. Below are the details of the benchmarking. Everything is there except the kernel config, but our benchmarking should be reproducible with nearly random configs anyway. Drew = host = - Intel(R) Xeon(R) CPU X7560 @ 2.27GHz - 64 cpus, 4 nodes, 64G mem - Fedora 17 with test kernels (see tests) = benchmark = - one cpu hog F17 VM - 64 vcpus, 8G mem - all vcpus run a bash infinite loop - kernel: 3.5.0-rc3 - one kcbench (a.k.a kernbench) F17 VM - 8 vcpus, 8G mem - 'kcbench -d /mnt/ram', /mnt/ram is 1G ramfs - kcbench-0.3-8.1.noarch, kcbench-data-2.6.38-0.1-9.fc17.noarch, kcbench-data-0.1-9.fc17.noarch - gcc (GCC) 4.7.0 20120507 (Red Hat 4.7.0-5) - kernel: same test kernel as host = test 1: base, PLE disabled (ple_gap=0) = - kernel: 3.5.0-rc3 + Rik's last_boosted patch Run 1 (-j 16): 4211 (e:237.43 P:637% U:697.98 S:815.46 F:0) Run 2 (-j 16): 3834 (e:260.77 P:631% U:729.69 S:917.56 F:0) Run 3 (-j 16): 4784 (e:208.99 P:644% U:638.17 S:708.63 F:0) mean: 235.730 stddev: 25.932 = test 2: base, PLE enabled = - kernel: 3.5.0-rc3 + Rik's last_boosted patch Run 1 (-j 16): 4335 (e:230.67 P:639% U:657.74 S:818.28 F:0) Run 2 (-j 16): 4269 (e:234.20 P:647% U:743.43 S:772.52 F:0) Run 3 (-j 16): 3974 (e:251.59 P:639% U:724.29 S:884.21 F:0) mean: 238.820 stddev: 11.199 = test 3: rand_start, PLE enabled = - kernel: 3.5.0-rc3 + Raghu's random start patch Run 1 (-j 16): 3898 (e:256.52 P:639% U:756.14 S:884.63 F:0) Run 2 (-j 16): 3341 (e:299.27 P:633% U:857.49 S:1039.62 F:0) Run 3 (-j 16): 3403 (e:293.79 P:635% U:857.21 S:1008.83 F:0) mean: 283.193 stddev: 23.262 = test 4: pvticketlocks, PLE disabled (ple_gap=0) = - kernel: 3.5.0-rc3 + Rik's last_boosted patch + Raghu's pvticketlock series + PARAVIRT_SPINLOCKS=y config change Run 1 (-j 16): 3963 (e:252.29 P:647% U:736.43 S:897.16 F:0) Run 2 (-j 16): 4216 (e:237.19 P:650% U:706.68 S:837.42 F:0) Run 3 (-j 16): 4073 (e:245.48 P:649% U:709.46 S:884.68 F:0) mean: 244.987 stddev: 7.562 = test 5: pvticketlocks, PLE enabled = - kernel: 3.5.0-rc3 + Rik's last_boosted patch + Raghu's pvticketlock series + PARAVIRT_SPINLOCKS=y config change Run 1 (-j 16): 3978 (e:251.32 P:629% U:758.86 S:824.29 F:0) Run 2 (-j 16): 4369 (e:228.84 P:634% U:708.32 S:743.71 F:0) Run 3 (-j 16): 3807 (e:262.63 P:626% U:767.03 S:877.96 F:0) mean: 247.597 stddev: 17.200 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at
Re: [PATCH] kvm: handle last_boosted_vcpu = 0 case
On 06/22/2012 08:41 PM, Andrew Jones wrote: On Thu, Jun 21, 2012 at 04:56:08PM +0530, Raghavendra K T wrote: Here are the results from kernbench. PS: I think we have to only take that, both the patches perform better, than reading into actual numbers since I am seeing more variance in especially 3x. may be I can test with some more stable benchmark if somebody points Hi Raghu, First of all Thank you for your test and raising valid points. It also made the avenue for discussion of all the different experiments done over a month (apart from tuning/benchmarking), which may bring more feedback and precious ideas from community to optimize the performance further. I shall discuss in reply to this mail separately. I wonder if we should back up and try to determine the best benchmark/test environment first. I agree, we have to be able to produce similar result independently. So far sysbench (even pgbench) has been consistent, Currently trying, if other benchmarks like hackbench (modified #loops), ebizzy/dbench have low variance. [ but they too are dependent on #client/threads etc ] I think kernbench is good, but Yes kernbench atleast helped me to tune SPIN_THRESHOLD to good extent. But Jeremy also had pointed out that kernbench is little inconsistent. I wonder about how to simulate the overcommit, and to what degree (1x, 3x, ??). What are you currently running to simulate overcommit now? Originally we were running kernbench in one VM and cpu hogs (bash infinite loops) in other VMs. Then we added vcpus and infinite loops to get up to the desired overcommit. I saw later that you've experimented with running kernbench in the other VMs as well, rather than cpu hogs. Is that still the case? Yes, I am now running same benchmark on all the guest. on non PLE, while 1 cpuhogs, played good role of simulating LHP, but on PLE machine It did not seem to be the case. I started playing with benchmarking these proposals myself, but so far have stuck to the cpu hog, since I wanted to keep variability limited. However, when targeting a reasonable host loadavg with a bunch of cpu hog vcpus, it limits the overcommit too much. I certainly haven't tried 3x this way. So I'm inclined to throw out the cpu hog approach as well. The question is, what to replace it with? It appears that the performance of the PLE and pvticketlock proposals are quite dependant on the level of overcommit, so we should choose a target overcommit level and also a constraint on the host loadavg first, then determine how to setup a test environment that fits it and yields results with low variance. Here are results from my 1.125x overcommit test environment using cpu hogs. At first, result seemed backward, but after seeing individual runs and variations, it seems, except for rand start I believe all the result should converge to zero difference. So if we run the same again we may get completely different result. IMO, on a 64 vcpu guest if we run -j16 it may not represent 1x load, so what I believe is it has resulted in more of under-commit/nearly 1x commit result. May be we should try atleast #threads = #vcpu or 2*#vcpu kcbench (a.k.a kernbench) results; 'mean-time (stddev)' base-noPLE: 235.730 (25.932) base-PLE: 238.820 (11.199) rand_start-PLE: 283.193 (23.262) Problem currently as we know, in PLE handler we may end up choosing same VCPU, which was in spinloop, that would unfortunately result in more cpu burning. And with randomizing start_vcpu, we are making that probability more. we need to have a logic, not choose a vcpu that has recently PL exited since it cannot be a lock-holder. and next eligible lock-holder can be picked up easily with PV patches. pvticketlocks-noPLE: 244.987 (7.562) pvticketlocks-PLE:247.597 (17.200) base kernel: 3.5.0-rc3 + Rik's new last_boosted patch rand_start kernel:3.5.0-rc3 + Raghu's proposed random start patch pvticketlocks kernel: 3.5.0-rc3 + Rik's new last_boosted patch + Raghu's pvticketlock series Ok, I believe SPIN_THRESHOLD was 2k right? what I had observed is with 2k THRESHOLD, we see halt exit overheads. currently I am trying with mostly 4k. The relative standard deviations are as high as 11%. So I'm not real pleased with the results, and they show degradation everywhere. Below are the details of the benchmarking. Everything is there except the kernel config, but our benchmarking should be reproducible with nearly random configs anyway. Drew = host = - Intel(R) Xeon(R) CPU X7560 @ 2.27GHz - 64 cpus, 4 nodes, 64G mem - Fedora 17 with test kernels (see tests) = benchmark = - one cpu hog F17 VM - 64 vcpus, 8G mem - all vcpus run a bash infinite loop - kernel: 3.5.0-rc3 - one kcbench (a.k.a kernbench) F17 VM - 8 vcpus, 8G mem - 'kcbench -d /mnt/ram', /mnt/ram is 1G ramfs may be we have to check whether 1GB RAM is ok when we have
Re: [PATCH] kvm: handle last_boosted_vcpu = 0 case
On Tue, Jun 19, 2012 at 04:51:04PM -0400, Rik van Riel wrote: On Wed, 20 Jun 2012 01:50:50 +0530 Raghavendra K T raghavendra...@linux.vnet.ibm.com wrote: In ple handler code, last_boosted_vcpu (lbv) variable is serving as reference point to start when we enter. Also statistical analysis (below) is showing lbv is not very well distributed with current approach. You are the second person to spot this bug today (yes, today). Due to time zones, the first person has not had a chance yet to test the patch below, which might fix the issue... Please let me know how it goes. 8 If last_boosted_vcpu == 0, then we fall through all test cases and may end up with all VCPUs pouncing on vcpu 0. With a large enough guest, this can result in enormous runqueue lock contention, which can prevent vcpu0 from running, leading to a livelock. Changing to = makes sure we properly handle that case. Signed-off-by: Rik van Riel r...@redhat.com --- virt/kvm/kvm_main.c |2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index 7e14068..1da542b 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -1586,7 +1586,7 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me) */ for (pass = 0; pass 2 !yielded; pass++) { kvm_for_each_vcpu(i, vcpu, kvm) { - if (!pass i last_boosted_vcpu) { + if (!pass i = last_boosted_vcpu) { i = last_boosted_vcpu; continue; } else if (pass i last_boosted_vcpu) Looks correct. We can simplify this by introducing something like: #define kvm_for_each_vcpu_from(idx, n, vcpup, kvm) \ for (n = atomic_read(kvm-online_vcpus); \ n (vcpup = kvm_get_vcpu(kvm, idx)) != NULL; \ n--, idx = (idx+1) % atomic_read(kvm-online_vcpus)) -- Gleb. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] kvm: handle last_boosted_vcpu = 0 case
On 06/21/2012 12:13 PM, Gleb Natapov wrote: On Tue, Jun 19, 2012 at 04:51:04PM -0400, Rik van Riel wrote: On Wed, 20 Jun 2012 01:50:50 +0530 Raghavendra K Traghavendra...@linux.vnet.ibm.com wrote: In ple handler code, last_boosted_vcpu (lbv) variable is serving as reference point to start when we enter. Also statistical analysis (below) is showing lbv is not very well distributed with current approach. You are the second person to spot this bug today (yes, today). Due to time zones, the first person has not had a chance yet to test the patch below, which might fix the issue... Please let me know how it goes. 8 If last_boosted_vcpu == 0, then we fall through all test cases and may end up with all VCPUs pouncing on vcpu 0. With a large enough guest, this can result in enormous runqueue lock contention, which can prevent vcpu0 from running, leading to a livelock. Changing to= makes sure we properly handle that case. Signed-off-by: Rik van Rielr...@redhat.com --- virt/kvm/kvm_main.c |2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index 7e14068..1da542b 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -1586,7 +1586,7 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me) */ for (pass = 0; pass 2 !yielded; pass++) { kvm_for_each_vcpu(i, vcpu, kvm) { - if (!pass i last_boosted_vcpu) { + if (!pass i= last_boosted_vcpu) { i = last_boosted_vcpu; continue; } else if (pass i last_boosted_vcpu) Looks correct. We can simplify this by introducing something like: #define kvm_for_each_vcpu_from(idx, n, vcpup, kvm) \ for (n = atomic_read(kvm-online_vcpus); \ n (vcpup = kvm_get_vcpu(kvm, idx)) != NULL; \ n--, idx = (idx+1) % atomic_read(kvm-online_vcpus)) Thumbs up for this simplification. This really helps in all the places where we want to start iterating from middle. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] kvm: handle last_boosted_vcpu = 0 case
On 06/21/2012 01:42 AM, Raghavendra K T wrote: On 06/20/2012 02:21 AM, Rik van Riel wrote: On Wed, 20 Jun 2012 01:50:50 +0530 Raghavendra K Traghavendra...@linux.vnet.ibm.com wrote: [...] Please let me know how it goes. Yes, have got result today, too tired to summarize. got better performance result too. will come back again tomorrow morning. have to post, randomized start point patch also, which I discussed to know the opinion. Here are the results from kernbench. PS: I think we have to only take that, both the patches perform better, than reading into actual numbers since I am seeing more variance in especially 3x. may be I can test with some more stable benchmark if somebody points +--+-+++---+ | base| Rik patch | % improve |Random patch| %improve | +--+-+++---+ | 49.98| 49.935| 0.0901172 | 49.924286 | 0.111597 | | 106.0051 | 89.25806 | 18.7625| 88.122217 | 20.2933 | | 189.82067| 175.58783 | 8.10582| 166.99989 | 13.6651 | +--+-+++---+ I also have posted result of randomizing starting point patch. I agree that Rik's fix should ideally go into git ASAP. and when above patches go into git, feel free to add, Tested-by: Raghavendra K T raghavendra...@linux.vnet.ibm.com But I still see some questions unanswered. 1) why can't we move setting of last_boosted_vcpu up, it gives more randomness ( As I said earlier, it gave degradation in 1x case because of violent yields but performance benefit in 3x case. degradation because most of them yielding back to same spinning guy increasing busy-wait but it gives huge benefit with ple_window set to higher values such as 32k/64k. But that is a different issue altogethor) 2) Having the update of last_boosted_vcpu after yield_to does not seem to be entirely correct. and having a common variable as starting point may not be that good too. Also RR is little slower. suppose we have 64 vcpu guest, and 4 vcpus enter ple_handler all of them jumping on same guy to yield may not be good. Rather I personally feel each of them starting at different point would be good idea. But this alone will not help, we need more filtering of eligible VCPU. for e.g. in first pass don't choose a VCPU that has recently done PL exit. (Thanks Vatsa for brainstorming this). May be Peter/Avi /Rik/Vatsa can give more idea in this area ( I mean, how can we identify that a vcpu had done a PL exit/OR exited from spinlock context etc) other idea may be something like identifying next eligible lock-holder (which is already possible with PV patches), and do yield-to him. Here is the stat from randomizing starting point patch. We can see that the patch has amazing fairness w.r.t starting point. IMO, this would be great only after we add more eligibility criteria to target vcpus (of yield_to). Randomizing start index === snapshot1 PLE handler yield stat : 218416 176802 164554 141184 148495 154709 159871 145157 135476 158025 139997 247638 152498 18 122774 248228 158469 121825 138542 113351 164988 120432 136391 129855 172764 214015 158710 133049 83485 112134 81651 190878 PLE handler start stat : 547772 547725 547545 547931 547836 548656 548272 547849 548879 549012 547285 548185 548700 547132 548310 547286 547236 547307 548328 548059 547842 549152 547870 548340 548170 546996 546678 547842 547716 548096 547918 547546 snapshot2 == PLE handler yield stat : 310690 222992 275829 156876 187354 185373 187584 155534 151578 205994 223731 320894 194995 167011 153415 286910 181290 143653 173988 181413 194505 170330 194455 181617 251108 226577 192070 143843 137878 166393 131405 250657 PLE handler start stat : 781335 782388 781837 782942 782025 781357 781950 781695 783183 783312 782004 782804 783766 780825 783232 781013 781587 781228 781642 781595 781665 783530 781546 781950 782268 781443 781327 781666 781907 781593 782105 781073 Sorry for attaching patch inline, I am using a dumb client. will post it separately if needed. 8 Currently PLE handler uses per VM variable as starting point. Get rid of the variable and use randomized starting point. Thanks Vatsa for scheduler related clarifications. Suggested-by: Srikar sri...@linux.vnet.ibm.com Signed-off-by: Raghavendra K T raghavendra...@linux.vnet.ibm.com --- diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index c446435..9799cab 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -275,7 +275,6 @@ struct kvm { #endif struct kvm_vcpu *vcpus[KVM_MAX_VCPUS]; atomic_t online_vcpus; - int last_boosted_vcpu; struct list_head vm_list; struct mutex lock; struct kvm_io_bus *buses[KVM_NR_BUSES]; diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index
Re: [PATCH] kvm: handle last_boosted_vcpu = 0 case
On 06/20/2012 02:21 AM, Rik van Riel wrote: On Wed, 20 Jun 2012 01:50:50 +0530 Raghavendra K Traghavendra...@linux.vnet.ibm.com wrote: In ple handler code, last_boosted_vcpu (lbv) variable is serving as reference point to start when we enter. Also statistical analysis (below) is showing lbv is not very well distributed with current approach. You are the second person to spot this bug today (yes, today). Oh! really interesting. Due to time zones, the first person has not had a chance yet to test the patch below, which might fix the issue... May be his timezone also falls near to mine. I am also pretty late now. :) Please let me know how it goes. Yes, have got result today, too tired to summarize. got better performance result too. will come back again tomorrow morning. have to post, randomized start point patch also, which I discussed to know the opinion. 8 If last_boosted_vcpu == 0, then we fall through all test cases and may end up with all VCPUs pouncing on vcpu 0. With a large enough guest, this can result in enormous runqueue lock contention, which can prevent vcpu0 from running, leading to a livelock. Changing to= makes sure we properly handle that case. Analysis shows distribution is more flatten now than before. Here are the snapshots: snapshot1 PLE handler yield stat : 66447 13 75510 65875 121298 92543 111267 79523 118134 105366 116441 114195 107493 6 86779 87733 84415 105778 94210 73197 55626 93036 112959 92035 95742 78558 72190 101719 94667 108593 63832 81580 PLE handler start stat : 334301 687807 384077 344917 504917 343988 439810 371389 466908 415509 394304 484276 376510 292821 370478 363727 366989 423441 392949 309706 292115 437900 413763 346135 364181 323031 348405 399593 336714 373995 302301 347383 snapshot2 PLE handler yield stat : 320547 267528 264316 164213 249246 182014 246468 225386 277179 310659 349767 310281 238680 187645 225791 266290 216202 316974 231077 216586 151679 356863 266031 213047 306229 182629 229334 241204 275975 265086 282218 242207 PLE handler start stat : 1335370 1378184 1252001 925414 1196973 951298 1219835 1108788 1265427 1290362 1308553 1271066 1107575 980036 1077210 1278611 1110779 1365130 1151200 1049859 937159 1577830 1209099 993391 1173766 987307 1144775 1102960 1100082 1177134 1207862 1119551 Signed-off-by: Rik van Rielr...@redhat.com --- virt/kvm/kvm_main.c |2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index 7e14068..1da542b 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -1586,7 +1586,7 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me) */ for (pass = 0; pass 2 !yielded; pass++) { kvm_for_each_vcpu(i, vcpu, kvm) { - if (!pass i last_boosted_vcpu) { + if (!pass i= last_boosted_vcpu) { Hmmm true, great catch. it was partial towards zero earlier. i = last_boosted_vcpu; continue; } else if (pass i last_boosted_vcpu) -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] kvm: handle last_boosted_vcpu = 0 case
On 06/20/2012 04:12 PM, Raghavendra K T wrote: On 06/20/2012 02:21 AM, Rik van Riel wrote: Please let me know how it goes. Yes, have got result today, too tired to summarize. got better performance result too. will come back again tomorrow morning. have to post, randomized start point patch also, which I discussed to know the opinion. The other person's problem has also gone away with this patch. Avi, could I convince you to apply this obvious bugfix to kvm.git? :) 8 If last_boosted_vcpu == 0, then we fall through all test cases and may end up with all VCPUs pouncing on vcpu 0. With a large enough guest, this can result in enormous runqueue lock contention, which can prevent vcpu0 from running, leading to a livelock. Changing to= makes sure we properly handle that case. Signed-off-by: Rik van Rielr...@redhat.com --- virt/kvm/kvm_main.c | 2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index 7e14068..1da542b 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -1586,7 +1586,7 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me) */ for (pass = 0; pass 2 !yielded; pass++) { kvm_for_each_vcpu(i, vcpu, kvm) { - if (!pass i last_boosted_vcpu) { + if (!pass i= last_boosted_vcpu) { i = last_boosted_vcpu; continue; } else if (pass i last_boosted_vcpu) -- All rights reversed -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] kvm: handle last_boosted_vcpu = 0 case
On Wed, 20 Jun 2012 01:50:50 +0530 Raghavendra K T raghavendra...@linux.vnet.ibm.com wrote: In ple handler code, last_boosted_vcpu (lbv) variable is serving as reference point to start when we enter. Also statistical analysis (below) is showing lbv is not very well distributed with current approach. You are the second person to spot this bug today (yes, today). Due to time zones, the first person has not had a chance yet to test the patch below, which might fix the issue... Please let me know how it goes. 8 If last_boosted_vcpu == 0, then we fall through all test cases and may end up with all VCPUs pouncing on vcpu 0. With a large enough guest, this can result in enormous runqueue lock contention, which can prevent vcpu0 from running, leading to a livelock. Changing to = makes sure we properly handle that case. Signed-off-by: Rik van Riel r...@redhat.com --- virt/kvm/kvm_main.c |2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index 7e14068..1da542b 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -1586,7 +1586,7 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me) */ for (pass = 0; pass 2 !yielded; pass++) { kvm_for_each_vcpu(i, vcpu, kvm) { - if (!pass i last_boosted_vcpu) { + if (!pass i = last_boosted_vcpu) { i = last_boosted_vcpu; continue; } else if (pass i last_boosted_vcpu) -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html