Re: [PATCH v2 0/4] KVM: optimize the kvm_vcpu_on_spin
On Thu, 10 Aug 2017 09:18:09 -0400 Eric Farmanwrote: > On 08/08/2017 04:14 AM, Longpeng (Mike) wrote: > > > > > > On 2017/8/8 15:41, Cornelia Huck wrote: > > > >> On Tue, 8 Aug 2017 12:05:31 +0800 > >> "Longpeng(Mike)" wrote: > >> > >>> This is a simple optimization for kvm_vcpu_on_spin, the > >>> main idea is described in patch-1's commit msg. > >> > >> I think this generally looks good now. > >> > >>> > >>> I did some tests base on the RFC version, the result shows > >>> that it can improves the performance slightly. > >> > >> Did you re-run tests on this version? > > > > > > Hi Cornelia, > > > > I didn't re-run tests on V2. But the major difference between RFC and V2 > > is that V2 only cache result for X86 (s390/arm needn't) and V2 saves a > > expensive operation ( 440-1400 cycles on my test machine ) for X86/VMX. > > > > So I think V2's performance is at least the same as RFC or even slightly > > better. :) > > > >> > >> I would also like to see some s390 numbers; unfortunately I only have a > >> z/VM environment and any performance numbers would be nearly useless > >> there. Maybe somebody within IBM with a better setup can run a quick > >> test? > > Won't swear I didn't screw something up, but here's some quick numbers. > Host was 4.12.0 with and without this series, running QEMU 2.10.0-rc0. > Created 4 guests, each with 4 CPU (unpinned) and 4GB RAM. VM1 did full > kernel compiles with kernbench, which took averages of 5 runs of > different job sizes (I threw away the "-j 1" numbers). VM2-VM4 ran cpu > burners on 2 of their 4 cpus. > > Numbers from VM1 kernbench output, and the delta between runs: > > load -j 3 before after delta > Elapsed Time 183.178 182.58 -0.598 > User Time 534.19 531.52 -2.67 > System Time 32.538 33.37 0.832 > Percent CPU 308.8 309 0.2 > Context Switches 98484.6 99001 516.4 > Sleeps227347 228752 1405 > > load -j 16before after delta > Elapsed Time 153.352 147.59 -5.762 > User Time 545.829 533.41 -12.419 > System Time 34.289 34.85 0.561 > Percent CPU 347.6 348 0.4 > Context Switches 160518 159120 -1398 > Sleeps240740 240536 -204 Thanks a lot, Eric! The decreases in elapsed time look nice, and we probably should not care about the increases reported.
Re: [PATCH v2 0/4] KVM: optimize the kvm_vcpu_on_spin
On Thu, 10 Aug 2017 09:18:09 -0400 Eric Farman wrote: > On 08/08/2017 04:14 AM, Longpeng (Mike) wrote: > > > > > > On 2017/8/8 15:41, Cornelia Huck wrote: > > > >> On Tue, 8 Aug 2017 12:05:31 +0800 > >> "Longpeng(Mike)" wrote: > >> > >>> This is a simple optimization for kvm_vcpu_on_spin, the > >>> main idea is described in patch-1's commit msg. > >> > >> I think this generally looks good now. > >> > >>> > >>> I did some tests base on the RFC version, the result shows > >>> that it can improves the performance slightly. > >> > >> Did you re-run tests on this version? > > > > > > Hi Cornelia, > > > > I didn't re-run tests on V2. But the major difference between RFC and V2 > > is that V2 only cache result for X86 (s390/arm needn't) and V2 saves a > > expensive operation ( 440-1400 cycles on my test machine ) for X86/VMX. > > > > So I think V2's performance is at least the same as RFC or even slightly > > better. :) > > > >> > >> I would also like to see some s390 numbers; unfortunately I only have a > >> z/VM environment and any performance numbers would be nearly useless > >> there. Maybe somebody within IBM with a better setup can run a quick > >> test? > > Won't swear I didn't screw something up, but here's some quick numbers. > Host was 4.12.0 with and without this series, running QEMU 2.10.0-rc0. > Created 4 guests, each with 4 CPU (unpinned) and 4GB RAM. VM1 did full > kernel compiles with kernbench, which took averages of 5 runs of > different job sizes (I threw away the "-j 1" numbers). VM2-VM4 ran cpu > burners on 2 of their 4 cpus. > > Numbers from VM1 kernbench output, and the delta between runs: > > load -j 3 before after delta > Elapsed Time 183.178 182.58 -0.598 > User Time 534.19 531.52 -2.67 > System Time 32.538 33.37 0.832 > Percent CPU 308.8 309 0.2 > Context Switches 98484.6 99001 516.4 > Sleeps227347 228752 1405 > > load -j 16before after delta > Elapsed Time 153.352 147.59 -5.762 > User Time 545.829 533.41 -12.419 > System Time 34.289 34.85 0.561 > Percent CPU 347.6 348 0.4 > Context Switches 160518 159120 -1398 > Sleeps240740 240536 -204 Thanks a lot, Eric! The decreases in elapsed time look nice, and we probably should not care about the increases reported.
Re: [PATCH v2 0/4] KVM: optimize the kvm_vcpu_on_spin
On 2017/8/10 21:18, Eric Farman wrote: > > > On 08/08/2017 04:14 AM, Longpeng (Mike) wrote: >> >> >> On 2017/8/8 15:41, Cornelia Huck wrote: >> >>> On Tue, 8 Aug 2017 12:05:31 +0800 >>> "Longpeng(Mike)"wrote: >>> This is a simple optimization for kvm_vcpu_on_spin, the main idea is described in patch-1's commit msg. >>> >>> I think this generally looks good now. >>> I did some tests base on the RFC version, the result shows that it can improves the performance slightly. >>> >>> Did you re-run tests on this version? >> >> >> Hi Cornelia, >> >> I didn't re-run tests on V2. But the major difference between RFC and V2 >> is that V2 only cache result for X86 (s390/arm needn't) and V2 saves a >> expensive operation ( 440-1400 cycles on my test machine ) for X86/VMX. >> >> So I think V2's performance is at least the same as RFC or even slightly >> better. :) >> >>> >>> I would also like to see some s390 numbers; unfortunately I only have a >>> z/VM environment and any performance numbers would be nearly useless >>> there. Maybe somebody within IBM with a better setup can run a quick >>> test? > > Won't swear I didn't screw something up, but here's some quick numbers. Host > was > 4.12.0 with and without this series, running QEMU 2.10.0-rc0. Created 4 > guests, > each with 4 CPU (unpinned) and 4GB RAM. VM1 did full kernel compiles with > kernbench, which took averages of 5 runs of different job sizes (I threw away > the "-j 1" numbers). VM2-VM4 ran cpu burners on 2 of their 4 cpus. > > Numbers from VM1 kernbench output, and the delta between runs: > > load -j 3beforeafterdelta > Elapsed Time183.178182.58-0.598 > User Time534.19531.52-2.67 > System Time32.53833.370.832 > Percent CPU308.83090.2 > Context Switches98484.699001516.4 > Sleeps2273472287521405 > > load -j 16beforeafterdelta > Elapsed Time153.352147.59-5.762 > User Time545.829533.41-12.419 > System Time34.28934.850.561 > Percent CPU347.63480.4 > Context Switches160518159120-1398 > Sleeps240740240536-204 > Thanks Eric! The `Elapsed Time` is smaller with this series , the result is the same as my numbers in cover-letter. > > - Eric > > > . > -- Regards, Longpeng(Mike)
Re: [PATCH v2 0/4] KVM: optimize the kvm_vcpu_on_spin
On 2017/8/10 21:18, Eric Farman wrote: > > > On 08/08/2017 04:14 AM, Longpeng (Mike) wrote: >> >> >> On 2017/8/8 15:41, Cornelia Huck wrote: >> >>> On Tue, 8 Aug 2017 12:05:31 +0800 >>> "Longpeng(Mike)" wrote: >>> This is a simple optimization for kvm_vcpu_on_spin, the main idea is described in patch-1's commit msg. >>> >>> I think this generally looks good now. >>> I did some tests base on the RFC version, the result shows that it can improves the performance slightly. >>> >>> Did you re-run tests on this version? >> >> >> Hi Cornelia, >> >> I didn't re-run tests on V2. But the major difference between RFC and V2 >> is that V2 only cache result for X86 (s390/arm needn't) and V2 saves a >> expensive operation ( 440-1400 cycles on my test machine ) for X86/VMX. >> >> So I think V2's performance is at least the same as RFC or even slightly >> better. :) >> >>> >>> I would also like to see some s390 numbers; unfortunately I only have a >>> z/VM environment and any performance numbers would be nearly useless >>> there. Maybe somebody within IBM with a better setup can run a quick >>> test? > > Won't swear I didn't screw something up, but here's some quick numbers. Host > was > 4.12.0 with and without this series, running QEMU 2.10.0-rc0. Created 4 > guests, > each with 4 CPU (unpinned) and 4GB RAM. VM1 did full kernel compiles with > kernbench, which took averages of 5 runs of different job sizes (I threw away > the "-j 1" numbers). VM2-VM4 ran cpu burners on 2 of their 4 cpus. > > Numbers from VM1 kernbench output, and the delta between runs: > > load -j 3beforeafterdelta > Elapsed Time183.178182.58-0.598 > User Time534.19531.52-2.67 > System Time32.53833.370.832 > Percent CPU308.83090.2 > Context Switches98484.699001516.4 > Sleeps2273472287521405 > > load -j 16beforeafterdelta > Elapsed Time153.352147.59-5.762 > User Time545.829533.41-12.419 > System Time34.28934.850.561 > Percent CPU347.63480.4 > Context Switches160518159120-1398 > Sleeps240740240536-204 > Thanks Eric! The `Elapsed Time` is smaller with this series , the result is the same as my numbers in cover-letter. > > - Eric > > > . > -- Regards, Longpeng(Mike)
Re: [PATCH v2 0/4] KVM: optimize the kvm_vcpu_on_spin
On 08/08/2017 04:14 AM, Longpeng (Mike) wrote: On 2017/8/8 15:41, Cornelia Huck wrote: On Tue, 8 Aug 2017 12:05:31 +0800 "Longpeng(Mike)"wrote: This is a simple optimization for kvm_vcpu_on_spin, the main idea is described in patch-1's commit msg. I think this generally looks good now. I did some tests base on the RFC version, the result shows that it can improves the performance slightly. Did you re-run tests on this version? Hi Cornelia, I didn't re-run tests on V2. But the major difference between RFC and V2 is that V2 only cache result for X86 (s390/arm needn't) and V2 saves a expensive operation ( 440-1400 cycles on my test machine ) for X86/VMX. So I think V2's performance is at least the same as RFC or even slightly better. :) I would also like to see some s390 numbers; unfortunately I only have a z/VM environment and any performance numbers would be nearly useless there. Maybe somebody within IBM with a better setup can run a quick test? Won't swear I didn't screw something up, but here's some quick numbers. Host was 4.12.0 with and without this series, running QEMU 2.10.0-rc0. Created 4 guests, each with 4 CPU (unpinned) and 4GB RAM. VM1 did full kernel compiles with kernbench, which took averages of 5 runs of different job sizes (I threw away the "-j 1" numbers). VM2-VM4 ran cpu burners on 2 of their 4 cpus. Numbers from VM1 kernbench output, and the delta between runs: load -j 3 before after delta Elapsed Time183.178 182.58 -0.598 User Time 534.19 531.52 -2.67 System Time 32.538 33.37 0.832 Percent CPU 308.8 309 0.2 Context Switches98484.6 99001 516.4 Sleeps 227347 228752 1405 load -j 16 before after delta Elapsed Time153.352 147.59 -5.762 User Time 545.829 533.41 -12.419 System Time 34.289 34.85 0.561 Percent CPU 347.6 348 0.4 Context Switches160518 159120 -1398 Sleeps 240740 240536 -204 - Eric
Re: [PATCH v2 0/4] KVM: optimize the kvm_vcpu_on_spin
On 08/08/2017 04:14 AM, Longpeng (Mike) wrote: On 2017/8/8 15:41, Cornelia Huck wrote: On Tue, 8 Aug 2017 12:05:31 +0800 "Longpeng(Mike)" wrote: This is a simple optimization for kvm_vcpu_on_spin, the main idea is described in patch-1's commit msg. I think this generally looks good now. I did some tests base on the RFC version, the result shows that it can improves the performance slightly. Did you re-run tests on this version? Hi Cornelia, I didn't re-run tests on V2. But the major difference between RFC and V2 is that V2 only cache result for X86 (s390/arm needn't) and V2 saves a expensive operation ( 440-1400 cycles on my test machine ) for X86/VMX. So I think V2's performance is at least the same as RFC or even slightly better. :) I would also like to see some s390 numbers; unfortunately I only have a z/VM environment and any performance numbers would be nearly useless there. Maybe somebody within IBM with a better setup can run a quick test? Won't swear I didn't screw something up, but here's some quick numbers. Host was 4.12.0 with and without this series, running QEMU 2.10.0-rc0. Created 4 guests, each with 4 CPU (unpinned) and 4GB RAM. VM1 did full kernel compiles with kernbench, which took averages of 5 runs of different job sizes (I threw away the "-j 1" numbers). VM2-VM4 ran cpu burners on 2 of their 4 cpus. Numbers from VM1 kernbench output, and the delta between runs: load -j 3 before after delta Elapsed Time183.178 182.58 -0.598 User Time 534.19 531.52 -2.67 System Time 32.538 33.37 0.832 Percent CPU 308.8 309 0.2 Context Switches98484.6 99001 516.4 Sleeps 227347 228752 1405 load -j 16 before after delta Elapsed Time153.352 147.59 -5.762 User Time 545.829 533.41 -12.419 System Time 34.289 34.85 0.561 Percent CPU 347.6 348 0.4 Context Switches160518 159120 -1398 Sleeps 240740 240536 -204 - Eric
Re: [PATCH v2 0/4] KVM: optimize the kvm_vcpu_on_spin
On 08.08.2017 13:49, Longpeng (Mike) wrote: > > > On 2017/8/8 19:25, David Hildenbrand wrote: > >> On 08.08.2017 06:05, Longpeng(Mike) wrote: >>> This is a simple optimization for kvm_vcpu_on_spin, the >>> main idea is described in patch-1's commit msg. >>> >>> I did some tests base on the RFC version, the result shows >>> that it can improves the performance slightly. >>> >>> == Geekbench-3.4.1 == >>> VM1:8U,4G, vcpu(0...7) is 1:1 pinned to pcpu(6...11,18,19) >>> running Geekbench-3.4.1 *10 truns* >>> VM2/VM3/VM4: configure is the same as VM1 >>> stress each vcpu usage(seed by top in guest) to 40% >>> >>> The comparison of each testcase's score: >>> (higher is better) >>> before after improve >>> Inter >>> single 1176.7 1179.0 0.2% >>> multi 3459.5 3426.5 -0.9% >>> Float >>> single 1150.5 1150.9 0.0% >>> multi 3364.5 3391.9 0.8% >>> Memory(stream) >>> single 1768.7 1773.1 0.2% >>> multi 2511.6 2557.2 1.8% >>> Overall >>> single 1284.2 1286.2 0.2% >>> multi 3231.4 3238.4 0.2% >>> >>> >>> == kernbench-0.42 == >>> VM1:8U,12G, vcpu(0...7) is 1:1 pinned to pcpu(6...11,18,19) >>> running "kernbench -n 10" >>> VM2/VM3/VM4: configure is the same as VM1 >>> stress each vcpu usage(seed by top in guest) to 40% >>> >>> The comparison of 'Elapsed Time': >>> (sooner is better) >>> before after improve >>> load -j412.762 12.751 0.1% >>> load -j32 9.743 8.955 8.1% >>> load -j 9.688 9.229 4.7% >>> >>> >>> Physical Machine: >>> Architecture: x86_64 >>> CPU op-mode(s):32-bit, 64-bit >>> Byte Order:Little Endian >>> CPU(s):24 >>> On-line CPU(s) list: 0-23 >>> Thread(s) per core:2 >>> Core(s) per socket:6 >>> Socket(s): 2 >>> NUMA node(s): 2 >>> Vendor ID: GenuineIntel >>> CPU family:6 >>> Model: 45 >>> Model name:Intel(R) Xeon(R) CPU E5-2640 0 @ 2.50GHz >>> Stepping: 7 >>> CPU MHz: 2799.902 >>> BogoMIPS: 5004.67 >>> Virtualization:VT-x >>> L1d cache: 32K >>> L1i cache: 32K >>> L2 cache: 256K >>> L3 cache: 15360K >>> NUMA node0 CPU(s): 0-5,12-17 >>> NUMA node1 CPU(s): 6-11,18-23 >>> >>> --- >>> Changes since V1: >>> - split the implementation of s390 & arm. [David] >>> - refactor the impls according to the suggestion. [Paolo] >>> >>> Changes since RFC: >>> - only cache result for X86. [David & Cornlia & Paolo] >>> - add performance numbers. [David] >>> - impls arm/s390. [Christoffer & David] >>> - refactor the impls. [me] >>> >>> --- >>> Longpeng(Mike) (4): >>> KVM: add spinlock optimization framework >>> KVM: X86: implement the logic for spinlock optimization >>> KVM: s390: implements the kvm_arch_vcpu_in_kernel() >>> KVM: arm: implements the kvm_arch_vcpu_in_kernel() >>> >>> arch/arm/kvm/handle_exit.c | 2 +- >>> arch/arm64/kvm/handle_exit.c| 2 +- >>> arch/mips/kvm/mips.c| 6 ++ >>> arch/powerpc/kvm/powerpc.c | 6 ++ >>> arch/s390/kvm/diag.c| 2 +- >>> arch/s390/kvm/kvm-s390.c| 6 ++ >>> arch/x86/include/asm/kvm_host.h | 5 + >>> arch/x86/kvm/hyperv.c | 2 +- >>> arch/x86/kvm/svm.c | 10 +- >>> arch/x86/kvm/vmx.c | 16 +++- >>> arch/x86/kvm/x86.c | 11 +++ >>> include/linux/kvm_host.h| 3 ++- >>> virt/kvm/arm/arm.c | 5 + >>> virt/kvm/kvm_main.c | 4 +++- >>> 14 files changed, 72 insertions(+), 8 deletions(-) >>> >> >> I am curious, is there any architecture that allows to trigger >> kvm_vcpu_on_spin(vcpu); while _not_ in kernel mode? > > > IIUC, X86/SVM will trap to host due to PAUSE insn no matter the vcpu is in > kernel-mode or user-mode. > >> >> I would have guessed that user space should never be allowed to make cpu >> wide decisions (giving up the CPU to the hypervisor). >> >> E.g. s390x diag can only be executed from kernel space. VMX PAUSE is >> only valid from kernel space. > > > X86/VMX has "PAUSE exiting" and "PAUSE-loop exiting"(PLE). KVM only uses PLE, > this is as you said "only valid from kernel space" > > However, the "PAUSE exiting" can cause user-mode vcpu exit too. Thanks Longpeng and Christoffer! -- Thanks, David
Re: [PATCH v2 0/4] KVM: optimize the kvm_vcpu_on_spin
On 08.08.2017 13:49, Longpeng (Mike) wrote: > > > On 2017/8/8 19:25, David Hildenbrand wrote: > >> On 08.08.2017 06:05, Longpeng(Mike) wrote: >>> This is a simple optimization for kvm_vcpu_on_spin, the >>> main idea is described in patch-1's commit msg. >>> >>> I did some tests base on the RFC version, the result shows >>> that it can improves the performance slightly. >>> >>> == Geekbench-3.4.1 == >>> VM1:8U,4G, vcpu(0...7) is 1:1 pinned to pcpu(6...11,18,19) >>> running Geekbench-3.4.1 *10 truns* >>> VM2/VM3/VM4: configure is the same as VM1 >>> stress each vcpu usage(seed by top in guest) to 40% >>> >>> The comparison of each testcase's score: >>> (higher is better) >>> before after improve >>> Inter >>> single 1176.7 1179.0 0.2% >>> multi 3459.5 3426.5 -0.9% >>> Float >>> single 1150.5 1150.9 0.0% >>> multi 3364.5 3391.9 0.8% >>> Memory(stream) >>> single 1768.7 1773.1 0.2% >>> multi 2511.6 2557.2 1.8% >>> Overall >>> single 1284.2 1286.2 0.2% >>> multi 3231.4 3238.4 0.2% >>> >>> >>> == kernbench-0.42 == >>> VM1:8U,12G, vcpu(0...7) is 1:1 pinned to pcpu(6...11,18,19) >>> running "kernbench -n 10" >>> VM2/VM3/VM4: configure is the same as VM1 >>> stress each vcpu usage(seed by top in guest) to 40% >>> >>> The comparison of 'Elapsed Time': >>> (sooner is better) >>> before after improve >>> load -j412.762 12.751 0.1% >>> load -j32 9.743 8.955 8.1% >>> load -j 9.688 9.229 4.7% >>> >>> >>> Physical Machine: >>> Architecture: x86_64 >>> CPU op-mode(s):32-bit, 64-bit >>> Byte Order:Little Endian >>> CPU(s):24 >>> On-line CPU(s) list: 0-23 >>> Thread(s) per core:2 >>> Core(s) per socket:6 >>> Socket(s): 2 >>> NUMA node(s): 2 >>> Vendor ID: GenuineIntel >>> CPU family:6 >>> Model: 45 >>> Model name:Intel(R) Xeon(R) CPU E5-2640 0 @ 2.50GHz >>> Stepping: 7 >>> CPU MHz: 2799.902 >>> BogoMIPS: 5004.67 >>> Virtualization:VT-x >>> L1d cache: 32K >>> L1i cache: 32K >>> L2 cache: 256K >>> L3 cache: 15360K >>> NUMA node0 CPU(s): 0-5,12-17 >>> NUMA node1 CPU(s): 6-11,18-23 >>> >>> --- >>> Changes since V1: >>> - split the implementation of s390 & arm. [David] >>> - refactor the impls according to the suggestion. [Paolo] >>> >>> Changes since RFC: >>> - only cache result for X86. [David & Cornlia & Paolo] >>> - add performance numbers. [David] >>> - impls arm/s390. [Christoffer & David] >>> - refactor the impls. [me] >>> >>> --- >>> Longpeng(Mike) (4): >>> KVM: add spinlock optimization framework >>> KVM: X86: implement the logic for spinlock optimization >>> KVM: s390: implements the kvm_arch_vcpu_in_kernel() >>> KVM: arm: implements the kvm_arch_vcpu_in_kernel() >>> >>> arch/arm/kvm/handle_exit.c | 2 +- >>> arch/arm64/kvm/handle_exit.c| 2 +- >>> arch/mips/kvm/mips.c| 6 ++ >>> arch/powerpc/kvm/powerpc.c | 6 ++ >>> arch/s390/kvm/diag.c| 2 +- >>> arch/s390/kvm/kvm-s390.c| 6 ++ >>> arch/x86/include/asm/kvm_host.h | 5 + >>> arch/x86/kvm/hyperv.c | 2 +- >>> arch/x86/kvm/svm.c | 10 +- >>> arch/x86/kvm/vmx.c | 16 +++- >>> arch/x86/kvm/x86.c | 11 +++ >>> include/linux/kvm_host.h| 3 ++- >>> virt/kvm/arm/arm.c | 5 + >>> virt/kvm/kvm_main.c | 4 +++- >>> 14 files changed, 72 insertions(+), 8 deletions(-) >>> >> >> I am curious, is there any architecture that allows to trigger >> kvm_vcpu_on_spin(vcpu); while _not_ in kernel mode? > > > IIUC, X86/SVM will trap to host due to PAUSE insn no matter the vcpu is in > kernel-mode or user-mode. > >> >> I would have guessed that user space should never be allowed to make cpu >> wide decisions (giving up the CPU to the hypervisor). >> >> E.g. s390x diag can only be executed from kernel space. VMX PAUSE is >> only valid from kernel space. > > > X86/VMX has "PAUSE exiting" and "PAUSE-loop exiting"(PLE). KVM only uses PLE, > this is as you said "only valid from kernel space" > > However, the "PAUSE exiting" can cause user-mode vcpu exit too. Thanks Longpeng and Christoffer! -- Thanks, David
Re: [PATCH v2 0/4] KVM: optimize the kvm_vcpu_on_spin
On 2017/8/8 19:25, David Hildenbrand wrote: > On 08.08.2017 06:05, Longpeng(Mike) wrote: >> This is a simple optimization for kvm_vcpu_on_spin, the >> main idea is described in patch-1's commit msg. >> >> I did some tests base on the RFC version, the result shows >> that it can improves the performance slightly. >> >> == Geekbench-3.4.1 == >> VM1: 8U,4G, vcpu(0...7) is 1:1 pinned to pcpu(6...11,18,19) >> running Geekbench-3.4.1 *10 truns* >> VM2/VM3/VM4: configure is the same as VM1 >> stress each vcpu usage(seed by top in guest) to 40% >> >> The comparison of each testcase's score: >> (higher is better) >> before after improve >> Inter >> single 1176.7 1179.0 0.2% >> multi 3459.5 3426.5 -0.9% >> Float >> single 1150.5 1150.9 0.0% >> multi 3364.5 3391.9 0.8% >> Memory(stream) >> single 1768.7 1773.1 0.2% >> multi 2511.6 2557.2 1.8% >> Overall >> single 1284.2 1286.2 0.2% >> multi 3231.4 3238.4 0.2% >> >> >> == kernbench-0.42 == >> VM1:8U,12G, vcpu(0...7) is 1:1 pinned to pcpu(6...11,18,19) >> running "kernbench -n 10" >> VM2/VM3/VM4: configure is the same as VM1 >> stress each vcpu usage(seed by top in guest) to 40% >> >> The comparison of 'Elapsed Time': >> (sooner is better) >> before after improve >> load -j4 12.762 12.751 0.1% >> load -j329.743 8.955 8.1% >> load -j 9.688 9.229 4.7% >> >> >> Physical Machine: >> Architecture: x86_64 >> CPU op-mode(s):32-bit, 64-bit >> Byte Order:Little Endian >> CPU(s):24 >> On-line CPU(s) list: 0-23 >> Thread(s) per core:2 >> Core(s) per socket:6 >> Socket(s): 2 >> NUMA node(s): 2 >> Vendor ID: GenuineIntel >> CPU family:6 >> Model: 45 >> Model name:Intel(R) Xeon(R) CPU E5-2640 0 @ 2.50GHz >> Stepping: 7 >> CPU MHz: 2799.902 >> BogoMIPS: 5004.67 >> Virtualization:VT-x >> L1d cache: 32K >> L1i cache: 32K >> L2 cache: 256K >> L3 cache: 15360K >> NUMA node0 CPU(s): 0-5,12-17 >> NUMA node1 CPU(s): 6-11,18-23 >> >> --- >> Changes since V1: >> - split the implementation of s390 & arm. [David] >> - refactor the impls according to the suggestion. [Paolo] >> >> Changes since RFC: >> - only cache result for X86. [David & Cornlia & Paolo] >> - add performance numbers. [David] >> - impls arm/s390. [Christoffer & David] >> - refactor the impls. [me] >> >> --- >> Longpeng(Mike) (4): >> KVM: add spinlock optimization framework >> KVM: X86: implement the logic for spinlock optimization >> KVM: s390: implements the kvm_arch_vcpu_in_kernel() >> KVM: arm: implements the kvm_arch_vcpu_in_kernel() >> >> arch/arm/kvm/handle_exit.c | 2 +- >> arch/arm64/kvm/handle_exit.c| 2 +- >> arch/mips/kvm/mips.c| 6 ++ >> arch/powerpc/kvm/powerpc.c | 6 ++ >> arch/s390/kvm/diag.c| 2 +- >> arch/s390/kvm/kvm-s390.c| 6 ++ >> arch/x86/include/asm/kvm_host.h | 5 + >> arch/x86/kvm/hyperv.c | 2 +- >> arch/x86/kvm/svm.c | 10 +- >> arch/x86/kvm/vmx.c | 16 +++- >> arch/x86/kvm/x86.c | 11 +++ >> include/linux/kvm_host.h| 3 ++- >> virt/kvm/arm/arm.c | 5 + >> virt/kvm/kvm_main.c | 4 +++- >> 14 files changed, 72 insertions(+), 8 deletions(-) >> > > I am curious, is there any architecture that allows to trigger > kvm_vcpu_on_spin(vcpu); while _not_ in kernel mode? IIUC, X86/SVM will trap to host due to PAUSE insn no matter the vcpu is in kernel-mode or user-mode. > > I would have guessed that user space should never be allowed to make cpu > wide decisions (giving up the CPU to the hypervisor). > > E.g. s390x diag can only be executed from kernel space. VMX PAUSE is > only valid from kernel space. X86/VMX has "PAUSE exiting" and "PAUSE-loop exiting"(PLE). KVM only uses PLE, this is as you said "only valid from kernel space" However, the "PAUSE exiting" can cause user-mode vcpu exit too. > > I.o.w. do we need a parameter to kvm_vcpu_on_spin(vcpu); at all, or is > "me_in_kernel" basically always true? > -- Regards, Longpeng(Mike)
Re: [PATCH v2 0/4] KVM: optimize the kvm_vcpu_on_spin
On 2017/8/8 19:25, David Hildenbrand wrote: > On 08.08.2017 06:05, Longpeng(Mike) wrote: >> This is a simple optimization for kvm_vcpu_on_spin, the >> main idea is described in patch-1's commit msg. >> >> I did some tests base on the RFC version, the result shows >> that it can improves the performance slightly. >> >> == Geekbench-3.4.1 == >> VM1: 8U,4G, vcpu(0...7) is 1:1 pinned to pcpu(6...11,18,19) >> running Geekbench-3.4.1 *10 truns* >> VM2/VM3/VM4: configure is the same as VM1 >> stress each vcpu usage(seed by top in guest) to 40% >> >> The comparison of each testcase's score: >> (higher is better) >> before after improve >> Inter >> single 1176.7 1179.0 0.2% >> multi 3459.5 3426.5 -0.9% >> Float >> single 1150.5 1150.9 0.0% >> multi 3364.5 3391.9 0.8% >> Memory(stream) >> single 1768.7 1773.1 0.2% >> multi 2511.6 2557.2 1.8% >> Overall >> single 1284.2 1286.2 0.2% >> multi 3231.4 3238.4 0.2% >> >> >> == kernbench-0.42 == >> VM1:8U,12G, vcpu(0...7) is 1:1 pinned to pcpu(6...11,18,19) >> running "kernbench -n 10" >> VM2/VM3/VM4: configure is the same as VM1 >> stress each vcpu usage(seed by top in guest) to 40% >> >> The comparison of 'Elapsed Time': >> (sooner is better) >> before after improve >> load -j4 12.762 12.751 0.1% >> load -j329.743 8.955 8.1% >> load -j 9.688 9.229 4.7% >> >> >> Physical Machine: >> Architecture: x86_64 >> CPU op-mode(s):32-bit, 64-bit >> Byte Order:Little Endian >> CPU(s):24 >> On-line CPU(s) list: 0-23 >> Thread(s) per core:2 >> Core(s) per socket:6 >> Socket(s): 2 >> NUMA node(s): 2 >> Vendor ID: GenuineIntel >> CPU family:6 >> Model: 45 >> Model name:Intel(R) Xeon(R) CPU E5-2640 0 @ 2.50GHz >> Stepping: 7 >> CPU MHz: 2799.902 >> BogoMIPS: 5004.67 >> Virtualization:VT-x >> L1d cache: 32K >> L1i cache: 32K >> L2 cache: 256K >> L3 cache: 15360K >> NUMA node0 CPU(s): 0-5,12-17 >> NUMA node1 CPU(s): 6-11,18-23 >> >> --- >> Changes since V1: >> - split the implementation of s390 & arm. [David] >> - refactor the impls according to the suggestion. [Paolo] >> >> Changes since RFC: >> - only cache result for X86. [David & Cornlia & Paolo] >> - add performance numbers. [David] >> - impls arm/s390. [Christoffer & David] >> - refactor the impls. [me] >> >> --- >> Longpeng(Mike) (4): >> KVM: add spinlock optimization framework >> KVM: X86: implement the logic for spinlock optimization >> KVM: s390: implements the kvm_arch_vcpu_in_kernel() >> KVM: arm: implements the kvm_arch_vcpu_in_kernel() >> >> arch/arm/kvm/handle_exit.c | 2 +- >> arch/arm64/kvm/handle_exit.c| 2 +- >> arch/mips/kvm/mips.c| 6 ++ >> arch/powerpc/kvm/powerpc.c | 6 ++ >> arch/s390/kvm/diag.c| 2 +- >> arch/s390/kvm/kvm-s390.c| 6 ++ >> arch/x86/include/asm/kvm_host.h | 5 + >> arch/x86/kvm/hyperv.c | 2 +- >> arch/x86/kvm/svm.c | 10 +- >> arch/x86/kvm/vmx.c | 16 +++- >> arch/x86/kvm/x86.c | 11 +++ >> include/linux/kvm_host.h| 3 ++- >> virt/kvm/arm/arm.c | 5 + >> virt/kvm/kvm_main.c | 4 +++- >> 14 files changed, 72 insertions(+), 8 deletions(-) >> > > I am curious, is there any architecture that allows to trigger > kvm_vcpu_on_spin(vcpu); while _not_ in kernel mode? IIUC, X86/SVM will trap to host due to PAUSE insn no matter the vcpu is in kernel-mode or user-mode. > > I would have guessed that user space should never be allowed to make cpu > wide decisions (giving up the CPU to the hypervisor). > > E.g. s390x diag can only be executed from kernel space. VMX PAUSE is > only valid from kernel space. X86/VMX has "PAUSE exiting" and "PAUSE-loop exiting"(PLE). KVM only uses PLE, this is as you said "only valid from kernel space" However, the "PAUSE exiting" can cause user-mode vcpu exit too. > > I.o.w. do we need a parameter to kvm_vcpu_on_spin(vcpu); at all, or is > "me_in_kernel" basically always true? > -- Regards, Longpeng(Mike)
Re: [PATCH v2 0/4] KVM: optimize the kvm_vcpu_on_spin
On Tue, Aug 8, 2017 at 1:25 PM, David Hildenbrandwrote: > On 08.08.2017 06:05, Longpeng(Mike) wrote: >> This is a simple optimization for kvm_vcpu_on_spin, the >> main idea is described in patch-1's commit msg. >> >> I did some tests base on the RFC version, the result shows >> that it can improves the performance slightly. >> >> == Geekbench-3.4.1 == >> VM1: 8U,4G, vcpu(0...7) is 1:1 pinned to pcpu(6...11,18,19) >> running Geekbench-3.4.1 *10 truns* >> VM2/VM3/VM4: configure is the same as VM1 >> stress each vcpu usage(seed by top in guest) to 40% >> >> The comparison of each testcase's score: >> (higher is better) >> before after improve >> Inter >> single 1176.7 1179.0 0.2% >> multi3459.5 3426.5 -0.9% >> Float >> single 1150.5 1150.9 0.0% >> multi3364.5 3391.9 0.8% >> Memory(stream) >> single 1768.7 1773.1 0.2% >> multi2511.6 2557.2 1.8% >> Overall >> single 1284.2 1286.2 0.2% >> multi3231.4 3238.4 0.2% >> >> >> == kernbench-0.42 == >> VM1:8U,12G, vcpu(0...7) is 1:1 pinned to pcpu(6...11,18,19) >> running "kernbench -n 10" >> VM2/VM3/VM4: configure is the same as VM1 >> stress each vcpu usage(seed by top in guest) to 40% >> >> The comparison of 'Elapsed Time': >> (sooner is better) >> before after improve >> load -j4 12.762 12.751 0.1% >> load -j32 9.743 8.955 8.1% >> load -j 9.688 9.229 4.7% >> >> >> Physical Machine: >> Architecture: x86_64 >> CPU op-mode(s):32-bit, 64-bit >> Byte Order:Little Endian >> CPU(s):24 >> On-line CPU(s) list: 0-23 >> Thread(s) per core:2 >> Core(s) per socket:6 >> Socket(s): 2 >> NUMA node(s): 2 >> Vendor ID: GenuineIntel >> CPU family:6 >> Model: 45 >> Model name:Intel(R) Xeon(R) CPU E5-2640 0 @ 2.50GHz >> Stepping: 7 >> CPU MHz: 2799.902 >> BogoMIPS: 5004.67 >> Virtualization:VT-x >> L1d cache: 32K >> L1i cache: 32K >> L2 cache: 256K >> L3 cache: 15360K >> NUMA node0 CPU(s): 0-5,12-17 >> NUMA node1 CPU(s): 6-11,18-23 >> >> --- >> Changes since V1: >> - split the implementation of s390 & arm. [David] >> - refactor the impls according to the suggestion. [Paolo] >> >> Changes since RFC: >> - only cache result for X86. [David & Cornlia & Paolo] >> - add performance numbers. [David] >> - impls arm/s390. [Christoffer & David] >> - refactor the impls. [me] >> >> --- >> Longpeng(Mike) (4): >> KVM: add spinlock optimization framework >> KVM: X86: implement the logic for spinlock optimization >> KVM: s390: implements the kvm_arch_vcpu_in_kernel() >> KVM: arm: implements the kvm_arch_vcpu_in_kernel() >> >> arch/arm/kvm/handle_exit.c | 2 +- >> arch/arm64/kvm/handle_exit.c| 2 +- >> arch/mips/kvm/mips.c| 6 ++ >> arch/powerpc/kvm/powerpc.c | 6 ++ >> arch/s390/kvm/diag.c| 2 +- >> arch/s390/kvm/kvm-s390.c| 6 ++ >> arch/x86/include/asm/kvm_host.h | 5 + >> arch/x86/kvm/hyperv.c | 2 +- >> arch/x86/kvm/svm.c | 10 +- >> arch/x86/kvm/vmx.c | 16 +++- >> arch/x86/kvm/x86.c | 11 +++ >> include/linux/kvm_host.h| 3 ++- >> virt/kvm/arm/arm.c | 5 + >> virt/kvm/kvm_main.c | 4 +++- >> 14 files changed, 72 insertions(+), 8 deletions(-) >> > > I am curious, is there any architecture that allows to trigger > kvm_vcpu_on_spin(vcpu); while _not_ in kernel mode? > > I would have guessed that user space should never be allowed to make cpu > wide decisions (giving up the CPU to the hypervisor). > > E.g. s390x diag can only be executed from kernel space. VMX PAUSE is > only valid from kernel space. > > I.o.w. do we need a parameter to kvm_vcpu_on_spin(vcpu); at all, or is > "me_in_kernel" basically always true? > ARM can be configured to not trap WFE in userspace. Thanks, -Christoffer
Re: [PATCH v2 0/4] KVM: optimize the kvm_vcpu_on_spin
On Tue, Aug 8, 2017 at 1:25 PM, David Hildenbrand wrote: > On 08.08.2017 06:05, Longpeng(Mike) wrote: >> This is a simple optimization for kvm_vcpu_on_spin, the >> main idea is described in patch-1's commit msg. >> >> I did some tests base on the RFC version, the result shows >> that it can improves the performance slightly. >> >> == Geekbench-3.4.1 == >> VM1: 8U,4G, vcpu(0...7) is 1:1 pinned to pcpu(6...11,18,19) >> running Geekbench-3.4.1 *10 truns* >> VM2/VM3/VM4: configure is the same as VM1 >> stress each vcpu usage(seed by top in guest) to 40% >> >> The comparison of each testcase's score: >> (higher is better) >> before after improve >> Inter >> single 1176.7 1179.0 0.2% >> multi3459.5 3426.5 -0.9% >> Float >> single 1150.5 1150.9 0.0% >> multi3364.5 3391.9 0.8% >> Memory(stream) >> single 1768.7 1773.1 0.2% >> multi2511.6 2557.2 1.8% >> Overall >> single 1284.2 1286.2 0.2% >> multi3231.4 3238.4 0.2% >> >> >> == kernbench-0.42 == >> VM1:8U,12G, vcpu(0...7) is 1:1 pinned to pcpu(6...11,18,19) >> running "kernbench -n 10" >> VM2/VM3/VM4: configure is the same as VM1 >> stress each vcpu usage(seed by top in guest) to 40% >> >> The comparison of 'Elapsed Time': >> (sooner is better) >> before after improve >> load -j4 12.762 12.751 0.1% >> load -j32 9.743 8.955 8.1% >> load -j 9.688 9.229 4.7% >> >> >> Physical Machine: >> Architecture: x86_64 >> CPU op-mode(s):32-bit, 64-bit >> Byte Order:Little Endian >> CPU(s):24 >> On-line CPU(s) list: 0-23 >> Thread(s) per core:2 >> Core(s) per socket:6 >> Socket(s): 2 >> NUMA node(s): 2 >> Vendor ID: GenuineIntel >> CPU family:6 >> Model: 45 >> Model name:Intel(R) Xeon(R) CPU E5-2640 0 @ 2.50GHz >> Stepping: 7 >> CPU MHz: 2799.902 >> BogoMIPS: 5004.67 >> Virtualization:VT-x >> L1d cache: 32K >> L1i cache: 32K >> L2 cache: 256K >> L3 cache: 15360K >> NUMA node0 CPU(s): 0-5,12-17 >> NUMA node1 CPU(s): 6-11,18-23 >> >> --- >> Changes since V1: >> - split the implementation of s390 & arm. [David] >> - refactor the impls according to the suggestion. [Paolo] >> >> Changes since RFC: >> - only cache result for X86. [David & Cornlia & Paolo] >> - add performance numbers. [David] >> - impls arm/s390. [Christoffer & David] >> - refactor the impls. [me] >> >> --- >> Longpeng(Mike) (4): >> KVM: add spinlock optimization framework >> KVM: X86: implement the logic for spinlock optimization >> KVM: s390: implements the kvm_arch_vcpu_in_kernel() >> KVM: arm: implements the kvm_arch_vcpu_in_kernel() >> >> arch/arm/kvm/handle_exit.c | 2 +- >> arch/arm64/kvm/handle_exit.c| 2 +- >> arch/mips/kvm/mips.c| 6 ++ >> arch/powerpc/kvm/powerpc.c | 6 ++ >> arch/s390/kvm/diag.c| 2 +- >> arch/s390/kvm/kvm-s390.c| 6 ++ >> arch/x86/include/asm/kvm_host.h | 5 + >> arch/x86/kvm/hyperv.c | 2 +- >> arch/x86/kvm/svm.c | 10 +- >> arch/x86/kvm/vmx.c | 16 +++- >> arch/x86/kvm/x86.c | 11 +++ >> include/linux/kvm_host.h| 3 ++- >> virt/kvm/arm/arm.c | 5 + >> virt/kvm/kvm_main.c | 4 +++- >> 14 files changed, 72 insertions(+), 8 deletions(-) >> > > I am curious, is there any architecture that allows to trigger > kvm_vcpu_on_spin(vcpu); while _not_ in kernel mode? > > I would have guessed that user space should never be allowed to make cpu > wide decisions (giving up the CPU to the hypervisor). > > E.g. s390x diag can only be executed from kernel space. VMX PAUSE is > only valid from kernel space. > > I.o.w. do we need a parameter to kvm_vcpu_on_spin(vcpu); at all, or is > "me_in_kernel" basically always true? > ARM can be configured to not trap WFE in userspace. Thanks, -Christoffer
Re: [PATCH v2 0/4] KVM: optimize the kvm_vcpu_on_spin
On 08.08.2017 06:05, Longpeng(Mike) wrote: > This is a simple optimization for kvm_vcpu_on_spin, the > main idea is described in patch-1's commit msg. > > I did some tests base on the RFC version, the result shows > that it can improves the performance slightly. > > == Geekbench-3.4.1 == > VM1: 8U,4G, vcpu(0...7) is 1:1 pinned to pcpu(6...11,18,19) > running Geekbench-3.4.1 *10 truns* > VM2/VM3/VM4: configure is the same as VM1 > stress each vcpu usage(seed by top in guest) to 40% > > The comparison of each testcase's score: > (higher is better) > before after improve > Inter > single 1176.7 1179.0 0.2% > multi3459.5 3426.5 -0.9% > Float > single 1150.5 1150.9 0.0% > multi3364.5 3391.9 0.8% > Memory(stream) > single 1768.7 1773.1 0.2% > multi2511.6 2557.2 1.8% > Overall > single 1284.2 1286.2 0.2% > multi3231.4 3238.4 0.2% > > > == kernbench-0.42 == > VM1:8U,12G, vcpu(0...7) is 1:1 pinned to pcpu(6...11,18,19) > running "kernbench -n 10" > VM2/VM3/VM4: configure is the same as VM1 > stress each vcpu usage(seed by top in guest) to 40% > > The comparison of 'Elapsed Time': > (sooner is better) > before after improve > load -j4 12.762 12.751 0.1% > load -j32 9.743 8.955 8.1% > load -j 9.688 9.229 4.7% > > > Physical Machine: > Architecture: x86_64 > CPU op-mode(s):32-bit, 64-bit > Byte Order:Little Endian > CPU(s):24 > On-line CPU(s) list: 0-23 > Thread(s) per core:2 > Core(s) per socket:6 > Socket(s): 2 > NUMA node(s): 2 > Vendor ID: GenuineIntel > CPU family:6 > Model: 45 > Model name:Intel(R) Xeon(R) CPU E5-2640 0 @ 2.50GHz > Stepping: 7 > CPU MHz: 2799.902 > BogoMIPS: 5004.67 > Virtualization:VT-x > L1d cache: 32K > L1i cache: 32K > L2 cache: 256K > L3 cache: 15360K > NUMA node0 CPU(s): 0-5,12-17 > NUMA node1 CPU(s): 6-11,18-23 > > --- > Changes since V1: > - split the implementation of s390 & arm. [David] > - refactor the impls according to the suggestion. [Paolo] > > Changes since RFC: > - only cache result for X86. [David & Cornlia & Paolo] > - add performance numbers. [David] > - impls arm/s390. [Christoffer & David] > - refactor the impls. [me] > > --- > Longpeng(Mike) (4): > KVM: add spinlock optimization framework > KVM: X86: implement the logic for spinlock optimization > KVM: s390: implements the kvm_arch_vcpu_in_kernel() > KVM: arm: implements the kvm_arch_vcpu_in_kernel() > > arch/arm/kvm/handle_exit.c | 2 +- > arch/arm64/kvm/handle_exit.c| 2 +- > arch/mips/kvm/mips.c| 6 ++ > arch/powerpc/kvm/powerpc.c | 6 ++ > arch/s390/kvm/diag.c| 2 +- > arch/s390/kvm/kvm-s390.c| 6 ++ > arch/x86/include/asm/kvm_host.h | 5 + > arch/x86/kvm/hyperv.c | 2 +- > arch/x86/kvm/svm.c | 10 +- > arch/x86/kvm/vmx.c | 16 +++- > arch/x86/kvm/x86.c | 11 +++ > include/linux/kvm_host.h| 3 ++- > virt/kvm/arm/arm.c | 5 + > virt/kvm/kvm_main.c | 4 +++- > 14 files changed, 72 insertions(+), 8 deletions(-) > I am curious, is there any architecture that allows to trigger kvm_vcpu_on_spin(vcpu); while _not_ in kernel mode? I would have guessed that user space should never be allowed to make cpu wide decisions (giving up the CPU to the hypervisor). E.g. s390x diag can only be executed from kernel space. VMX PAUSE is only valid from kernel space. I.o.w. do we need a parameter to kvm_vcpu_on_spin(vcpu); at all, or is "me_in_kernel" basically always true? -- Thanks, David
Re: [PATCH v2 0/4] KVM: optimize the kvm_vcpu_on_spin
On 08.08.2017 06:05, Longpeng(Mike) wrote: > This is a simple optimization for kvm_vcpu_on_spin, the > main idea is described in patch-1's commit msg. > > I did some tests base on the RFC version, the result shows > that it can improves the performance slightly. > > == Geekbench-3.4.1 == > VM1: 8U,4G, vcpu(0...7) is 1:1 pinned to pcpu(6...11,18,19) > running Geekbench-3.4.1 *10 truns* > VM2/VM3/VM4: configure is the same as VM1 > stress each vcpu usage(seed by top in guest) to 40% > > The comparison of each testcase's score: > (higher is better) > before after improve > Inter > single 1176.7 1179.0 0.2% > multi3459.5 3426.5 -0.9% > Float > single 1150.5 1150.9 0.0% > multi3364.5 3391.9 0.8% > Memory(stream) > single 1768.7 1773.1 0.2% > multi2511.6 2557.2 1.8% > Overall > single 1284.2 1286.2 0.2% > multi3231.4 3238.4 0.2% > > > == kernbench-0.42 == > VM1:8U,12G, vcpu(0...7) is 1:1 pinned to pcpu(6...11,18,19) > running "kernbench -n 10" > VM2/VM3/VM4: configure is the same as VM1 > stress each vcpu usage(seed by top in guest) to 40% > > The comparison of 'Elapsed Time': > (sooner is better) > before after improve > load -j4 12.762 12.751 0.1% > load -j32 9.743 8.955 8.1% > load -j 9.688 9.229 4.7% > > > Physical Machine: > Architecture: x86_64 > CPU op-mode(s):32-bit, 64-bit > Byte Order:Little Endian > CPU(s):24 > On-line CPU(s) list: 0-23 > Thread(s) per core:2 > Core(s) per socket:6 > Socket(s): 2 > NUMA node(s): 2 > Vendor ID: GenuineIntel > CPU family:6 > Model: 45 > Model name:Intel(R) Xeon(R) CPU E5-2640 0 @ 2.50GHz > Stepping: 7 > CPU MHz: 2799.902 > BogoMIPS: 5004.67 > Virtualization:VT-x > L1d cache: 32K > L1i cache: 32K > L2 cache: 256K > L3 cache: 15360K > NUMA node0 CPU(s): 0-5,12-17 > NUMA node1 CPU(s): 6-11,18-23 > > --- > Changes since V1: > - split the implementation of s390 & arm. [David] > - refactor the impls according to the suggestion. [Paolo] > > Changes since RFC: > - only cache result for X86. [David & Cornlia & Paolo] > - add performance numbers. [David] > - impls arm/s390. [Christoffer & David] > - refactor the impls. [me] > > --- > Longpeng(Mike) (4): > KVM: add spinlock optimization framework > KVM: X86: implement the logic for spinlock optimization > KVM: s390: implements the kvm_arch_vcpu_in_kernel() > KVM: arm: implements the kvm_arch_vcpu_in_kernel() > > arch/arm/kvm/handle_exit.c | 2 +- > arch/arm64/kvm/handle_exit.c| 2 +- > arch/mips/kvm/mips.c| 6 ++ > arch/powerpc/kvm/powerpc.c | 6 ++ > arch/s390/kvm/diag.c| 2 +- > arch/s390/kvm/kvm-s390.c| 6 ++ > arch/x86/include/asm/kvm_host.h | 5 + > arch/x86/kvm/hyperv.c | 2 +- > arch/x86/kvm/svm.c | 10 +- > arch/x86/kvm/vmx.c | 16 +++- > arch/x86/kvm/x86.c | 11 +++ > include/linux/kvm_host.h| 3 ++- > virt/kvm/arm/arm.c | 5 + > virt/kvm/kvm_main.c | 4 +++- > 14 files changed, 72 insertions(+), 8 deletions(-) > I am curious, is there any architecture that allows to trigger kvm_vcpu_on_spin(vcpu); while _not_ in kernel mode? I would have guessed that user space should never be allowed to make cpu wide decisions (giving up the CPU to the hypervisor). E.g. s390x diag can only be executed from kernel space. VMX PAUSE is only valid from kernel space. I.o.w. do we need a parameter to kvm_vcpu_on_spin(vcpu); at all, or is "me_in_kernel" basically always true? -- Thanks, David
Re: [PATCH v2 0/4] KVM: optimize the kvm_vcpu_on_spin
On 2017/8/8 15:41, Cornelia Huck wrote: > On Tue, 8 Aug 2017 12:05:31 +0800 > "Longpeng(Mike)"wrote: > >> This is a simple optimization for kvm_vcpu_on_spin, the >> main idea is described in patch-1's commit msg. > > I think this generally looks good now. > >> >> I did some tests base on the RFC version, the result shows >> that it can improves the performance slightly. > > Did you re-run tests on this version? Hi Cornelia, I didn't re-run tests on V2. But the major difference between RFC and V2 is that V2 only cache result for X86 (s390/arm needn't) and V2 saves a expensive operation ( 440-1400 cycles on my test machine ) for X86/VMX. So I think V2's performance is at least the same as RFC or even slightly better. :) > > I would also like to see some s390 numbers; unfortunately I only have a > z/VM environment and any performance numbers would be nearly useless > there. Maybe somebody within IBM with a better setup can run a quick > test? > > . > -- Regards, Longpeng(Mike)
Re: [PATCH v2 0/4] KVM: optimize the kvm_vcpu_on_spin
On 2017/8/8 15:41, Cornelia Huck wrote: > On Tue, 8 Aug 2017 12:05:31 +0800 > "Longpeng(Mike)" wrote: > >> This is a simple optimization for kvm_vcpu_on_spin, the >> main idea is described in patch-1's commit msg. > > I think this generally looks good now. > >> >> I did some tests base on the RFC version, the result shows >> that it can improves the performance slightly. > > Did you re-run tests on this version? Hi Cornelia, I didn't re-run tests on V2. But the major difference between RFC and V2 is that V2 only cache result for X86 (s390/arm needn't) and V2 saves a expensive operation ( 440-1400 cycles on my test machine ) for X86/VMX. So I think V2's performance is at least the same as RFC or even slightly better. :) > > I would also like to see some s390 numbers; unfortunately I only have a > z/VM environment and any performance numbers would be nearly useless > there. Maybe somebody within IBM with a better setup can run a quick > test? > > . > -- Regards, Longpeng(Mike)
Re: [PATCH v2 0/4] KVM: optimize the kvm_vcpu_on_spin
On Tue, 8 Aug 2017 12:05:31 +0800 "Longpeng(Mike)"wrote: > This is a simple optimization for kvm_vcpu_on_spin, the > main idea is described in patch-1's commit msg. I think this generally looks good now. > > I did some tests base on the RFC version, the result shows > that it can improves the performance slightly. Did you re-run tests on this version? I would also like to see some s390 numbers; unfortunately I only have a z/VM environment and any performance numbers would be nearly useless there. Maybe somebody within IBM with a better setup can run a quick test?
Re: [PATCH v2 0/4] KVM: optimize the kvm_vcpu_on_spin
On Tue, 8 Aug 2017 12:05:31 +0800 "Longpeng(Mike)" wrote: > This is a simple optimization for kvm_vcpu_on_spin, the > main idea is described in patch-1's commit msg. I think this generally looks good now. > > I did some tests base on the RFC version, the result shows > that it can improves the performance slightly. Did you re-run tests on this version? I would also like to see some s390 numbers; unfortunately I only have a z/VM environment and any performance numbers would be nearly useless there. Maybe somebody within IBM with a better setup can run a quick test?
[PATCH v2 0/4] KVM: optimize the kvm_vcpu_on_spin
This is a simple optimization for kvm_vcpu_on_spin, the main idea is described in patch-1's commit msg. I did some tests base on the RFC version, the result shows that it can improves the performance slightly. == Geekbench-3.4.1 == VM1:8U,4G, vcpu(0...7) is 1:1 pinned to pcpu(6...11,18,19) running Geekbench-3.4.1 *10 truns* VM2/VM3/VM4: configure is the same as VM1 stress each vcpu usage(seed by top in guest) to 40% The comparison of each testcase's score: (higher is better) before after improve Inter single 1176.7 1179.0 0.2% multi 3459.5 3426.5 -0.9% Float single 1150.5 1150.9 0.0% multi 3364.5 3391.9 0.8% Memory(stream) single 1768.7 1773.1 0.2% multi 2511.6 2557.2 1.8% Overall single 1284.2 1286.2 0.2% multi 3231.4 3238.4 0.2% == kernbench-0.42 == VM1:8U,12G, vcpu(0...7) is 1:1 pinned to pcpu(6...11,18,19) running "kernbench -n 10" VM2/VM3/VM4: configure is the same as VM1 stress each vcpu usage(seed by top in guest) to 40% The comparison of 'Elapsed Time': (sooner is better) before after improve load -j412.762 12.751 0.1% load -j32 9.743 8.955 8.1% load -j 9.688 9.229 4.7% Physical Machine: Architecture: x86_64 CPU op-mode(s):32-bit, 64-bit Byte Order:Little Endian CPU(s):24 On-line CPU(s) list: 0-23 Thread(s) per core:2 Core(s) per socket:6 Socket(s): 2 NUMA node(s): 2 Vendor ID: GenuineIntel CPU family:6 Model: 45 Model name:Intel(R) Xeon(R) CPU E5-2640 0 @ 2.50GHz Stepping: 7 CPU MHz: 2799.902 BogoMIPS: 5004.67 Virtualization:VT-x L1d cache: 32K L1i cache: 32K L2 cache: 256K L3 cache: 15360K NUMA node0 CPU(s): 0-5,12-17 NUMA node1 CPU(s): 6-11,18-23 --- Changes since V1: - split the implementation of s390 & arm. [David] - refactor the impls according to the suggestion. [Paolo] Changes since RFC: - only cache result for X86. [David & Cornlia & Paolo] - add performance numbers. [David] - impls arm/s390. [Christoffer & David] - refactor the impls. [me] --- Longpeng(Mike) (4): KVM: add spinlock optimization framework KVM: X86: implement the logic for spinlock optimization KVM: s390: implements the kvm_arch_vcpu_in_kernel() KVM: arm: implements the kvm_arch_vcpu_in_kernel() arch/arm/kvm/handle_exit.c | 2 +- arch/arm64/kvm/handle_exit.c| 2 +- arch/mips/kvm/mips.c| 6 ++ arch/powerpc/kvm/powerpc.c | 6 ++ arch/s390/kvm/diag.c| 2 +- arch/s390/kvm/kvm-s390.c| 6 ++ arch/x86/include/asm/kvm_host.h | 5 + arch/x86/kvm/hyperv.c | 2 +- arch/x86/kvm/svm.c | 10 +- arch/x86/kvm/vmx.c | 16 +++- arch/x86/kvm/x86.c | 11 +++ include/linux/kvm_host.h| 3 ++- virt/kvm/arm/arm.c | 5 + virt/kvm/kvm_main.c | 4 +++- 14 files changed, 72 insertions(+), 8 deletions(-) -- 1.8.3.1
[PATCH v2 0/4] KVM: optimize the kvm_vcpu_on_spin
This is a simple optimization for kvm_vcpu_on_spin, the main idea is described in patch-1's commit msg. I did some tests base on the RFC version, the result shows that it can improves the performance slightly. == Geekbench-3.4.1 == VM1:8U,4G, vcpu(0...7) is 1:1 pinned to pcpu(6...11,18,19) running Geekbench-3.4.1 *10 truns* VM2/VM3/VM4: configure is the same as VM1 stress each vcpu usage(seed by top in guest) to 40% The comparison of each testcase's score: (higher is better) before after improve Inter single 1176.7 1179.0 0.2% multi 3459.5 3426.5 -0.9% Float single 1150.5 1150.9 0.0% multi 3364.5 3391.9 0.8% Memory(stream) single 1768.7 1773.1 0.2% multi 2511.6 2557.2 1.8% Overall single 1284.2 1286.2 0.2% multi 3231.4 3238.4 0.2% == kernbench-0.42 == VM1:8U,12G, vcpu(0...7) is 1:1 pinned to pcpu(6...11,18,19) running "kernbench -n 10" VM2/VM3/VM4: configure is the same as VM1 stress each vcpu usage(seed by top in guest) to 40% The comparison of 'Elapsed Time': (sooner is better) before after improve load -j412.762 12.751 0.1% load -j32 9.743 8.955 8.1% load -j 9.688 9.229 4.7% Physical Machine: Architecture: x86_64 CPU op-mode(s):32-bit, 64-bit Byte Order:Little Endian CPU(s):24 On-line CPU(s) list: 0-23 Thread(s) per core:2 Core(s) per socket:6 Socket(s): 2 NUMA node(s): 2 Vendor ID: GenuineIntel CPU family:6 Model: 45 Model name:Intel(R) Xeon(R) CPU E5-2640 0 @ 2.50GHz Stepping: 7 CPU MHz: 2799.902 BogoMIPS: 5004.67 Virtualization:VT-x L1d cache: 32K L1i cache: 32K L2 cache: 256K L3 cache: 15360K NUMA node0 CPU(s): 0-5,12-17 NUMA node1 CPU(s): 6-11,18-23 --- Changes since V1: - split the implementation of s390 & arm. [David] - refactor the impls according to the suggestion. [Paolo] Changes since RFC: - only cache result for X86. [David & Cornlia & Paolo] - add performance numbers. [David] - impls arm/s390. [Christoffer & David] - refactor the impls. [me] --- Longpeng(Mike) (4): KVM: add spinlock optimization framework KVM: X86: implement the logic for spinlock optimization KVM: s390: implements the kvm_arch_vcpu_in_kernel() KVM: arm: implements the kvm_arch_vcpu_in_kernel() arch/arm/kvm/handle_exit.c | 2 +- arch/arm64/kvm/handle_exit.c| 2 +- arch/mips/kvm/mips.c| 6 ++ arch/powerpc/kvm/powerpc.c | 6 ++ arch/s390/kvm/diag.c| 2 +- arch/s390/kvm/kvm-s390.c| 6 ++ arch/x86/include/asm/kvm_host.h | 5 + arch/x86/kvm/hyperv.c | 2 +- arch/x86/kvm/svm.c | 10 +- arch/x86/kvm/vmx.c | 16 +++- arch/x86/kvm/x86.c | 11 +++ include/linux/kvm_host.h| 3 ++- virt/kvm/arm/arm.c | 5 + virt/kvm/kvm_main.c | 4 +++- 14 files changed, 72 insertions(+), 8 deletions(-) -- 1.8.3.1