Re: [PATCH v2 0/4] KVM: optimize the kvm_vcpu_on_spin

2017-08-11 Thread Cornelia Huck
On Thu, 10 Aug 2017 09:18:09 -0400
Eric Farman  wrote:

> On 08/08/2017 04:14 AM, Longpeng (Mike) wrote:
> > 
> > 
> > On 2017/8/8 15:41, Cornelia Huck wrote:
> >   
> >> On Tue, 8 Aug 2017 12:05:31 +0800
> >> "Longpeng(Mike)"  wrote:
> >>  
> >>> This is a simple optimization for kvm_vcpu_on_spin, the
> >>> main idea is described in patch-1's commit msg.  
> >>
> >> I think this generally looks good now.
> >>  
> >>>
> >>> I did some tests base on the RFC version, the result shows
> >>> that it can improves the performance slightly.  
> >>
> >> Did you re-run tests on this version?  
> > 
> > 
> > Hi Cornelia,
> > 
> > I didn't re-run tests on V2. But the major difference between RFC and V2
> > is that V2 only cache result for X86 (s390/arm needn't) and V2 saves a
> > expensive operation ( 440-1400 cycles on my test machine ) for X86/VMX.
> > 
> > So I think V2's performance is at least the same as RFC or even slightly
> > better. :)
> >   
> >>
> >> I would also like to see some s390 numbers; unfortunately I only have a
> >> z/VM environment and any performance numbers would be nearly useless
> >> there. Maybe somebody within IBM with a better setup can run a quick
> >> test?  
> 
> Won't swear I didn't screw something up, but here's some quick numbers. 
> Host was 4.12.0 with and without this series, running QEMU 2.10.0-rc0. 
> Created 4 guests, each with 4 CPU (unpinned) and 4GB RAM.  VM1 did full 
> kernel compiles with kernbench, which took averages of 5 runs of 
> different job sizes (I threw away the "-j 1" numbers). VM2-VM4 ran cpu 
> burners on 2 of their 4 cpus.
> 
> Numbers from VM1 kernbench output, and the delta between runs:
> 
> load -j 3 before  after   delta
> Elapsed Time  183.178 182.58  -0.598
> User Time 534.19  531.52  -2.67
> System Time   32.538  33.37   0.832
> Percent CPU   308.8   309 0.2
> Context Switches  98484.6 99001   516.4
> Sleeps227347  228752  1405
> 
> load -j 16before  after   delta
> Elapsed Time  153.352 147.59  -5.762
> User Time 545.829 533.41  -12.419
> System Time   34.289  34.85   0.561
> Percent CPU   347.6   348 0.4
> Context Switches  160518  159120  -1398
> Sleeps240740  240536  -204

Thanks a lot, Eric!

The decreases in elapsed time look nice, and we probably should not
care about the increases reported.


Re: [PATCH v2 0/4] KVM: optimize the kvm_vcpu_on_spin

2017-08-11 Thread Cornelia Huck
On Thu, 10 Aug 2017 09:18:09 -0400
Eric Farman  wrote:

> On 08/08/2017 04:14 AM, Longpeng (Mike) wrote:
> > 
> > 
> > On 2017/8/8 15:41, Cornelia Huck wrote:
> >   
> >> On Tue, 8 Aug 2017 12:05:31 +0800
> >> "Longpeng(Mike)"  wrote:
> >>  
> >>> This is a simple optimization for kvm_vcpu_on_spin, the
> >>> main idea is described in patch-1's commit msg.  
> >>
> >> I think this generally looks good now.
> >>  
> >>>
> >>> I did some tests base on the RFC version, the result shows
> >>> that it can improves the performance slightly.  
> >>
> >> Did you re-run tests on this version?  
> > 
> > 
> > Hi Cornelia,
> > 
> > I didn't re-run tests on V2. But the major difference between RFC and V2
> > is that V2 only cache result for X86 (s390/arm needn't) and V2 saves a
> > expensive operation ( 440-1400 cycles on my test machine ) for X86/VMX.
> > 
> > So I think V2's performance is at least the same as RFC or even slightly
> > better. :)
> >   
> >>
> >> I would also like to see some s390 numbers; unfortunately I only have a
> >> z/VM environment and any performance numbers would be nearly useless
> >> there. Maybe somebody within IBM with a better setup can run a quick
> >> test?  
> 
> Won't swear I didn't screw something up, but here's some quick numbers. 
> Host was 4.12.0 with and without this series, running QEMU 2.10.0-rc0. 
> Created 4 guests, each with 4 CPU (unpinned) and 4GB RAM.  VM1 did full 
> kernel compiles with kernbench, which took averages of 5 runs of 
> different job sizes (I threw away the "-j 1" numbers). VM2-VM4 ran cpu 
> burners on 2 of their 4 cpus.
> 
> Numbers from VM1 kernbench output, and the delta between runs:
> 
> load -j 3 before  after   delta
> Elapsed Time  183.178 182.58  -0.598
> User Time 534.19  531.52  -2.67
> System Time   32.538  33.37   0.832
> Percent CPU   308.8   309 0.2
> Context Switches  98484.6 99001   516.4
> Sleeps227347  228752  1405
> 
> load -j 16before  after   delta
> Elapsed Time  153.352 147.59  -5.762
> User Time 545.829 533.41  -12.419
> System Time   34.289  34.85   0.561
> Percent CPU   347.6   348 0.4
> Context Switches  160518  159120  -1398
> Sleeps240740  240536  -204

Thanks a lot, Eric!

The decreases in elapsed time look nice, and we probably should not
care about the increases reported.


Re: [PATCH v2 0/4] KVM: optimize the kvm_vcpu_on_spin

2017-08-10 Thread Longpeng (Mike)


On 2017/8/10 21:18, Eric Farman wrote:

> 
> 
> On 08/08/2017 04:14 AM, Longpeng (Mike) wrote:
>>
>>
>> On 2017/8/8 15:41, Cornelia Huck wrote:
>>
>>> On Tue, 8 Aug 2017 12:05:31 +0800
>>> "Longpeng(Mike)"  wrote:
>>>
 This is a simple optimization for kvm_vcpu_on_spin, the
 main idea is described in patch-1's commit msg.
>>>
>>> I think this generally looks good now.
>>>

 I did some tests base on the RFC version, the result shows
 that it can improves the performance slightly.
>>>
>>> Did you re-run tests on this version?
>>
>>
>> Hi Cornelia,
>>
>> I didn't re-run tests on V2. But the major difference between RFC and V2
>> is that V2 only cache result for X86 (s390/arm needn't) and V2 saves a
>> expensive operation ( 440-1400 cycles on my test machine ) for X86/VMX.
>>
>> So I think V2's performance is at least the same as RFC or even slightly
>> better. :)
>>
>>>
>>> I would also like to see some s390 numbers; unfortunately I only have a
>>> z/VM environment and any performance numbers would be nearly useless
>>> there. Maybe somebody within IBM with a better setup can run a quick
>>> test?
> 
> Won't swear I didn't screw something up, but here's some quick numbers. Host 
> was
> 4.12.0 with and without this series, running QEMU 2.10.0-rc0. Created 4 
> guests,
> each with 4 CPU (unpinned) and 4GB RAM.  VM1 did full kernel compiles with
> kernbench, which took averages of 5 runs of different job sizes (I threw away
> the "-j 1" numbers). VM2-VM4 ran cpu burners on 2 of their 4 cpus.
> 
> Numbers from VM1 kernbench output, and the delta between runs:
> 
> load -j 3beforeafterdelta
> Elapsed Time183.178182.58-0.598
> User Time534.19531.52-2.67
> System Time32.53833.370.832
> Percent CPU308.83090.2
> Context Switches98484.699001516.4
> Sleeps2273472287521405
> 
> load -j 16beforeafterdelta
> Elapsed Time153.352147.59-5.762
> User Time545.829533.41-12.419
> System Time34.28934.850.561
> Percent CPU347.63480.4
> Context Switches160518159120-1398
> Sleeps240740240536-204
> 


Thanks Eric!

The `Elapsed Time` is smaller with this series , the result is the same as my
numbers in cover-letter.

> 
>  - Eric
> 
> 
> .
> 


-- 
Regards,
Longpeng(Mike)



Re: [PATCH v2 0/4] KVM: optimize the kvm_vcpu_on_spin

2017-08-10 Thread Longpeng (Mike)


On 2017/8/10 21:18, Eric Farman wrote:

> 
> 
> On 08/08/2017 04:14 AM, Longpeng (Mike) wrote:
>>
>>
>> On 2017/8/8 15:41, Cornelia Huck wrote:
>>
>>> On Tue, 8 Aug 2017 12:05:31 +0800
>>> "Longpeng(Mike)"  wrote:
>>>
 This is a simple optimization for kvm_vcpu_on_spin, the
 main idea is described in patch-1's commit msg.
>>>
>>> I think this generally looks good now.
>>>

 I did some tests base on the RFC version, the result shows
 that it can improves the performance slightly.
>>>
>>> Did you re-run tests on this version?
>>
>>
>> Hi Cornelia,
>>
>> I didn't re-run tests on V2. But the major difference between RFC and V2
>> is that V2 only cache result for X86 (s390/arm needn't) and V2 saves a
>> expensive operation ( 440-1400 cycles on my test machine ) for X86/VMX.
>>
>> So I think V2's performance is at least the same as RFC or even slightly
>> better. :)
>>
>>>
>>> I would also like to see some s390 numbers; unfortunately I only have a
>>> z/VM environment and any performance numbers would be nearly useless
>>> there. Maybe somebody within IBM with a better setup can run a quick
>>> test?
> 
> Won't swear I didn't screw something up, but here's some quick numbers. Host 
> was
> 4.12.0 with and without this series, running QEMU 2.10.0-rc0. Created 4 
> guests,
> each with 4 CPU (unpinned) and 4GB RAM.  VM1 did full kernel compiles with
> kernbench, which took averages of 5 runs of different job sizes (I threw away
> the "-j 1" numbers). VM2-VM4 ran cpu burners on 2 of their 4 cpus.
> 
> Numbers from VM1 kernbench output, and the delta between runs:
> 
> load -j 3beforeafterdelta
> Elapsed Time183.178182.58-0.598
> User Time534.19531.52-2.67
> System Time32.53833.370.832
> Percent CPU308.83090.2
> Context Switches98484.699001516.4
> Sleeps2273472287521405
> 
> load -j 16beforeafterdelta
> Elapsed Time153.352147.59-5.762
> User Time545.829533.41-12.419
> System Time34.28934.850.561
> Percent CPU347.63480.4
> Context Switches160518159120-1398
> Sleeps240740240536-204
> 


Thanks Eric!

The `Elapsed Time` is smaller with this series , the result is the same as my
numbers in cover-letter.

> 
>  - Eric
> 
> 
> .
> 


-- 
Regards,
Longpeng(Mike)



Re: [PATCH v2 0/4] KVM: optimize the kvm_vcpu_on_spin

2017-08-10 Thread Eric Farman



On 08/08/2017 04:14 AM, Longpeng (Mike) wrote:



On 2017/8/8 15:41, Cornelia Huck wrote:


On Tue, 8 Aug 2017 12:05:31 +0800
"Longpeng(Mike)"  wrote:


This is a simple optimization for kvm_vcpu_on_spin, the
main idea is described in patch-1's commit msg.


I think this generally looks good now.



I did some tests base on the RFC version, the result shows
that it can improves the performance slightly.


Did you re-run tests on this version?



Hi Cornelia,

I didn't re-run tests on V2. But the major difference between RFC and V2
is that V2 only cache result for X86 (s390/arm needn't) and V2 saves a
expensive operation ( 440-1400 cycles on my test machine ) for X86/VMX.

So I think V2's performance is at least the same as RFC or even slightly
better. :)



I would also like to see some s390 numbers; unfortunately I only have a
z/VM environment and any performance numbers would be nearly useless
there. Maybe somebody within IBM with a better setup can run a quick
test?


Won't swear I didn't screw something up, but here's some quick numbers. 
Host was 4.12.0 with and without this series, running QEMU 2.10.0-rc0. 
Created 4 guests, each with 4 CPU (unpinned) and 4GB RAM.  VM1 did full 
kernel compiles with kernbench, which took averages of 5 runs of 
different job sizes (I threw away the "-j 1" numbers). VM2-VM4 ran cpu 
burners on 2 of their 4 cpus.


Numbers from VM1 kernbench output, and the delta between runs:

load -j 3   before  after   delta
Elapsed Time183.178 182.58  -0.598
User Time   534.19  531.52  -2.67
System Time 32.538  33.37   0.832
Percent CPU 308.8   309 0.2
Context Switches98484.6 99001   516.4
Sleeps  227347  228752  1405

load -j 16  before  after   delta
Elapsed Time153.352 147.59  -5.762
User Time   545.829 533.41  -12.419
System Time 34.289  34.85   0.561
Percent CPU 347.6   348 0.4
Context Switches160518  159120  -1398
Sleeps  240740  240536  -204


 - Eric



Re: [PATCH v2 0/4] KVM: optimize the kvm_vcpu_on_spin

2017-08-10 Thread Eric Farman



On 08/08/2017 04:14 AM, Longpeng (Mike) wrote:



On 2017/8/8 15:41, Cornelia Huck wrote:


On Tue, 8 Aug 2017 12:05:31 +0800
"Longpeng(Mike)"  wrote:


This is a simple optimization for kvm_vcpu_on_spin, the
main idea is described in patch-1's commit msg.


I think this generally looks good now.



I did some tests base on the RFC version, the result shows
that it can improves the performance slightly.


Did you re-run tests on this version?



Hi Cornelia,

I didn't re-run tests on V2. But the major difference between RFC and V2
is that V2 only cache result for X86 (s390/arm needn't) and V2 saves a
expensive operation ( 440-1400 cycles on my test machine ) for X86/VMX.

So I think V2's performance is at least the same as RFC or even slightly
better. :)



I would also like to see some s390 numbers; unfortunately I only have a
z/VM environment and any performance numbers would be nearly useless
there. Maybe somebody within IBM with a better setup can run a quick
test?


Won't swear I didn't screw something up, but here's some quick numbers. 
Host was 4.12.0 with and without this series, running QEMU 2.10.0-rc0. 
Created 4 guests, each with 4 CPU (unpinned) and 4GB RAM.  VM1 did full 
kernel compiles with kernbench, which took averages of 5 runs of 
different job sizes (I threw away the "-j 1" numbers). VM2-VM4 ran cpu 
burners on 2 of their 4 cpus.


Numbers from VM1 kernbench output, and the delta between runs:

load -j 3   before  after   delta
Elapsed Time183.178 182.58  -0.598
User Time   534.19  531.52  -2.67
System Time 32.538  33.37   0.832
Percent CPU 308.8   309 0.2
Context Switches98484.6 99001   516.4
Sleeps  227347  228752  1405

load -j 16  before  after   delta
Elapsed Time153.352 147.59  -5.762
User Time   545.829 533.41  -12.419
System Time 34.289  34.85   0.561
Percent CPU 347.6   348 0.4
Context Switches160518  159120  -1398
Sleeps  240740  240536  -204


 - Eric



Re: [PATCH v2 0/4] KVM: optimize the kvm_vcpu_on_spin

2017-08-08 Thread David Hildenbrand
On 08.08.2017 13:49, Longpeng (Mike) wrote:
> 
> 
> On 2017/8/8 19:25, David Hildenbrand wrote:
> 
>> On 08.08.2017 06:05, Longpeng(Mike) wrote:
>>> This is a simple optimization for kvm_vcpu_on_spin, the
>>> main idea is described in patch-1's commit msg.
>>>
>>> I did some tests base on the RFC version, the result shows
>>> that it can improves the performance slightly.
>>>
>>> == Geekbench-3.4.1 ==
>>> VM1:8U,4G, vcpu(0...7) is 1:1 pinned to pcpu(6...11,18,19)
>>> running Geekbench-3.4.1 *10 truns*
>>> VM2/VM3/VM4: configure is the same as VM1
>>> stress each vcpu usage(seed by top in guest) to 40%
>>>
>>> The comparison of each testcase's score:
>>> (higher is better)
>>> before  after   improve
>>> Inter
>>>  single 1176.7  1179.0  0.2%
>>>  multi  3459.5  3426.5  -0.9%
>>> Float
>>>  single 1150.5  1150.9  0.0%
>>>  multi  3364.5  3391.9  0.8%
>>> Memory(stream)
>>>  single 1768.7  1773.1  0.2%
>>>  multi  2511.6  2557.2  1.8%
>>> Overall
>>>  single 1284.2  1286.2  0.2%
>>>  multi  3231.4  3238.4  0.2%
>>>
>>>
>>> == kernbench-0.42 ==
>>> VM1:8U,12G, vcpu(0...7) is 1:1 pinned to pcpu(6...11,18,19)
>>> running "kernbench -n 10"
>>> VM2/VM3/VM4: configure is the same as VM1
>>> stress each vcpu usage(seed by top in guest) to 40%
>>>
>>> The comparison of 'Elapsed Time':
>>> (sooner is better)
>>> before  after   improve
>>> load -j412.762  12.751  0.1%
>>> load -j32   9.743   8.955   8.1%
>>> load -j 9.688   9.229   4.7%
>>>
>>>
>>> Physical Machine:
>>>   Architecture:  x86_64
>>>   CPU op-mode(s):32-bit, 64-bit
>>>   Byte Order:Little Endian
>>>   CPU(s):24
>>>   On-line CPU(s) list:   0-23
>>>   Thread(s) per core:2
>>>   Core(s) per socket:6
>>>   Socket(s): 2
>>>   NUMA node(s):  2
>>>   Vendor ID: GenuineIntel
>>>   CPU family:6
>>>   Model: 45
>>>   Model name:Intel(R) Xeon(R) CPU E5-2640 0 @ 2.50GHz
>>>   Stepping:  7
>>>   CPU MHz:   2799.902
>>>   BogoMIPS:  5004.67
>>>   Virtualization:VT-x
>>>   L1d cache: 32K
>>>   L1i cache: 32K
>>>   L2 cache:  256K
>>>   L3 cache:  15360K
>>>   NUMA node0 CPU(s): 0-5,12-17
>>>   NUMA node1 CPU(s): 6-11,18-23
>>>
>>> ---
>>> Changes since V1:
>>>  - split the implementation of s390 & arm. [David]
>>>  - refactor the impls according to the suggestion. [Paolo]
>>>
>>> Changes since RFC:
>>>  - only cache result for X86. [David & Cornlia & Paolo]
>>>  - add performance numbers. [David]
>>>  - impls arm/s390. [Christoffer & David]
>>>  - refactor the impls. [me]
>>>
>>> ---
>>> Longpeng(Mike) (4):
>>>   KVM: add spinlock optimization framework
>>>   KVM: X86: implement the logic for spinlock optimization
>>>   KVM: s390: implements the kvm_arch_vcpu_in_kernel()
>>>   KVM: arm: implements the kvm_arch_vcpu_in_kernel()
>>>
>>>  arch/arm/kvm/handle_exit.c  |  2 +-
>>>  arch/arm64/kvm/handle_exit.c|  2 +-
>>>  arch/mips/kvm/mips.c|  6 ++
>>>  arch/powerpc/kvm/powerpc.c  |  6 ++
>>>  arch/s390/kvm/diag.c|  2 +-
>>>  arch/s390/kvm/kvm-s390.c|  6 ++
>>>  arch/x86/include/asm/kvm_host.h |  5 +
>>>  arch/x86/kvm/hyperv.c   |  2 +-
>>>  arch/x86/kvm/svm.c  | 10 +-
>>>  arch/x86/kvm/vmx.c  | 16 +++-
>>>  arch/x86/kvm/x86.c  | 11 +++
>>>  include/linux/kvm_host.h|  3 ++-
>>>  virt/kvm/arm/arm.c  |  5 +
>>>  virt/kvm/kvm_main.c |  4 +++-
>>>  14 files changed, 72 insertions(+), 8 deletions(-)
>>>
>>
>> I am curious, is there any architecture that allows to trigger
>> kvm_vcpu_on_spin(vcpu); while _not_ in kernel mode?
> 
> 
> IIUC, X86/SVM will trap to host due to PAUSE insn no matter the vcpu is in
> kernel-mode or user-mode.
> 
>>
>> I would have guessed that user space should never be allowed to make cpu
>> wide decisions (giving up the CPU to the hypervisor).
>>
>> E.g. s390x diag can only be executed from kernel space. VMX PAUSE is
>> only valid from kernel space.
> 
> 
> X86/VMX has "PAUSE exiting" and "PAUSE-loop exiting"(PLE). KVM only uses PLE,
> this is as you said "only valid from kernel space"
> 
> However, the "PAUSE exiting" can cause user-mode vcpu exit too.

Thanks Longpeng and Christoffer!

-- 

Thanks,

David


Re: [PATCH v2 0/4] KVM: optimize the kvm_vcpu_on_spin

2017-08-08 Thread David Hildenbrand
On 08.08.2017 13:49, Longpeng (Mike) wrote:
> 
> 
> On 2017/8/8 19:25, David Hildenbrand wrote:
> 
>> On 08.08.2017 06:05, Longpeng(Mike) wrote:
>>> This is a simple optimization for kvm_vcpu_on_spin, the
>>> main idea is described in patch-1's commit msg.
>>>
>>> I did some tests base on the RFC version, the result shows
>>> that it can improves the performance slightly.
>>>
>>> == Geekbench-3.4.1 ==
>>> VM1:8U,4G, vcpu(0...7) is 1:1 pinned to pcpu(6...11,18,19)
>>> running Geekbench-3.4.1 *10 truns*
>>> VM2/VM3/VM4: configure is the same as VM1
>>> stress each vcpu usage(seed by top in guest) to 40%
>>>
>>> The comparison of each testcase's score:
>>> (higher is better)
>>> before  after   improve
>>> Inter
>>>  single 1176.7  1179.0  0.2%
>>>  multi  3459.5  3426.5  -0.9%
>>> Float
>>>  single 1150.5  1150.9  0.0%
>>>  multi  3364.5  3391.9  0.8%
>>> Memory(stream)
>>>  single 1768.7  1773.1  0.2%
>>>  multi  2511.6  2557.2  1.8%
>>> Overall
>>>  single 1284.2  1286.2  0.2%
>>>  multi  3231.4  3238.4  0.2%
>>>
>>>
>>> == kernbench-0.42 ==
>>> VM1:8U,12G, vcpu(0...7) is 1:1 pinned to pcpu(6...11,18,19)
>>> running "kernbench -n 10"
>>> VM2/VM3/VM4: configure is the same as VM1
>>> stress each vcpu usage(seed by top in guest) to 40%
>>>
>>> The comparison of 'Elapsed Time':
>>> (sooner is better)
>>> before  after   improve
>>> load -j412.762  12.751  0.1%
>>> load -j32   9.743   8.955   8.1%
>>> load -j 9.688   9.229   4.7%
>>>
>>>
>>> Physical Machine:
>>>   Architecture:  x86_64
>>>   CPU op-mode(s):32-bit, 64-bit
>>>   Byte Order:Little Endian
>>>   CPU(s):24
>>>   On-line CPU(s) list:   0-23
>>>   Thread(s) per core:2
>>>   Core(s) per socket:6
>>>   Socket(s): 2
>>>   NUMA node(s):  2
>>>   Vendor ID: GenuineIntel
>>>   CPU family:6
>>>   Model: 45
>>>   Model name:Intel(R) Xeon(R) CPU E5-2640 0 @ 2.50GHz
>>>   Stepping:  7
>>>   CPU MHz:   2799.902
>>>   BogoMIPS:  5004.67
>>>   Virtualization:VT-x
>>>   L1d cache: 32K
>>>   L1i cache: 32K
>>>   L2 cache:  256K
>>>   L3 cache:  15360K
>>>   NUMA node0 CPU(s): 0-5,12-17
>>>   NUMA node1 CPU(s): 6-11,18-23
>>>
>>> ---
>>> Changes since V1:
>>>  - split the implementation of s390 & arm. [David]
>>>  - refactor the impls according to the suggestion. [Paolo]
>>>
>>> Changes since RFC:
>>>  - only cache result for X86. [David & Cornlia & Paolo]
>>>  - add performance numbers. [David]
>>>  - impls arm/s390. [Christoffer & David]
>>>  - refactor the impls. [me]
>>>
>>> ---
>>> Longpeng(Mike) (4):
>>>   KVM: add spinlock optimization framework
>>>   KVM: X86: implement the logic for spinlock optimization
>>>   KVM: s390: implements the kvm_arch_vcpu_in_kernel()
>>>   KVM: arm: implements the kvm_arch_vcpu_in_kernel()
>>>
>>>  arch/arm/kvm/handle_exit.c  |  2 +-
>>>  arch/arm64/kvm/handle_exit.c|  2 +-
>>>  arch/mips/kvm/mips.c|  6 ++
>>>  arch/powerpc/kvm/powerpc.c  |  6 ++
>>>  arch/s390/kvm/diag.c|  2 +-
>>>  arch/s390/kvm/kvm-s390.c|  6 ++
>>>  arch/x86/include/asm/kvm_host.h |  5 +
>>>  arch/x86/kvm/hyperv.c   |  2 +-
>>>  arch/x86/kvm/svm.c  | 10 +-
>>>  arch/x86/kvm/vmx.c  | 16 +++-
>>>  arch/x86/kvm/x86.c  | 11 +++
>>>  include/linux/kvm_host.h|  3 ++-
>>>  virt/kvm/arm/arm.c  |  5 +
>>>  virt/kvm/kvm_main.c |  4 +++-
>>>  14 files changed, 72 insertions(+), 8 deletions(-)
>>>
>>
>> I am curious, is there any architecture that allows to trigger
>> kvm_vcpu_on_spin(vcpu); while _not_ in kernel mode?
> 
> 
> IIUC, X86/SVM will trap to host due to PAUSE insn no matter the vcpu is in
> kernel-mode or user-mode.
> 
>>
>> I would have guessed that user space should never be allowed to make cpu
>> wide decisions (giving up the CPU to the hypervisor).
>>
>> E.g. s390x diag can only be executed from kernel space. VMX PAUSE is
>> only valid from kernel space.
> 
> 
> X86/VMX has "PAUSE exiting" and "PAUSE-loop exiting"(PLE). KVM only uses PLE,
> this is as you said "only valid from kernel space"
> 
> However, the "PAUSE exiting" can cause user-mode vcpu exit too.

Thanks Longpeng and Christoffer!

-- 

Thanks,

David


Re: [PATCH v2 0/4] KVM: optimize the kvm_vcpu_on_spin

2017-08-08 Thread Longpeng (Mike)


On 2017/8/8 19:25, David Hildenbrand wrote:

> On 08.08.2017 06:05, Longpeng(Mike) wrote:
>> This is a simple optimization for kvm_vcpu_on_spin, the
>> main idea is described in patch-1's commit msg.
>>
>> I did some tests base on the RFC version, the result shows
>> that it can improves the performance slightly.
>>
>> == Geekbench-3.4.1 ==
>> VM1: 8U,4G, vcpu(0...7) is 1:1 pinned to pcpu(6...11,18,19)
>>  running Geekbench-3.4.1 *10 truns*
>> VM2/VM3/VM4: configure is the same as VM1
>>  stress each vcpu usage(seed by top in guest) to 40%
>>
>> The comparison of each testcase's score:
>> (higher is better)
>>  before  after   improve
>> Inter
>>  single  1176.7  1179.0  0.2%
>>  multi   3459.5  3426.5  -0.9%
>> Float
>>  single  1150.5  1150.9  0.0%
>>  multi   3364.5  3391.9  0.8%
>> Memory(stream)
>>  single  1768.7  1773.1  0.2%
>>  multi   2511.6  2557.2  1.8%
>> Overall
>>  single  1284.2  1286.2  0.2%
>>  multi   3231.4  3238.4  0.2%
>>
>>
>> == kernbench-0.42 ==
>> VM1:8U,12G, vcpu(0...7) is 1:1 pinned to pcpu(6...11,18,19)
>> running "kernbench -n 10"
>> VM2/VM3/VM4: configure is the same as VM1
>> stress each vcpu usage(seed by top in guest) to 40%
>>
>> The comparison of 'Elapsed Time':
>> (sooner is better)
>>  before  after   improve
>> load -j4 12.762  12.751  0.1%
>> load -j329.743   8.955   8.1%
>> load -j  9.688   9.229   4.7%
>>
>>
>> Physical Machine:
>>   Architecture:  x86_64
>>   CPU op-mode(s):32-bit, 64-bit
>>   Byte Order:Little Endian
>>   CPU(s):24
>>   On-line CPU(s) list:   0-23
>>   Thread(s) per core:2
>>   Core(s) per socket:6
>>   Socket(s): 2
>>   NUMA node(s):  2
>>   Vendor ID: GenuineIntel
>>   CPU family:6
>>   Model: 45
>>   Model name:Intel(R) Xeon(R) CPU E5-2640 0 @ 2.50GHz
>>   Stepping:  7
>>   CPU MHz:   2799.902
>>   BogoMIPS:  5004.67
>>   Virtualization:VT-x
>>   L1d cache: 32K
>>   L1i cache: 32K
>>   L2 cache:  256K
>>   L3 cache:  15360K
>>   NUMA node0 CPU(s): 0-5,12-17
>>   NUMA node1 CPU(s): 6-11,18-23
>>
>> ---
>> Changes since V1:
>>  - split the implementation of s390 & arm. [David]
>>  - refactor the impls according to the suggestion. [Paolo]
>>
>> Changes since RFC:
>>  - only cache result for X86. [David & Cornlia & Paolo]
>>  - add performance numbers. [David]
>>  - impls arm/s390. [Christoffer & David]
>>  - refactor the impls. [me]
>>
>> ---
>> Longpeng(Mike) (4):
>>   KVM: add spinlock optimization framework
>>   KVM: X86: implement the logic for spinlock optimization
>>   KVM: s390: implements the kvm_arch_vcpu_in_kernel()
>>   KVM: arm: implements the kvm_arch_vcpu_in_kernel()
>>
>>  arch/arm/kvm/handle_exit.c  |  2 +-
>>  arch/arm64/kvm/handle_exit.c|  2 +-
>>  arch/mips/kvm/mips.c|  6 ++
>>  arch/powerpc/kvm/powerpc.c  |  6 ++
>>  arch/s390/kvm/diag.c|  2 +-
>>  arch/s390/kvm/kvm-s390.c|  6 ++
>>  arch/x86/include/asm/kvm_host.h |  5 +
>>  arch/x86/kvm/hyperv.c   |  2 +-
>>  arch/x86/kvm/svm.c  | 10 +-
>>  arch/x86/kvm/vmx.c  | 16 +++-
>>  arch/x86/kvm/x86.c  | 11 +++
>>  include/linux/kvm_host.h|  3 ++-
>>  virt/kvm/arm/arm.c  |  5 +
>>  virt/kvm/kvm_main.c |  4 +++-
>>  14 files changed, 72 insertions(+), 8 deletions(-)
>>
> 
> I am curious, is there any architecture that allows to trigger
> kvm_vcpu_on_spin(vcpu); while _not_ in kernel mode?


IIUC, X86/SVM will trap to host due to PAUSE insn no matter the vcpu is in
kernel-mode or user-mode.

> 
> I would have guessed that user space should never be allowed to make cpu
> wide decisions (giving up the CPU to the hypervisor).
> 
> E.g. s390x diag can only be executed from kernel space. VMX PAUSE is
> only valid from kernel space.


X86/VMX has "PAUSE exiting" and "PAUSE-loop exiting"(PLE). KVM only uses PLE,
this is as you said "only valid from kernel space"

However, the "PAUSE exiting" can cause user-mode vcpu exit too.

> 
> I.o.w. do we need a parameter to kvm_vcpu_on_spin(vcpu); at all, or is
> "me_in_kernel" basically always true?
> 


-- 
Regards,
Longpeng(Mike)



Re: [PATCH v2 0/4] KVM: optimize the kvm_vcpu_on_spin

2017-08-08 Thread Longpeng (Mike)


On 2017/8/8 19:25, David Hildenbrand wrote:

> On 08.08.2017 06:05, Longpeng(Mike) wrote:
>> This is a simple optimization for kvm_vcpu_on_spin, the
>> main idea is described in patch-1's commit msg.
>>
>> I did some tests base on the RFC version, the result shows
>> that it can improves the performance slightly.
>>
>> == Geekbench-3.4.1 ==
>> VM1: 8U,4G, vcpu(0...7) is 1:1 pinned to pcpu(6...11,18,19)
>>  running Geekbench-3.4.1 *10 truns*
>> VM2/VM3/VM4: configure is the same as VM1
>>  stress each vcpu usage(seed by top in guest) to 40%
>>
>> The comparison of each testcase's score:
>> (higher is better)
>>  before  after   improve
>> Inter
>>  single  1176.7  1179.0  0.2%
>>  multi   3459.5  3426.5  -0.9%
>> Float
>>  single  1150.5  1150.9  0.0%
>>  multi   3364.5  3391.9  0.8%
>> Memory(stream)
>>  single  1768.7  1773.1  0.2%
>>  multi   2511.6  2557.2  1.8%
>> Overall
>>  single  1284.2  1286.2  0.2%
>>  multi   3231.4  3238.4  0.2%
>>
>>
>> == kernbench-0.42 ==
>> VM1:8U,12G, vcpu(0...7) is 1:1 pinned to pcpu(6...11,18,19)
>> running "kernbench -n 10"
>> VM2/VM3/VM4: configure is the same as VM1
>> stress each vcpu usage(seed by top in guest) to 40%
>>
>> The comparison of 'Elapsed Time':
>> (sooner is better)
>>  before  after   improve
>> load -j4 12.762  12.751  0.1%
>> load -j329.743   8.955   8.1%
>> load -j  9.688   9.229   4.7%
>>
>>
>> Physical Machine:
>>   Architecture:  x86_64
>>   CPU op-mode(s):32-bit, 64-bit
>>   Byte Order:Little Endian
>>   CPU(s):24
>>   On-line CPU(s) list:   0-23
>>   Thread(s) per core:2
>>   Core(s) per socket:6
>>   Socket(s): 2
>>   NUMA node(s):  2
>>   Vendor ID: GenuineIntel
>>   CPU family:6
>>   Model: 45
>>   Model name:Intel(R) Xeon(R) CPU E5-2640 0 @ 2.50GHz
>>   Stepping:  7
>>   CPU MHz:   2799.902
>>   BogoMIPS:  5004.67
>>   Virtualization:VT-x
>>   L1d cache: 32K
>>   L1i cache: 32K
>>   L2 cache:  256K
>>   L3 cache:  15360K
>>   NUMA node0 CPU(s): 0-5,12-17
>>   NUMA node1 CPU(s): 6-11,18-23
>>
>> ---
>> Changes since V1:
>>  - split the implementation of s390 & arm. [David]
>>  - refactor the impls according to the suggestion. [Paolo]
>>
>> Changes since RFC:
>>  - only cache result for X86. [David & Cornlia & Paolo]
>>  - add performance numbers. [David]
>>  - impls arm/s390. [Christoffer & David]
>>  - refactor the impls. [me]
>>
>> ---
>> Longpeng(Mike) (4):
>>   KVM: add spinlock optimization framework
>>   KVM: X86: implement the logic for spinlock optimization
>>   KVM: s390: implements the kvm_arch_vcpu_in_kernel()
>>   KVM: arm: implements the kvm_arch_vcpu_in_kernel()
>>
>>  arch/arm/kvm/handle_exit.c  |  2 +-
>>  arch/arm64/kvm/handle_exit.c|  2 +-
>>  arch/mips/kvm/mips.c|  6 ++
>>  arch/powerpc/kvm/powerpc.c  |  6 ++
>>  arch/s390/kvm/diag.c|  2 +-
>>  arch/s390/kvm/kvm-s390.c|  6 ++
>>  arch/x86/include/asm/kvm_host.h |  5 +
>>  arch/x86/kvm/hyperv.c   |  2 +-
>>  arch/x86/kvm/svm.c  | 10 +-
>>  arch/x86/kvm/vmx.c  | 16 +++-
>>  arch/x86/kvm/x86.c  | 11 +++
>>  include/linux/kvm_host.h|  3 ++-
>>  virt/kvm/arm/arm.c  |  5 +
>>  virt/kvm/kvm_main.c |  4 +++-
>>  14 files changed, 72 insertions(+), 8 deletions(-)
>>
> 
> I am curious, is there any architecture that allows to trigger
> kvm_vcpu_on_spin(vcpu); while _not_ in kernel mode?


IIUC, X86/SVM will trap to host due to PAUSE insn no matter the vcpu is in
kernel-mode or user-mode.

> 
> I would have guessed that user space should never be allowed to make cpu
> wide decisions (giving up the CPU to the hypervisor).
> 
> E.g. s390x diag can only be executed from kernel space. VMX PAUSE is
> only valid from kernel space.


X86/VMX has "PAUSE exiting" and "PAUSE-loop exiting"(PLE). KVM only uses PLE,
this is as you said "only valid from kernel space"

However, the "PAUSE exiting" can cause user-mode vcpu exit too.

> 
> I.o.w. do we need a parameter to kvm_vcpu_on_spin(vcpu); at all, or is
> "me_in_kernel" basically always true?
> 


-- 
Regards,
Longpeng(Mike)



Re: [PATCH v2 0/4] KVM: optimize the kvm_vcpu_on_spin

2017-08-08 Thread Christoffer Dall
On Tue, Aug 8, 2017 at 1:25 PM, David Hildenbrand  wrote:
> On 08.08.2017 06:05, Longpeng(Mike) wrote:
>> This is a simple optimization for kvm_vcpu_on_spin, the
>> main idea is described in patch-1's commit msg.
>>
>> I did some tests base on the RFC version, the result shows
>> that it can improves the performance slightly.
>>
>> == Geekbench-3.4.1 ==
>> VM1:  8U,4G, vcpu(0...7) is 1:1 pinned to pcpu(6...11,18,19)
>>   running Geekbench-3.4.1 *10 truns*
>> VM2/VM3/VM4: configure is the same as VM1
>>   stress each vcpu usage(seed by top in guest) to 40%
>>
>> The comparison of each testcase's score:
>> (higher is better)
>>   before  after   improve
>> Inter
>>  single   1176.7  1179.0  0.2%
>>  multi3459.5  3426.5  -0.9%
>> Float
>>  single   1150.5  1150.9  0.0%
>>  multi3364.5  3391.9  0.8%
>> Memory(stream)
>>  single   1768.7  1773.1  0.2%
>>  multi2511.6  2557.2  1.8%
>> Overall
>>  single   1284.2  1286.2  0.2%
>>  multi3231.4  3238.4  0.2%
>>
>>
>> == kernbench-0.42 ==
>> VM1:8U,12G, vcpu(0...7) is 1:1 pinned to pcpu(6...11,18,19)
>> running "kernbench -n 10"
>> VM2/VM3/VM4: configure is the same as VM1
>> stress each vcpu usage(seed by top in guest) to 40%
>>
>> The comparison of 'Elapsed Time':
>> (sooner is better)
>>   before  after   improve
>> load -j4  12.762  12.751  0.1%
>> load -j32 9.743   8.955   8.1%
>> load -j   9.688   9.229   4.7%
>>
>>
>> Physical Machine:
>>   Architecture:  x86_64
>>   CPU op-mode(s):32-bit, 64-bit
>>   Byte Order:Little Endian
>>   CPU(s):24
>>   On-line CPU(s) list:   0-23
>>   Thread(s) per core:2
>>   Core(s) per socket:6
>>   Socket(s): 2
>>   NUMA node(s):  2
>>   Vendor ID: GenuineIntel
>>   CPU family:6
>>   Model: 45
>>   Model name:Intel(R) Xeon(R) CPU E5-2640 0 @ 2.50GHz
>>   Stepping:  7
>>   CPU MHz:   2799.902
>>   BogoMIPS:  5004.67
>>   Virtualization:VT-x
>>   L1d cache: 32K
>>   L1i cache: 32K
>>   L2 cache:  256K
>>   L3 cache:  15360K
>>   NUMA node0 CPU(s): 0-5,12-17
>>   NUMA node1 CPU(s): 6-11,18-23
>>
>> ---
>> Changes since V1:
>>  - split the implementation of s390 & arm. [David]
>>  - refactor the impls according to the suggestion. [Paolo]
>>
>> Changes since RFC:
>>  - only cache result for X86. [David & Cornlia & Paolo]
>>  - add performance numbers. [David]
>>  - impls arm/s390. [Christoffer & David]
>>  - refactor the impls. [me]
>>
>> ---
>> Longpeng(Mike) (4):
>>   KVM: add spinlock optimization framework
>>   KVM: X86: implement the logic for spinlock optimization
>>   KVM: s390: implements the kvm_arch_vcpu_in_kernel()
>>   KVM: arm: implements the kvm_arch_vcpu_in_kernel()
>>
>>  arch/arm/kvm/handle_exit.c  |  2 +-
>>  arch/arm64/kvm/handle_exit.c|  2 +-
>>  arch/mips/kvm/mips.c|  6 ++
>>  arch/powerpc/kvm/powerpc.c  |  6 ++
>>  arch/s390/kvm/diag.c|  2 +-
>>  arch/s390/kvm/kvm-s390.c|  6 ++
>>  arch/x86/include/asm/kvm_host.h |  5 +
>>  arch/x86/kvm/hyperv.c   |  2 +-
>>  arch/x86/kvm/svm.c  | 10 +-
>>  arch/x86/kvm/vmx.c  | 16 +++-
>>  arch/x86/kvm/x86.c  | 11 +++
>>  include/linux/kvm_host.h|  3 ++-
>>  virt/kvm/arm/arm.c  |  5 +
>>  virt/kvm/kvm_main.c |  4 +++-
>>  14 files changed, 72 insertions(+), 8 deletions(-)
>>
>
> I am curious, is there any architecture that allows to trigger
> kvm_vcpu_on_spin(vcpu); while _not_ in kernel mode?
>
> I would have guessed that user space should never be allowed to make cpu
> wide decisions (giving up the CPU to the hypervisor).
>
> E.g. s390x diag can only be executed from kernel space. VMX PAUSE is
> only valid from kernel space.
>
> I.o.w. do we need a parameter to kvm_vcpu_on_spin(vcpu); at all, or is
> "me_in_kernel" basically always true?
>
ARM can be configured to not trap WFE in userspace.

Thanks,
-Christoffer


Re: [PATCH v2 0/4] KVM: optimize the kvm_vcpu_on_spin

2017-08-08 Thread Christoffer Dall
On Tue, Aug 8, 2017 at 1:25 PM, David Hildenbrand  wrote:
> On 08.08.2017 06:05, Longpeng(Mike) wrote:
>> This is a simple optimization for kvm_vcpu_on_spin, the
>> main idea is described in patch-1's commit msg.
>>
>> I did some tests base on the RFC version, the result shows
>> that it can improves the performance slightly.
>>
>> == Geekbench-3.4.1 ==
>> VM1:  8U,4G, vcpu(0...7) is 1:1 pinned to pcpu(6...11,18,19)
>>   running Geekbench-3.4.1 *10 truns*
>> VM2/VM3/VM4: configure is the same as VM1
>>   stress each vcpu usage(seed by top in guest) to 40%
>>
>> The comparison of each testcase's score:
>> (higher is better)
>>   before  after   improve
>> Inter
>>  single   1176.7  1179.0  0.2%
>>  multi3459.5  3426.5  -0.9%
>> Float
>>  single   1150.5  1150.9  0.0%
>>  multi3364.5  3391.9  0.8%
>> Memory(stream)
>>  single   1768.7  1773.1  0.2%
>>  multi2511.6  2557.2  1.8%
>> Overall
>>  single   1284.2  1286.2  0.2%
>>  multi3231.4  3238.4  0.2%
>>
>>
>> == kernbench-0.42 ==
>> VM1:8U,12G, vcpu(0...7) is 1:1 pinned to pcpu(6...11,18,19)
>> running "kernbench -n 10"
>> VM2/VM3/VM4: configure is the same as VM1
>> stress each vcpu usage(seed by top in guest) to 40%
>>
>> The comparison of 'Elapsed Time':
>> (sooner is better)
>>   before  after   improve
>> load -j4  12.762  12.751  0.1%
>> load -j32 9.743   8.955   8.1%
>> load -j   9.688   9.229   4.7%
>>
>>
>> Physical Machine:
>>   Architecture:  x86_64
>>   CPU op-mode(s):32-bit, 64-bit
>>   Byte Order:Little Endian
>>   CPU(s):24
>>   On-line CPU(s) list:   0-23
>>   Thread(s) per core:2
>>   Core(s) per socket:6
>>   Socket(s): 2
>>   NUMA node(s):  2
>>   Vendor ID: GenuineIntel
>>   CPU family:6
>>   Model: 45
>>   Model name:Intel(R) Xeon(R) CPU E5-2640 0 @ 2.50GHz
>>   Stepping:  7
>>   CPU MHz:   2799.902
>>   BogoMIPS:  5004.67
>>   Virtualization:VT-x
>>   L1d cache: 32K
>>   L1i cache: 32K
>>   L2 cache:  256K
>>   L3 cache:  15360K
>>   NUMA node0 CPU(s): 0-5,12-17
>>   NUMA node1 CPU(s): 6-11,18-23
>>
>> ---
>> Changes since V1:
>>  - split the implementation of s390 & arm. [David]
>>  - refactor the impls according to the suggestion. [Paolo]
>>
>> Changes since RFC:
>>  - only cache result for X86. [David & Cornlia & Paolo]
>>  - add performance numbers. [David]
>>  - impls arm/s390. [Christoffer & David]
>>  - refactor the impls. [me]
>>
>> ---
>> Longpeng(Mike) (4):
>>   KVM: add spinlock optimization framework
>>   KVM: X86: implement the logic for spinlock optimization
>>   KVM: s390: implements the kvm_arch_vcpu_in_kernel()
>>   KVM: arm: implements the kvm_arch_vcpu_in_kernel()
>>
>>  arch/arm/kvm/handle_exit.c  |  2 +-
>>  arch/arm64/kvm/handle_exit.c|  2 +-
>>  arch/mips/kvm/mips.c|  6 ++
>>  arch/powerpc/kvm/powerpc.c  |  6 ++
>>  arch/s390/kvm/diag.c|  2 +-
>>  arch/s390/kvm/kvm-s390.c|  6 ++
>>  arch/x86/include/asm/kvm_host.h |  5 +
>>  arch/x86/kvm/hyperv.c   |  2 +-
>>  arch/x86/kvm/svm.c  | 10 +-
>>  arch/x86/kvm/vmx.c  | 16 +++-
>>  arch/x86/kvm/x86.c  | 11 +++
>>  include/linux/kvm_host.h|  3 ++-
>>  virt/kvm/arm/arm.c  |  5 +
>>  virt/kvm/kvm_main.c |  4 +++-
>>  14 files changed, 72 insertions(+), 8 deletions(-)
>>
>
> I am curious, is there any architecture that allows to trigger
> kvm_vcpu_on_spin(vcpu); while _not_ in kernel mode?
>
> I would have guessed that user space should never be allowed to make cpu
> wide decisions (giving up the CPU to the hypervisor).
>
> E.g. s390x diag can only be executed from kernel space. VMX PAUSE is
> only valid from kernel space.
>
> I.o.w. do we need a parameter to kvm_vcpu_on_spin(vcpu); at all, or is
> "me_in_kernel" basically always true?
>
ARM can be configured to not trap WFE in userspace.

Thanks,
-Christoffer


Re: [PATCH v2 0/4] KVM: optimize the kvm_vcpu_on_spin

2017-08-08 Thread David Hildenbrand
On 08.08.2017 06:05, Longpeng(Mike) wrote:
> This is a simple optimization for kvm_vcpu_on_spin, the
> main idea is described in patch-1's commit msg.
> 
> I did some tests base on the RFC version, the result shows
> that it can improves the performance slightly.
> 
> == Geekbench-3.4.1 ==
> VM1:  8U,4G, vcpu(0...7) is 1:1 pinned to pcpu(6...11,18,19)
>   running Geekbench-3.4.1 *10 truns*
> VM2/VM3/VM4: configure is the same as VM1
>   stress each vcpu usage(seed by top in guest) to 40%
> 
> The comparison of each testcase's score:
> (higher is better)
>   before  after   improve
> Inter
>  single   1176.7  1179.0  0.2%
>  multi3459.5  3426.5  -0.9%
> Float
>  single   1150.5  1150.9  0.0%
>  multi3364.5  3391.9  0.8%
> Memory(stream)
>  single   1768.7  1773.1  0.2%
>  multi2511.6  2557.2  1.8%
> Overall
>  single   1284.2  1286.2  0.2%
>  multi3231.4  3238.4  0.2%
> 
> 
> == kernbench-0.42 ==
> VM1:8U,12G, vcpu(0...7) is 1:1 pinned to pcpu(6...11,18,19)
> running "kernbench -n 10"
> VM2/VM3/VM4: configure is the same as VM1
> stress each vcpu usage(seed by top in guest) to 40%
> 
> The comparison of 'Elapsed Time':
> (sooner is better)
>   before  after   improve
> load -j4  12.762  12.751  0.1%
> load -j32 9.743   8.955   8.1%
> load -j   9.688   9.229   4.7%
> 
> 
> Physical Machine:
>   Architecture:  x86_64
>   CPU op-mode(s):32-bit, 64-bit
>   Byte Order:Little Endian
>   CPU(s):24
>   On-line CPU(s) list:   0-23
>   Thread(s) per core:2
>   Core(s) per socket:6
>   Socket(s): 2
>   NUMA node(s):  2
>   Vendor ID: GenuineIntel
>   CPU family:6
>   Model: 45
>   Model name:Intel(R) Xeon(R) CPU E5-2640 0 @ 2.50GHz
>   Stepping:  7
>   CPU MHz:   2799.902
>   BogoMIPS:  5004.67
>   Virtualization:VT-x
>   L1d cache: 32K
>   L1i cache: 32K
>   L2 cache:  256K
>   L3 cache:  15360K
>   NUMA node0 CPU(s): 0-5,12-17
>   NUMA node1 CPU(s): 6-11,18-23
> 
> ---
> Changes since V1:
>  - split the implementation of s390 & arm. [David]
>  - refactor the impls according to the suggestion. [Paolo]
> 
> Changes since RFC:
>  - only cache result for X86. [David & Cornlia & Paolo]
>  - add performance numbers. [David]
>  - impls arm/s390. [Christoffer & David]
>  - refactor the impls. [me]
> 
> ---
> Longpeng(Mike) (4):
>   KVM: add spinlock optimization framework
>   KVM: X86: implement the logic for spinlock optimization
>   KVM: s390: implements the kvm_arch_vcpu_in_kernel()
>   KVM: arm: implements the kvm_arch_vcpu_in_kernel()
> 
>  arch/arm/kvm/handle_exit.c  |  2 +-
>  arch/arm64/kvm/handle_exit.c|  2 +-
>  arch/mips/kvm/mips.c|  6 ++
>  arch/powerpc/kvm/powerpc.c  |  6 ++
>  arch/s390/kvm/diag.c|  2 +-
>  arch/s390/kvm/kvm-s390.c|  6 ++
>  arch/x86/include/asm/kvm_host.h |  5 +
>  arch/x86/kvm/hyperv.c   |  2 +-
>  arch/x86/kvm/svm.c  | 10 +-
>  arch/x86/kvm/vmx.c  | 16 +++-
>  arch/x86/kvm/x86.c  | 11 +++
>  include/linux/kvm_host.h|  3 ++-
>  virt/kvm/arm/arm.c  |  5 +
>  virt/kvm/kvm_main.c |  4 +++-
>  14 files changed, 72 insertions(+), 8 deletions(-)
> 

I am curious, is there any architecture that allows to trigger
kvm_vcpu_on_spin(vcpu); while _not_ in kernel mode?

I would have guessed that user space should never be allowed to make cpu
wide decisions (giving up the CPU to the hypervisor).

E.g. s390x diag can only be executed from kernel space. VMX PAUSE is
only valid from kernel space.

I.o.w. do we need a parameter to kvm_vcpu_on_spin(vcpu); at all, or is
"me_in_kernel" basically always true?

-- 

Thanks,

David


Re: [PATCH v2 0/4] KVM: optimize the kvm_vcpu_on_spin

2017-08-08 Thread David Hildenbrand
On 08.08.2017 06:05, Longpeng(Mike) wrote:
> This is a simple optimization for kvm_vcpu_on_spin, the
> main idea is described in patch-1's commit msg.
> 
> I did some tests base on the RFC version, the result shows
> that it can improves the performance slightly.
> 
> == Geekbench-3.4.1 ==
> VM1:  8U,4G, vcpu(0...7) is 1:1 pinned to pcpu(6...11,18,19)
>   running Geekbench-3.4.1 *10 truns*
> VM2/VM3/VM4: configure is the same as VM1
>   stress each vcpu usage(seed by top in guest) to 40%
> 
> The comparison of each testcase's score:
> (higher is better)
>   before  after   improve
> Inter
>  single   1176.7  1179.0  0.2%
>  multi3459.5  3426.5  -0.9%
> Float
>  single   1150.5  1150.9  0.0%
>  multi3364.5  3391.9  0.8%
> Memory(stream)
>  single   1768.7  1773.1  0.2%
>  multi2511.6  2557.2  1.8%
> Overall
>  single   1284.2  1286.2  0.2%
>  multi3231.4  3238.4  0.2%
> 
> 
> == kernbench-0.42 ==
> VM1:8U,12G, vcpu(0...7) is 1:1 pinned to pcpu(6...11,18,19)
> running "kernbench -n 10"
> VM2/VM3/VM4: configure is the same as VM1
> stress each vcpu usage(seed by top in guest) to 40%
> 
> The comparison of 'Elapsed Time':
> (sooner is better)
>   before  after   improve
> load -j4  12.762  12.751  0.1%
> load -j32 9.743   8.955   8.1%
> load -j   9.688   9.229   4.7%
> 
> 
> Physical Machine:
>   Architecture:  x86_64
>   CPU op-mode(s):32-bit, 64-bit
>   Byte Order:Little Endian
>   CPU(s):24
>   On-line CPU(s) list:   0-23
>   Thread(s) per core:2
>   Core(s) per socket:6
>   Socket(s): 2
>   NUMA node(s):  2
>   Vendor ID: GenuineIntel
>   CPU family:6
>   Model: 45
>   Model name:Intel(R) Xeon(R) CPU E5-2640 0 @ 2.50GHz
>   Stepping:  7
>   CPU MHz:   2799.902
>   BogoMIPS:  5004.67
>   Virtualization:VT-x
>   L1d cache: 32K
>   L1i cache: 32K
>   L2 cache:  256K
>   L3 cache:  15360K
>   NUMA node0 CPU(s): 0-5,12-17
>   NUMA node1 CPU(s): 6-11,18-23
> 
> ---
> Changes since V1:
>  - split the implementation of s390 & arm. [David]
>  - refactor the impls according to the suggestion. [Paolo]
> 
> Changes since RFC:
>  - only cache result for X86. [David & Cornlia & Paolo]
>  - add performance numbers. [David]
>  - impls arm/s390. [Christoffer & David]
>  - refactor the impls. [me]
> 
> ---
> Longpeng(Mike) (4):
>   KVM: add spinlock optimization framework
>   KVM: X86: implement the logic for spinlock optimization
>   KVM: s390: implements the kvm_arch_vcpu_in_kernel()
>   KVM: arm: implements the kvm_arch_vcpu_in_kernel()
> 
>  arch/arm/kvm/handle_exit.c  |  2 +-
>  arch/arm64/kvm/handle_exit.c|  2 +-
>  arch/mips/kvm/mips.c|  6 ++
>  arch/powerpc/kvm/powerpc.c  |  6 ++
>  arch/s390/kvm/diag.c|  2 +-
>  arch/s390/kvm/kvm-s390.c|  6 ++
>  arch/x86/include/asm/kvm_host.h |  5 +
>  arch/x86/kvm/hyperv.c   |  2 +-
>  arch/x86/kvm/svm.c  | 10 +-
>  arch/x86/kvm/vmx.c  | 16 +++-
>  arch/x86/kvm/x86.c  | 11 +++
>  include/linux/kvm_host.h|  3 ++-
>  virt/kvm/arm/arm.c  |  5 +
>  virt/kvm/kvm_main.c |  4 +++-
>  14 files changed, 72 insertions(+), 8 deletions(-)
> 

I am curious, is there any architecture that allows to trigger
kvm_vcpu_on_spin(vcpu); while _not_ in kernel mode?

I would have guessed that user space should never be allowed to make cpu
wide decisions (giving up the CPU to the hypervisor).

E.g. s390x diag can only be executed from kernel space. VMX PAUSE is
only valid from kernel space.

I.o.w. do we need a parameter to kvm_vcpu_on_spin(vcpu); at all, or is
"me_in_kernel" basically always true?

-- 

Thanks,

David


Re: [PATCH v2 0/4] KVM: optimize the kvm_vcpu_on_spin

2017-08-08 Thread Longpeng (Mike)


On 2017/8/8 15:41, Cornelia Huck wrote:

> On Tue, 8 Aug 2017 12:05:31 +0800
> "Longpeng(Mike)"  wrote:
> 
>> This is a simple optimization for kvm_vcpu_on_spin, the
>> main idea is described in patch-1's commit msg.
> 
> I think this generally looks good now.
> 
>>
>> I did some tests base on the RFC version, the result shows
>> that it can improves the performance slightly.
> 
> Did you re-run tests on this version?


Hi Cornelia,

I didn't re-run tests on V2. But the major difference between RFC and V2
is that V2 only cache result for X86 (s390/arm needn't) and V2 saves a
expensive operation ( 440-1400 cycles on my test machine ) for X86/VMX.

So I think V2's performance is at least the same as RFC or even slightly
better. :)

> 
> I would also like to see some s390 numbers; unfortunately I only have a
> z/VM environment and any performance numbers would be nearly useless
> there. Maybe somebody within IBM with a better setup can run a quick
> test?
> 
> .
> 


-- 
Regards,
Longpeng(Mike)



Re: [PATCH v2 0/4] KVM: optimize the kvm_vcpu_on_spin

2017-08-08 Thread Longpeng (Mike)


On 2017/8/8 15:41, Cornelia Huck wrote:

> On Tue, 8 Aug 2017 12:05:31 +0800
> "Longpeng(Mike)"  wrote:
> 
>> This is a simple optimization for kvm_vcpu_on_spin, the
>> main idea is described in patch-1's commit msg.
> 
> I think this generally looks good now.
> 
>>
>> I did some tests base on the RFC version, the result shows
>> that it can improves the performance slightly.
> 
> Did you re-run tests on this version?


Hi Cornelia,

I didn't re-run tests on V2. But the major difference between RFC and V2
is that V2 only cache result for X86 (s390/arm needn't) and V2 saves a
expensive operation ( 440-1400 cycles on my test machine ) for X86/VMX.

So I think V2's performance is at least the same as RFC or even slightly
better. :)

> 
> I would also like to see some s390 numbers; unfortunately I only have a
> z/VM environment and any performance numbers would be nearly useless
> there. Maybe somebody within IBM with a better setup can run a quick
> test?
> 
> .
> 


-- 
Regards,
Longpeng(Mike)



Re: [PATCH v2 0/4] KVM: optimize the kvm_vcpu_on_spin

2017-08-08 Thread Cornelia Huck
On Tue, 8 Aug 2017 12:05:31 +0800
"Longpeng(Mike)"  wrote:

> This is a simple optimization for kvm_vcpu_on_spin, the
> main idea is described in patch-1's commit msg.

I think this generally looks good now.

> 
> I did some tests base on the RFC version, the result shows
> that it can improves the performance slightly.

Did you re-run tests on this version?

I would also like to see some s390 numbers; unfortunately I only have a
z/VM environment and any performance numbers would be nearly useless
there. Maybe somebody within IBM with a better setup can run a quick
test?


Re: [PATCH v2 0/4] KVM: optimize the kvm_vcpu_on_spin

2017-08-08 Thread Cornelia Huck
On Tue, 8 Aug 2017 12:05:31 +0800
"Longpeng(Mike)"  wrote:

> This is a simple optimization for kvm_vcpu_on_spin, the
> main idea is described in patch-1's commit msg.

I think this generally looks good now.

> 
> I did some tests base on the RFC version, the result shows
> that it can improves the performance slightly.

Did you re-run tests on this version?

I would also like to see some s390 numbers; unfortunately I only have a
z/VM environment and any performance numbers would be nearly useless
there. Maybe somebody within IBM with a better setup can run a quick
test?


[PATCH v2 0/4] KVM: optimize the kvm_vcpu_on_spin

2017-08-07 Thread Longpeng(Mike)
This is a simple optimization for kvm_vcpu_on_spin, the
main idea is described in patch-1's commit msg.

I did some tests base on the RFC version, the result shows
that it can improves the performance slightly.

== Geekbench-3.4.1 ==
VM1:8U,4G, vcpu(0...7) is 1:1 pinned to pcpu(6...11,18,19)
running Geekbench-3.4.1 *10 truns*
VM2/VM3/VM4: configure is the same as VM1
stress each vcpu usage(seed by top in guest) to 40%

The comparison of each testcase's score:
(higher is better)
before  after   improve
Inter
 single 1176.7  1179.0  0.2%
 multi  3459.5  3426.5  -0.9%
Float
 single 1150.5  1150.9  0.0%
 multi  3364.5  3391.9  0.8%
Memory(stream)
 single 1768.7  1773.1  0.2%
 multi  2511.6  2557.2  1.8%
Overall
 single 1284.2  1286.2  0.2%
 multi  3231.4  3238.4  0.2%


== kernbench-0.42 ==
VM1:8U,12G, vcpu(0...7) is 1:1 pinned to pcpu(6...11,18,19)
running "kernbench -n 10"
VM2/VM3/VM4: configure is the same as VM1
stress each vcpu usage(seed by top in guest) to 40%

The comparison of 'Elapsed Time':
(sooner is better)
before  after   improve
load -j412.762  12.751  0.1%
load -j32   9.743   8.955   8.1%
load -j 9.688   9.229   4.7%


Physical Machine:
  Architecture:  x86_64
  CPU op-mode(s):32-bit, 64-bit
  Byte Order:Little Endian
  CPU(s):24
  On-line CPU(s) list:   0-23
  Thread(s) per core:2
  Core(s) per socket:6
  Socket(s): 2
  NUMA node(s):  2
  Vendor ID: GenuineIntel
  CPU family:6
  Model: 45
  Model name:Intel(R) Xeon(R) CPU E5-2640 0 @ 2.50GHz
  Stepping:  7
  CPU MHz:   2799.902
  BogoMIPS:  5004.67
  Virtualization:VT-x
  L1d cache: 32K
  L1i cache: 32K
  L2 cache:  256K
  L3 cache:  15360K
  NUMA node0 CPU(s): 0-5,12-17
  NUMA node1 CPU(s): 6-11,18-23

---
Changes since V1:
 - split the implementation of s390 & arm. [David]
 - refactor the impls according to the suggestion. [Paolo]

Changes since RFC:
 - only cache result for X86. [David & Cornlia & Paolo]
 - add performance numbers. [David]
 - impls arm/s390. [Christoffer & David]
 - refactor the impls. [me]

---
Longpeng(Mike) (4):
  KVM: add spinlock optimization framework
  KVM: X86: implement the logic for spinlock optimization
  KVM: s390: implements the kvm_arch_vcpu_in_kernel()
  KVM: arm: implements the kvm_arch_vcpu_in_kernel()

 arch/arm/kvm/handle_exit.c  |  2 +-
 arch/arm64/kvm/handle_exit.c|  2 +-
 arch/mips/kvm/mips.c|  6 ++
 arch/powerpc/kvm/powerpc.c  |  6 ++
 arch/s390/kvm/diag.c|  2 +-
 arch/s390/kvm/kvm-s390.c|  6 ++
 arch/x86/include/asm/kvm_host.h |  5 +
 arch/x86/kvm/hyperv.c   |  2 +-
 arch/x86/kvm/svm.c  | 10 +-
 arch/x86/kvm/vmx.c  | 16 +++-
 arch/x86/kvm/x86.c  | 11 +++
 include/linux/kvm_host.h|  3 ++-
 virt/kvm/arm/arm.c  |  5 +
 virt/kvm/kvm_main.c |  4 +++-
 14 files changed, 72 insertions(+), 8 deletions(-)

-- 
1.8.3.1




[PATCH v2 0/4] KVM: optimize the kvm_vcpu_on_spin

2017-08-07 Thread Longpeng(Mike)
This is a simple optimization for kvm_vcpu_on_spin, the
main idea is described in patch-1's commit msg.

I did some tests base on the RFC version, the result shows
that it can improves the performance slightly.

== Geekbench-3.4.1 ==
VM1:8U,4G, vcpu(0...7) is 1:1 pinned to pcpu(6...11,18,19)
running Geekbench-3.4.1 *10 truns*
VM2/VM3/VM4: configure is the same as VM1
stress each vcpu usage(seed by top in guest) to 40%

The comparison of each testcase's score:
(higher is better)
before  after   improve
Inter
 single 1176.7  1179.0  0.2%
 multi  3459.5  3426.5  -0.9%
Float
 single 1150.5  1150.9  0.0%
 multi  3364.5  3391.9  0.8%
Memory(stream)
 single 1768.7  1773.1  0.2%
 multi  2511.6  2557.2  1.8%
Overall
 single 1284.2  1286.2  0.2%
 multi  3231.4  3238.4  0.2%


== kernbench-0.42 ==
VM1:8U,12G, vcpu(0...7) is 1:1 pinned to pcpu(6...11,18,19)
running "kernbench -n 10"
VM2/VM3/VM4: configure is the same as VM1
stress each vcpu usage(seed by top in guest) to 40%

The comparison of 'Elapsed Time':
(sooner is better)
before  after   improve
load -j412.762  12.751  0.1%
load -j32   9.743   8.955   8.1%
load -j 9.688   9.229   4.7%


Physical Machine:
  Architecture:  x86_64
  CPU op-mode(s):32-bit, 64-bit
  Byte Order:Little Endian
  CPU(s):24
  On-line CPU(s) list:   0-23
  Thread(s) per core:2
  Core(s) per socket:6
  Socket(s): 2
  NUMA node(s):  2
  Vendor ID: GenuineIntel
  CPU family:6
  Model: 45
  Model name:Intel(R) Xeon(R) CPU E5-2640 0 @ 2.50GHz
  Stepping:  7
  CPU MHz:   2799.902
  BogoMIPS:  5004.67
  Virtualization:VT-x
  L1d cache: 32K
  L1i cache: 32K
  L2 cache:  256K
  L3 cache:  15360K
  NUMA node0 CPU(s): 0-5,12-17
  NUMA node1 CPU(s): 6-11,18-23

---
Changes since V1:
 - split the implementation of s390 & arm. [David]
 - refactor the impls according to the suggestion. [Paolo]

Changes since RFC:
 - only cache result for X86. [David & Cornlia & Paolo]
 - add performance numbers. [David]
 - impls arm/s390. [Christoffer & David]
 - refactor the impls. [me]

---
Longpeng(Mike) (4):
  KVM: add spinlock optimization framework
  KVM: X86: implement the logic for spinlock optimization
  KVM: s390: implements the kvm_arch_vcpu_in_kernel()
  KVM: arm: implements the kvm_arch_vcpu_in_kernel()

 arch/arm/kvm/handle_exit.c  |  2 +-
 arch/arm64/kvm/handle_exit.c|  2 +-
 arch/mips/kvm/mips.c|  6 ++
 arch/powerpc/kvm/powerpc.c  |  6 ++
 arch/s390/kvm/diag.c|  2 +-
 arch/s390/kvm/kvm-s390.c|  6 ++
 arch/x86/include/asm/kvm_host.h |  5 +
 arch/x86/kvm/hyperv.c   |  2 +-
 arch/x86/kvm/svm.c  | 10 +-
 arch/x86/kvm/vmx.c  | 16 +++-
 arch/x86/kvm/x86.c  | 11 +++
 include/linux/kvm_host.h|  3 ++-
 virt/kvm/arm/arm.c  |  5 +
 virt/kvm/kvm_main.c |  4 +++-
 14 files changed, 72 insertions(+), 8 deletions(-)

-- 
1.8.3.1