Re: Measuring KVM Performance using Hardware Performance Counters

Xin Tong Tue, 04 Feb 2014 12:54:17 -0800

On Mon, Feb 3, 2014 at 9:31 PM, Paolo Bonzini <pbonz...@redhat.com> wrote:
> Il 03/02/2014 18:06, Xin Tong ha scritto:
>
>>         /.../qemu-system-x86_64 TID 2537 [TID 2537] (877 ticks/71.24%)
>
>
> This is the CPU thread (calls into the KVM modules).


This is the thread that calls the int kvm_cpu_exec(CPUArchState *env) and ioctl.
>
>
>>           /.../vmlinux (395 ticks/45.04%)
>>           /kvm (198 ticks/22.58%)
>>           /kvm_intel (153 ticks/17.45%)
>>           /.../qemu-system-x86_64 (71 ticks/8.10%)
>>           /.../libc-2.15.so (47 ticks/5.36%)
>>           /.../libpthread-2.15.so (12 ticks/1.37%)
>>           /libahci (1 ticks/0.11%)
>>         /.../qemu-system-x86_64 TID 2658 [TID 2658] (200 ticks/16.25%)
>
>
> This is probably the VNC thread.
>
>
>>           /.../vmlinux (190 ticks/95.00%)
>>           /.../libpthread-2.15.so (6 ticks/3.00%)
>>           /.../libc-2.15.so (2 ticks/1.00%)
>>           /.../qemu-system-x86_64 (1 ticks/0.50%)
>>           [vdso] (tgid:2534 range:0x7fff10588000-0x7fff10589fff) (1
>> ticks/0.50%)
>>         /.../qemu-system-x86_64 TID 2534 [TID 2534] (154 ticks/12.51%)
>
>
> This is the main thread (lowest TID of all).

Is this  the thread that executes main_loop_wait and handles IO emulation ?
>
>
>>     /ksmd [PID 53] (83 ticks/5.36%)
>>     /.../top [PID 2617] (43 ticks/2.78%)
>>     /.../unity-2d-shell [PID 1807] (41 ticks/2.65%)
>
>
> Now, going to your questions:
>
>
>> 2. why is qemu-system-x86_64 only taking 8.10% of the time, i imagine
>> most of the time should be spent in qemu-system-x86_64 as 403.GCC does
>> not do too much IO.
>
>
> This is the same I already answered: the time spent running the guest is in
> the context of process qemu-system-x86_64, but it is not running code from
> the qemu executable.  In fact, 8.10% spent in qemu-system-x86_64 is a lot.
> I would expect much less.
>
>
>> 3. why are so much time spent in vmlinux ? the symbols for vmlinux is
>> listed below.
>> 4. why are so much time spent in kvm-intel and kvm ? the symbols for
>> both are listed below.
>
>
> Again, I think I already answered this.  The time spent in vmx_vcpu_run is
> the actual time spent in the guest (17.45*60.13% = 10% roughly).

I see, so this 10% is the actually useful GCC work done from the
prospective of the guest. and the 8.10% is the time spent in the
QEMU-system-x86_64 process.
>
> Everything else is overhead introduced by either virtualization or
> profiling.
>
> What SPEC scores are you getting from bare-metal, KVM without oprofile and
> KVM with oprofile?  Is the profile substantially different if you use perf
> instead of oprofile?
>
> Can you run "scripts/kvm/kvm_stat" (from the QEMU tree) while your guest is
> running (under oprofile) and paste the output from that tool?
>

Data below is from a new 403.gcc run on a haswell linux machine with
ref input data set.  Profile taken over the entire GCC run. Host linux
kernel (compiled myself) version: Linux aristotle 3.13.1+ #3 SMP Mon
Feb 3 11:16:00 EST 2014 x86_64 x86_64 x86_64 GNU/Linux


Command to run kvm:  kvm ~/disks/ubuntu-natty.qcow2  -nographic
Commands to oprofile:
sudo operf -s --separate-thread --event=CPU_CLK_UNHALTED:5000000
--vmlinux=/home/xtong/xtong-kernel/vmlinux
sudo opreport --image-path /home/xtong/xtong-kernel  -l -g -d --xml -o
current.opm


Profiled KVM Performance (Lower is better): 1005 seconds
Non-Profiled KVM Performance (Lower is better): 998 seconds
Non-Profiled run on host machine (Lower is better): 686 seconds ( the
virtualization overhead is still quite large in spite of the hw
extension !).


running sudo scripts/kvm/kvm_stat gave
kvm statistics

 kvm_exit                                       105      16
 kvm_entry                                      105      16
 kvm_apic                                        70      10
 kvm_emulate_insn                                35       5
 kvm_eoi                                         35       5
 kvm_mmio                                        35       5
 kvm_inj_virq                                    35       5
 kvm_apic_accept_irq                             35       5


Overall:
      Core0 (53973 ticks/100.00%)
        /.../qemu-system-x86_64 TID 8861 [TID 8861] (38676 ticks/71.66%)
          /.../vmlinux (17426 ticks/45.06%)
          /kvm (7768 ticks/20.08%)
          /kvm_intel (6882 ticks/17.79%)
          /.../qemu-system-x86_64 (3535 ticks/9.14%)
          /.../libc-2.15.so (2384 ticks/6.16%)
          /.../libpthread-2.15.so (537 ticks/1.39%)
          /.../librt-2.15.so (82 ticks/0.21%)
          /.../libglib-2.0.so.0.3200.4 (41 ticks/0.11%)
          [vdso] (tgid:8858 range:0x7fffe31fe000-0x7fffe31fffff) (16
ticks/0.04%)
          /libahci (4 ticks/0.01%)
          /r8169 (1 ticks/0.00%)
        /.../qemu-system-x86_64 TID 8858 [TID 8858] (6745 ticks/12.50%)
        /.../qemu-system-x86_64 TID 8880 [TID 8880] (2945 ticks/5.46%)
        /.../qemu-system-x86_64 TID 8882 [TID 8882] (2835 ticks/5.25%)
        /.../qemu-system-x86_64 TID 8860 [TID 8860] (2772 ticks/5.14%)


vmlinux (45.06%)
CPU_CLK_UNHALTED %     Symbol/Functions
4583             26.30 native_write_msr_safe
2067             11.86 native_read_msr_safe
1148             6.59 __srcu_read_lock
852             4.89 __srcu_read_unlock
758             4.35 native_load_tr_desc
527             3.02 native_load_gdt
492             2.82 fget_light
313             1.80 __set_current_blocked
270             1.55 _raw_spin_lock_irq
198             1.14 recalc_sigpending
194             1.11 system_call
174             1.00 memcpy
164             0.94 rcu_note_context_switch
164             0.94 fput
160             0.92 user_return_notifier_unregister
158             0.91 SyS_ioctl


kvm (20.08%)
CPU_CLK_UNHALTED %     Symbol/Functions
1949             25.09 kvm_arch_vcpu_ioctl_run
421             5.42 kvm_set_shared_msr
394             5.07 paging64_walk_addr_generic
271             3.49 kvm_on_user_return
256             3.30 x86_emulate_instruction
249             3.21 x86_emulate_insn
221             2.85 em_in
212             2.73 x86_decode_insn
156             2.01 init_emulate_ctxt
145             1.87 emulator_pio_in_out.isra.104



kvm-intel (17.79%)
CPU_CLK_UNHALTED %     Symbol/Functions
4006             58.21 vmx_vcpu_run
540             7.85 vmcs_writel
472             6.86 vmx_save_host_state
275             4.00 __vmx_load_host_state.part.43
151             2.19 vmx_handle_exit
149             2.17 vmcs_clear_bits
130             1.89 clear_atomic_switch_msr
109             1.58 vmx_set_interrupt_shadow
109             1.58 vmx_handle_external_intr
102             1.48 vmx_get_rflags


so, 58.21%*17.79% = 10.3% of the time is spent executing the code from
403.gcc ( i.e. doing useful work), is this correct ? If that is the
case, why do not i see ~10X performance degradation. Also I see a lot
of time is spent in native_write_msr_safe and native_read_msr_safe  in
vmlinux, what are these for ?

Thanks,

Xin

> Thanks,
>
> Paolo
>
>>
>>
>>  /.../vmlinux (395 ticks/45.04%):
>> CPU_CLK_UNHALTED %     Symbol/Functions
>> 82               20.76 native_write_msr_safe
>> 46               11.65 native_read_msr_safe
>> 25               6.33 __srcu_read_lock
>> 16               4.05 native_load_tr_desc
>> 12               3.04 __srcu_read_unlock
>> 9               2.28 native_load_gdt
>> 8               2.03 fget_light
>> 8               2.03 _raw_spin_lock_irq
>> 7               1.77 memcpy
>> 6               1.52 _raw_spin_lock
>> 6               1.52 __set_current_blocked
>> 5               1.27 user_return_notifier_unregister
>> 5               1.27 update_cfs_shares
>> 5               1.27 page_fault
>> 5               1.27 bsearch
>>
>> /kvm (198 ticks/22.58%)
>> CPU_CLK_UNHALTED %     Symbol/Functions
>> 42               21.21 kvm_arch_vcpu_ioctl_run
>> 9               4.55 kvm_set_shared_msr
>> 7               3.54 x86_emulate_insn
>> 7               3.54 paging64_walk_addr_generic
>> 7               3.54 kvm_on_user_return
>> 7               3.54 do_insn_fetch
>> 6               3.03 gfn_to_memslot
>> 6               3.03 decode_operand
>> 5               2.53 kvm_io_bus_sort_cmp
>> 5               2.53 kvm_arch_vcpu_load
>> 5               2.53 kvm_apic_accept_pic_intr
>> 4               2.02 kvm_fetch_guest_virt
>> 4               2.02 kvm_arch_vcpu_runnable
>> 4               2.02 emulator_read_gpr
>> 4               2.02 apic_has_pending_timer
>> 3               1.52 x86_decode_insn
>>
>> /kvm_intel (153 ticks/17.45%)
>> CPU_CLK_UNHALTED %     Symbol/Functions
>> 92               60.13 vmx_vcpu_run
>> 9               5.88 vmx_save_host_state
>> 7               4.58 vmx_handle_external_intr
>> 7               4.58 vmcs_writel
>> 6               3.92 vmx_handle_exit
>> 6               3.92 __vmx_load_host_state.part.43
>> 4               2.61 clear_atomic_switch_msr
>> 3               1.96 vmx_vm_has_apicv
>> 3               1.96 vmx_get_rflags
>> 2               1.31 vmx_read_guest_seg_selector
>> 2               1.31 vmcs_clear_bits
>> 2               1.31 skip_emulated_instruction
>> 1               0.65 vmx_set_rflags
>> 1               0.65 vmx_vcpu_load
>>
>> Thank you,
>> Xin
>>
>>
>>
>> On Mon, Feb 3, 2014 at 2:11 AM, Paolo Bonzini <pbonz...@redhat.com> wrote:
>>>
>>> Il 02/02/2014 16:47, Xin Tong ha scritto:
>>>
>>>> On Sun, Feb 2, 2014 at 5:37 AM, Paolo Bonzini <pbonz...@redhat.com>
>>>> wrote:
>>>>>
>>>>>
>>>>> Il 02/02/2014 03:08, Xin Tong ha scritto:
>>>>>>
>>>>>>
>>>>>>
>>>>>> I am getting very weird profile results by running operf on linux on
>>>>>> the host and profiling the a kvm virtual machine running dacapo
>>>>>> eclipse benchmark.  I am expecting a lot of time should be spent in
>>>>>> the qemu-system-x86_64 as the instructions from the eclipse benchmark
>>>>>> would be treated as part of the qemu-system-x86_64 process, but the
>>>>>> results tell different. any suggestions ?
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> Most of the time should be spent running the guest.  This is in the
>>>>> context
>>>>> of process qemu-system-x86_64, but it is not running code from the qemu
>>>>> executable.
>>>>
>>>>
>>>>
>>>> That is what i was trying to say. you said it better.
>>>>>
>>>>>
>>>>>
>>>>> What likely happens is that when the profiling counter expires, it
>>>>> causes
>>>>> the VM to exit before the profiling interrupt is delivered.  The sample
>>>>> should then be associated to the kvm_intel module.
>>>>
>>>>
>>>>
>>>> could not the kvm module read the counters and inject an counter
>>>> overflow interrupt into the guest ?  what are some ways to make the
>>>> profile more accurate. this is collected on a intel haswell host
>>>> machine.
>>>
>>>
>>>
>>> Yes, it can.  But then you have to run perf/operf in the guest not in the
>>> host.
>>>
>>> For this to work, you need to specify "-cpu host" on the QEMU command
>>> line.
>>>
>>> Paolo
>>
>> --
>>
>> To unsubscribe from this list: send the line "unsubscribe kvm" in
>> the body of a message to majord...@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Measuring KVM Performance using Hardware Performance Counters

Reply via email to